Skip to content
  1. Mar 17, 2022
  2. Mar 16, 2022
  3. Mar 15, 2022
    • Niklas Söderlund's avatar
      samples/bpf, xdpsock: Fix race when running for fix duration of time · 8fa42d78
      Niklas Söderlund authored
      When running xdpsock for a fix duration of time before terminating
      using --duration=<n>, there is a race condition that may cause xdpsock
      to terminate immediately.
      
      When running for a fixed duration of time the check to determine when to
      terminate execution is in is_benchmark_done() and is being executed in
      the context of the poller thread,
      
          if (opt_duration > 0) {
                  unsigned long dt = (get_nsecs() - start_time);
      
                  if (dt >= opt_duration)
                          benchmark_done = true;
          }
      
      However start_time is only set after the poller thread have been
      created. This leaves a small window when the poller thread is starting
      and calls is_benchmark_done() for the first time that start_time is not
      yet set. In that case start_time have its initial value of 0 and the
      duration check fails as it do not correlate correctly for the
      applications start time and immediately sets benchmark_done which in
      turn terminates the xdpsock application.
      
      Fix this by setting start_time before creating the poller thread.
      
      Fixes: d3f11b01
      
       ("samples/bpf: xdpsock: Add duration option to specify how long to run")
      Signed-off-by: default avatarNiklas Söderlund <niklas.soderlund@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220315102948.466436-1-niklas.soderlund@corigine.com
      8fa42d78
    • Wang Yufen's avatar
      bpf, sockmap: Fix double uncharge the mem of sk_msg · 2486ab43
      Wang Yufen authored
      If tcp_bpf_sendmsg is running during a tear down operation, psock may be
      freed.
      
      tcp_bpf_sendmsg()
       tcp_bpf_send_verdict()
        sk_msg_return()
        tcp_bpf_sendmsg_redir()
         unlikely(!psock))
           sk_msg_free()
      
      The mem of msg has been uncharged in tcp_bpf_send_verdict() by
      sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock
      is null, we can simply returning an error code, this would then trigger
      the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have
      the side effect of throwing an error up to user space. This would be a
      slight change in behavior from user side but would look the same as an
      error if the redirect on the socket threw an error.
      
      This issue can cause the following info:
      WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       worker_thread+0x30/0x350
       ? process_one_work+0x3c0/0x3c0
       kthread+0xe6/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 604326b4
      
       ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarWang Yufen <wangyufen@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com
      2486ab43
    • Wang Yufen's avatar
      bpf, sockmap: Fix more uncharged while msg has more_data · 84472b43
      Wang Yufen authored
      In tcp_bpf_send_verdict(), if msg has more data after
      tcp_bpf_sendmsg_redir():
      
      tcp_bpf_send_verdict()
       tosend = msg->sg.size  //msg->sg.size = 22220
       case __SK_REDIRECT:
        sk_msg_return()  //uncharged msg->sg.size(22220) sk->sk_forward_alloc
        tcp_bpf_sendmsg_redir() //after tcp_bpf_sendmsg_redir, msg->sg.size=11000
       goto more_data;
       tosend = msg->sg.size  //msg->sg.size = 11000
       case __SK_REDIRECT:
        sk_msg_return()  //uncharged msg->sg.size(11000) to sk->sk_forward_alloc
      
      The msg->sg.size(11000) has been uncharged twice, to fix we can charge the
      remaining msg->sg.size before goto more data.
      
      This issue can cause the following info:
      WARNING: CPU: 0 PID: 9860 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
      Call Trace:
       <TASK>
       inet_csk_destroy_sock+0x55/0x110
       __tcp_close+0x279/0x470
       tcp_close+0x1f/0x60
       inet_release+0x3f/0x80
       __sock_release+0x3d/0xb0
       sock_close+0x11/0x20
       __fput+0x92/0x250
       task_work_run+0x6a/0xa0
       do_exit+0x33b/0xb60
       do_group_exit+0x2f/0xa0
       get_signal+0xb6/0x950
       arch_do_signal_or_restart+0xac/0x2a0
       ? vfs_write+0x237/0x290
       exit_to_user_mode_prepare+0xa9/0x200
       syscall_exit_to_user_mode+0x12/0x30
       do_syscall_64+0x46/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      
      WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       worker_thread+0x30/0x350
       ? process_one_work+0x3c0/0x3c0
       kthread+0xe6/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 604326b4
      
       ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarWang Yufen <wangyufen@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-4-wangyufen@huawei.com
      84472b43
    • Wang Yufen's avatar
      bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is full · 9c34e38c
      Wang Yufen authored
      If tcp_bpf_sendmsg() is running while sk msg is full. When sk_msg_alloc()
      returns -ENOMEM error, tcp_bpf_sendmsg() goes to wait_for_memory. If partial
      memory has been alloced by sk_msg_alloc(), that is, msg_tx->sg.size is
      greater than osize after sk_msg_alloc(), memleak occurs. To fix we use
      sk_msg_trim() to release the allocated memory, then goto wait for memory.
      
      Other call paths of sk_msg_alloc() have the similar issue, such as
      tls_sw_sendmsg(), so handle sk_msg_trim logic inside sk_msg_alloc(),
      as Cong Wang suggested.
      
      This issue can cause the following info:
      WARNING: CPU: 3 PID: 7950 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
      Call Trace:
       <TASK>
       inet_csk_destroy_sock+0x55/0x110
       __tcp_close+0x279/0x470
       tcp_close+0x1f/0x60
       inet_release+0x3f/0x80
       __sock_release+0x3d/0xb0
       sock_close+0x11/0x20
       __fput+0x92/0x250
       task_work_run+0x6a/0xa0
       do_exit+0x33b/0xb60
       do_group_exit+0x2f/0xa0
       get_signal+0xb6/0x950
       arch_do_signal_or_restart+0xac/0x2a0
       exit_to_user_mode_prepare+0xa9/0x200
       syscall_exit_to_user_mode+0x12/0x30
       do_syscall_64+0x46/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      
      WARNING: CPU: 3 PID: 2094 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       kthread+0xe6/0x110
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 604326b4
      
       ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarWang Yufen <wangyufen@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-3-wangyufen@huawei.com
      9c34e38c
    • Wang Yufen's avatar
      bpf, sockmap: Fix memleak in sk_psock_queue_msg · 938d3480
      Wang Yufen authored
      If tcp_bpf_sendmsg is running during a tear down operation we may enqueue
      data on the ingress msg queue while tear down is trying to free it.
      
       sk1 (redirect sk2)                         sk2
       -------------------                      ---------------
      tcp_bpf_sendmsg()
       tcp_bpf_send_verdict()
        tcp_bpf_sendmsg_redir()
         bpf_tcp_ingress()
                                                sock_map_close()
                                                 lock_sock()
          lock_sock() ... blocking
                                                 sk_psock_stop
                                                  sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
                                                 release_sock(sk);
          lock_sock()
          sk_mem_charge()
          get_page()
          sk_psock_queue_msg()
           sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED);
            drop_sk_msg()
          release_sock()
      
      While drop_sk_msg(), the msg has charged memory form sk by sk_mem_charge
      and has sg pages need to put. To fix we use sk_msg_free() and then kfee()
      msg.
      
      This issue can cause the following info:
      WARNING: CPU: 0 PID: 9202 at net/core/stream.c:205 sk_stream_kill_queues+0xc8/0xe0
      Call Trace:
       <IRQ>
       inet_csk_destroy_sock+0x55/0x110
       tcp_rcv_state_process+0xe5f/0xe90
       ? sk_filter_trim_cap+0x10d/0x230
       ? tcp_v4_do_rcv+0x161/0x250
       tcp_v4_do_rcv+0x161/0x250
       tcp_v4_rcv+0xc3a/0xce0
       ip_protocol_deliver_rcu+0x3d/0x230
       ip_local_deliver_finish+0x54/0x60
       ip_local_deliver+0xfd/0x110
       ? ip_protocol_deliver_rcu+0x230/0x230
       ip_rcv+0xd6/0x100
       ? ip_local_deliver+0x110/0x110
       __netif_receive_skb_one_core+0x85/0xa0
       process_backlog+0xa4/0x160
       __napi_poll+0x29/0x1b0
       net_rx_action+0x287/0x300
       __do_softirq+0xff/0x2fc
       do_softirq+0x79/0x90
       </IRQ>
      
      WARNING: CPU: 0 PID: 531 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x175/0x1b0
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       ? process_one_work+0x3c0/0x3c0
       worker_thread+0x30/0x350
       ? process_one_work+0x3c0/0x3c0
       kthread+0xe6/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 9635720b
      
       ("bpf, sockmap: Fix memleak on ingress msg enqueue")
      Signed-off-by: default avatarWang Yufen <wangyufen@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-2-wangyufen@huawei.com
      938d3480
  4. Mar 12, 2022
  5. Mar 11, 2022
  6. Mar 10, 2022
    • Yihao Han's avatar
      bpf, test_run: Use kvfree() for memory allocated with kvmalloc() · 743bec1b
      Yihao Han authored
      It is allocated with kvmalloc(), the corresponding release function
      should not be kfree(), use kvfree() instead.
      
      Generated by: scripts/coccinelle/api/kfree_mismatch.cocci
      
      Fixes: b530e9e1
      
       ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN")
      Signed-off-by: default avatarYihao Han <hanyihao@vivo.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20220310092828.13405-1-hanyihao@vivo.com
      743bec1b
    • Toke Høiland-Jørgensen's avatar
      bpf: Initialise retval in bpf_prog_test_run_xdp() · eecbfd97
      Toke Høiland-Jørgensen authored
      The kernel test robot pointed out that the newly added
      bpf_test_run_xdp_live() runner doesn't set the retval in the caller (by
      design), which means that the variable can be passed unitialised to
      bpf_test_finish(). Fix this by initialising the variable properly.
      
      Fixes: b530e9e1
      
       ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220310110228.161869-1-toke@redhat.com
      eecbfd97
    • Niklas Söderlund's avatar
      bpftool: Restore support for BPF offload-enabled feature probing · f655c088
      Niklas Söderlund authored
      Commit 1a56c18e
      
       ("bpftool: Stop supporting BPF offload-enabled
      feature probing") removed the support to probe for BPF offload features.
      This is still something that is useful for NFP NIC that can support
      offloading of BPF programs.
      
      The reason for the dropped support was that libbpf starting with v1.0
      would drop support for passing the ifindex to the BPF prog/map/helper
      feature probing APIs. In order to keep this useful feature for NFP
      restore the functionality by moving it directly into bpftool.
      
      The code restored is a simplified version of the code that existed in
      libbpf which supposed passing the ifindex. The simplification is that it
      only targets the cases where ifindex is given and call into libbpf for
      the cases where it's not.
      
      Before restoring support for probing offload features:
      
        # bpftool feature probe dev ens4np0
        Scanning system call availability...
        bpf() syscall is available
      
        Scanning eBPF program types...
      
        Scanning eBPF map types...
      
        Scanning eBPF helper functions...
        eBPF helpers supported for program type sched_cls:
        eBPF helpers supported for program type xdp:
      
        Scanning miscellaneous eBPF features...
        Large program size limit is NOT available
        Bounded loop support is NOT available
        ISA extension v2 is NOT available
        ISA extension v3 is NOT available
      
      With support for probing offload features restored:
      
        # bpftool feature probe dev ens4np0
        Scanning system call availability...
        bpf() syscall is available
      
        Scanning eBPF program types...
        eBPF program_type sched_cls is available
        eBPF program_type xdp is available
      
        Scanning eBPF map types...
        eBPF map_type hash is available
        eBPF map_type array is available
      
        Scanning eBPF helper functions...
        eBPF helpers supported for program type sched_cls:
        	- bpf_map_lookup_elem
        	- bpf_get_prandom_u32
        	- bpf_perf_event_output
        eBPF helpers supported for program type xdp:
        	- bpf_map_lookup_elem
        	- bpf_get_prandom_u32
        	- bpf_perf_event_output
        	- bpf_xdp_adjust_head
        	- bpf_xdp_adjust_tail
      
        Scanning miscellaneous eBPF features...
        Large program size limit is NOT available
        Bounded loop support is NOT available
        ISA extension v2 is NOT available
        ISA extension v3 is NOT available
      
      Signed-off-by: default avatarNiklas Söderlund <niklas.soderlund@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20220310121846.921256-1-niklas.soderlund@corigine.com
      f655c088
    • Alexei Starovoitov's avatar
      Merge branch 'Add support for transmitting packets using XDP in bpf_prog_run()' · de55c9a1
      Alexei Starovoitov authored
      
      
      Toke Høiland-Jørgensen says:
      
      ====================
      
      This series adds support for transmitting packets using XDP in
      bpf_prog_run(), by enabling a new mode "live packet" mode which will handle
      the XDP program return codes and redirect the packets to the stack or other
      devices.
      
      The primary use case for this is testing the redirect map types and the
      ndo_xdp_xmit driver operation without an external traffic generator. But it
      turns out to also be useful for creating a programmable traffic generator
      in XDP, as well as injecting frames into the stack. A sample traffic
      generator, which was included in previous versions of the series, but now
      moved to xdp-tools, transmits up to 9 Mpps/core on my test machine.
      
      To transmit the frames, the new mode instantiates a page_pool structure in
      bpf_prog_run() and initialises the pages to contain XDP frames with the
      data passed in by userspace. These frames can then be handled as though
      they came from the hardware XDP path, and the existing page_pool code takes
      care of returning and recycling them. The setup is optimised for high
      performance with a high number of repetitions to support stress testing and
      the traffic generator use case; see patch 1 for details.
      
      v11:
      - Fix override of return code in xdp_test_run_batch()
      - Add Martin's ACKs to remaining patches
      
      v10:
      - Only propagate memory allocation errors from xdp_test_run_batch()
      - Get rid of BPF_F_TEST_XDP_RESERVED; batch_size can be used to probe
      - Check that batch_size is unset in non-XDP test_run funcs
      - Lower the number of repetitions in the selftest to 10k
      - Count number of recycled pages in the selftest
      - Fix a few other nits from Martin, carry forward ACKs
      
      v9:
      - XDP_DROP packets in the selftest to ensure pages are recycled
      - Fix a few issues reported by the kernel test robot
      - Rewrite the documentation of the batch size to make it a bit clearer
      - Rebase to newest bpf-next
      
      v8:
      - Make the batch size configurable from userspace
      - Don't interrupt the packet loop on errors in do_redirect (this can be
        caught from the tracepoint)
      - Add documentation of the feature
      - Add reserved flag userspace can use to probe for support (kernel didn't
        check flags previously)
      - Rebase to newest bpf-next, disallow live mode for jumbo frames
      
      v7:
      - Extend the local_bh_disable() to cover the full test run loop, to prevent
        running concurrently with the softirq. Fixes a deadlock with veth xmit.
      - Reinstate the forwarding sysctl setting in the selftest, and bump up the
        number of packets being transmitted to trigger the above bug.
      - Update commit message to make it clear that user space can select the
        ingress interface.
      
      v6:
      - Fix meta vs data pointer setting and add a selftest for it
      - Add local_bh_disable() around code passing packets up the stack
      - Create a new netns for the selftest and use a TC program instead of the
        forwarding hack to count packets being XDP_PASS'ed from the test prog.
      - Check for the correct ingress ifindex in the selftest
      - Rebase and drop patches 1-5 that were already merged
      
      v5:
      - Rebase to current bpf-next
      
      v4:
      - Fix a few code style issues (Alexei)
      - Also handle the other return codes: XDP_PASS builds skbs and injects them
        into the stack, and XDP_TX is turned into a redirect out the same
        interface (Alexei).
      - Drop the last patch adding an xdp_trafficgen program to samples/bpf; this
        will live in xdp-tools instead (Alexei).
      - Add a separate bpf_test_run_xdp_live() function to test_run.c instead of
        entangling the new mode in the existing bpf_test_run().
      
      v3:
      - Reorder patches to make sure they all build individually (Patchwork)
      - Remove a couple of unused variables (Patchwork)
      - Remove unlikely() annotation in slow path and add back John's ACK that I
        accidentally dropped for v2 (John)
      
      v2:
      - Split up up __xdp_do_redirect to avoid passing two pointers to it (John)
      - Always reset context pointers before each test run (John)
      - Use get_mac_addr() from xdp_sample_user.h instead of rolling our own (Kumar)
      - Fix wrong offset for metadata pointer
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      de55c9a1
    • Toke Høiland-Jørgensen's avatar
      selftests/bpf: Add selftest for XDP_REDIRECT in BPF_PROG_RUN · 55fcacca
      Toke Høiland-Jørgensen authored
      
      
      This adds a selftest for the XDP_REDIRECT facility in BPF_PROG_RUN, that
      redirects packets into a veth and counts them using an XDP program on the
      other side of the veth pair and a TC program on the local side of the veth.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20220309105346.100053-6-toke@redhat.com
      55fcacca
    • Toke Høiland-Jørgensen's avatar
      selftests/bpf: Move open_netns() and close_netns() into network_helpers.c · a3033884
      Toke Høiland-Jørgensen authored
      
      
      These will also be used by the xdp_do_redirect test being added in the next
      commit.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20220309105346.100053-5-toke@redhat.com
      a3033884
    • Toke Høiland-Jørgensen's avatar
      libbpf: Support batch_size option to bpf_prog_test_run · 24592ad1
      Toke Høiland-Jørgensen authored
      
      
      Add support for setting the new batch_size parameter to BPF_PROG_TEST_RUN
      to libbpf; just add it as an option and pass it through to the kernel.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20220309105346.100053-4-toke@redhat.com
      24592ad1