Skip to content
  1. Jan 28, 2022
    • Yonghong Song's avatar
      compiler_types: define __user as __attribute__((btf_type_tag("user"))) · 7472d5a6
      Yonghong Song authored
      
      
      The __user attribute is currently mainly used by sparse for type checking.
      The attribute indicates whether a memory access is in user memory address
      space or not. Such information is important during tracing kernel
      internal functions or data structures as accessing user memory often
      has different mechanisms compared to accessing kernel memory. For example,
      the perf-probe needs explicit command line specification to indicate a
      particular argument or string in user-space memory ([1], [2], [3]).
      Currently, vmlinux BTF is available in kernel with many distributions.
      If __user attribute information is available in vmlinux BTF, the explicit
      user memory access information from users will not be necessary as
      the kernel can figure it out by itself with vmlinux BTF.
      
      Besides the above possible use for perf/probe, another use case is
      for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf
      programs, users can write direct code like
        p->m1->m2
      and "p" could be a function parameter. Without __user information in BTF,
      the verifier will assume p->m1 accessing kernel memory and will generate
      normal loads. Let us say "p" actually tagged with __user in the source
      code.  In such cases, p->m1 is actually accessing user memory and direct
      load is not right and may produce incorrect result. For such cases,
      bpf_probe_read_user() will be the correct way to read p->m1.
      
      To support encoding __user information in BTF, a new attribute
        __attribute__((btf_type_tag("<arbitrary_string>")))
      is implemented in clang ([4]). For example, if we have
        #define __user __attribute__((btf_type_tag("user")))
      during kernel compilation, the attribute "user" information will
      be preserved in dwarf. After pahole converting dwarf to BTF, __user
      information will be available in vmlinux BTF.
      
      The following is an example with latest upstream clang (clang14) and
      pahole 1.23:
      
        [$ ~] cat test.c
        #define __user __attribute__((btf_type_tag("user")))
        int foo(int __user *arg) {
                return *arg;
        }
        [$ ~] clang -O2 -g -c test.c
        [$ ~] pahole -JV test.o
        ...
        [1] INT int size=4 nr_bits=32 encoding=SIGNED
        [2] TYPE_TAG user type_id=1
        [3] PTR (anon) type_id=2
        [4] FUNC_PROTO (anon) return=1 args=(3 arg)
        [5] FUNC foo type_id=4
        [$ ~]
      
      You can see for the function argument "int __user *arg", its type is
      described as
        PTR -> TYPE_TAG(user) -> INT
      The kernel can use this information for bpf verification or other
      use cases.
      
      Current btf_type_tag is only supported in clang (>= clang14) and
      pahole (>= 1.23). gcc support is also proposed and under development ([5]).
      
        [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2
        [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2
        [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2
        [4] https://reviews.llvm.org/D111199
        [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/
      
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127154600.652613-1-yhs@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7472d5a6
    • Pavel Begunkov's avatar
      cgroup/bpf: fast path skb BPF filtering · 46531a30
      Pavel Begunkov authored
      
      
      Even though there is a static key protecting from overhead from
      cgroup-bpf skb filtering when there is nothing attached, in many cases
      it's not enough as registering a filter for one type will ruin the fast
      path for all others. It's observed in production servers I've looked
      at but also in laptops, where registration is done during init by
      systemd or something else.
      
      Add a per-socket fast path check guarding from such overhead. This
      affects both receive and transmit paths of TCP, UDP and other
      protocols. It showed ~1% tx/s improvement in small payload UDP
      send benchmarks using a real NIC and in a server environment and the
      number jumps to 2-3% for preemtible kernels.
      
      Reviewed-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      46531a30
    • Yonghong Song's avatar
      selftests/bpf: fix a clang compilation error · cdb5ed97
      Yonghong Song authored
      
      
      When building selftests/bpf with clang
        make -j LLVM=1
        make -C tools/testing/selftests/bpf -j LLVM=1
      I hit the following compilation error:
      
        trace_helpers.c:152:9: error: variable 'found' is used uninitialized whenever 'while' loop exits because its condition is false [-Werror,-Wsometimes-uninitialized]
                while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        trace_helpers.c:161:7: note: uninitialized use occurs here
                if (!found)
                     ^~~~~
        trace_helpers.c:152:9: note: remove the condition if it is always true
                while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                       1
        trace_helpers.c:145:12: note: initialize the variable 'found' to silence this warning
                bool found;
                          ^
                           = false
      
      It is possible that for sane /proc/self/maps we may never hit the above issue
      in practice. But let us initialize variable 'found' properly to silence the
      compilation error.
      
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127163726.1442032-1-yhs@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cdb5ed97
    • Magnus Karlsson's avatar
      selftests, xsk: Fix bpf_res cleanup test · 3b22523b
      Magnus Karlsson authored
      After commit 710ad98c
      
       ("veth: Do not record rx queue hint in veth_xmit"),
      veth no longer receives traffic on the same queue as it was sent on. This
      breaks the bpf_res test for the AF_XDP selftests as the socket tied to
      queue 1 will not receive traffic anymore.
      
      Modify the test so that two sockets are tied to queue id 0 using a shared
      umem instead. When killing the first socket enter the second socket into
      the xskmap so that traffic will flow to it. This will still test that the
      resources are not cleaned up until after the second socket dies, without
      having to rely on veth supporting rx_queue hints.
      
      Reported-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220125082945.26179-1-magnus.karlsson@gmail.com
      3b22523b
    • Daniel Borkmann's avatar
      Merge branch 'xsk-batching' · 33372bc2
      Daniel Borkmann authored
      
      
      Maciej Fijalkowski says:
      
      ====================
      Unfortunately, similar scalability issues that were addressed for XDP
      processing in ice, exist for XDP in the zero-copy driver used by AF_XDP.
      Let's resolve them in mostly the same way as we did in [0] and utilize
      the Tx batching API from XSK buffer pool.
      
      Move the array of Tx descriptors that is used with batching approach to
      the XSK buffer pool. This means that future users of this API will not
      have to carry the array on their own side, they can simple refer to
      pool's tx_desc array.
      
      We also improve the Rx side where we extend ice_alloc_rx_buf_zc() to
      handle the ring wrap and bump Rx tail more frequently. By doing so,
      Rx side is adjusted to Tx and it was needed for l2fwd scenario.
      
      Here are the improvements of performance numbers that this set brings
      measured with xdpsock app in busy poll mode for 1 and 2 core modes.
      Both Tx and Rx rings were sized to 1k length and busy poll budget was
      256.
      
      ----------------------------------------------------------------
           |      txonly:      |      l2fwd      |      rxdrop
      ----------------------------------------------------------------
      1C   |       149%        |       14%       |        3%
      ----------------------------------------------------------------
      2C   |       134%        |       20%       |        5%
      ----------------------------------------------------------------
      
      Next step will be to introduce batching onto Rx side.
      
      v5:
      * collect acks
      * fix typos
      * correct comments showing cache line boundaries in ice_tx_ring struct
      v4 - address Alexandr's review:
      * new patch (2) for making sure ring size is pow(2) when attaching
        xsk socket
      * don't open code ALIGN_DOWN (patch 3)
      * resign from storing tx_thresh in ice_tx_ring (patch 4)
      * scope variables in a better way for Tx batching (patch 7)
      v3:
      * drop likely() that was wrapping napi_complete_done (patch 1)
      * introduce configurable Tx threshold (patch 2)
      * handle ring wrap on Rx side when allocating buffers (patch 3)
      * respect NAPI budget when cleaning Tx descriptors in ZC (patch 6)
      v2:
      * introduce new patch that resets @next_dd and @next_rs fields
      * use batching API for AF_XDP Tx on ice side
      
        [0]: https://lore.kernel.org/bpf/20211015162908.145341-8-anthony.l.nguyen@intel.com/
      ====================
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      33372bc2
    • Maciej Fijalkowski's avatar
      ice: xsk: Borrow xdp_tx_active logic from i40e · 59e92bfe
      Maciej Fijalkowski authored
      One of the things that commit 5574ff7b
      
       ("i40e: optimize AF_XDP Tx
      completion path") introduced was the @xdp_tx_active field. Its usage
      from i40e can be adjusted to ice driver and give us positive performance
      results.
      
      If the descriptor that @next_dd points to has been sent by HW (its DD
      bit is set), then we are sure that at least quarter of the ring is ready
      to be cleaned. If @xdp_tx_active is 0 which means that related xdp_ring
      is not used for XDP_{TX, REDIRECT} workloads, then we know how many XSK
      entries should placed to completion queue, IOW walking through the ring
      can be skipped.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-9-maciej.fijalkowski@intel.com
      59e92bfe
    • Maciej Fijalkowski's avatar
      ice: xsk: Improve AF_XDP ZC Tx and use batching API · 126cdfe1
      Maciej Fijalkowski authored
      Apply the logic that was done for regular XDP from commit 9610bd98
      
      
      ("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
      of that, introduce batching to Tx that is inspired by i40e's
      implementation with adjustments to the cleaning logic - take into the
      account NAPI budget in ice_clean_xdp_irq_zc().
      
      Separating the stats structs onto separate cache lines seemed to improve
      the performance.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-8-maciej.fijalkowski@intel.com
      126cdfe1
    • Maciej Fijalkowski's avatar
      ice: xsk: Avoid potential dead AF_XDP Tx processing · 86e3f78c
      Maciej Fijalkowski authored
      Commit 9610bd98
      
       ("ice: optimize XDP_TX workloads") introduced
      @next_dd and @next_rs to ice_tx_ring struct. Currently, their state is
      not restored in ice_clean_tx_ring(), which was not causing any troubles
      as the XDP rings are gone after we're done with XDP prog on interface.
      
      For upcoming usage of mentioned fields in AF_XDP, this might expose us
      to a potential dead Tx side. Scenario would look like following (based
      on xdpsock):
      
      - two xdpsock instances are spawned in Tx mode
      - one of them is killed
      - XDP prog is kept on interface due to the other xdpsock still running
        * this means that XDP rings stayed in place
      - xdpsock is launched again on same queue id that was terminated on
      - @next_dd and @next_rs setting is bogus, therefore transmit side is
        broken
      
      To protect us from the above, restore the initial @next_rs and @next_dd
      values when cleaning the Tx ring.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-7-maciej.fijalkowski@intel.com
      86e3f78c
    • Magnus Karlsson's avatar
      i40e: xsk: Move tmp desc array from driver to pool · d1bc532e
      Magnus Karlsson authored
      
      
      Move desc_array from the driver to the pool. The reason behind this is
      that we can then reuse this array as a temporary storage for descriptors
      in all zero-copy drivers that use the batched interface. This will make
      it easier to add batching to more drivers.
      
      i40e is the only driver that has a batched Tx zero-copy
      implementation, so no need to touch any other driver.
      
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com
      d1bc532e
    • Maciej Fijalkowski's avatar
      ice: Make Tx threshold dependent on ring length · 3dd411ef
      Maciej Fijalkowski authored
      
      
      XDP_TX workloads use a concept of Tx threshold that indicates the
      interval of setting RS bit on descriptors which in turn tells the HW to
      generate an interrupt to signal the completion of Tx on HW side. It is
      currently based on a constant value of 32 which might not work out well
      for various sizes of ring combined with for example batch size that can
      be set via SO_BUSY_POLL_BUDGET.
      
      Internal tests based on AF_XDP showed that most convenient setup of
      mentioned threshold is when it is equal to quarter of a ring length.
      
      Make use of recently introduced ICE_RING_QUARTER macro and use this
      value as a substitute for ICE_TX_THRESH.
      
      Align also ethtool -G callback so that next_dd/next_rs fields are up to
      date in terms of the ring size.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-5-maciej.fijalkowski@intel.com
      3dd411ef
    • Maciej Fijalkowski's avatar
      ice: xsk: Handle SW XDP ring wrap and bump tail more often · 3876ff52
      Maciej Fijalkowski authored
      
      
      Currently, if ice_clean_rx_irq_zc() processed the whole ring and
      next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole
      ring even if the XSK buffer pool would have enough free entries (either
      from fill ring or the internal recycle mechanism) - it is because ring
      wrap is not handled.
      
      Improve the logic in ice_alloc_rx_buf_zc() to address the problem above.
      Do not clamp the count of buffers that is passed to
      xsk_buff_alloc_batch() in case when next_to_use + buffer count >=
      rx_ring->count,  but rather split it and have two calls to the mentioned
      function - one for the part up until the wrap and one for the part after
      the wrap.
      
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-4-maciej.fijalkowski@intel.com
      3876ff52
    • Maciej Fijalkowski's avatar
      ice: xsk: Force rings to be sized to power of 2 · 296f13ff
      Maciej Fijalkowski authored
      
      
      With the upcoming introduction of batching to XSK data path,
      performance wise it will be the best to have the ring descriptor count
      to be aligned to power of 2.
      
      Check if ring sizes that user is going to attach the XSK socket fulfill
      the condition above. For Tx side, although check is being done against
      the Tx queue and in the end the socket will be attached to the XDP
      queue, it is fine since XDP queues get the ring->count setting from Tx
      queues.
      
      Suggested-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-3-maciej.fijalkowski@intel.com
      296f13ff
    • Maciej Fijalkowski's avatar
      ice: Remove likely for napi_complete_done · a4e18669
      Maciej Fijalkowski authored
      
      
      Remove the likely before napi_complete_done as this is the unlikely case
      when busy-poll is used. Removing this has a positive performance impact
      for busy-poll and no negative impact to the regular case.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-2-maciej.fijalkowski@intel.com
      a4e18669
  2. Jan 27, 2022
    • Jakub Kicinski's avatar
      bpf: remove unused static inlines · 8033c6c2
      Jakub Kicinski authored
      
      
      Remove two dead stubs, sk_msg_clear_meta() was never
      used, use of xskq_cons_is_full() got replaced by
      xsk_tx_writeable() in v5.10.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20220126185412.2776254-1-kuba@kernel.org
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8033c6c2
    • Andrii Nakryiko's avatar
      selftests/bpf: fix uprobe offset calculation in selftests · ff943683
      Andrii Nakryiko authored
      
      
      Fix how selftests determine relative offset of a function that is
      uprobed. Previously, there was an assumption that uprobed function is
      always in the first executable region, which is not always the case
      (libbpf CI hits this case now). So get_base_addr() approach in isolation
      doesn't work anymore. So teach get_uprobe_offset() to determine correct
      memory mapping and calculate uprobe offset correctly.
      
      While at it, I merged together two implementations of
      get_uprobe_offset() helper, moving powerpc64-specific logic inside (had
      to add extra {} block to avoid unused variable error for insn).
      
      Also ensured that uprobed functions are never inlined, but are still
      static (and thus local to each selftest), by using a no-op asm volatile
      block internally. I didn't want to keep them global __weak, because some
      tests use uprobe's ref counter offset (to test USDT-like logic) which is
      not compatible with non-refcounted uprobe. So it's nicer to have each
      test uprobe target local to the file and guaranteed to not be inlined or
      skipped by the compiler (which can happen with static functions,
      especially if compiling selftests with -O2).
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220126193058.3390292-1-andrii@kernel.org
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ff943683
    • Yonghong Song's avatar
      selftests/bpf: Fix a clang compilation error · e5465a90
      Yonghong Song authored
      
      
      Compiling kernel and selftests/bpf with latest llvm like blow:
        make -j LLVM=1
        make -C tools/testing/selftests/bpf -j LLVM=1
      I hit the following compilation error:
        /.../prog_tests/log_buf.c:215:6: error: variable 'log_buf' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
                if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data_good"))
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        /.../prog_tests/log_buf.c:264:7: note: uninitialized use occurs here
                free(log_buf);
                     ^~~~~~~
        /.../prog_tests/log_buf.c:215:2: note: remove the 'if' if its condition is always false
                if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data_good"))
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        /.../prog_tests/log_buf.c:205:15: note: initialize the variable 'log_buf' to silence this warning
                char *log_buf;
                             ^
                              = NULL
        1 error generated.
      
      Compiler rightfully detected that log_buf is uninitialized in one of failure path as indicated
      in the above.
      
      Proper initialization of 'log_buf' variable fixed the issue.
      
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220126181940.4105997-1-yhs@fb.com
      e5465a90
  3. Jan 26, 2022
  4. Jan 25, 2022
    • Andrii Nakryiko's avatar
      Merge branch 'Fix the incorrect register read for syscalls on x86_64' · 74bb0f0c
      Andrii Nakryiko authored
      
      
      Kenta Tada says:
      
      ====================
      Currently, rcx is read as the fourth parameter of syscall on x86_64.
      But x86_64 Linux System Call convention uses r10 actually.
      This commit adds the wrapper for users who want to access to
      syscall params to analyze the user space.
      
      Changelog:
      ----------
      v1 -> v2:
      - Rebase to current bpf-next
      https://lore.kernel.org/bpf/20211222213924.1869758-1-andrii@kernel.org/
      
      v2 -> v3:
      - Modify the definition of SYSCALL macros for only targeted archs.
      - Define __BPF_TARGET_MISSING variants for completeness.
      - Remove CORE variants. These macros will not be used.
      - Add a selftest.
      
      v3 -> v4:
      - Modify a selftest not to use serial tests.
      - Modify a selftest to use ASSERT_EQ().
      - Extract syscall wrapper for all the other tests.
      - Add CORE variants.
      
      v4 -> v5:
      - Modify the CORE variant macro not to read memory directly.
      - Remove the unnecessary comment.
      - Add a selftest for the CORE variant.
      ====================
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      74bb0f0c
    • Kenta Tada's avatar
      selftests/bpf: Add a test to confirm PT_REGS_PARM4_SYSCALL · 77fc0330
      Kenta Tada authored
      
      
      Add a selftest to verify the behavior of PT_REGS_xxx
      and the CORE variant.
      
      Signed-off-by: default avatarKenta Tada <Kenta.Tada@sony.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220124141622.4378-4-Kenta.Tada@sony.com
      77fc0330
    • Kenta Tada's avatar
      libbpf: Fix the incorrect register read for syscalls on x86_64 · d084df3b
      Kenta Tada authored
      
      
      Currently, rcx is read as the fourth parameter of syscall on x86_64.
      But x86_64 Linux System Call convention uses r10 actually.
      This commit adds the wrapper for users who want to access to
      syscall params to analyze the user space.
      
      Signed-off-by: default avatarKenta Tada <Kenta.Tada@sony.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220124141622.4378-3-Kenta.Tada@sony.com
      d084df3b
    • Kenta Tada's avatar
      selftests/bpf: Extract syscall wrapper · 78a20541
      Kenta Tada authored
      
      
      Extract the helper to set up SYS_PREFIX for fentry and kprobe selftests
      that use __x86_sys_* attach functions.
      
      Suggested-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarKenta Tada <Kenta.Tada@sony.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220124141622.4378-2-Kenta.Tada@sony.com
      78a20541
    • Christy Lee's avatar
      libbpf: Mark bpf_object__open_xattr() deprecated · fc763870
      Christy Lee authored
      
      
      Mark bpf_object__open_xattr() as deprecated, use
      bpf_object__open_file() instead.
      
        [0] Closes: https://github.com/libbpf/libbpf/issues/287
      
      Signed-off-by: default avatarChristy Lee <christylee@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220125010917.679975-1-christylee@fb.com
      fc763870
    • Andrii Nakryiko's avatar
      Merge branch 'deprecate bpf_object__open_buffer() API' · bfc0a2e9
      Andrii Nakryiko authored
      
      
      Christy Lee says:
      
      ====================
      
      Deprecate bpf_object__open_buffer() API, replace all usage
      with bpf_object__open_mem().
      
      ====================
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      bfc0a2e9
    • Christy Lee's avatar
      perf: Stop using bpf_object__open_buffer() API · 5a34d98b
      Christy Lee authored
      
      
      bpf_object__open_buffer() API is deprecated, use the unified opts
      bpf_object__open_mem() API in perf instead. This requires at least
      libbpf 0.0.6.
      
      Signed-off-by: default avatarChristy Lee <christylee@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220125005923.418339-3-christylee@fb.com
      5a34d98b
    • Christy Lee's avatar
      libbpf: Mark bpf_object__open_buffer() API deprecated · 9f45f70a
      Christy Lee authored
      
      
      Deprecate bpf_object__open_buffer() API in favor of the unified
      opts-based bpf_object__open_mem() API.
      
      Signed-off-by: default avatarChristy Lee <christylee@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220125005923.418339-2-christylee@fb.com
      9f45f70a
    • Alexei Starovoitov's avatar
      Merge branch 'Add bpf_copy_from_user_task helper and sleepable bpf iterator programs' · c45c79e5
      Alexei Starovoitov authored
      
      
      Kenny Yu says:
      
      ====================
      
      This patch series makes the following changes:
      * Adds a new bpf helper `bpf_copy_from_user_task` to read user space
        memory from a different task.
      * Adds the ability to create sleepable bpf iterator programs.
      
      As an example of how this will be used, at Meta we are using bpf task
      iterator programs and this new bpf helper to read C++ async stack traces of
      a running process for debugging C++ binaries in production.
      
      Changes since v6:
      * Split first patch into two patches: first patch to add support
        for bpf iterators to use sleepable helpers, and the second to add
        the new bpf helper.
      * Simplify implementation of `bpf_copy_from_user_task` based on feedback.
      * Add to docs that the destination buffer will be zero-ed on error.
      
      Changes since v5:
      * Rename `bpf_access_process_vm` to `bpf_copy_from_user_task`.
      * Change return value to be all-or-nothing. If we get a partial read,
        memset all bytes to 0 and return -EFAULT.
      * Add to docs that the helper can only be used by sleepable BPF programs.
      * Fix nits in selftests.
      
      Changes since v4:
      * Make `flags` into u64.
      * Use `user_ptr` arg name to be consistent with `bpf_copy_from_user`.
      * Add an extra check in selftests to verify access_process_vm calls
        succeeded.
      
      Changes since v3:
      * Check if `flags` is 0 and return -EINVAL if not.
      * Rebase on latest bpf-next branch and fix merge conflicts.
      
      Changes since v2:
      * Reorder arguments in `bpf_access_process_vm` to match existing related
        bpf helpers (e.g. `bpf_probe_read_kernel`, `bpf_probe_read_user`,
        `bpf_copy_from_user`).
      * `flags` argument is provided for future extensibility and is not
        currently used, and we always invoke `access_process_vm` with no flags.
      * Merge bpf helper patch and `bpf_iter_run_prog` patch together for better
        bisectability in case of failures.
      * Clean up formatting and comments in selftests.
      
      Changes since v1:
      * Fixed "Invalid wait context" issue in `bpf_iter_run_prog` by using
        `rcu_read_lock_trace()` for sleepable bpf iterator programs.
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c45c79e5
    • Kenny Yu's avatar
      selftests/bpf: Add test for sleepable bpf iterator programs · 45105c2e
      Kenny Yu authored
      
      
      This adds a test for bpf iterator programs to make use of sleepable
      bpf helpers.
      
      Signed-off-by: default avatarKenny Yu <kennyyu@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220124185403.468466-5-kennyyu@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      45105c2e
    • Kenny Yu's avatar
      libbpf: Add "iter.s" section for sleepable bpf iterator programs · a8b77f74
      Kenny Yu authored
      
      
      This adds a new bpf section "iter.s" to allow bpf iterator programs to
      be sleepable.
      
      Signed-off-by: default avatarKenny Yu <kennyyu@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220124185403.468466-4-kennyyu@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a8b77f74
    • Kenny Yu's avatar
      bpf: Add bpf_copy_from_user_task() helper · 376040e4
      Kenny Yu authored
      
      
      This adds a helper for bpf programs to read the memory of other
      tasks.
      
      As an example use case at Meta, we are using a bpf task iterator program
      and this new helper to print C++ async stack traces for all threads of
      a given process.
      
      Signed-off-by: default avatarKenny Yu <kennyyu@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220124185403.468466-3-kennyyu@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      376040e4
    • Kenny Yu's avatar
      bpf: Add support for bpf iterator programs to use sleepable helpers · b77fb25d
      Kenny Yu authored
      
      
      This patch allows bpf iterator programs to use sleepable helpers by
      changing `bpf_iter_run_prog` to use the appropriate synchronization.
      With sleepable bpf iterator programs, we can no longer use
      `rcu_read_lock()` and must use `rcu_read_lock_trace()` instead
      to protect the bpf program.
      
      Signed-off-by: default avatarKenny Yu <kennyyu@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220124185403.468466-2-kennyyu@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b77fb25d
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · caaba961
      Jakub Kicinski authored
      
      
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2022-01-24
      
      We've added 80 non-merge commits during the last 14 day(s) which contain
      a total of 128 files changed, 4990 insertions(+), 895 deletions(-).
      
      The main changes are:
      
      1) Add XDP multi-buffer support and implement it for the mvneta driver,
         from Lorenzo Bianconi, Eelco Chaudron and Toke Høiland-Jørgensen.
      
      2) Add unstable conntrack lookup helpers for BPF by using the BPF kfunc
         infra, from Kumar Kartikeya Dwivedi.
      
      3) Extend BPF cgroup programs to export custom ret value to userspace via
         two helpers bpf_get_retval() and bpf_set_retval(), from YiFei Zhu.
      
      4) Add support for AF_UNIX iterator batching, from Kuniyuki Iwashima.
      
      5) Complete missing UAPI BPF helper description and change bpf_doc.py script
         to enforce consistent & complete helper documentation, from Usama Arif.
      
      6) Deprecate libbpf's legacy BPF map definitions and streamline XDP APIs to
         follow tc-based APIs, from Andrii Nakryiko.
      
      7) Support BPF_PROG_QUERY for BPF programs attached to sockmap, from Di Zhu.
      
      8) Deprecate libbpf's bpf_map__def() API and replace users with proper getters
         and setters, from Christy Lee.
      
      9) Extend libbpf's btf__add_btf() with an additional hashmap for strings to
         reduce overhead, from Kui-Feng Lee.
      
      10) Fix bpftool and libbpf error handling related to libbpf's hashmap__new()
          utility function, from Mauricio Vásquez.
      
      11) Add support to BTF program names in bpftool's program dump, from Raman Shukhau.
      
      12) Fix resolve_btfids build to pick up host flags, from Connor O'Brien.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (80 commits)
        selftests, bpf: Do not yet switch to new libbpf XDP APIs
        selftests, xsk: Fix rx_full stats test
        bpf: Fix flexible_array.cocci warnings
        xdp: disable XDP_REDIRECT for xdp frags
        bpf: selftests: add CPUMAP/DEVMAP selftests for xdp frags
        bpf: selftests: introduce bpf_xdp_{load,store}_bytes selftest
        net: xdp: introduce bpf_xdp_pointer utility routine
        bpf: generalise tail call map compatibility check
        libbpf: Add SEC name for xdp frags programs
        bpf: selftests: update xdp_adjust_tail selftest to include xdp frags
        bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature
        bpf: introduce frags support to bpf_prog_test_run_xdp()
        bpf: move user_size out of bpf_test_init
        bpf: add frags support to xdp copy helpers
        bpf: add frags support to the bpf_xdp_adjust_tail() API
        bpf: introduce bpf_xdp_get_buff_len helper
        net: mvneta: enable jumbo frames if the loaded XDP program support frags
        bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf program
        net: mvneta: add frags support to XDP_TX
        xdp: add frags support to xdp_return_{buff/frame}
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20220124221235.18993-1-daniel@iogearbox.net
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      caaba961