Skip to content
  1. May 22, 2020
    • John Fastabend's avatar
      bpf: Verifier track null pointer branch_taken with JNE and JEQ · cac616db
      John Fastabend authored
      
      
      Currently, when considering the branches that may be taken for a jump
      instruction if the register being compared is a pointer the verifier
      assumes both branches may be taken. But, if the jump instruction
      is comparing if a pointer is NULL we have this information in the
      verifier encoded in the reg->type so we can do better in these cases.
      Specifically, these two common cases can be handled.
      
       * If the instruction is BPF_JEQ and we are comparing against a
         zero value. This test is 'if ptr == 0 goto +X' then using the
         type information in reg->type we can decide if the ptr is not
         null. This allows us to avoid pushing both branches onto the
         stack and instead only use the != 0 case. For example
         PTR_TO_SOCK and PTR_TO_SOCK_OR_NULL encode the null pointer.
         Note if the type is PTR_TO_SOCK_OR_NULL we can not learn anything.
         And also if the value is non-zero we learn nothing because it
         could be any arbitrary value a different pointer for example
      
       * If the instruction is BPF_JNE and ware comparing against a zero
         value then a similar analysis as above can be done. The test in
         asm looks like 'if ptr != 0 goto +X'. Again using the type
         information if the non null type is set (from above PTR_TO_SOCK)
         we know the jump is taken.
      
      In this patch we extend is_branch_taken() to consider this extra
      information and to return only the branch that will be taken. This
      resolves a verifier issue reported with C code like the following.
      See progs/test_sk_lookup_kern.c in selftests.
      
       sk = bpf_sk_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
       bpf_printk("sk=%d\n", sk ? 1 : 0);
       if (sk)
         bpf_sk_release(sk);
       return sk ? TC_ACT_OK : TC_ACT_UNSPEC;
      
      In the above the bpf_printk() will resolve the pointer from
      PTR_TO_SOCK_OR_NULL to PTR_TO_SOCK. Then the second test guarding
      the release will cause the verifier to walk both paths resulting
      in the an unreleased sock reference. See verifier/ref_tracking.c
      in selftests for an assembly version of the above.
      
      After the above additional logic is added the C code above passes
      as expected.
      
      Reported-by: default avatarAndrey Ignatov <rdna@fb.com>
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/159009164651.6313.380418298578070501.stgit@john-Precision-5820-Tower
      cac616db
    • Alexei Starovoitov's avatar
      Merge branch 'af_xdp-common-alloc' · 79917b24
      Alexei Starovoitov authored
      
      
      Björn Töpel says:
      
      ====================
      Overview
      ========
      
      Driver adoption for AF_XDP has been slow. The amount of code required
      to proper support AF_XDP is substantial and the driver/core APIs are
      vague or even non-existing. Drivers have to manually adjust data
      offsets, updating AF_XDP handles differently for different modes
      (aligned/unaligned).
      
      This series attempts to improve the situation by introducing an AF_XDP
      buffer allocation API. The implementation is based on a single core
      (single producer/consumer) buffer pool for the AF_XDP UMEM.
      
      A buffer is allocated using the xsk_buff_alloc() function, and
      returned using xsk_buff_free(). If a buffer is disassociated with the
      pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is
      said to be released. Currently, the release function is only used by
      the AF_XDP internals and not visible to the driver.
      
      Drivers using this API should register the XDP memory model with the
      new MEM_TYPE_XSK_BUFF_POOL type, which will supersede the
      MEM_TYPE_ZERO_COPY type.
      
      The buffer type is struct xdp_buff, and follows the lifetime of
      regular xdp_buffs, i.e.  the lifetime of an xdp_buff is restricted to
      a NAPI context. In other words, the API is not replacing xdp_frames.
      
      DMA mapping/synching is folded into the buffer handling as well.
      
      @JeffK The Intel drivers changes should go through the bpf-next tree,
             and not your regular Intel tree, since multiple (non-Intel)
             drivers are affected.
      
      The outline of the series is as following:
      
      Patch 1 is a fix for xsk_umem_xdp_frame_sz().
      
      Patch 2 to 4 are restructures/clean ups. The XSKMAP implementation is
      moved to net/xdp/. Functions/defines/enums that are only used by the
      AF_XDP internals are moved from the global include/net/xdp_sock.h to
      net/xdp/xsk.h. We are also introducing a new "driver include file",
      include/net/xdp_sock_drv.h, which is the only file NIC driver
      developers adding AF_XDP zero-copy support should care about.
      
      Patch 5 adds the new API, and migrates the "copy-mode"/skb-mode AF_XDP
      path to the new API.
      
      Patch 6 to 11 migrates the existing zero-copy drivers to the new API.
      
      Patch 12 removes the MEM_TYPE_ZERO_COPY memory type, and the "handle"
      member of struct xdp_buff.
      
      Patch 13 simplifies the xdp_return_{frame,frame_rx_napi,buff}
      functions.
      
      Patch 14 is a performance patch, where some functions are inlined.
      
      Finally, patch 15 updates the MAINTAINERS file to correctly mirror the
      new file layout.
      
      Note that this series removes the "handle" member from struct
      xdp_buff, which reduces the xdp_buff size.
      
      After this series, the diff stat of drivers/net/ is:
        27 files changed, 419 insertions(+), 1288 deletions(-)
      
      This series is a first step of simplifying the driver side of
      AF_XDP. I think more of the AF_XDP logic can be moved from the drivers
      to the AF_XDP core, e.g. the "need wakeup" set/clear functionality.
      
      Statistics when allocation fails can now be added to the socket
      statistics via the XDP_STATISTICS getsockopt(). This will be added in
      a follow up series.
      
      Performance
      ===========
      
      As a nice side effect, performance is up a bit as well.
      
        * i40e: 3% higher pps for rxdrop, zero-copy, aligned and unaligned
          (40 GbE, 64B packets).
        * mlx5: RX +0.8 Mpps, TX +0.4 Mpps
      
      Changelog
      =========
      
      v4->v5:
        * Fix various kdoc and GCC warnings (W=1). (Jakub)
      
      v3->v4:
          * mlx5: Remove unused variable num_xsk_frames. (Jakub)
          * i40e: Made i40e_fd_handle_status() static. (kbuild test robot)
      
      v2->v3:
        * Added xsk_umem_xdp_frame_sz() fix to the series. (Björn)
        * Initialize struct xdp_buff member frame_sz. (Björn)
        * Add API to query the DMA address of a frame. (Maxim)
        * Do DMA sync for CPU till the end of the frame to handle possible
          growth (frame_sz). (Maxim)
        * mlx5: Handle frame_sz, use xsk_buff_xdp_get_frame_dma, use
          xsk_buff API for DMA sync on TX, add performance numbers. (Maxim)
      
      v1->v2:
        * mlx5: Fix DMA address handling, set XDP metadata to invalid. (Maxim)
        * ixgbe: Fixed xdp_buff data_end update. (Björn)
        * Swapped SoBs in patch 4. (Maxim)
      
      rfc->v1:
        * Fixed build errors/warnings for m68k and riscv. (kbuild test
          robot)
        * Added headroom/chunk size getter. (Maxim/Björn)
        * mlx5: Put back the sanity check for XSK params, use XSK API to get
          the total headroom size. (Maxim)
        * Fixed spelling in commit message. (Björn)
        * Make sure xp_validate_desc() is inlined for Tx perf. (Maxim)
        * Sorted file entries. (Joe)
        * Added xdp_return_{frame,frame_rx_napi,buff} simplification (Björn)
      
      Thanks for all the comments/input/help!
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      79917b24
    • Björn Töpel's avatar
      MAINTAINERS, xsk: Update AF_XDP section after moves/adds · 28bee21d
      Björn Töpel authored
      
      
      Update MAINTAINERS to correctly mirror the current AF_XDP socket file
      layout. Also, add the AF_XDP files of libbpf.
      
      rfc->v1: Sorted file entries. (Joe)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Joe Perches <joe@perches.com>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-16-bjorn.topel@gmail.com
      28bee21d
    • Björn Töpel's avatar
      xsk: Explicitly inline functions and move definitions · 26062b18
      Björn Töpel authored
      
      
      In order to reduce the number of function calls, the struct
      xsk_buff_pool definition is moved to xsk_buff_pool.h. The functions
      xp_get_dma(), xp_dma_sync_for_cpu(), xp_dma_sync_for_device(),
      xp_validate_desc() and various helper functions are explicitly
      inlined.
      
      Further, move xp_get_handle() and xp_release() to xsk.c, to allow for
      the compiler to perform inlining.
      
      rfc->v1: Make sure xp_validate_desc() is inlined for Tx perf. (Maxim)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-15-bjorn.topel@gmail.com
      26062b18
    • Björn Töpel's avatar
      xdp: Simplify xdp_return_{frame, frame_rx_napi, buff} · 82c41671
      Björn Töpel authored
      
      
      The xdp_return_{frame,frame_rx_napi,buff} function are never used,
      except in xdp_convert_zc_to_xdp_frame(), by the MEM_TYPE_XSK_BUFF_POOL
      memory type.
      
      To simplify and reduce code, change so that
      xdp_convert_zc_to_xdp_frame() calls xsk_buff_free() directly since the
      type is know, and remove MEM_TYPE_XSK_BUFF_POOL from the switch
      statement in __xdp_return() function.
      
      Suggested-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-14-bjorn.topel@gmail.com
      82c41671
    • Björn Töpel's avatar
      xsk: Remove MEM_TYPE_ZERO_COPY and corresponding code · 0807892e
      Björn Töpel authored
      
      
      There are no users of MEM_TYPE_ZERO_COPY. Remove all corresponding
      code, including the "handle" member of struct xdp_buff.
      
      rfc->v1: Fixed spelling in commit message. (Björn)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-13-bjorn.topel@gmail.com
      0807892e
    • Björn Töpel's avatar
      mlx5, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL · 39d6443c
      Björn Töpel authored
      
      
      Use the new MEM_TYPE_XSK_BUFF_POOL API in lieu of MEM_TYPE_ZERO_COPY in
      mlx5e. It allows to drop a lot of code from the driver (which is now
      common in AF_XDP core and was related to XSK RX frame allocation, DMA
      mapping, etc.) and slightly improve performance (RX +0.8 Mpps, TX +0.4
      Mpps).
      
      rfc->v1: Put back the sanity check for XSK params, use XSK API to get
               the total headroom size. (Maxim)
      
      v1->v2: Fix DMA address handling, set XDP metadata to invalid. (Maxim)
      
      v2->v3: Handle frame_sz, use xsk_buff_xdp_get_frame_dma, use xsk_buff
              API for DMA sync on TX, add performance numbers. (Maxim)
      
      v3->v4: Remove unused variable num_xsk_frames. (Jakub)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-12-bjorn.topel@gmail.com
      39d6443c
    • Björn Töpel's avatar
      ixgbe, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL · 7117132b
      Björn Töpel authored
      
      
      Remove MEM_TYPE_ZERO_COPY in favor of the new MEM_TYPE_XSK_BUFF_POOL
      APIs.
      
      v1->v2: Fixed xdp_buff data_end update. (Björn)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Link: https://lore.kernel.org/bpf/20200520192103.355233-11-bjorn.topel@gmail.com
      7117132b
    • Björn Töpel's avatar
      ice, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL · 175fc430
      Björn Töpel authored
      
      
      Remove MEM_TYPE_ZERO_COPY in favor of the new MEM_TYPE_XSK_BUFF_POOL
      APIs.
      
      v4->v5: Fixed "warning: Excess function parameter 'alloc' description
              in 'ice_alloc_rx_bufs_zc'" and "warning: Excess function
              parameter 'xdp' description in
              'ice_construct_skb_zc'". (Jakub)
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Link: https://lore.kernel.org/bpf/20200520192103.355233-10-bjorn.topel@gmail.com
      175fc430
    • Björn Töpel's avatar
      i40e, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL · 3b4f0b66
      Björn Töpel authored
      
      
      Remove MEM_TYPE_ZERO_COPY in favor of the new MEM_TYPE_XSK_BUFF_POOL
      APIs. The AF_XDP zero-copy rx_bi ring is now simply a struct xdp_buff
      pointer.
      
      v4->v5: Fixed "warning: Excess function parameter 'bi' description in
              'i40e_construct_skb_zc'". (Jakub)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Link: https://lore.kernel.org/bpf/20200520192103.355233-9-bjorn.topel@gmail.com
      3b4f0b66
    • Björn Töpel's avatar
      i40e: Separate kernel allocated rx_bi rings from AF_XDP rings · be1222b5
      Björn Töpel authored
      
      
      Continuing the path to support MEM_TYPE_XSK_BUFF_POOL, the AF_XDP
      zero-copy/sk_buff rx_bi rings are now separate. Functions to properly
      allocate the different rings are added as well.
      
      v3->v4: Made i40e_fd_handle_status() static. (kbuild test robot)
      v4->v5: Fix kdoc for i40e_clean_programming_status(). (Jakub)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Link: https://lore.kernel.org/bpf/20200520192103.355233-8-bjorn.topel@gmail.com
      be1222b5
    • Björn Töpel's avatar
      i40e: Refactor rx_bi accesses · e1675f97
      Björn Töpel authored
      
      
      As a first step to migrate i40e to the new MEM_TYPE_XSK_BUFF_POOL
      APIs, code that accesses the rx_bi (SW/shadow ring) is refactored to
      use an accessor function.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: intel-wired-lan@lists.osuosl.org
      Link: https://lore.kernel.org/bpf/20200520192103.355233-7-bjorn.topel@gmail.com
      e1675f97
    • Björn Töpel's avatar
      xsk: Introduce AF_XDP buffer allocation API · 2b43470a
      Björn Töpel authored
      
      
      In order to simplify AF_XDP zero-copy enablement for NIC driver
      developers, a new AF_XDP buffer allocation API is added. The
      implementation is based on a single core (single producer/consumer)
      buffer pool for the AF_XDP UMEM.
      
      A buffer is allocated using the xsk_buff_alloc() function, and
      returned using xsk_buff_free(). If a buffer is disassociated with the
      pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is
      said to be released. Currently, the release function is only used by
      the AF_XDP internals and not visible to the driver.
      
      Drivers using this API should register the XDP memory model with the
      new MEM_TYPE_XSK_BUFF_POOL type.
      
      The API is defined in net/xdp_sock_drv.h.
      
      The buffer type is struct xdp_buff, and follows the lifetime of
      regular xdp_buffs, i.e.  the lifetime of an xdp_buff is restricted to
      a NAPI context. In other words, the API is not replacing xdp_frames.
      
      In addition to introducing the API and implementations, the AF_XDP
      core is migrated to use the new APIs.
      
      rfc->v1: Fixed build errors/warnings for m68k and riscv. (kbuild test
               robot)
               Added headroom/chunk size getter. (Maxim/Björn)
      
      v1->v2: Swapped SoBs. (Maxim)
      
      v2->v3: Initialize struct xdp_buff member frame_sz. (Björn)
              Add API to query the DMA address of a frame. (Maxim)
              Do DMA sync for CPU till the end of the frame to handle
              possible growth (frame_sz). (Maxim)
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-6-bjorn.topel@gmail.com
      2b43470a
    • Björn Töpel's avatar
      xsk: Move defines only used by AF_XDP internals to xsk.h · 89e4a376
      Björn Töpel authored
      
      
      Move the XSK_NEXT_PG_CONTIG_{MASK,SHIFT}, and
      XDP_UMEM_USES_NEED_WAKEUP defines from xdp_sock.h to the AF_XDP
      internal xsk.h file. Also, start using the BIT{,_ULL} macro instead of
      explicit shifts.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-5-bjorn.topel@gmail.com
      89e4a376
    • Magnus Karlsson's avatar
      xsk: Move driver interface to xdp_sock_drv.h · a71506a4
      Magnus Karlsson authored
      
      
      Move the AF_XDP zero-copy driver interface to its own include file
      called xdp_sock_drv.h. This, hopefully, will make it more clear for
      NIC driver implementors to know what functions to use for zero-copy
      support.
      
      v4->v5: Fix -Wmissing-prototypes by include header file. (Jakub)
      
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-4-bjorn.topel@gmail.com
      a71506a4
    • Björn Töpel's avatar
      xsk: Move xskmap.c to net/xdp/ · d20a1676
      Björn Töpel authored
      
      
      The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
      kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
      code. Also, move AF_XDP struct definitions, and function declarations
      only used by AF_XDP internals into net/xdp/xsk.h.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
      d20a1676
    • Björn Töpel's avatar
      xsk: Fix xsk_umem_xdp_frame_sz() · 44ac082b
      Björn Töpel authored
      Calculating the "data_hard_end" for an XDP buffer coming from AF_XDP
      zero-copy mode, the return value of xsk_umem_xdp_frame_sz() is added
      to "data_hard_start".
      
      Currently, the chunk size of the UMEM is returned by
      xsk_umem_xdp_frame_sz(). This is not correct, if the fixed UMEM
      headroom is non-zero. Fix this by returning the chunk_size without the
      UMEM headroom.
      
      Fixes: 2a637c5b
      
       ("xdp: For Intel AF_XDP drivers add XDP frame_sz")
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200520192103.355233-2-bjorn.topel@gmail.com
      44ac082b
  2. May 20, 2020
    • Andrii Nakryiko's avatar
      selftests/bpf: Convert bpf_iter_test_kern{3, 4}.c to define own bpf_iter_meta · dda18a5c
      Andrii Nakryiko authored
      b9f4c01f ("selftest/bpf: Make bpf_iter selftest compilable against old vmlinux.h")
      missed the fact that bpf_iter_test_kern{3,4}.c are not just including
      bpf_iter_test_kern_common.h and need similar bpf_iter_meta re-definition
      explicitly.
      
      Fixes: b9f4c01f
      
       ("selftest/bpf: Make bpf_iter selftest compilable against old vmlinux.h")
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200519192341.134360-1-andriin@fb.com
      dda18a5c
    • Andrii Nakryiko's avatar
      selftest/bpf: Make bpf_iter selftest compilable against old vmlinux.h · b9f4c01f
      Andrii Nakryiko authored
      
      
      It's good to be able to compile bpf_iter selftest even on systems that don't
      have the very latest vmlinux.h, e.g., for libbpf tests against older kernels in
      Travis CI. To that extent, re-define bpf_iter_meta and corresponding bpf_iter
      context structs in each selftest. To avoid type clashes with vmlinux.h, rename
      vmlinux.h's definitions to get them out of the way.
      
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20200518234516.3915052-1-andriin@fb.com
      b9f4c01f
    • Alexei Starovoitov's avatar
      tools/bpf: sync bpf.h · fb53d3b6
      Alexei Starovoitov authored
      
      
      Sync tools/include/uapi/linux/bpf.h from include/uapi.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fb53d3b6
    • Alexei Starovoitov's avatar
      Merge branch 'getpeername' · 0e5633ac
      Alexei Starovoitov authored
      
      
      Daniel Borkmann says:
      
      ====================
      Trivial patch to add get{peer,sock}name cgroup attach types to the BPF
      sock_addr programs in order to enable rewriting sockaddr structs from
      both calls along with libbpf and bpftool support as well as selftests.
      
      Thanks!
      
      v1 -> v2:
        - use __u16 for ports in start_server_with_port() signature and in
          expected_{local,peer} ports in the test case (Andrey)
        - Added both Andrii's and Andrey's ACKs
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0e5633ac
    • Daniel Borkmann's avatar
      bpf, testing: Add get{peer, sock}name selftests to test_progs · 566fc3f5
      Daniel Borkmann authored
      
      
      Extend the existing connect_force_port test to assert get{peer,sock}name programs
      as well. The workflow for e.g. IPv4 is as follows: i) server binds to concrete
      port, ii) client calls getsockname() on server fd which exposes 1.2.3.4:60000 to
      client, iii) client connects to service address 1.2.3.4:60000 binds to concrete
      local address (127.0.0.1:22222) and remaps service address to a concrete backend
      address (127.0.0.1:60123), iv) client then calls getsockname() on its own fd to
      verify local address (127.0.0.1:22222) and getpeername() on its own fd which then
      publishes service address (1.2.3.4:60000) instead of actual backend. Same workflow
      is done for IPv6 just with different address/port tuples.
      
        # ./test_progs -t connect_force_port
        #14 connect_force_port:OK
        Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/3343da6ad08df81af715a95d61a84fb4a960f2bf.1589841594.git.daniel@iogearbox.net
      566fc3f5
    • Daniel Borkmann's avatar
      bpf, bpftool: Enable get{peer, sock}name attach types · 05ee19c1
      Daniel Borkmann authored
      
      
      Make bpftool aware and add the new get{peer,sock}name attach types to its
      cli, documentation and bash completion to allow attachment/detachment of
      sock_addr programs there.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/9765b3d03e4c29210c4df56a9cc7e52f5f7bb5ef.1589841594.git.daniel@iogearbox.net
      05ee19c1
    • Daniel Borkmann's avatar
      bpf, libbpf: Enable get{peer, sock}name attach types · f15ed018
      Daniel Borkmann authored
      
      
      Trivial patch to add the new get{peer,sock}name attach types to the section
      definitions in order to hook them up to sock_addr cgroup program type.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/7fcd4b1e41a8ebb364754a5975c75a7795051bd2.1589841594.git.daniel@iogearbox.net
      f15ed018
    • Daniel Borkmann's avatar
      bpf: Add get{peer, sock}name attach types for sock_addr · 1b66d253
      Daniel Borkmann authored
      As stated in 983695fa
      
       ("bpf: fix unconnected udp hooks"), the objective
      for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
      transparent to applications. In Cilium we make use of these hooks [0] in
      order to enable E-W load balancing for existing Kubernetes service types
      for all Cilium managed nodes in the cluster. Those backends can be local
      or remote. The main advantage of this approach is that it operates as close
      as possible to the socket, and therefore allows to avoid packet-based NAT
      given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
      
      This also allows to expose NodePort services on loopback addresses in the
      host namespace, for example. As another advantage, this also efficiently
      blocks bind requests for applications in the host namespace for exposed
      ports. However, one missing item is that we also need to perform reverse
      xlation for inet{,6}_getname() hooks such that we can return the service
      IP/port tuple back to the application instead of the remote peer address.
      
      The vast majority of applications does not bother about getpeername(), but
      in a few occasions we've seen breakage when validating the peer's address
      since it returns unexpectedly the backend tuple instead of the service one.
      Therefore, this trivial patch allows to customise and adds a getpeername()
      as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
      to address this situation.
      
      Simple example:
      
        # ./cilium/cilium service list
        ID   Frontend     Service Type   Backend
        1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80
      
      Before; curl's verbose output example, no getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
        > GET / HTTP/1.1
        > Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      After; with getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
        > GET / HTTP/1.1
        >  Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
      peer to the context similar as in inet{,6}_getname() fashion, but API-wise
      this is suboptimal as it always enforces programs having to test for ctx->peer
      which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
      Similarly, the checked return code is on tnum_range(1, 1), but if a use case
      comes up in future, it can easily be changed to return an error code instead.
      Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
      
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
      1b66d253
  3. May 19, 2020
  4. May 16, 2020