Skip to content
  1. Jul 16, 2021
    • Cong Wang's avatar
      af_unix: Set TCP_ESTABLISHED for datagram sockets too · 83301b53
      Cong Wang authored
      
      
      Currently only unix stream socket sets TCP_ESTABLISHED,
      datagram socket can set this too when they connect to its
      peer socket. At least __ip4_datagram_connect() does the same.
      
      This will be used to determine whether an AF_UNIX datagram
      socket can be redirected to in sockmap.
      
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210704190252.11866-5-xiyou.wangcong@gmail.com
      83301b53
    • Cong Wang's avatar
      af_unix: Implement ->read_sock() for sockmap · 29df44fa
      Cong Wang authored
      
      
      Implement ->read_sock() for AF_UNIX datagram socket, it is
      pretty much similar to udp_read_sock().
      
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210704190252.11866-4-xiyou.wangcong@gmail.com
      29df44fa
    • Cong Wang's avatar
      sock_map: Lift socket state restriction for datagram sockets · 0c48eefa
      Cong Wang authored
      
      
      TCP and other connection oriented sockets have accept()
      for each incoming connection on the server side, hence
      they can just insert those fd's from accept() to sockmap,
      which are of course established.
      
      Now with datagram sockets begin to support sockmap and
      redirection, the restriction is no longer applicable to
      them, as they have no accept(). So we have to lift this
      restriction for them. This is fine, because inside
      bpf_sk_redirect_map() we still have another socket status
      check, sock_map_redirect_allowed(), as a guard.
      
      This also means they do not have to be removed from
      sockmap when disconnecting.
      
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com
      0c48eefa
    • Cong Wang's avatar
      sock_map: Relax config dependency to CONFIG_NET · 17edea21
      Cong Wang authored
      
      
      Currently sock_map still has Kconfig dependency on CONFIG_INET,
      but there is no actual functional dependency on it after we
      introduce ->psock_update_sk_prot().
      
      We have to extend it to CONFIG_NET now as we are going to
      support AF_UNIX.
      
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
      17edea21
    • Alexei Starovoitov's avatar
      Merge branch 'Add bpf_get_func_ip helper' · 1554a080
      Alexei Starovoitov authored
      
      
      Jiri Olsa says:
      
      ====================
      Add bpf_get_func_ip helper that returns IP address of the
      caller function for trampoline and krobe programs.
      
      There're 2 specific implementation of the bpf_get_func_ip
      helper, one for trampoline progs and one for kprobe/kretprobe
      progs.
      
      The trampoline helper call is replaced/inlined by the verifier
      with simple move instruction. The kprobe/kretprobe is actual
      helper call that returns prepared caller address.
      
      Also available at:
        https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
        bpf/get_func_ip
      
      v4 changes:
        - dropped jit/x86 check for get_func_ip tracing check [Alexei]
        - added code to bpf_get_func_ip_tracing [Alexei]
          and tested that it works without inlining [Alexei]
        - changed has_get_func_ip to check_get_func_ip [Andrii]
        - replaced test assert loop with explicit asserts [Andrii]
        - adde bpf_program__attach_kprobe_opts function
          and use it for offset setup [Andrii]
        - used bpf_program__set_autoload(false) for test6 [Andrii]
        - added Masami's ack
      
      v3 changes:
        - resend with Masami in cc and v3 in each patch subject
      
      v2 changes:
        - use kprobe_running to get kprobe instead of cpu var [Masami]
        - added support to add kprobe on function+offset
          and test for that [Alan]
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1554a080
    • Jiri Olsa's avatar
      selftests/bpf: Add test for bpf_get_func_ip in kprobe+offset probe · 8237e754
      Jiri Olsa authored
      
      
      Adding test for bpf_get_func_ip in kprobe+ofset probe.
      Because of the offset value it's arch specific, enabling
      the new test only for x86_64 architecture.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-9-jolsa@kernel.org
      8237e754
    • Alan Maguire's avatar
      libbpf: Allow specification of "kprobe/function+offset" · a2488b5f
      Alan Maguire authored
      
      
      kprobes can be placed on most instructions in a function, not
      just entry, and ftrace and bpftrace support the function+offset
      notification for probe placement.  Adding parsing of func_name
      into func+offset to bpf_program__attach_kprobe() allows the
      user to specify
      
      SEC("kprobe/bpf_fentry_test5+0x6")
      
      ...for example, and the offset can be passed to perf_event_open_probe()
      to support kprobe attachment.
      
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-8-jolsa@kernel.org
      a2488b5f
    • Jiri Olsa's avatar
      libbpf: Add bpf_program__attach_kprobe_opts function · ac0ed488
      Jiri Olsa authored
      
      
      Adding bpf_program__attach_kprobe_opts that does the same
      as bpf_program__attach_kprobe, but takes opts argument.
      
      Currently opts struct holds just retprobe bool, but we will
      add new field in following patch.
      
      The function is not exported, so there's no need to add
      size to the struct bpf_program_attach_kprobe_opts for now.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-7-jolsa@kernel.org
      ac0ed488
    • Jiri Olsa's avatar
      selftests/bpf: Add test for bpf_get_func_ip helper · 5d8b583d
      Jiri Olsa authored
      
      
      Adding test for bpf_get_func_ip helper for fentry, fexit,
      kprobe, kretprobe and fmod_ret programs.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-6-jolsa@kernel.org
      5d8b583d
    • Jiri Olsa's avatar
      bpf: Add bpf_get_func_ip helper for kprobe programs · 9ffd9f3f
      Jiri Olsa authored
      
      
      Adding bpf_get_func_ip helper for BPF_PROG_TYPE_KPROBE programs,
      so it's now possible to call bpf_get_func_ip from both kprobe and
      kretprobe programs.
      
      Taking the caller's address from 'struct kprobe::addr', which is
      defined for both kprobe and kretprobe.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-5-jolsa@kernel.org
      9ffd9f3f
    • Jiri Olsa's avatar
      bpf: Add bpf_get_func_ip helper for tracing programs · 9b99edca
      Jiri Olsa authored
      
      
      Adding bpf_get_func_ip helper for BPF_PROG_TYPE_TRACING programs,
      specifically for all trampoline attach types.
      
      The trampoline's caller IP address is stored in (ctx - 8) address.
      so there's no reason to actually call the helper, but rather fixup
      the call instruction and return [ctx - 8] value directly.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-4-jolsa@kernel.org
      9b99edca
    • Jiri Olsa's avatar
      bpf: Enable BPF_TRAMP_F_IP_ARG for trampolines with call_get_func_ip · 1e37392c
      Jiri Olsa authored
      
      
      Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it.
      
      The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code
      and is used only by programs with bpf_get_func_ip helper, which is
      added in following patch and sets call_get_func_ip bit.
      
      This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for
      trampolines that have programs with call_get_func_ip set.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-3-jolsa@kernel.org
      1e37392c
    • Jiri Olsa's avatar
      bpf, x86: Store caller's ip in trampoline stack · 7e6f3cd8
      Jiri Olsa authored
      
      
      Storing caller's ip in trampoline's stack. Trampoline programs
      can reach the IP in (ctx - 8) address, so there's no change in
      program's arguments interface.
      
      The IP address is takes from [fp + 8], which is return address
      from the initial 'call fentry' call to trampoline.
      
      This IP address will be returned via bpf_get_func_ip helper
      helper, which is added in following patches.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-2-jolsa@kernel.org
      7e6f3cd8
    • Daniel Borkmann's avatar
      Merge branch 'bpf-timers' · 76283171
      Daniel Borkmann authored
      
      
      Alexei Starovoitov says:
      
      ====================
      The first request to support timers in bpf was made in 2013 before sys_bpf
      syscall was added. That use case was periodic sampling. It was address with
      attaching bpf programs to perf_events. Then during XDP development the timers
      were requested to do garbage collection and health checks. They were worked
      around by implementing timers in user space and triggering progs with
      BPF_PROG_RUN command. The user space timers and perf_event+bpf timers are not
      armed by the bpf program. They're done asynchronously vs program execution.
      The XDP program cannot send a packet and arm the timer at the same time. The
      tracing prog cannot record an event and arm the timer right away. This large
      class of use cases remained unaddressed. The jiffy based and hrtimer based
      timers are essential part of the kernel development and with this patch set
      the hrtimer based timers will be available to bpf programs.
      
      TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
      to make sure bpf progs cannot crash the kernel.
      
      v6->v7:
      - address Andrii's comments and add his Acks.
      
      v5->v6:
      - address code review feedback from Martin and add his Acks.
      - add usercnt > 0 check to bpf_timer_init and remove timers_cancel_and_free
      second loop in map_free callbacks.
      - add cond_resched_rcu.
      
      v4->v5:
      - Martin noticed the following issues:
      . prog could be reallocated bpf_patch_insn_data().
      Fixed by passing 'aux' into bpf_timer_set_callback, since 'aux' is stable
      during insn patching.
      . Added missing rcu_read_lock.
      . Removed redundant record_map.
      - Discovered few bugs with stress testing:
      . One cpu does htab_free_prealloced_timers->bpf_timer_cancel_and_free->hrtimer_cancel
      while another is trying to do something with the timer like bpf_timer_start/set_callback.
      Those ops try to acquire bpf_spin_lock that is already taken by bpf_timer_cancel_and_free,
      so both cpus spin forever. The same problem existed in bpf_timer_cancel().
      One bpf prog on one cpu might call bpf_timer_cancel and wait, while another cpu is in
      the timer callback that tries to do bpf_timer_*() helper on the same timer.
      The fix is to do drop_prog_refcnt() and unlock. And only then hrtimer_cancel.
      Because of this had to add callback_fn != NULL check to bpf_timer_cb().
      Also removed redundant bpf_prog_inc/put from bpf_timer_cb() and replaced
      with rcu_dereference_check similar to recent rcu_read_lock-removal from drivers.
      bpf_timer_cb is in softirq.
      . Managed to hit refcnt==0 while doing bpf_prog_put from bpf_timer_cancel_and_free().
      That exposed the issue that bpf_prog_put wasn't ready to be called from irq context.
      Fixed similar to bpf_map_put which is irq ready.
      - Refactored BPF_CALL_1(bpf_spin_lock) into __bpf_spin_lock_irqsave() to
      make the main logic more clear, since Martin and Yonghong brought up this concern.
      
      v3->v4:
      1.
      Split callback_fn from bpf_timer_start into bpf_timer_set_callback as
      suggested by Martin. That makes bpf timer api match one to one to
      kernel hrtimer api and provides greater flexibility.
      2.
      Martin also discovered the following issue with uref approach:
      bpftool prog load xdp_timer.o /sys/fs/bpf/xdp_timer type xdp
      bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_timer dev lo
      rm /sys/fs/bpf/xdp_timer
      nc -6 ::1 8888
      bpftool net detach xdpgeneric dev lo
      The timer callback stays active in the kernel though the prog was detached
      and map usercnt == 0.
      It happened because 'bpftool prog load' pinned the prog only.
      The map usercnt went to zero. Subsequent attach and runs didn't
      affect map usercnt. The timer was able to start and bpf_prog_inc itself.
      When the prog was detached the prog stayed active.
      To address this issue added
      if (!atomic64_read(&(t->map->usercnt))) return -EPERM;
      to the first patch.
      Which means that timers are allowed only in the maps that are held
      by user space with open file descriptor or maps pinned in bpffs.
      3.
      Discovered that timers in inner maps were broken.
      The inner map pointers are dynamic. Therefore changed bpf_timer_init()
      to accept explicit map pointer supplied by the program instead
      of hidden map pointer supplied by the verifier.
      To make sure that pointer to a timer actually belongs to that map
      added the verifier check in patch 3.
      4.
      Addressed Yonghong's feedback. Improved comments and added
      dynamic in_nmi() check.
      Added Acks.
      
      v2->v3:
      The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
      called to make sure callback code doesn't disappear when timer is active and
      drop refcnt when timer cb is done. That led to a ton of race conditions between
      callback running and concurrent bpf_timer_init/start/cancel on another cpu,
      and concurrent bpf_map_update/delete_elem, and map destroy.
      
      Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
      timers that bpf prog armed in a link list and canceled them when prog refcnt
      went to zero. The race conditions disappeared, but timers in map-in-map could
      not be supported cleanly, since timers in inner maps have inner map's life time
      and don't match prog's life time.
      
      This v3 approach makes timers to be owned by maps. It allows timers in inner
      maps to be supported from the start. This apporach relies on "user refcnt"
      scheme used in prog_array that stores bpf programs for bpf_tail_call. The
      bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
      callback does decrement the refcnt. The ops->map_release_uref is
      responsible for cancelling the timers and dropping prog refcnt when user space
      reference to a map is dropped. That addressed all the races and simplified
      locking.
      
      Andrii presented a use case where specifying callback_fn in bpf_timer_init()
      is inconvenient vs specifying in bpf_timer_start(). The bpf_timer_init()
      typically is called outside for timer callback, while bpf_timer_start() most
      likely will be called from the callback.
      timer_cb() { ... bpf_timer_start(timer_cb); ...} looks like recursion and as
      infinite loop to the verifier. The verifier had to be made smarter to recognize
      such async callbacks. Patches 7,8,9 addressed that.
      
      Patch 1 and 2 refactoring.
      Patch 3 implements bpf timer helpers and locking.
      Patch 4 implements map side of bpf timer support.
      Patch 5 prevent pointer mismatch in bpf_timer_init.
      Patch 6 adds support for BTF in inner maps.
      Patch 7 teaches check_cfg() pass to understand async callbacks.
      Patch 8 teaches do_check() pass to understand async callbacks.
      Patch 9 teaches check_max_stack_depth() pass to understand async callbacks.
      Patches 10 and 11 are the tests.
      
      v1->v2:
      - Addressed great feedback from Andrii and Toke.
      - Fixed race between parallel bpf_timer_*() ops.
      - Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
      - Disallowed mmap and global timers.
      - Allow spin_lock and bpf_timer in an element.
      - Fixed memory leaks due to map destruction and LRU eviction.
      - A ton more tests.
      ====================
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      76283171
    • Alexei Starovoitov's avatar
      selftests/bpf: Add a test with bpf_timer in inner map. · 61f71e74
      Alexei Starovoitov authored
      
      
      Check that map-in-map supports bpf timers.
      
      Check that indirect "recursion" of timer callbacks works:
      timer_cb1() { bpf_timer_set_callback(timer_cb2); }
      timer_cb2() { bpf_timer_set_callback(timer_cb1); }
      
      Check that
        bpf_map_release
          htab_free_prealloced_timers
            bpf_timer_cancel_and_free
              hrtimer_cancel
      works while timer cb is running.
      "while true; do ./test_progs -t timer_mim; done"
      is a great stress test. It caught missing timer cancel in htab->extra_elems.
      
      timer_mim_reject.c is a negative test that checks
      that timer<->map mismatch is prevented.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-12-alexei.starovoitov@gmail.com
      61f71e74
    • Alexei Starovoitov's avatar
      selftests/bpf: Add bpf_timer test. · 3540f7c6
      Alexei Starovoitov authored
      
      
      Add bpf_timer test that creates timers in preallocated and
      non-preallocated hash, in array and in lru maps.
      Let array timer expire once and then re-arm it for 35 seconds.
      Arm lru timer into the same callback.
      Then arm and re-arm hash timers 10 times each.
      At the last invocation of prealloc hash timer cancel the array timer.
      Force timer free via LRU eviction and direct bpf_map_delete_elem.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-11-alexei.starovoitov@gmail.com
      3540f7c6
    • Alexei Starovoitov's avatar
      bpf: Teach stack depth check about async callbacks. · 7ddc80a4
      Alexei Starovoitov authored
      
      
      Teach max stack depth checking algorithm about async callbacks
      that don't increase bpf program stack size.
      Also add sanity check that bpf_tail_call didn't sneak into async cb.
      It's impossible, since PTR_TO_CTX is not available in async cb,
      hence the program cannot contain bpf_tail_call(ctx,...);
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
      7ddc80a4
    • Alexei Starovoitov's avatar
      bpf: Implement verifier support for validation of async callbacks. · bfc6bb74
      Alexei Starovoitov authored
      
      
      bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
      PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
      and pass them into the helpers as function callbacks.
      In case of bpf_for_each_map_elem() the callback is invoked synchronously
      and the verifier treats it as a normal subprogram call by adding another
      bpf_func_state and new frame in __check_func_call().
      bpf_timer_set_callback() doesn't invoke the callback directly.
      The subprogram will be called asynchronously from bpf_timer_cb().
      Teach the verifier to validate such async callbacks as special kind
      of jump by pushing verifier state into stack and let pop_stack() process it.
      
      Special care needs to be taken during state pruning.
      The call insn doing bpf_timer_set_callback has to be a prune_point.
      Otherwise short timer callbacks might not have prune points in front of
      bpf_timer_set_callback() which means is_state_visited() will be called
      after this call insn is processed in __check_func_call(). Which means that
      another async_cb state will be pushed to be walked later and the verifier
      will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
      Since push_async_cb() looks like another push_stack() branch the
      infinite loop detection will trigger false positive. To recognize
      this case mark such states as in_async_callback_fn.
      To distinguish infinite loop in async callback vs the same callback called
      with different arguments for different map and timer add async_entry_cnt
      to bpf_func_state.
      
      Enforce return zero from async callbacks.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
      bfc6bb74
    • Alexei Starovoitov's avatar
      bpf: Relax verifier recursion check. · 86fc6ee6
      Alexei Starovoitov authored
      
      
      In the following bpf subprogram:
      static int timer_cb(void *map, void *key, void *value)
      {
          bpf_timer_set_callback(.., timer_cb);
      }
      
      the 'timer_cb' is a pointer to a function.
      ld_imm64 insn is used to carry this pointer.
      bpf_pseudo_func() returns true for such ld_imm64 insn.
      
      Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
      Relax control flow check to allow such "recursion" that is seen as an infinite
      loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
      bpf_timer_set_callback() is done in the follow up patch.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-8-alexei.starovoitov@gmail.com
      86fc6ee6
    • Alexei Starovoitov's avatar
      bpf: Remember BTF of inner maps. · 40ec00ab
      Alexei Starovoitov authored
      
      
      BTF is required for 'struct bpf_timer' to be recognized inside map value.
      The bpf timers are supported inside inner maps.
      Remember 'struct btf *' in inner_map_meta to make it available
      to the verifier in the sequence:
      
      struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
      if (inner_map)
          timer = bpf_map_lookup_elem(&inner_map, ...);
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-7-alexei.starovoitov@gmail.com
      40ec00ab
    • Alexei Starovoitov's avatar
      bpf: Prevent pointer mismatch in bpf_timer_init. · 3e8ce298
      Alexei Starovoitov authored
      
      
      bpf_timer_init() arguments are:
      1. pointer to a timer (which is embedded in map element).
      2. pointer to a map.
      Make sure that pointer to a timer actually belongs to that map.
      
      Use map_uid (which is unique id of inner map) to reject:
      inner_map1 = bpf_map_lookup_elem(outer_map, key1)
      inner_map2 = bpf_map_lookup_elem(outer_map, key2)
      if (inner_map1 && inner_map2) {
          timer = bpf_map_lookup_elem(inner_map1);
          if (timer)
              // mismatch would have been allowed
              bpf_timer_init(timer, inner_map2);
      }
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
      3e8ce298
    • Alexei Starovoitov's avatar
      bpf: Add map side support for bpf timers. · 68134668
      Alexei Starovoitov authored
      
      
      Restrict bpf timers to array, hash (both preallocated and kmalloced), and
      lru map types. The per-cpu maps with timers don't make sense, since 'struct
      bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
      the number of timers depends on number of possible cpus and timers would not be
      accessible from all cpus. lpm map support can be added in the future.
      The timers in inner maps are supported.
      
      The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
      bpf_timer in a given map element.
      
      Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
      that map element indeed contains 'struct bpf_timer'.
      
      Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
      map element data is reused in preallocated htab and lru maps.
      
      Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
      map element. There could be one of each, but not more than one. Due to 'one
      bpf_timer in one element' restriction do not support timers in global data,
      since global data is a map of single element, but from bpf program side it's
      seen as many global variables and restriction of single global timer would be
      odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
      timers, since user space could have corrupted mmap element and crashed the
      kernel. The maps with timers cannot be readonly. Due to these restrictions
      search for bpf_timer in datasec BTF in case it was placed in the global data to
      report clear error.
      
      The previous patch allowed 'struct bpf_timer' as a first field in a map
      element only. Relax this restriction.
      
      Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
      the timer when lru map deletes an element as a part of it eviction algorithm.
      
      Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
      The timer operation are done through helpers only.
      This is similar to 'struct bpf_spin_lock'.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
      68134668
    • Alexei Starovoitov's avatar
      bpf: Introduce bpf timers. · b00628b1
      Alexei Starovoitov authored
      
      
      Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
      in hash/array/lru maps as a regular field and helpers to operate on it:
      
      // Initialize the timer.
      // First 4 bits of 'flags' specify clockid.
      // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
      long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);
      
      // Configure the timer to call 'callback_fn' static function.
      long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);
      
      // Arm the timer to expire 'nsec' nanoseconds from the current time.
      long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);
      
      // Cancel the timer and wait for callback_fn to finish if it was running.
      long bpf_timer_cancel(struct bpf_timer *timer);
      
      Here is how BPF program might look like:
      struct map_elem {
          int counter;
          struct bpf_timer timer;
      };
      
      struct {
          __uint(type, BPF_MAP_TYPE_HASH);
          __uint(max_entries, 1000);
          __type(key, int);
          __type(value, struct map_elem);
      } hmap SEC(".maps");
      
      static int timer_cb(void *map, int *key, struct map_elem *val);
      /* val points to particular map element that contains bpf_timer. */
      
      SEC("fentry/bpf_fentry_test1")
      int BPF_PROG(test1, int a)
      {
          struct map_elem *val;
          int key = 0;
      
          val = bpf_map_lookup_elem(&hmap, &key);
          if (val) {
              bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
              bpf_timer_set_callback(&val->timer, timer_cb);
              bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
          }
      }
      
      This patch adds helper implementations that rely on hrtimers
      to call bpf functions as timers expire.
      The following patches add necessary safety checks.
      
      Only programs with CAP_BPF are allowed to use bpf_timer.
      
      The amount of timers used by the program is constrained by
      the memcg recorded at map creation time.
      
      The bpf_timer_init() helper needs explicit 'map' argument because inner maps
      are dynamic and not known at load time. While the bpf_timer_set_callback() is
      receiving hidden 'aux->prog' argument supplied by the verifier.
      
      The prog pointer is needed to do refcnting of bpf program to make sure that
      program doesn't get freed while the timer is armed. This approach relies on
      "user refcnt" scheme used in prog_array that stores bpf programs for
      bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
      paired with bpf_timer_cancel() that will drop the prog refcnt. The
      ops->map_release_uref is responsible for cancelling the timers and dropping
      prog refcnt when user space reference to a map reaches zero.
      This uref approach is done to make sure that Ctrl-C of user space process will
      not leave timers running forever unless the user space explicitly pinned a map
      that contained timers in bpffs.
      
      bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
      have user references (is not held by open file descriptor from user space and
      not pinned in bpffs).
      
      The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
      and free the timer if given map element had it allocated.
      "bpftool map update" command can be used to cancel timers.
      
      The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
      '__u64 :64' has 1 byte alignment of 8 byte padding.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
      b00628b1
    • Alexei Starovoitov's avatar
      bpf: Factor out bpf_spin_lock into helpers. · c1b3fed3
      Alexei Starovoitov authored
      
      
      Move ____bpf_spin_lock/unlock into helpers to make it more clear
      that quadruple underscore bpf_spin_lock/unlock are irqsave/restore variants.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-3-alexei.starovoitov@gmail.com
      c1b3fed3
    • Alexei Starovoitov's avatar
      bpf: Prepare bpf_prog_put() to be called from irq context. · d809e134
      Alexei Starovoitov authored
      
      
      Currently bpf_prog_put() is called from the task context only.
      With addition of bpf timers the timer related helpers will start calling
      bpf_prog_put() from irq-saved region and in rare cases might drop
      the refcnt to zero.
      To address this case, first, convert bpf_prog_free_id() to be irq-save
      (this is similar to bpf_map_free_id), and, second, defer non irq
      appropriate calls into work queue.
      For example:
      bpf_audit_prog() is calling kmalloc and wake_up_interruptible,
      bpf_prog_kallsyms_del_all()->bpf_ksym_del()->spin_unlock_bh().
      They are not safe with irqs disabled.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-2-alexei.starovoitov@gmail.com
      d809e134
    • Tobias Klauser's avatar
      selftests/bpf: Remove unused variable in tc_tunnel prog · de587d56
      Tobias Klauser authored
      The variable buf is unused since commit 005edd16 ("selftests/bpf:
      convert bpf tunnel test to BPF_ADJ_ROOM_MAC"). Remove it to fix the
      following warning:
      
          test_tc_tunnel.c:531:7: warning: unused variable 'buf' [-Wunused-variable]
      
      Fixes: 005edd16
      
       ("selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC")
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210713102719.8890-1-tklauser@distanz.ch
      de587d56
  2. Jul 15, 2021
  3. Jul 13, 2021
    • Martynas Pumputis's avatar
      libbpf: Fix reuse of pinned map on older kernel · 97eb3138
      Martynas Pumputis authored
      
      
      When loading a BPF program with a pinned map, the loader checks whether
      the pinned map can be reused, i.e. their properties match. To derive
      such of the pinned map, the loader invokes BPF_OBJ_GET_INFO_BY_FD and
      then does the comparison.
      
      Unfortunately, on < 4.12 kernels the BPF_OBJ_GET_INFO_BY_FD is not
      available, so loading the program fails with the following error:
      
      	libbpf: failed to get map info for map FD 5: Invalid argument
      	libbpf: couldn't reuse pinned map at
      		'/sys/fs/bpf/tc/globals/cilium_call_policy': parameter
      		mismatch"
      	libbpf: map 'cilium_call_policy': error reusing pinned map
      	libbpf: map 'cilium_call_policy': failed to create:
      		Invalid argument(-22)
      	libbpf: failed to load object 'bpf_overlay.o'
      
      To fix this, fallback to derivation of the map properties via
      /proc/$PID/fdinfo/$MAP_FD if BPF_OBJ_GET_INFO_BY_FD fails with EINVAL,
      which can be used as an indicator that the kernel doesn't support
      the latter.
      
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210712125552.58705-1-m@lambda.lt
      97eb3138
  4. Jul 08, 2021
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_redirect_cpu_user: Cpumap qsize set larger default · eff94154
      Jesper Dangaard Brouer authored
      
      
      Experience from production shows queue size of 192 is too small, as
      this caused packet drops during cpumap-enqueue on RX-CPU.  This can be
      diagnosed with xdp_monitor sample program.
      
      This bpftrace program was used to diagnose the problem in more detail:
      
       bpftrace -e '
        tracepoint:xdp:xdp_cpumap_kthread { @deq_bulk = lhist(args->processed,0,10,1); @drop_net = lhist(args->drops,0,10,1) }
        tracepoint:xdp:xdp_cpumap_enqueue { @enq_bulk = lhist(args->processed,0,10,1); @enq_drops = lhist(args->drops,0,10,1); }'
      
      Watch out for the @enq_drops counter. The @drop_net counter can happen
      when netstack gets invalid packets, so don't despair it can be
      natural, and that counter will likely disappear in newer kernels as it
      was a source of confusion (look at netstat info for reason of the
      netstack @drop_net counters).
      
      The production system was configured with CPU power-saving C6 state.
      Learn more in this blogpost[1].
      
      And wakeup latency in usec for the states are:
      
       # grep -H . /sys/devices/system/cpu/cpu0/cpuidle/*/latency
       /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
       /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
       /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
       /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:133
      
      Deepest state take 133 usec to wakeup from (133/10^6). The link speed
      is 25Gbit/s ((25*10^9/8) in bytes/sec). How many bytes can arrive with
      in 133 usec at this speed: (25*10^9/8)*(133/10^6) = 415625 bytes. With
      MTU size packets this is 275 packets, and with minimum Ethernet (incl
      intergap overhead) 84 bytes it is 4948 packets. Clearly default queue
      size is too small.
      
      Setting default cpumap queue to 2048 as worst-case (small packet) at
      10Gbit/s is 1979 packets with 133 usec wakeup time, +64 packet before
      kthread wakeup call (due to xdp_do_flush) worst-case 2043 packets.
      
      Thus, if a packet burst on RX-CPU will enqueue packets to a remote
      cpumap CPU that is in deep-sleep state it can overrun the cpumap queue.
      
      The production system was also configured to avoid deep-sleep via:
       tuned-adm profile network-latency
      
      [1] https://jeremyeder.com/2013/08/30/oh-did-you-expect-the-cpu/
      
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/162523477604.786243.13372630844944530891.stgit@firesoul
      eff94154
    • Alexei Starovoitov's avatar
      Merge branch 'Generic XDP improvements' · e0bc8927
      Alexei Starovoitov authored
      
      
      Kumar Kartikeya says:
      
      ====================
      
      This small series makes some improvements to generic XDP mode and brings it
      closer to native XDP. Patch 1 splits out generic XDP processing into reusable
      parts, patch 2 adds pointer friendly wrappers for bitops (not have to cast back
      and forth the address of local pointer to unsigned long *), patch 3 implements
      generic cpumap support (details in commit) and patch 4 allows devmap bpf prog
      execution before generic_xdp_tx is called.
      
      Patch 5 just updates a couple of selftests to adapt to changes in behavior (in
      that specifying devmap/cpumap prog fd in generic mode is now allowed).
      
      Changelog:
      ----------
      v5 -> v6
      v5: https://lore.kernel.org/bpf/20210701002759.381983-1-memxor@gmail.com
       * Put rcpu->prog check before RCU-bh section to avoid do_softirq (Jesper)
      
      v4 -> v5
      v4: https://lore.kernel.org/bpf/20210628114746.129669-1-memxor@gmail.com
       * Add comments and examples for new bitops macros (Alexei)
      
      v3 -> v4
      v3: https://lore.kernel.org/bpf/20210622202835.1151230-1-memxor@gmail.com
       * Add detach now that attach of XDP program succeeds (Toke)
       * Clean up the test to use new ASSERT macros
      
      v2 -> v3
      v2: https://lore.kernel.org/bpf/20210622195527.1110497-1-memxor@gmail.com
       * list_for_each_entry -> list_for_each_entry_safe (due to deletion of skb)
      
      v1 -> v2
      v1: https://lore.kernel.org/bpf/20210620233200.855534-1-memxor@gmail.com
       * Move __ptr_{set,clear,test}_bit to bitops.h (Toke)
         Also changed argument order to match the bit op they wrap.
       * Remove map value size checking functions for cpumap/devmap (Toke)
       * Rework prog run for skb in cpu_map_kthread_run (Toke)
       * Set skb->dev to dst->dev after devmap prog has run
       * Don't set xdp rxq that will be overwritten in cpumap prog run
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e0bc8927
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Tidy xdp attach selftests · 36246d5a
      Kumar Kartikeya Dwivedi authored
      
      
      Support for cpumap and devmap entry progs in previous commits means the
      test needs to be updated for the new semantics. Also take this
      opportunity to convert it from CHECK macros to the new ASSERT macros.
      
      Since xdp_cpumap_attach has no subtest, put the sole test inside the
      test_xdp_cpumap_attach function.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-6-memxor@gmail.com
      36246d5a
    • Kumar Kartikeya Dwivedi's avatar
      bpf: devmap: Implement devmap prog execution for generic XDP · 2ea5eaba
      Kumar Kartikeya Dwivedi authored
      
      
      This lifts the restriction on running devmap BPF progs in generic
      redirect mode. To match native XDP behavior, it is invoked right before
      generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
      XDP_DROP actions.
      
      We also return 0 even if devmap program drops the packet, as
      semantically redirect has already succeeded and the devmap prog is the
      last point before TX of the packet to device where it can deliver a
      verdict on the packet.
      
      This also means it must take care of freeing the skb, as
      xdp_do_generic_redirect callers only do that in case an error is
      returned.
      
      Since devmap entry prog is supported, remove the check in
      generic_xdp_install entirely.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com
      2ea5eaba
    • Kumar Kartikeya Dwivedi's avatar
      bpf: cpumap: Implement generic cpumap · 11941f8a
      Kumar Kartikeya Dwivedi authored
      
      
      This change implements CPUMAP redirect support for generic XDP programs.
      The idea is to reuse the cpu map entry's queue that is used to push
      native xdp frames for redirecting skb to a different CPU. This will
      match native XDP behavior (in that RPS is invoked again for packet
      reinjected into networking stack).
      
      To be able to determine whether the incoming skb is from the driver or
      cpumap, we reuse skb->redirected bit that skips generic XDP processing
      when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
      it has been lifted and it is always available.
      
      >From the redirect side, we add the skb to ptr_ring with its lowest bit
      set to 1.  This should be safe as skb is not 1-byte aligned. This allows
      kthread to discern between xdp_frames and sk_buff. On consumption of the
      ptr_ring item, the lowest bit is unset.
      
      In the end, the skb is simply added to the list that kthread is anyway
      going to maintain for xdp_frames converted to skb, and then received
      again by using netif_receive_skb_list.
      
      Bulking optimization for generic cpumap is left as an exercise for a
      future patch for now.
      
      Since cpumap entry progs are now supported, also remove check in
      generic_xdp_install for the cpumap.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
      11941f8a
    • Kumar Kartikeya Dwivedi's avatar
      bitops: Add non-atomic bitops for pointers · cb0f8003
      Kumar Kartikeya Dwivedi authored
      
      
      cpumap needs to set, clear, and test the lowest bit in skb pointer in
      various places. To make these checks less noisy, add pointer friendly
      bitop macros that also do some typechecking to sanitize the argument.
      
      These wrap the non-atomic bitops __set_bit, __clear_bit, and test_bit
      but for pointer arguments. Pointer's address has to be passed in and it
      is treated as an unsigned long *, since width and representation of
      pointer and unsigned long match on targets Linux supports. They are
      prefixed with double underscore to indicate lack of atomicity.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-3-memxor@gmail.com
      cb0f8003
    • Kumar Kartikeya Dwivedi's avatar
      net: core: Split out code to run generic XDP prog · fe21cb91
      Kumar Kartikeya Dwivedi authored
      
      
      This helper can later be utilized in code that runs cpumap and devmap
      programs in generic redirect mode and adjust skb based on changes made
      to xdp_buff.
      
      When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
      generic redirect path invokes devmap/cpumap prog if set, it must
      __skb_pull again as we expect mac header to be pulled.
      
      It also drops the skb_reset_mac_len call after do_xdp_generic, as the
      mac_header and network_header are advanced by the same offset, so the
      difference (mac_len) remains constant.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
      fe21cb91
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: support input xdp_md context in BPF_PROG_TEST_RUN' · a080cdcc
      Alexei Starovoitov authored
      
      
      Zvi Effron says:
      
      ====================
      
      This patchset adds support for passing an xdp_md via ctx_in/ctx_out in
      bpf_attr for BPF_PROG_TEST_RUN of XDP programs.
      
      Patch 1 adds a function to validate XDP meta data lengths.
      
      Patch 2 adds initial support for passing XDP meta data in addition to
      packet data.
      
      Patch 3 adds support for also specifying the ingress interface and
      rx queue.
      
      Patch 4 adds selftests to ensure functionality is correct.
      
      Changelog:
      ----------
      v7->v8
      v7: https://lore.kernel.org/bpf/20210624211304.90807-1-zeffron@riotgames.com/
      
       * Fix too long comment line in patch 3
      
      v6->v7
      v6: https://lore.kernel.org/bpf/20210617232904.1899-1-zeffron@riotgames.com/
      
       * Add Yonghong Song's Acked-by to commit message in patch 1
       * Add Yonghong Song's Acked-by to commit message in patch 2
       * Extracted the post-update of the xdp_md context into a function (again)
       * Validate that the rx queue was registered with XDP info
       * Decrement the reference count on a found netdevice on failure to find
        a valid rx queue
       * Decrement the reference count on a found netdevice after the XDP
        program is run
       * Drop Yonghong Song's Acked-By for patch 3 because of patch changes
       * Improve a comment in the selftests
       * Drop Yonghong Song's Acked-By for patch 4 because of patch changes
      
      v5->v6
      v5: https://lore.kernel.org/bpf/20210616224712.3243-1-zeffron@riotgames.com/
      
       * Correct commit messages in patches 1 and 3
       * Add Acked-by to commit message in patch 4
       * Use gotos instead of returns to correctly free resources in
        bpf_prog_test_run_xdp
       * Rename xdp_metalen_valid to xdp_metalen_invalid
       * Improve the function signature for xdp_metalen_invalid
       * Merged declaration of ingress_ifindex and rx_queue_index into one line
      
      v4->v5
      v4: https://lore.kernel.org/bpf/20210604220235.6758-1-zeffron@riotgames.com/
      
       * Add new patch to introduce xdp_metalen_valid inline function to avoid
        duplicated code from net/core/filter.c
       * Correct size of bad_ctx in selftests
       * Make all declarations reverse Christmas tree
       * Move data check from xdp_convert_md_to_buff to bpf_prog_test_run_xdp
       * Merge xdp_convert_buff_to_md into bpf_prog_test_run_xdp
       * Fix line too long
       * Extracted common checks in selftests to a helper function
       * Removed redundant assignment in selftests
       * Reordered test cases in selftests
       * Check data against 0 instead of data_meta in selftests
       * Made selftests use EINVAL instead of hardcoded 22
       * Dropped "_" from XDP function name
       * Changed casts in XDP program from unsigned long to long
       * Added a comment explaining the use of the loopback interface in selftests
       * Change parameter order in xdp_convert_md_to_buff to be input first
       * Assigned xdp->ingress_ifindex and xdp->rx_queue_index to local variables in
        xdp_convert_md_to_buff
       * Made use of "meta data" versus "metadata" consistent in comments and commit
        messages
      
      v3->v4
      v3: https://lore.kernel.org/bpf/20210602190815.8096-1-zeffron@riotgames.com/
      
       * Clean up nits
       * Validate xdp_md->data_end in bpf_prog_test_run_xdp
       * Remove intermediate metalen variables
      
      v2 -> v3
      v2: https://lore.kernel.org/bpf/20210527201341.7128-1-zeffron@riotgames.com/
      
       * Check errno first in selftests
       * Use DECLARE_LIBBPF_OPTS
       * Rename tattr to opts in selftests
       * Remove extra new line
       * Rename convert_xdpmd_to_xdpb to xdp_convert_md_to_buff
       * Rename convert_xdpb_to_xdpmd to xdp_convert_buff_to_md
       * Move declaration of device and rxqueue in xdp_convert_md_to_buff to
        patch 2
       * Reorder the kfree calls in bpf_prog_test_run_xdp
      
      v1 -> v2
      v1: https://lore.kernel.org/bpf/20210524220555.251473-1-zeffron@riotgames.com
      
       * Fix null pointer dereference with no context
       * Use the BPF skeleton and replace CHECK with ASSERT macros
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a080cdcc
    • Zvi Effron's avatar
      selftests/bpf: Add test for xdp_md context in BPF_PROG_TEST_RUN · 939b9c68
      Zvi Effron authored
      
      
      Add a test for using xdp_md as a context to BPF_PROG_TEST_RUN for XDP
      programs.
      
      The test uses a BPF program that takes in a return value from XDP
      meta data, then reduces the size of the XDP meta data by 4 bytes.
      
      Test cases validate the possible failure cases for passing in invalid
      xdp_md contexts, that the return value is successfully passed
      in, and that the adjusted meta data is successfully copied out.
      
      Co-developed-by: default avatarCody Haas <chaas@riotgames.com>
      Co-developed-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarCody Haas <chaas@riotgames.com>
      Signed-off-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarZvi Effron <zeffron@riotgames.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210707221657.3985075-5-zeffron@riotgames.com
      939b9c68
    • Zvi Effron's avatar
      bpf: Support specifying ingress via xdp_md context in BPF_PROG_TEST_RUN · ec94670f
      Zvi Effron authored
      
      
      Support specifying the ingress_ifindex and rx_queue_index of xdp_md
      contexts for BPF_PROG_TEST_RUN.
      
      The intended use case is to allow testing XDP programs that make decisions
      based on the ingress interface or RX queue.
      
      If ingress_ifindex is specified, look up the device by the provided index
      in the current namespace and use its xdp_rxq for the xdp_buff. If the
      rx_queue_index is out of range, or is non-zero when the ingress_ifindex is
      0, return -EINVAL.
      
      Co-developed-by: default avatarCody Haas <chaas@riotgames.com>
      Co-developed-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarCody Haas <chaas@riotgames.com>
      Signed-off-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarZvi Effron <zeffron@riotgames.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210707221657.3985075-4-zeffron@riotgames.com
      ec94670f