Skip to content
  1. Jul 16, 2021
    • Alexei Starovoitov's avatar
      bpf: Introduce bpf timers. · b00628b1
      Alexei Starovoitov authored
      
      
      Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
      in hash/array/lru maps as a regular field and helpers to operate on it:
      
      // Initialize the timer.
      // First 4 bits of 'flags' specify clockid.
      // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
      long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);
      
      // Configure the timer to call 'callback_fn' static function.
      long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);
      
      // Arm the timer to expire 'nsec' nanoseconds from the current time.
      long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);
      
      // Cancel the timer and wait for callback_fn to finish if it was running.
      long bpf_timer_cancel(struct bpf_timer *timer);
      
      Here is how BPF program might look like:
      struct map_elem {
          int counter;
          struct bpf_timer timer;
      };
      
      struct {
          __uint(type, BPF_MAP_TYPE_HASH);
          __uint(max_entries, 1000);
          __type(key, int);
          __type(value, struct map_elem);
      } hmap SEC(".maps");
      
      static int timer_cb(void *map, int *key, struct map_elem *val);
      /* val points to particular map element that contains bpf_timer. */
      
      SEC("fentry/bpf_fentry_test1")
      int BPF_PROG(test1, int a)
      {
          struct map_elem *val;
          int key = 0;
      
          val = bpf_map_lookup_elem(&hmap, &key);
          if (val) {
              bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
              bpf_timer_set_callback(&val->timer, timer_cb);
              bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
          }
      }
      
      This patch adds helper implementations that rely on hrtimers
      to call bpf functions as timers expire.
      The following patches add necessary safety checks.
      
      Only programs with CAP_BPF are allowed to use bpf_timer.
      
      The amount of timers used by the program is constrained by
      the memcg recorded at map creation time.
      
      The bpf_timer_init() helper needs explicit 'map' argument because inner maps
      are dynamic and not known at load time. While the bpf_timer_set_callback() is
      receiving hidden 'aux->prog' argument supplied by the verifier.
      
      The prog pointer is needed to do refcnting of bpf program to make sure that
      program doesn't get freed while the timer is armed. This approach relies on
      "user refcnt" scheme used in prog_array that stores bpf programs for
      bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
      paired with bpf_timer_cancel() that will drop the prog refcnt. The
      ops->map_release_uref is responsible for cancelling the timers and dropping
      prog refcnt when user space reference to a map reaches zero.
      This uref approach is done to make sure that Ctrl-C of user space process will
      not leave timers running forever unless the user space explicitly pinned a map
      that contained timers in bpffs.
      
      bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
      have user references (is not held by open file descriptor from user space and
      not pinned in bpffs).
      
      The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
      and free the timer if given map element had it allocated.
      "bpftool map update" command can be used to cancel timers.
      
      The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
      '__u64 :64' has 1 byte alignment of 8 byte padding.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
      b00628b1
    • Alexei Starovoitov's avatar
      bpf: Factor out bpf_spin_lock into helpers. · c1b3fed3
      Alexei Starovoitov authored
      
      
      Move ____bpf_spin_lock/unlock into helpers to make it more clear
      that quadruple underscore bpf_spin_lock/unlock are irqsave/restore variants.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-3-alexei.starovoitov@gmail.com
      c1b3fed3
    • Alexei Starovoitov's avatar
      bpf: Prepare bpf_prog_put() to be called from irq context. · d809e134
      Alexei Starovoitov authored
      
      
      Currently bpf_prog_put() is called from the task context only.
      With addition of bpf timers the timer related helpers will start calling
      bpf_prog_put() from irq-saved region and in rare cases might drop
      the refcnt to zero.
      To address this case, first, convert bpf_prog_free_id() to be irq-save
      (this is similar to bpf_map_free_id), and, second, defer non irq
      appropriate calls into work queue.
      For example:
      bpf_audit_prog() is calling kmalloc and wake_up_interruptible,
      bpf_prog_kallsyms_del_all()->bpf_ksym_del()->spin_unlock_bh().
      They are not safe with irqs disabled.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-2-alexei.starovoitov@gmail.com
      d809e134
    • Tobias Klauser's avatar
      selftests/bpf: Remove unused variable in tc_tunnel prog · de587d56
      Tobias Klauser authored
      The variable buf is unused since commit 005edd16 ("selftests/bpf:
      convert bpf tunnel test to BPF_ADJ_ROOM_MAC"). Remove it to fix the
      following warning:
      
          test_tc_tunnel.c:531:7: warning: unused variable 'buf' [-Wunused-variable]
      
      Fixes: 005edd16
      
       ("selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC")
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210713102719.8890-1-tklauser@distanz.ch
      de587d56
  2. Jul 15, 2021
  3. Jul 13, 2021
    • Martynas Pumputis's avatar
      libbpf: Fix reuse of pinned map on older kernel · 97eb3138
      Martynas Pumputis authored
      
      
      When loading a BPF program with a pinned map, the loader checks whether
      the pinned map can be reused, i.e. their properties match. To derive
      such of the pinned map, the loader invokes BPF_OBJ_GET_INFO_BY_FD and
      then does the comparison.
      
      Unfortunately, on < 4.12 kernels the BPF_OBJ_GET_INFO_BY_FD is not
      available, so loading the program fails with the following error:
      
      	libbpf: failed to get map info for map FD 5: Invalid argument
      	libbpf: couldn't reuse pinned map at
      		'/sys/fs/bpf/tc/globals/cilium_call_policy': parameter
      		mismatch"
      	libbpf: map 'cilium_call_policy': error reusing pinned map
      	libbpf: map 'cilium_call_policy': failed to create:
      		Invalid argument(-22)
      	libbpf: failed to load object 'bpf_overlay.o'
      
      To fix this, fallback to derivation of the map properties via
      /proc/$PID/fdinfo/$MAP_FD if BPF_OBJ_GET_INFO_BY_FD fails with EINVAL,
      which can be used as an indicator that the kernel doesn't support
      the latter.
      
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210712125552.58705-1-m@lambda.lt
      97eb3138
  4. Jul 08, 2021
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_redirect_cpu_user: Cpumap qsize set larger default · eff94154
      Jesper Dangaard Brouer authored
      
      
      Experience from production shows queue size of 192 is too small, as
      this caused packet drops during cpumap-enqueue on RX-CPU.  This can be
      diagnosed with xdp_monitor sample program.
      
      This bpftrace program was used to diagnose the problem in more detail:
      
       bpftrace -e '
        tracepoint:xdp:xdp_cpumap_kthread { @deq_bulk = lhist(args->processed,0,10,1); @drop_net = lhist(args->drops,0,10,1) }
        tracepoint:xdp:xdp_cpumap_enqueue { @enq_bulk = lhist(args->processed,0,10,1); @enq_drops = lhist(args->drops,0,10,1); }'
      
      Watch out for the @enq_drops counter. The @drop_net counter can happen
      when netstack gets invalid packets, so don't despair it can be
      natural, and that counter will likely disappear in newer kernels as it
      was a source of confusion (look at netstat info for reason of the
      netstack @drop_net counters).
      
      The production system was configured with CPU power-saving C6 state.
      Learn more in this blogpost[1].
      
      And wakeup latency in usec for the states are:
      
       # grep -H . /sys/devices/system/cpu/cpu0/cpuidle/*/latency
       /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
       /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
       /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
       /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:133
      
      Deepest state take 133 usec to wakeup from (133/10^6). The link speed
      is 25Gbit/s ((25*10^9/8) in bytes/sec). How many bytes can arrive with
      in 133 usec at this speed: (25*10^9/8)*(133/10^6) = 415625 bytes. With
      MTU size packets this is 275 packets, and with minimum Ethernet (incl
      intergap overhead) 84 bytes it is 4948 packets. Clearly default queue
      size is too small.
      
      Setting default cpumap queue to 2048 as worst-case (small packet) at
      10Gbit/s is 1979 packets with 133 usec wakeup time, +64 packet before
      kthread wakeup call (due to xdp_do_flush) worst-case 2043 packets.
      
      Thus, if a packet burst on RX-CPU will enqueue packets to a remote
      cpumap CPU that is in deep-sleep state it can overrun the cpumap queue.
      
      The production system was also configured to avoid deep-sleep via:
       tuned-adm profile network-latency
      
      [1] https://jeremyeder.com/2013/08/30/oh-did-you-expect-the-cpu/
      
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/162523477604.786243.13372630844944530891.stgit@firesoul
      eff94154
    • Alexei Starovoitov's avatar
      Merge branch 'Generic XDP improvements' · e0bc8927
      Alexei Starovoitov authored
      
      
      Kumar Kartikeya says:
      
      ====================
      
      This small series makes some improvements to generic XDP mode and brings it
      closer to native XDP. Patch 1 splits out generic XDP processing into reusable
      parts, patch 2 adds pointer friendly wrappers for bitops (not have to cast back
      and forth the address of local pointer to unsigned long *), patch 3 implements
      generic cpumap support (details in commit) and patch 4 allows devmap bpf prog
      execution before generic_xdp_tx is called.
      
      Patch 5 just updates a couple of selftests to adapt to changes in behavior (in
      that specifying devmap/cpumap prog fd in generic mode is now allowed).
      
      Changelog:
      ----------
      v5 -> v6
      v5: https://lore.kernel.org/bpf/20210701002759.381983-1-memxor@gmail.com
       * Put rcpu->prog check before RCU-bh section to avoid do_softirq (Jesper)
      
      v4 -> v5
      v4: https://lore.kernel.org/bpf/20210628114746.129669-1-memxor@gmail.com
       * Add comments and examples for new bitops macros (Alexei)
      
      v3 -> v4
      v3: https://lore.kernel.org/bpf/20210622202835.1151230-1-memxor@gmail.com
       * Add detach now that attach of XDP program succeeds (Toke)
       * Clean up the test to use new ASSERT macros
      
      v2 -> v3
      v2: https://lore.kernel.org/bpf/20210622195527.1110497-1-memxor@gmail.com
       * list_for_each_entry -> list_for_each_entry_safe (due to deletion of skb)
      
      v1 -> v2
      v1: https://lore.kernel.org/bpf/20210620233200.855534-1-memxor@gmail.com
       * Move __ptr_{set,clear,test}_bit to bitops.h (Toke)
         Also changed argument order to match the bit op they wrap.
       * Remove map value size checking functions for cpumap/devmap (Toke)
       * Rework prog run for skb in cpu_map_kthread_run (Toke)
       * Set skb->dev to dst->dev after devmap prog has run
       * Don't set xdp rxq that will be overwritten in cpumap prog run
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e0bc8927
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Tidy xdp attach selftests · 36246d5a
      Kumar Kartikeya Dwivedi authored
      
      
      Support for cpumap and devmap entry progs in previous commits means the
      test needs to be updated for the new semantics. Also take this
      opportunity to convert it from CHECK macros to the new ASSERT macros.
      
      Since xdp_cpumap_attach has no subtest, put the sole test inside the
      test_xdp_cpumap_attach function.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-6-memxor@gmail.com
      36246d5a
    • Kumar Kartikeya Dwivedi's avatar
      bpf: devmap: Implement devmap prog execution for generic XDP · 2ea5eaba
      Kumar Kartikeya Dwivedi authored
      
      
      This lifts the restriction on running devmap BPF progs in generic
      redirect mode. To match native XDP behavior, it is invoked right before
      generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
      XDP_DROP actions.
      
      We also return 0 even if devmap program drops the packet, as
      semantically redirect has already succeeded and the devmap prog is the
      last point before TX of the packet to device where it can deliver a
      verdict on the packet.
      
      This also means it must take care of freeing the skb, as
      xdp_do_generic_redirect callers only do that in case an error is
      returned.
      
      Since devmap entry prog is supported, remove the check in
      generic_xdp_install entirely.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com
      2ea5eaba
    • Kumar Kartikeya Dwivedi's avatar
      bpf: cpumap: Implement generic cpumap · 11941f8a
      Kumar Kartikeya Dwivedi authored
      
      
      This change implements CPUMAP redirect support for generic XDP programs.
      The idea is to reuse the cpu map entry's queue that is used to push
      native xdp frames for redirecting skb to a different CPU. This will
      match native XDP behavior (in that RPS is invoked again for packet
      reinjected into networking stack).
      
      To be able to determine whether the incoming skb is from the driver or
      cpumap, we reuse skb->redirected bit that skips generic XDP processing
      when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
      it has been lifted and it is always available.
      
      >From the redirect side, we add the skb to ptr_ring with its lowest bit
      set to 1.  This should be safe as skb is not 1-byte aligned. This allows
      kthread to discern between xdp_frames and sk_buff. On consumption of the
      ptr_ring item, the lowest bit is unset.
      
      In the end, the skb is simply added to the list that kthread is anyway
      going to maintain for xdp_frames converted to skb, and then received
      again by using netif_receive_skb_list.
      
      Bulking optimization for generic cpumap is left as an exercise for a
      future patch for now.
      
      Since cpumap entry progs are now supported, also remove check in
      generic_xdp_install for the cpumap.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
      11941f8a
    • Kumar Kartikeya Dwivedi's avatar
      bitops: Add non-atomic bitops for pointers · cb0f8003
      Kumar Kartikeya Dwivedi authored
      
      
      cpumap needs to set, clear, and test the lowest bit in skb pointer in
      various places. To make these checks less noisy, add pointer friendly
      bitop macros that also do some typechecking to sanitize the argument.
      
      These wrap the non-atomic bitops __set_bit, __clear_bit, and test_bit
      but for pointer arguments. Pointer's address has to be passed in and it
      is treated as an unsigned long *, since width and representation of
      pointer and unsigned long match on targets Linux supports. They are
      prefixed with double underscore to indicate lack of atomicity.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-3-memxor@gmail.com
      cb0f8003
    • Kumar Kartikeya Dwivedi's avatar
      net: core: Split out code to run generic XDP prog · fe21cb91
      Kumar Kartikeya Dwivedi authored
      
      
      This helper can later be utilized in code that runs cpumap and devmap
      programs in generic redirect mode and adjust skb based on changes made
      to xdp_buff.
      
      When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
      generic redirect path invokes devmap/cpumap prog if set, it must
      __skb_pull again as we expect mac header to be pulled.
      
      It also drops the skb_reset_mac_len call after do_xdp_generic, as the
      mac_header and network_header are advanced by the same offset, so the
      difference (mac_len) remains constant.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
      fe21cb91
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: support input xdp_md context in BPF_PROG_TEST_RUN' · a080cdcc
      Alexei Starovoitov authored
      
      
      Zvi Effron says:
      
      ====================
      
      This patchset adds support for passing an xdp_md via ctx_in/ctx_out in
      bpf_attr for BPF_PROG_TEST_RUN of XDP programs.
      
      Patch 1 adds a function to validate XDP meta data lengths.
      
      Patch 2 adds initial support for passing XDP meta data in addition to
      packet data.
      
      Patch 3 adds support for also specifying the ingress interface and
      rx queue.
      
      Patch 4 adds selftests to ensure functionality is correct.
      
      Changelog:
      ----------
      v7->v8
      v7: https://lore.kernel.org/bpf/20210624211304.90807-1-zeffron@riotgames.com/
      
       * Fix too long comment line in patch 3
      
      v6->v7
      v6: https://lore.kernel.org/bpf/20210617232904.1899-1-zeffron@riotgames.com/
      
       * Add Yonghong Song's Acked-by to commit message in patch 1
       * Add Yonghong Song's Acked-by to commit message in patch 2
       * Extracted the post-update of the xdp_md context into a function (again)
       * Validate that the rx queue was registered with XDP info
       * Decrement the reference count on a found netdevice on failure to find
        a valid rx queue
       * Decrement the reference count on a found netdevice after the XDP
        program is run
       * Drop Yonghong Song's Acked-By for patch 3 because of patch changes
       * Improve a comment in the selftests
       * Drop Yonghong Song's Acked-By for patch 4 because of patch changes
      
      v5->v6
      v5: https://lore.kernel.org/bpf/20210616224712.3243-1-zeffron@riotgames.com/
      
       * Correct commit messages in patches 1 and 3
       * Add Acked-by to commit message in patch 4
       * Use gotos instead of returns to correctly free resources in
        bpf_prog_test_run_xdp
       * Rename xdp_metalen_valid to xdp_metalen_invalid
       * Improve the function signature for xdp_metalen_invalid
       * Merged declaration of ingress_ifindex and rx_queue_index into one line
      
      v4->v5
      v4: https://lore.kernel.org/bpf/20210604220235.6758-1-zeffron@riotgames.com/
      
       * Add new patch to introduce xdp_metalen_valid inline function to avoid
        duplicated code from net/core/filter.c
       * Correct size of bad_ctx in selftests
       * Make all declarations reverse Christmas tree
       * Move data check from xdp_convert_md_to_buff to bpf_prog_test_run_xdp
       * Merge xdp_convert_buff_to_md into bpf_prog_test_run_xdp
       * Fix line too long
       * Extracted common checks in selftests to a helper function
       * Removed redundant assignment in selftests
       * Reordered test cases in selftests
       * Check data against 0 instead of data_meta in selftests
       * Made selftests use EINVAL instead of hardcoded 22
       * Dropped "_" from XDP function name
       * Changed casts in XDP program from unsigned long to long
       * Added a comment explaining the use of the loopback interface in selftests
       * Change parameter order in xdp_convert_md_to_buff to be input first
       * Assigned xdp->ingress_ifindex and xdp->rx_queue_index to local variables in
        xdp_convert_md_to_buff
       * Made use of "meta data" versus "metadata" consistent in comments and commit
        messages
      
      v3->v4
      v3: https://lore.kernel.org/bpf/20210602190815.8096-1-zeffron@riotgames.com/
      
       * Clean up nits
       * Validate xdp_md->data_end in bpf_prog_test_run_xdp
       * Remove intermediate metalen variables
      
      v2 -> v3
      v2: https://lore.kernel.org/bpf/20210527201341.7128-1-zeffron@riotgames.com/
      
       * Check errno first in selftests
       * Use DECLARE_LIBBPF_OPTS
       * Rename tattr to opts in selftests
       * Remove extra new line
       * Rename convert_xdpmd_to_xdpb to xdp_convert_md_to_buff
       * Rename convert_xdpb_to_xdpmd to xdp_convert_buff_to_md
       * Move declaration of device and rxqueue in xdp_convert_md_to_buff to
        patch 2
       * Reorder the kfree calls in bpf_prog_test_run_xdp
      
      v1 -> v2
      v1: https://lore.kernel.org/bpf/20210524220555.251473-1-zeffron@riotgames.com
      
       * Fix null pointer dereference with no context
       * Use the BPF skeleton and replace CHECK with ASSERT macros
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a080cdcc
    • Zvi Effron's avatar
      selftests/bpf: Add test for xdp_md context in BPF_PROG_TEST_RUN · 939b9c68
      Zvi Effron authored
      
      
      Add a test for using xdp_md as a context to BPF_PROG_TEST_RUN for XDP
      programs.
      
      The test uses a BPF program that takes in a return value from XDP
      meta data, then reduces the size of the XDP meta data by 4 bytes.
      
      Test cases validate the possible failure cases for passing in invalid
      xdp_md contexts, that the return value is successfully passed
      in, and that the adjusted meta data is successfully copied out.
      
      Co-developed-by: default avatarCody Haas <chaas@riotgames.com>
      Co-developed-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarCody Haas <chaas@riotgames.com>
      Signed-off-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarZvi Effron <zeffron@riotgames.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210707221657.3985075-5-zeffron@riotgames.com
      939b9c68
    • Zvi Effron's avatar
      bpf: Support specifying ingress via xdp_md context in BPF_PROG_TEST_RUN · ec94670f
      Zvi Effron authored
      
      
      Support specifying the ingress_ifindex and rx_queue_index of xdp_md
      contexts for BPF_PROG_TEST_RUN.
      
      The intended use case is to allow testing XDP programs that make decisions
      based on the ingress interface or RX queue.
      
      If ingress_ifindex is specified, look up the device by the provided index
      in the current namespace and use its xdp_rxq for the xdp_buff. If the
      rx_queue_index is out of range, or is non-zero when the ingress_ifindex is
      0, return -EINVAL.
      
      Co-developed-by: default avatarCody Haas <chaas@riotgames.com>
      Co-developed-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarCody Haas <chaas@riotgames.com>
      Signed-off-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarZvi Effron <zeffron@riotgames.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210707221657.3985075-4-zeffron@riotgames.com
      ec94670f
    • Zvi Effron's avatar
      bpf: Support input xdp_md context in BPF_PROG_TEST_RUN · 47316f4a
      Zvi Effron authored
      
      
      Support passing a xdp_md via ctx_in/ctx_out in bpf_attr for
      BPF_PROG_TEST_RUN.
      
      The intended use case is to pass some XDP meta data to the test runs of
      XDP programs that are used as tail calls.
      
      For programs that use bpf_prog_test_run_xdp, support xdp_md input and
      output. Unlike with an actual xdp_md during a non-test run, data_meta must
      be 0 because it must point to the start of the provided user data. From
      the initial xdp_md, use data and data_end to adjust the pointers in the
      generated xdp_buff. All other non-zero fields are prohibited (with
      EINVAL). If the user has set ctx_out/ctx_size_out, copy the (potentially
      different) xdp_md back to the userspace.
      
      We require all fields of input xdp_md except the ones we explicitly
      support to be set to zero. The expectation is that in the future we might
      add support for more fields and we want to fail explicitly if the user
      runs the program on the kernel where we don't yet support them.
      
      Co-developed-by: default avatarCody Haas <chaas@riotgames.com>
      Co-developed-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarCody Haas <chaas@riotgames.com>
      Signed-off-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarZvi Effron <zeffron@riotgames.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210707221657.3985075-3-zeffron@riotgames.com
      47316f4a
    • Zvi Effron's avatar
      bpf: Add function for XDP meta data length check · 7445cf31
      Zvi Effron authored
      
      
      This commit prepares to use the XDP meta data length check in multiple
      places by making it into a static inline function instead of a literal.
      
      Co-developed-by: default avatarCody Haas <chaas@riotgames.com>
      Co-developed-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarCody Haas <chaas@riotgames.com>
      Signed-off-by: default avatarLisa Watanabe <lwatanabe@riotgames.com>
      Signed-off-by: default avatarZvi Effron <zeffron@riotgames.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210707221657.3985075-2-zeffron@riotgames.com
      7445cf31
  5. Jul 02, 2021
  6. Jul 01, 2021
    • Linus Torvalds's avatar
      Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next · dbe69e43
      Linus Torvalds authored
      Pull networking updates from Jakub Kicinski:
       "Core:
      
         - BPF:
            - add syscall program type and libbpf support for generating
              instructions and bindings for in-kernel BPF loaders (BPF loaders
              for BPF), this is a stepping stone for signed BPF programs
            - infrastructure to migrate TCP child sockets from one listener to
              another in the same reuseport group/map to improve flexibility
              of service hand-off/restart
            - add broadcast support to XDP redirect
      
         - allow bypass of the lockless qdisc to improving performance (for
           pktgen: +23% with one thread, +44% with 2 threads)
      
         - add a simpler version of "DO_ONCE()" which does not require jump
           labels, intended for slow-path usage
      
         - virtio/vsock: introduce SOCK_SEQPACKET support
      
         - add getsocketopt to retrieve netns cookie
      
         - ip: treat lowest address of a IPv4 subnet as ordinary unicast
           address allowing reclaiming of precious IPv4 addresses
      
         - ipv6: use prandom_u32() for ID generation
      
         - ip: add support for more flexible field selection for hashing
           across multi-path routes (w/ offload to mlxsw)
      
         - icmp: add support for extended RFC 8335 PROBE (ping)
      
         - seg6: add support for SRv6 End.DT46 behavior
      
         - mptcp:
            - DSS checksum support (RFC 8684) to detect middlebox meddling
            - support Connection-time 'C' flag
            - time stamping support
      
         - sctp: packetization Layer Path MTU Discovery (RFC 8899)
      
         - xfrm: speed up state addition with seq set
      
         - WiFi:
            - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
            - aggregation handling improvements for some drivers
            - minstrel improvements for no-ack frames
            - deferred rate control for TXQs to improve reaction times
            - switch from round robin to virtual time-based airtime scheduler
      
         - add trace points:
            - tcp checksum errors
            - openvswitch - action execution, upcalls
            - socket errors via sk_error_report
      
        Device APIs:
      
         - devlink: add rate API for hierarchical control of max egress rate
           of virtual devices (VFs, SFs etc.)
      
         - don't require RCU read lock to be held around BPF hooks in NAPI
           context
      
         - page_pool: generic buffer recycling
      
        New hardware/drivers:
      
         - mobile:
            - iosm: PCIe Driver for Intel M.2 Modem
            - support for Qualcomm MSM8998 (ipa)
      
         - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
      
         - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
      
         - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
      
         - NXP SJA1110 Automotive Ethernet 10-port switch
      
         - Qualcomm QCA8327 switch support (qca8k)
      
         - Mikrotik 10/25G NIC (atl1c)
      
        Driver changes:
      
         - ACPI support for some MDIO, MAC and PHY devices from Marvell and
           NXP (our first foray into MAC/PHY description via ACPI)
      
         - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
      
         - Mellanox/Nvidia NIC (mlx5)
            - NIC VF offload of L2 bridging
            - support IRQ distribution to Sub-functions
      
         - Marvell (prestera):
            - add flower and match all
            - devlink trap
            - link aggregation
      
         - Netronome (nfp): connection tracking offload
      
         - Intel 1GE (igc): add AF_XDP support
      
         - Marvell DPU (octeontx2): ingress ratelimit offload
      
         - Google vNIC (gve): new ring/descriptor format support
      
         - Qualcomm mobile (rmnet & ipa): inline checksum offload support
      
         - MediaTek WiFi (mt76)
            - mt7915 MSI support
            - mt7915 Tx status reporting
            - mt7915 thermal sensors support
            - mt7921 decapsulation offload
            - mt7921 enable runtime pm and deep sleep
      
         - Realtek WiFi (rtw88)
            - beacon filter support
            - Tx antenna path diversity support
            - firmware crash information via devcoredump
      
         - Qualcomm WiFi (wcn36xx)
            - Wake-on-WLAN support with magic packets and GTK rekeying
      
         - Micrel PHY (ksz886x/ksz8081): add cable test support"
      
      * tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
        tcp: change ICSK_CA_PRIV_SIZE definition
        tcp_yeah: check struct yeah size at compile time
        gve: DQO: Fix off by one in gve_rx_dqo()
        stmmac: intel: set PCI_D3hot in suspend
        stmmac: intel: Enable PHY WOL option in EHL
        net: stmmac: option to enable PHY WOL with PMT enabled
        net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
        net: use netdev_info in ndo_dflt_fdb_{add,del}
        ptp: Set lookup cookie when creating a PTP PPS source.
        net: sock: add trace for socket errors
        net: sock: introduce sk_error_report
        net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
        net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
        net: dsa: include fdb entries pointing to bridge in the host fdb list
        net: dsa: include bridge addresses which are local in the host fdb list
        net: dsa: sync static FDB entries on foreign interfaces to hardware
        net: dsa: install the host MDB and FDB entries in the master's RX filter
        net: dsa: reference count the FDB addresses at the cross-chip notifier level
        net: dsa: introduce a separate cross-chip notifier type for host FDBs
        net: dsa: reference count the MDB entries at the cross-chip notifier level
        ...
      dbe69e43
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a6eaf385
      Linus Torvalds authored
      Pull scheduler fixes from Ingo Molnar:
      
       - Fix a small inconsistency (bug) in load tracking, caught by a new
         warning that several people reported.
      
       - Flip CONFIG_SCHED_CORE to default-disabled, and update the Kconfig
         help text.
      
      * tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/core: Disable CONFIG_SCHED_CORE by default
        sched/fair: Ensure _sum and _avg values stay consistent
      a6eaf385
    • Linus Torvalds's avatar
      Merge tag 'microblaze-v5.14' of git://git.monstr.eu/linux-2.6-microblaze · f4cc74c9
      Linus Torvalds authored
      Pull microblaze updates from Michal Simek:
      
       - Remove unused PAGE_UP/DOWN macros
      
       - Fix trivial spelling mistake
      
      * tag 'microblaze-v5.14' of git://git.monstr.eu/linux-2.6-microblaze:
        arch: microblaze: Fix spelling mistake "vesion" -> "version"
        microblaze: Cleanup unused functions
      f4cc74c9
    • Linus Torvalds's avatar
      Merge tag 'safesetid-5.14' of git://github.com/micah-morton/linux · 92183137
      Linus Torvalds authored
      Pull SafeSetID update from Micah Morton:
       "One very minor code cleanup change that marks a variable as
        __initdata"
      
      * tag 'safesetid-5.14' of git://github.com/micah-morton/linux:
        LSM: SafeSetID: Mark safesetid_initialized as __initdata
      92183137
    • Linus Torvalds's avatar
      Merge tag 'Smack-for-5.14' of git://github.com/cschaufler/smack-next · 5c874a5b
      Linus Torvalds authored
      Pull smack updates from Casey Schaufler:
       "There is nothing more significant than an improvement to a byte count
        check in smackfs.
      
        All changes have been in next for weeks"
      
      * tag 'Smack-for-5.14' of git://github.com/cschaufler/smack-next:
        Smack: fix doc warning
        Revert "Smack: Handle io_uring kernel thread privileges"
        smackfs: restrict bytes count in smk_set_cipso()
        security/smack/: fix misspellings using codespell tool
      5c874a5b
    • Linus Torvalds's avatar
      Merge tag 'audit-pr-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit · 290fe0fa
      Linus Torvalds authored
      Pull audit updates from Paul Moore:
       "Another merge window, another small audit pull request.
      
        Four patches in total: one is cosmetic, one removes an unnecessary
        initialization, one renames some enum values to prevent name
        collisions, and one converts list_del()/list_add() to list_move().
      
        None of these are earth shattering and all pass the audit-testsuite
        tests while merging cleanly on top of your tree from earlier today"
      
      * tag 'audit-pr-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
        audit: remove unnecessary 'ret' initialization
        audit: remove trailing spaces and tabs
        audit: Use list_move instead of list_del/list_add
        audit: Rename enum audit_state constants to avoid AUDIT_DISABLED redefinition
        audit: add blank line after variable declarations
      290fe0fa
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 6bd344e5
      Linus Torvalds authored
      Pull SELinux updates from Paul Moore:
      
       - The slow_avc_audit() function is now non-blocking so we can remove
         the AVC_NONBLOCKING tricks; this also includes the 'flags' variant of
         avc_has_perm().
      
       - Use kmemdup() instead of kcalloc()+copy when copying parts of the
         SELinux policydb.
      
       - The InfiniBand device name is now passed by reference when possible
         in the SELinux code, removing a strncpy().
      
       - Minor cleanups including: constification of avtab function args,
         removal of useless LSM/XFRM function args, SELinux kdoc fixes, and
         removal of redundant assignments.
      
      * tag 'selinux-pr-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: kill 'flags' argument in avc_has_perm_flags() and avc_audit()
        selinux: slow_avc_audit has become non-blocking
        selinux: Fix kernel-doc
        selinux: use __GFP_NOWARN with GFP_NOWAIT in the AVC
        lsm_audit,selinux: pass IB device name by reference
        selinux: Remove redundant assignment to rc
        selinux: Corrected comment to match kernel-doc comment
        selinux: delete selinux_xfrm_policy_lookup() useless argument
        selinux: constify some avtab function arguments
        selinux: simplify duplicate_policydb_cond_list() by using kmemdup()
      6bd344e5
    • Linus Torvalds's avatar
      Merge tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 44b6ed4c
      Linus Torvalds authored
      Pull clang feature updates from Kees Cook:
      
       - Add CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in the
         face of the noinstr attribute, paving the way for PGO and fixing
         GCOV. (Nick Desaulniers)
      
       - x86_64 LTO coverage is expanded to 32-bit x86. (Nathan Chancellor)
      
       - Small fixes to CFI. (Mark Rutland, Nathan Chancellor)
      
      * tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute
        Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
        compiler_attributes.h: cleanups for GCC 4.9+
        compiler_attributes.h: define __no_profile, add to noinstr
        x86, lto: Enable Clang LTO for 32-bit as well
        CFI: Move function_nocfi() into compiler.h
        MAINTAINERS: Add Clang CFI section
      44b6ed4c