Skip to content
  1. Nov 22, 2023
    • Hao Ge's avatar
      dpll: Fix potential msg memleak when genlmsg_put_reply failed · b6fe6f03
      Hao Ge authored
      We should clean the skb resource if genlmsg_put_reply failed.
      
      Fixes: 9d71b54b
      
       ("dpll: netlink: Add DPLL framework base functions")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Reviewed-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Link: https://lore.kernel.org/r/20231121013709.73323-1-gehao@kylinos.cn
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b6fe6f03
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · b2d66643
      Jakub Kicinski authored
      
      
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2023-11-21
      
      We've added 19 non-merge commits during the last 4 day(s) which contain
      a total of 18 files changed, 1043 insertions(+), 416 deletions(-).
      
      The main changes are:
      
      1) Fix BPF verifier to validate callbacks as if they are called an unknown
         number of times in order to fix not detecting some unsafe programs,
         from Eduard Zingerman.
      
      2) Fix bpf_redirect_peer() handling which missed proper stats accounting
         for veth and netkit and also generally fix missing stats for the latter,
         from Peilin Ye, Daniel Borkmann et al.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: check if max number of bpf_loop iterations is tracked
        bpf: keep track of max number of bpf_loop callback iterations
        selftests/bpf: test widening for iterating callbacks
        bpf: widening for callback iterators
        selftests/bpf: tests for iterating callbacks
        bpf: verify callbacks as if they are called unknown number of times
        bpf: extract setup_func_entry() utility function
        bpf: extract __check_reg_arg() utility function
        selftests/bpf: fix bpf_loop_bench for new callback verification scheme
        selftests/bpf: track string payload offset as scalar in strobemeta
        selftests/bpf: track tcp payload offset as scalar in xdp_synproxy
        selftests/bpf: Add netkit to tc_redirect selftest
        selftests/bpf: De-veth-ize the tc_redirect test case
        bpf, netkit: Add indirect call wrapper for fetching peer dev
        bpf: Fix dev's rx stats for bpf_redirect_peer traffic
        veth: Use tstats per-CPU traffic counters
        netkit: Add tstats per-CPU traffic counters
        net: Move {l,t,d}stats allocation to core and convert veth & vrf
        net, vrf: Move dstats structure to core
      ====================
      
      Link: https://lore.kernel.org/r/20231121193113.11796-1-daniel@iogearbox.net
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b2d66643
    • Jakub Kicinski's avatar
      docs: netdev: try to guide people on dealing with silence · 495ec91b
      Jakub Kicinski authored
      
      
      There has been more than a few threads which went idle before
      the merge window and now people came back to them and started
      asking about next steps.
      
      We currently tell people to be patient and not to repost too
      often. Our "not too often", however, is still a few orders of
      magnitude faster than other subsystems. Or so I feel after
      hearing people talk about review rates at LPC.
      
      Clarify in the doc that if the discussion went idle for a week
      on netdev, 95% of the time there's no point waiting longer.
      
      Link: https://lore.kernel.org/r/20231120200109.620392-1-kuba@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      495ec91b
    • Jose Ignacio Tornos Martinez's avatar
      net: usb: ax88179_178a: fix failed operations during ax88179_reset · 0739af07
      Jose Ignacio Tornos Martinez authored
      Using generic ASIX Electronics Corp. AX88179 Gigabit Ethernet device,
      the following test cycle has been implemented:
          - power on
          - check logs
          - shutdown
          - after detecting the system shutdown, disconnect power
          - after approximately 60 seconds of sleep, power is restored
      Running some cycles, sometimes error logs like this appear:
          kernel: ax88179_178a 2-9:1.0 (unnamed net_device) (uninitialized): Failed to write reg index 0x0001: -19
          kernel: ax88179_178a 2-9:1.0 (unnamed net_device) (uninitialized): Failed to read reg index 0x0001: -19
          ...
      These failed operation are happening during ax88179_reset execution, so
      the initialization could not be correct.
      
      In order to avoid this, we need to increase the delay after reset and
      clock initial operations. By using these larger values, many cycles
      have been run and no failed operations appear.
      
      It would be better to check some status register to verify when the
      operation has finished, but I do not have found any available information
      (neither in the public datasheets nor in the manufacturer's driver). The
      only available information for the necessary delays is the maufacturer's
      driver (original values) but the proposed values are not enough for the
      tested devices.
      
      Fixes: e2ca90c2
      
       ("ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver")
      Reported-by: default avatarHerb Wei <weihao.bj@ieisystem.com>
      Tested-by: default avatarHerb Wei <weihao.bj@ieisystem.com>
      Signed-off-by: default avatarJose Ignacio Tornos Martinez <jtornosm@redhat.com>
      Link: https://lore.kernel.org/r/20231120120642.54334-1-jtornosm@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0739af07
  2. Nov 21, 2023
    • Simon Horman's avatar
      MAINTAINERS: Add indirect_call_wrapper.h to NETWORKING [GENERAL] · 9c6dc131
      Simon Horman authored
      
      
      indirect_call_wrapper.h  is not, strictly speaking, networking specific.
      However, it's git history indicates that in practice changes go through
      netdev and thus the netdev maintainers have effectively been taking
      responsibility for it.
      
      Formalise this by adding it to the NETWORKING [GENERAL] section in the
      MAINTAINERS file.
      
      It is not clear how many other files under include/linux fall into this
      category and it would be interesting, as a follow-up, to audit that and
      propose further updates to the MAINTAINERS file as appropriate.
      
      Link: https://lore.kernel.org/netdev/20231116010310.4664dd38@kernel.org/
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20231120-indirect_call_wrapper-maintainer-v1-1-0a6bb1f7363e@kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9c6dc131
    • Paolo Abeni's avatar
      Merge branch 'hv_netvsc-fix-race-of-netvsc-vf-register-and-slave-bit' · 54d4434d
      Paolo Abeni authored
      
      
      Haiyang Zhang says:
      
      ====================
      hv_netvsc: fix race of netvsc, VF register, and slave bit
      
      There are some races between netvsc probe, set notifier, VF register,
      and slave bit setting.
      This patch set fixes them.
      ====================
      
      Link: https://lore.kernel.org/r/1700411023-14317-1-git-send-email-haiyangz@microsoft.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      54d4434d
    • Long Li's avatar
      hv_netvsc: Mark VF as slave before exposing it to user-mode · c807d6cd
      Long Li authored
      When a VF is being exposed form the kernel, it should be marked as "slave"
      before exposing to the user-mode. The VF is not usable without netvsc
      running as master. The user-mode should never see a VF without the "slave"
      flag.
      
      This commit moves the code of setting the slave flag to the time before
      VF is exposed to user-mode.
      
      Cc: stable@vger.kernel.org
      Fixes: 0c195567
      
       ("netvsc: transparent VF management")
      Signed-off-by: default avatarLong Li <longli@microsoft.com>
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c807d6cd
    • Haiyang Zhang's avatar
      hv_netvsc: Fix race of register_netdevice_notifier and VF register · 85520856
      Haiyang Zhang authored
      If VF NIC is registered earlier, NETDEV_REGISTER event is replayed,
      but NETDEV_POST_INIT is not.
      
      Move register_netdevice_notifier() earlier, so the call back
      function is set before probing.
      
      Cc: stable@vger.kernel.org
      Fixes: e04e7a7b
      
       ("hv_netvsc: Fix a deadlock by getting rtnl lock earlier in netvsc_probe()")
      Reported-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: default avatarWojciech Drewek <wojciech.drewek@intel.com>
      Reviewed-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      85520856
    • Haiyang Zhang's avatar
      hv_netvsc: fix race of netvsc and VF register_netdevice · d30fb712
      Haiyang Zhang authored
      The rtnl lock also needs to be held before rndis_filter_device_add()
      which advertises nvsp_2_vsc_capability / sriov bit, and triggers
      VF NIC offering and registering. If VF NIC finished register_netdev()
      earlier it may cause name based config failure.
      
      To fix this issue, move the call to rtnl_lock() before
      rndis_filter_device_add(), so VF will be registered later than netvsc
      / synthetic NIC, and gets a name numbered (ethX) after netvsc.
      
      Cc: stable@vger.kernel.org
      Fixes: e04e7a7b
      
       ("hv_netvsc: Fix a deadlock by getting rtnl lock earlier in netvsc_probe()")
      Reported-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: default avatarWojciech Drewek <wojciech.drewek@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d30fb712
    • Kunwu Chan's avatar
      ipv4: Correct/silence an endian warning in __ip_do_redirect · c0e29262
      Kunwu Chan authored
      net/ipv4/route.c:783:46: warning: incorrect type in argument 2 (different base types)
      net/ipv4/route.c:783:46:    expected unsigned int [usertype] key
      net/ipv4/route.c:783:46:    got restricted __be32 [usertype] new_gw
      
      Fixes: 969447f2
      
       ("ipv4: use new_gw for redirect neigh lookup")
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarKunwu Chan <chentao@kylinos.cn>
      Link: https://lore.kernel.org/r/20231119141759.420477-1-chentao@kylinos.cn
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c0e29262
    • Alexei Starovoitov's avatar
      Merge branch 'verify-callbacks-as-if-they-are-called-unknown-number-of-times' · acb12c85
      Alexei Starovoitov authored
      
      
      Eduard Zingerman says:
      
      ====================
      verify callbacks as if they are called unknown number of times
      
      This series updates verifier logic for callback functions handling.
      Current master simulates callback body execution exactly once,
      which leads to verifier not detecting unsafe programs like below:
      
          static int unsafe_on_zero_iter_cb(__u32 idx, struct num_context *ctx)
          {
              ctx->i = 0;
              return 0;
          }
      
          SEC("?raw_tp")
          int unsafe_on_zero_iter(void *unused)
          {
              struct num_context loop_ctx = { .i = 32 };
              __u8 choice_arr[2] = { 0, 1 };
      
              bpf_loop(100, unsafe_on_zero_iter_cb, &loop_ctx, 0);
              return choice_arr[loop_ctx.i];
          }
      
      This was reported previously in [0].
      The basic idea of the fix is to schedule callback entry state for
      verification in env->head until some identical, previously visited
      state in current DFS state traversal is found. Same logic as with open
      coded iterators, and builds on top recent fixes [1] for those.
      
      The series is structured as follows:
      - patches #1,2,3 update strobemeta, xdp_synproxy selftests and
        bpf_loop_bench benchmark to allow convergence of the bpf_loop
        callback states;
      - patches #4,5 just shuffle the code a bit;
      - patch #6 is the main part of the series;
      - patch #7 adds test cases for #6;
      - patch #8 extend patch #6 with same speculative scalar widening
        logic, as used for open coded iterators;
      - patch #9 adds test cases for #8;
      - patch #10 extends patch #6 to track maximal number of callback
        executions specifically for bpf_loop();
      - patch #11 adds test cases for #10.
      
      Veristat results comparing this series to master+patches #1,2,3 using selftests
      show the following difference:
      
      File                       Program        States (A)  States (B)  States (DIFF)
      -------------------------  -------------  ----------  ----------  -------------
      bpf_loop_bench.bpf.o       benchmark               1           2  +1 (+100.00%)
      pyperf600_bpf_loop.bpf.o   on_event              322         407  +85 (+26.40%)
      strobemeta_bpf_loop.bpf.o  on_event              113         151  +38 (+33.63%)
      xdp_synproxy_kern.bpf.o    syncookie_tc          341         291  -50 (-14.66%)
      xdp_synproxy_kern.bpf.o    syncookie_xdp         344         301  -43 (-12.50%)
      
      Veristat results comparing this series to master using Tetragon BPF
      files [2] also show some differences.
      States diff varies from +2% to +15% on 23 programs out of 186,
      no new failures.
      
      Changelog:
      - V3 [5] -> V4, changes suggested by Andrii:
        - validate mark_chain_precision() result in patch #10;
        - renaming s/cumulative_callback_depth/callback_unroll_depth/.
      - V2 [4] -> V3:
        - fixes in expected log messages for test cases:
          - callback_result_precise;
          - parent_callee_saved_reg_precise_with_callback;
          - parent_stack_slot_precise_with_callback;
        - renamings (suggested by Alexei):
          - s/callback_iter_depth/cumulative_callback_depth/
          - s/is_callback_iter_next/calls_callback/
          - s/mark_callback_iter_next/mark_calls_callback/
        - prepare_func_exit() updated to exit with -EFAULT when
          callee->in_callback_fn is true but calls_callback() is not true
          for callsite;
        - test case 'bpf_loop_iter_limit_nested' rewritten to use return
          value check instead of verifier log message checks
          (suggested by Alexei).
      - V1 [3] -> V2, changes suggested by Andrii:
        - small changes for error handling code in __check_func_call();
        - callback body processing log is now matched in relevant
          verifier_subprog_precision.c tests;
        - R1 passed to bpf_loop() is now always marked as precise;
        - log level 2 message for bpf_loop() iteration termination instead of
          iteration depth messages;
        - __no_msg macro removed;
        - bpf_loop_iter_limit_nested updated to avoid using __no_msg;
        - commit message for patch #3 updated according to Alexei's request.
      
      [0] https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
      [1] https://lore.kernel.org/bpf/20231024000917.12153-1-eddyz87@gmail.com/
      [2] git@github.com:cilium/tetragon.git
      [3] https://lore.kernel.org/bpf/20231116021803.9982-1-eddyz87@gmail.com/T/#t
      [4] https://lore.kernel.org/bpf/20231118013355.7943-1-eddyz87@gmail.com/T/#t
      [5] https://lore.kernel.org/bpf/20231120225945.11741-1-eddyz87@gmail.com/T/#t
      ====================
      
      Link: https://lore.kernel.org/r/20231121020701.26440-1-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      acb12c85
    • Eduard Zingerman's avatar
      selftests/bpf: check if max number of bpf_loop iterations is tracked · 57e2a52d
      Eduard Zingerman authored
      
      
      Check that even if bpf_loop() callback simulation does not converge to
      a specific state, verification could proceed via "brute force"
      simulation of maximal number of callback calls.
      
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-12-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      57e2a52d
    • Eduard Zingerman's avatar
      bpf: keep track of max number of bpf_loop callback iterations · bb124da6
      Eduard Zingerman authored
      
      
      In some cases verifier can't infer convergence of the bpf_loop()
      iteration. E.g. for the following program:
      
          static int cb(__u32 idx, struct num_context* ctx)
          {
              ctx->i++;
              return 0;
          }
      
          SEC("?raw_tp")
          int prog(void *_)
          {
              struct num_context ctx = { .i = 0 };
              __u8 choice_arr[2] = { 0, 1 };
      
              bpf_loop(2, cb, &ctx, 0);
              return choice_arr[ctx.i];
          }
      
      Each 'cb' simulation would eventually return to 'prog' and reach
      'return choice_arr[ctx.i]' statement. At which point ctx.i would be
      marked precise, thus forcing verifier to track multitude of separate
      states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.
      
      This commit allows "brute force" handling for such cases by limiting
      number of callback body simulations using 'umax' value of the first
      bpf_loop() parameter.
      
      For this, extend bpf_func_state with 'callback_depth' field.
      Increment this field when callback visiting state is pushed to states
      traversal stack. For frame #N it's 'callback_depth' field counts how
      many times callback with frame depth N+1 had been executed.
      Use bpf_func_state specifically to allow independent tracking of
      callback depths when multiple nested bpf_loop() calls are present.
      
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bb124da6
    • Eduard Zingerman's avatar
      selftests/bpf: test widening for iterating callbacks · 9f3330aa
      Eduard Zingerman authored
      
      
      A test case to verify that imprecise scalars widening is applied to
      callback entering state, when callback call is simulated repeatedly.
      
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-10-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9f3330aa
    • Eduard Zingerman's avatar
      bpf: widening for callback iterators · cafe2c21
      Eduard Zingerman authored
      
      
      Callbacks are similar to open coded iterators, so add imprecise
      widening logic for callback body processing. This makes callback based
      loops behave identically to open coded iterators, e.g. allowing to
      verify programs like below:
      
        struct ctx { u32 i; };
        int cb(u32 idx, struct ctx* ctx)
        {
                ++ctx->i;
                return 0;
        }
        ...
        struct ctx ctx = { .i = 0 };
        bpf_loop(100, cb, &ctx, 0);
        ...
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cafe2c21
    • Eduard Zingerman's avatar
      selftests/bpf: tests for iterating callbacks · 958465e2
      Eduard Zingerman authored
      
      
      A set of test cases to check behavior of callback handling logic,
      check if verifier catches the following situations:
      - program not safe on second callback iteration;
      - program not safe on zero callback iterations;
      - infinite loop inside a callback.
      
      Verify that callback logic works for bpf_loop, bpf_for_each_map_elem,
      bpf_user_ringbuf_drain, bpf_find_vma.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-8-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      958465e2
    • Eduard Zingerman's avatar
      bpf: verify callbacks as if they are called unknown number of times · ab5cfac1
      Eduard Zingerman authored
      
      
      Prior to this patch callbacks were handled as regular function calls,
      execution of callback body was modeled exactly once.
      This patch updates callbacks handling logic as follows:
      - introduces a function push_callback_call() that schedules callback
        body verification in env->head stack;
      - updates prepare_func_exit() to reschedule callback body verification
        upon BPF_EXIT;
      - as calls to bpf_*_iter_next(), calls to callback invoking functions
        are marked as checkpoints;
      - is_state_visited() is updated to stop callback based iteration when
        some identical parent state is found.
      
      Paths with callback function invoked zero times are now verified first,
      which leads to necessity to modify some selftests:
      - the following negative tests required adding release/unlock/drop
        calls to avoid previously masked unrelated error reports:
        - cb_refs.c:underflow_prog
        - exceptions_fail.c:reject_rbtree_add_throw
        - exceptions_fail.c:reject_with_cp_reference
      - the following precision tracking selftests needed change in expected
        log trace:
        - verifier_subprog_precision.c:callback_result_precise
          (note: r0 precision is no longer propagated inside callback and
                 I think this is a correct behavior)
        - verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
        - verifier_subprog_precision.c:parent_stack_slot_precise_with_callback
      
      Reported-by: default avatarAndrew Werner <awerner32@gmail.com>
      Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ab5cfac1
    • Eduard Zingerman's avatar
      bpf: extract setup_func_entry() utility function · 58124a98
      Eduard Zingerman authored
      
      
      Move code for simulated stack frame creation to a separate utility
      function. This function would be used in the follow-up change for
      callbacks handling.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      58124a98
    • Eduard Zingerman's avatar
      bpf: extract __check_reg_arg() utility function · 683b96f9
      Eduard Zingerman authored
      
      
      Split check_reg_arg() into two utility functions:
      - check_reg_arg() operating on registers from current verifier state;
      - __check_reg_arg() operating on a specific set of registers passed as
        a parameter;
      
      The __check_reg_arg() function would be used by a follow-up change for
      callbacks handling.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      683b96f9
    • Eduard Zingerman's avatar
      selftests/bpf: fix bpf_loop_bench for new callback verification scheme · f40bfd16
      Eduard Zingerman authored
      
      
      This is a preparatory change. A follow-up patch "bpf: verify callbacks
      as if they are called unknown number of times" changes logic for
      callbacks handling. While previously callbacks were verified as a
      single function call, new scheme takes into account that callbacks
      could be executed unknown number of times.
      
      This has dire implications for bpf_loop_bench:
      
          SEC("fentry/" SYS_PREFIX "sys_getpgid")
          int benchmark(void *ctx)
          {
                  for (int i = 0; i < 1000; i++) {
                          bpf_loop(nr_loops, empty_callback, NULL, 0);
                          __sync_add_and_fetch(&hits, nr_loops);
                  }
                  return 0;
          }
      
      W/o callbacks change verifier sees it as a 1000 calls to
      empty_callback(). However, with callbacks change things become
      exponential:
      - i=0: state exploring empty_callback is scheduled with i=0 (a);
      - i=1: state exploring empty_callback is scheduled with i=1;
        ...
      - i=999: state exploring empty_callback is scheduled with i=999;
      - state (a) is popped from stack;
      - i=1: state exploring empty_callback is scheduled with i=1;
        ...
      
      Avoid this issue by rewriting outer loop as bpf_loop().
      Unfortunately, this adds a function call to a loop at runtime, which
      negatively affects performance:
      
                  throughput               latency
         before:  149.919 ± 0.168 M ops/s, 6.670 ns/op
         after :  137.040 ± 0.187 M ops/s, 7.297 ns/op
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-4-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f40bfd16
    • Eduard Zingerman's avatar
      selftests/bpf: track string payload offset as scalar in strobemeta · 87eb0152
      Eduard Zingerman authored
      
      
      This change prepares strobemeta for update in callbacks verification
      logic. To allow bpf_loop() verification converge when multiple
      callback iterations are considered:
      - track offset inside strobemeta_payload->payload directly as scalar
        value;
      - at each iteration make sure that remaining
        strobemeta_payload->payload capacity is sufficient for execution of
        read_{map,str}_var functions;
      - make sure that offset is tracked as unbound scalar between
        iterations, otherwise verifier won't be able infer that bpf_loop
        callback reaches identical states.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-3-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87eb0152
    • Eduard Zingerman's avatar
      selftests/bpf: track tcp payload offset as scalar in xdp_synproxy · 977bc146
      Eduard Zingerman authored
      
      
      This change prepares syncookie_{tc,xdp} for update in callbakcs
      verification logic. To allow bpf_loop() verification converge when
      multiple callback itreations are considered:
      - track offset inside TCP payload explicitly, not as a part of the
        pointer;
      - make sure that offset does not exceed MAX_PACKET_OFF enforced by
        verifier;
      - make sure that offset is tracked as unbound scalar between
        iterations, otherwise verifier won't be able infer that bpf_loop
        callback reaches identical states.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-2-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      977bc146
    • Martin KaFai Lau's avatar
      Merge branch 'bpf_redirect_peer fixes' · fcb905d8
      Martin KaFai Lau authored
      
      
      Daniel Borkmann says:
      
      ====================
      This fixes bpf_redirect_peer stats accounting for veth and netkit,
      and adds tstats in the first place for the latter. Utilise indirect
      call wrapper for bpf_redirect_peer, and improve test coverage of the
      latter also for netkit devices. Details in the patches, thanks!
      
      The series was targeted at bpf originally, and is done here as well,
      so it can trigger BPF CI. Jakub, if you think directly going via net
      is better since the majority of the diff touches net anyway, that is
      fine, too.
      
      Thanks!
      
      v2 -> v3:
        - Add kdoc for pcpu_stat_type (Simon)
        - Reject invalid type value in netdev_do_alloc_pcpu_stats (Simon)
        - Add Reviewed-by tags from list
      v1 -> v2:
        - Move stats allocation/freeing into net core (Jakub)
        - As prepwork for the above, move vrf's dstats over into the core
        - Add a check into stats alloc to enforce tstats upon
          implementing ndo_get_peer_dev
        - Add Acked-by tags from list
      
      Daniel Borkmann (6):
        net, vrf: Move dstats structure to core
        net: Move {l,t,d}stats allocation to core and convert veth & vrf
        netkit: Add tstats per-CPU traffic counters
        bpf, netkit: Add indirect call wrapper for fetching peer dev
        selftests/bpf: De-veth-ize the tc_redirect test case
        selftests/bpf: Add netkit to tc_redirect selftest
      ====================
      
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      fcb905d8
    • Daniel Borkmann's avatar
      selftests/bpf: Add netkit to tc_redirect selftest · adfeae2d
      Daniel Borkmann authored
      
      
      Extend the existing tc_redirect selftest to also cover netkit devices
      for exercising the bpf_redirect_peer() code paths, so that we have both
      veth as well as netkit covered, all tests still pass after this change.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-9-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      adfeae2d
    • Daniel Borkmann's avatar
      selftests/bpf: De-veth-ize the tc_redirect test case · eee82da7
      Daniel Borkmann authored
      
      
      No functional changes to the test case, but just renaming various functions,
      variables, etc, to remove veth part of their name for making it more generic
      and reusable later on (e.g. for netkit).
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-8-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      eee82da7
    • Daniel Borkmann's avatar
      bpf, netkit: Add indirect call wrapper for fetching peer dev · 2c225425
      Daniel Borkmann authored
      
      
      ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
      indirect call wrapper and therefore optimize the bpf_redirect_peer()
      internal handling a bit. Add a small skb_get_peer_dev() wrapper which
      utilizes the INDIRECT_CALL_1() macro instead of open coding.
      
      Future work could potentially add a peer pointer directly into struct
      net_device in future and convert veth and netkit over to use it so
      that eventually ndo_get_peer_dev can be removed.
      
      Co-developed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      2c225425
    • Peilin Ye's avatar
      bpf: Fix dev's rx stats for bpf_redirect_peer traffic · 024ee930
      Peilin Ye authored
      Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
      is not accounted for in the RX stats of supported devices (that is, veth
      and netkit), confusing user space metrics collectors such as cAdvisor [0],
      as reported by Youlun.
      
      Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
      RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
      @tstats per-CPU counters (instead of @lstats, or @dstats).
      
      To make this more fool-proof, error out when ndo_get_peer_dev is set but
      @tstats are not selected.
      
        [0] Specifically, the "container_network_receive_{byte,packet}s_total"
            counters are affected.
      
      Fixes: 9aa1206e
      
       ("bpf: Add redirect_peer helper")
      Reported-by: default avatarYoulun Zhang <zhangyoulun@bytedance.com>
      Signed-off-by: default avatarPeilin Ye <peilin.ye@bytedance.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      024ee930
    • Peilin Ye's avatar
      veth: Use tstats per-CPU traffic counters · 6f2684bf
      Peilin Ye authored
      
      
      Currently veth devices use the lstats per-CPU traffic counters, which only
      cover TX traffic. veth_get_stats64() actually populates RX stats of a veth
      device from its peer's TX counters, based on the assumption that a veth
      device can _only_ receive packets from its peer, which is no longer true:
      
      For example, recent CNIs (like Cilium) can use the bpf_redirect_peer() BPF
      helper to redirect traffic from NIC's tc ingress to veth's tc ingress (in
      a different netns), skipping veth's peer device. Unfortunately, this kind
      of traffic isn't currently accounted for in veth's RX stats.
      
      In preparation for the fix, use tstats (instead of lstats) to maintain
      both RX and TX counters for each veth device. We'll use RX counters for
      bpf_redirect_peer() traffic, and keep using TX counters for the usual
      "peer-to-peer" traffic. In veth_get_stats64(), calculate RX stats by
      _adding_ RX count to peer's TX count, in order to cover both kinds of
      traffic.
      
      veth_stats_rx() might need a name change (perhaps to "veth_stats_xdp()")
      for less confusion, but let's leave it to another patch to keep the fix
      minimal.
      
      Signed-off-by: default avatarPeilin Ye <peilin.ye@bytedance.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-5-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      6f2684bf
    • Daniel Borkmann's avatar
      netkit: Add tstats per-CPU traffic counters · ae165827
      Daniel Borkmann authored
      
      
      Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
      RX and TX counters.
      
      The dev's TX counters are bumped upon pass/unspec as well as redirect
      verdicts, in other words, on everything except for drops.
      
      The dev's RX counters are bumped upon successful __netif_rx(), as well
      as from skb_do_redirect() (not part of this commit here).
      
      Using dev->lstats with having just a single packets/bytes counter and
      inferring one another's RX counters from the peer dev's lstats is not
      possible given skb_do_redirect() can also bump the device's stats.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-4-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      ae165827
    • Daniel Borkmann's avatar
      net: Move {l,t,d}stats allocation to core and convert veth & vrf · 34d21de9
      Daniel Borkmann authored
      
      
      Move {l,t,d}stats allocation to the core and let netdevs pick the stats
      type they need. That way the driver doesn't have to bother with error
      handling (allocation failure checking, making sure free happens in the
      right spot, etc) - all happening in the core.
      
      Co-developed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Cc: David Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      34d21de9
    • Daniel Borkmann's avatar
      net, vrf: Move dstats structure to core · 79e0c5be
      Daniel Borkmann authored
      
      
      Just move struct pcpu_dstats out of the vrf into the core, and streamline
      the field names slightly, so they better align with the {t,l}stats ones.
      
      No functional change otherwise. A conversion of the u64s to u64_stats_t
      could be done at a separate point in future. This move is needed as we are
      moving the {t,l,d}stats allocation/freeing to the core.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-2-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      79e0c5be
  3. Nov 20, 2023
  4. Nov 17, 2023
    • Kees Cook's avatar
      MAINTAINERS: Add netdev subsystem profile link · 76df934c
      Kees Cook authored
      
      
      The netdev subsystem has had a subsystem process document for a while
      now. Link it appropriately in MAINTAINERS with the P: tag.
      
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76df934c
    • David S. Miller's avatar
      Merge branch 'rxrpc-ack-fixes' · 3c15504a
      David S. Miller authored
      
      
      David Howells says:
      
      ====================
      rxrpc: ACK handling fixes
      
      Here are a couple of patches to fix ACK handling in AF_RXRPC:
      
       (1) Allow RTT determination to use an ACK of any type as the response from
           which to calculate RTT, provided ack.serial matches the serial number
           of the outgoing packet.
      
       (2) Defer the response to a PING ACK packet (or any ACK with the
           REQUEST_ACK flag set) until after we've parsed the packet so that we
           carry up to date information if the Tx or Rx rings are advanced.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c15504a
    • David Howells's avatar
      rxrpc: Defer the response to a PING ACK until we've parsed it · 1a01319f
      David Howells authored
      Defer the generation of a PING RESPONSE ACK in response to a PING ACK until
      we've parsed the PING ACK so that we pick up any changes to the packet
      queue so that we can update ackinfo.
      
      This is also applied to an ACK generated in response to an ACK with the
      REQUEST_ACK flag set.
      
      Note that whilst the problem was added in commit 248f219c, it didn't
      really matter at that point because the ACK was proposed in softirq mode
      and generated asynchronously later in process context, taking the latest
      values at the time.  But this fix is only needed since the move to parse
      incoming packets in an I/O thread rather than in softirq and generate the
      ACK at point of proposal (b0346843).
      
      Fixes: 248f219c
      
       ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a01319f
    • David Howells's avatar
      rxrpc: Fix RTT determination to use any ACK as a source · 3798680f
      David Howells authored
      Fix RTT determination to be able to use any type of ACK as the response
      from which RTT can be calculated provided its ack.serial is non-zero and
      matches the serial number of an outgoing DATA or ACK packet.  This
      shouldn't be limited to REQUESTED-type ACKs as these can have other types
      substituted for them for things like duplicate or out-of-order packets.
      
      Fixes: 4700c4d8
      
       ("rxrpc: Fix loss of RTT samples due to interposed ACK")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3798680f
    • Paolo Abeni's avatar
      kselftest: rtnetlink: fix ip route command typo · 75a50c4f
      Paolo Abeni authored
      The blamed commit below introduced a typo causing 'gretap' test-case
      failures:
      
      ./rtnetlink.sh  -t kci_test_gretap -v
      COMMAND: ip link add name test-dummy0 type dummy
      COMMAND: ip link set test-dummy0 up
      COMMAND: ip netns add testns
      COMMAND: ip link help gretap 2>&1 | grep -q '^Usage:'
      COMMAND: ip -netns testns link add dev gretap00 type gretap seq key 102 local 172.16.1.100 remote 172.16.1.200
      COMMAND: ip -netns testns addr add dev gretap00 10.1.1.100/24
      COMMAND: ip -netns testns link set dev gretap00 ups
          Error: either "dev" is duplicate, or "ups" is a garbage.
      COMMAND: ip -netns testns link del gretap00
      COMMAND: ip -netns testns link add dev gretap00 type gretap external
      COMMAND: ip -netns testns link del gretap00
      FAIL: gretap
      
      Fix it by using the correct keyword.
      
      Fixes: 9c2a19f7
      
       ("kselftest: rtnetlink.sh: add verbose flag")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75a50c4f