Skip to content
  1. Nov 21, 2023
    • Eduard Zingerman's avatar
      bpf: verify callbacks as if they are called unknown number of times · ab5cfac1
      Eduard Zingerman authored
      
      
      Prior to this patch callbacks were handled as regular function calls,
      execution of callback body was modeled exactly once.
      This patch updates callbacks handling logic as follows:
      - introduces a function push_callback_call() that schedules callback
        body verification in env->head stack;
      - updates prepare_func_exit() to reschedule callback body verification
        upon BPF_EXIT;
      - as calls to bpf_*_iter_next(), calls to callback invoking functions
        are marked as checkpoints;
      - is_state_visited() is updated to stop callback based iteration when
        some identical parent state is found.
      
      Paths with callback function invoked zero times are now verified first,
      which leads to necessity to modify some selftests:
      - the following negative tests required adding release/unlock/drop
        calls to avoid previously masked unrelated error reports:
        - cb_refs.c:underflow_prog
        - exceptions_fail.c:reject_rbtree_add_throw
        - exceptions_fail.c:reject_with_cp_reference
      - the following precision tracking selftests needed change in expected
        log trace:
        - verifier_subprog_precision.c:callback_result_precise
          (note: r0 precision is no longer propagated inside callback and
                 I think this is a correct behavior)
        - verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
        - verifier_subprog_precision.c:parent_stack_slot_precise_with_callback
      
      Reported-by: default avatarAndrew Werner <awerner32@gmail.com>
      Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ab5cfac1
    • Eduard Zingerman's avatar
      bpf: extract setup_func_entry() utility function · 58124a98
      Eduard Zingerman authored
      
      
      Move code for simulated stack frame creation to a separate utility
      function. This function would be used in the follow-up change for
      callbacks handling.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      58124a98
    • Eduard Zingerman's avatar
      bpf: extract __check_reg_arg() utility function · 683b96f9
      Eduard Zingerman authored
      
      
      Split check_reg_arg() into two utility functions:
      - check_reg_arg() operating on registers from current verifier state;
      - __check_reg_arg() operating on a specific set of registers passed as
        a parameter;
      
      The __check_reg_arg() function would be used by a follow-up change for
      callbacks handling.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      683b96f9
    • Eduard Zingerman's avatar
      selftests/bpf: fix bpf_loop_bench for new callback verification scheme · f40bfd16
      Eduard Zingerman authored
      
      
      This is a preparatory change. A follow-up patch "bpf: verify callbacks
      as if they are called unknown number of times" changes logic for
      callbacks handling. While previously callbacks were verified as a
      single function call, new scheme takes into account that callbacks
      could be executed unknown number of times.
      
      This has dire implications for bpf_loop_bench:
      
          SEC("fentry/" SYS_PREFIX "sys_getpgid")
          int benchmark(void *ctx)
          {
                  for (int i = 0; i < 1000; i++) {
                          bpf_loop(nr_loops, empty_callback, NULL, 0);
                          __sync_add_and_fetch(&hits, nr_loops);
                  }
                  return 0;
          }
      
      W/o callbacks change verifier sees it as a 1000 calls to
      empty_callback(). However, with callbacks change things become
      exponential:
      - i=0: state exploring empty_callback is scheduled with i=0 (a);
      - i=1: state exploring empty_callback is scheduled with i=1;
        ...
      - i=999: state exploring empty_callback is scheduled with i=999;
      - state (a) is popped from stack;
      - i=1: state exploring empty_callback is scheduled with i=1;
        ...
      
      Avoid this issue by rewriting outer loop as bpf_loop().
      Unfortunately, this adds a function call to a loop at runtime, which
      negatively affects performance:
      
                  throughput               latency
         before:  149.919 ± 0.168 M ops/s, 6.670 ns/op
         after :  137.040 ± 0.187 M ops/s, 7.297 ns/op
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-4-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f40bfd16
    • Eduard Zingerman's avatar
      selftests/bpf: track string payload offset as scalar in strobemeta · 87eb0152
      Eduard Zingerman authored
      
      
      This change prepares strobemeta for update in callbacks verification
      logic. To allow bpf_loop() verification converge when multiple
      callback iterations are considered:
      - track offset inside strobemeta_payload->payload directly as scalar
        value;
      - at each iteration make sure that remaining
        strobemeta_payload->payload capacity is sufficient for execution of
        read_{map,str}_var functions;
      - make sure that offset is tracked as unbound scalar between
        iterations, otherwise verifier won't be able infer that bpf_loop
        callback reaches identical states.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-3-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87eb0152
    • Eduard Zingerman's avatar
      selftests/bpf: track tcp payload offset as scalar in xdp_synproxy · 977bc146
      Eduard Zingerman authored
      
      
      This change prepares syncookie_{tc,xdp} for update in callbakcs
      verification logic. To allow bpf_loop() verification converge when
      multiple callback itreations are considered:
      - track offset inside TCP payload explicitly, not as a part of the
        pointer;
      - make sure that offset does not exceed MAX_PACKET_OFF enforced by
        verifier;
      - make sure that offset is tracked as unbound scalar between
        iterations, otherwise verifier won't be able infer that bpf_loop
        callback reaches identical states.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231121020701.26440-2-eddyz87@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      977bc146
    • Martin KaFai Lau's avatar
      Merge branch 'bpf_redirect_peer fixes' · fcb905d8
      Martin KaFai Lau authored
      
      
      Daniel Borkmann says:
      
      ====================
      This fixes bpf_redirect_peer stats accounting for veth and netkit,
      and adds tstats in the first place for the latter. Utilise indirect
      call wrapper for bpf_redirect_peer, and improve test coverage of the
      latter also for netkit devices. Details in the patches, thanks!
      
      The series was targeted at bpf originally, and is done here as well,
      so it can trigger BPF CI. Jakub, if you think directly going via net
      is better since the majority of the diff touches net anyway, that is
      fine, too.
      
      Thanks!
      
      v2 -> v3:
        - Add kdoc for pcpu_stat_type (Simon)
        - Reject invalid type value in netdev_do_alloc_pcpu_stats (Simon)
        - Add Reviewed-by tags from list
      v1 -> v2:
        - Move stats allocation/freeing into net core (Jakub)
        - As prepwork for the above, move vrf's dstats over into the core
        - Add a check into stats alloc to enforce tstats upon
          implementing ndo_get_peer_dev
        - Add Acked-by tags from list
      
      Daniel Borkmann (6):
        net, vrf: Move dstats structure to core
        net: Move {l,t,d}stats allocation to core and convert veth & vrf
        netkit: Add tstats per-CPU traffic counters
        bpf, netkit: Add indirect call wrapper for fetching peer dev
        selftests/bpf: De-veth-ize the tc_redirect test case
        selftests/bpf: Add netkit to tc_redirect selftest
      ====================
      
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      fcb905d8
    • Daniel Borkmann's avatar
      selftests/bpf: Add netkit to tc_redirect selftest · adfeae2d
      Daniel Borkmann authored
      
      
      Extend the existing tc_redirect selftest to also cover netkit devices
      for exercising the bpf_redirect_peer() code paths, so that we have both
      veth as well as netkit covered, all tests still pass after this change.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-9-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      adfeae2d
    • Daniel Borkmann's avatar
      selftests/bpf: De-veth-ize the tc_redirect test case · eee82da7
      Daniel Borkmann authored
      
      
      No functional changes to the test case, but just renaming various functions,
      variables, etc, to remove veth part of their name for making it more generic
      and reusable later on (e.g. for netkit).
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-8-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      eee82da7
    • Daniel Borkmann's avatar
      bpf, netkit: Add indirect call wrapper for fetching peer dev · 2c225425
      Daniel Borkmann authored
      
      
      ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
      indirect call wrapper and therefore optimize the bpf_redirect_peer()
      internal handling a bit. Add a small skb_get_peer_dev() wrapper which
      utilizes the INDIRECT_CALL_1() macro instead of open coding.
      
      Future work could potentially add a peer pointer directly into struct
      net_device in future and convert veth and netkit over to use it so
      that eventually ndo_get_peer_dev can be removed.
      
      Co-developed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      2c225425
    • Peilin Ye's avatar
      bpf: Fix dev's rx stats for bpf_redirect_peer traffic · 024ee930
      Peilin Ye authored
      Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
      is not accounted for in the RX stats of supported devices (that is, veth
      and netkit), confusing user space metrics collectors such as cAdvisor [0],
      as reported by Youlun.
      
      Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
      RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
      @tstats per-CPU counters (instead of @lstats, or @dstats).
      
      To make this more fool-proof, error out when ndo_get_peer_dev is set but
      @tstats are not selected.
      
        [0] Specifically, the "container_network_receive_{byte,packet}s_total"
            counters are affected.
      
      Fixes: 9aa1206e
      
       ("bpf: Add redirect_peer helper")
      Reported-by: default avatarYoulun Zhang <zhangyoulun@bytedance.com>
      Signed-off-by: default avatarPeilin Ye <peilin.ye@bytedance.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      024ee930
    • Peilin Ye's avatar
      veth: Use tstats per-CPU traffic counters · 6f2684bf
      Peilin Ye authored
      
      
      Currently veth devices use the lstats per-CPU traffic counters, which only
      cover TX traffic. veth_get_stats64() actually populates RX stats of a veth
      device from its peer's TX counters, based on the assumption that a veth
      device can _only_ receive packets from its peer, which is no longer true:
      
      For example, recent CNIs (like Cilium) can use the bpf_redirect_peer() BPF
      helper to redirect traffic from NIC's tc ingress to veth's tc ingress (in
      a different netns), skipping veth's peer device. Unfortunately, this kind
      of traffic isn't currently accounted for in veth's RX stats.
      
      In preparation for the fix, use tstats (instead of lstats) to maintain
      both RX and TX counters for each veth device. We'll use RX counters for
      bpf_redirect_peer() traffic, and keep using TX counters for the usual
      "peer-to-peer" traffic. In veth_get_stats64(), calculate RX stats by
      _adding_ RX count to peer's TX count, in order to cover both kinds of
      traffic.
      
      veth_stats_rx() might need a name change (perhaps to "veth_stats_xdp()")
      for less confusion, but let's leave it to another patch to keep the fix
      minimal.
      
      Signed-off-by: default avatarPeilin Ye <peilin.ye@bytedance.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-5-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      6f2684bf
    • Daniel Borkmann's avatar
      netkit: Add tstats per-CPU traffic counters · ae165827
      Daniel Borkmann authored
      
      
      Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
      RX and TX counters.
      
      The dev's TX counters are bumped upon pass/unspec as well as redirect
      verdicts, in other words, on everything except for drops.
      
      The dev's RX counters are bumped upon successful __netif_rx(), as well
      as from skb_do_redirect() (not part of this commit here).
      
      Using dev->lstats with having just a single packets/bytes counter and
      inferring one another's RX counters from the peer dev's lstats is not
      possible given skb_do_redirect() can also bump the device's stats.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-4-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      ae165827
    • Daniel Borkmann's avatar
      net: Move {l,t,d}stats allocation to core and convert veth & vrf · 34d21de9
      Daniel Borkmann authored
      
      
      Move {l,t,d}stats allocation to the core and let netdevs pick the stats
      type they need. That way the driver doesn't have to bother with error
      handling (allocation failure checking, making sure free happens in the
      right spot, etc) - all happening in the core.
      
      Co-developed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Cc: David Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      34d21de9
    • Daniel Borkmann's avatar
      net, vrf: Move dstats structure to core · 79e0c5be
      Daniel Borkmann authored
      
      
      Just move struct pcpu_dstats out of the vrf into the core, and streamline
      the field names slightly, so they better align with the {t,l}stats ones.
      
      No functional change otherwise. A conversion of the u64s to u64_stats_t
      could be done at a separate point in future. This move is needed as we are
      moving the {t,l,d}stats allocation/freeing to the core.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20231114004220.6495-2-daniel@iogearbox.net
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      79e0c5be
  2. Nov 17, 2023
    • Kees Cook's avatar
      MAINTAINERS: Add netdev subsystem profile link · 76df934c
      Kees Cook authored
      
      
      The netdev subsystem has had a subsystem process document for a while
      now. Link it appropriately in MAINTAINERS with the P: tag.
      
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76df934c
    • David S. Miller's avatar
      Merge branch 'rxrpc-ack-fixes' · 3c15504a
      David S. Miller authored
      
      
      David Howells says:
      
      ====================
      rxrpc: ACK handling fixes
      
      Here are a couple of patches to fix ACK handling in AF_RXRPC:
      
       (1) Allow RTT determination to use an ACK of any type as the response from
           which to calculate RTT, provided ack.serial matches the serial number
           of the outgoing packet.
      
       (2) Defer the response to a PING ACK packet (or any ACK with the
           REQUEST_ACK flag set) until after we've parsed the packet so that we
           carry up to date information if the Tx or Rx rings are advanced.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c15504a
    • David Howells's avatar
      rxrpc: Defer the response to a PING ACK until we've parsed it · 1a01319f
      David Howells authored
      Defer the generation of a PING RESPONSE ACK in response to a PING ACK until
      we've parsed the PING ACK so that we pick up any changes to the packet
      queue so that we can update ackinfo.
      
      This is also applied to an ACK generated in response to an ACK with the
      REQUEST_ACK flag set.
      
      Note that whilst the problem was added in commit 248f219c, it didn't
      really matter at that point because the ACK was proposed in softirq mode
      and generated asynchronously later in process context, taking the latest
      values at the time.  But this fix is only needed since the move to parse
      incoming packets in an I/O thread rather than in softirq and generate the
      ACK at point of proposal (b0346843).
      
      Fixes: 248f219c
      
       ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a01319f
    • David Howells's avatar
      rxrpc: Fix RTT determination to use any ACK as a source · 3798680f
      David Howells authored
      Fix RTT determination to be able to use any type of ACK as the response
      from which RTT can be calculated provided its ack.serial is non-zero and
      matches the serial number of an outgoing DATA or ACK packet.  This
      shouldn't be limited to REQUESTED-type ACKs as these can have other types
      substituted for them for things like duplicate or out-of-order packets.
      
      Fixes: 4700c4d8
      
       ("rxrpc: Fix loss of RTT samples due to interposed ACK")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3798680f
    • Paolo Abeni's avatar
      kselftest: rtnetlink: fix ip route command typo · 75a50c4f
      Paolo Abeni authored
      The blamed commit below introduced a typo causing 'gretap' test-case
      failures:
      
      ./rtnetlink.sh  -t kci_test_gretap -v
      COMMAND: ip link add name test-dummy0 type dummy
      COMMAND: ip link set test-dummy0 up
      COMMAND: ip netns add testns
      COMMAND: ip link help gretap 2>&1 | grep -q '^Usage:'
      COMMAND: ip -netns testns link add dev gretap00 type gretap seq key 102 local 172.16.1.100 remote 172.16.1.200
      COMMAND: ip -netns testns addr add dev gretap00 10.1.1.100/24
      COMMAND: ip -netns testns link set dev gretap00 ups
          Error: either "dev" is duplicate, or "ups" is a garbage.
      COMMAND: ip -netns testns link del gretap00
      COMMAND: ip -netns testns link add dev gretap00 type gretap external
      COMMAND: ip -netns testns link del gretap00
      FAIL: gretap
      
      Fix it by using the correct keyword.
      
      Fixes: 9c2a19f7
      
       ("kselftest: rtnetlink.sh: add verbose flag")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75a50c4f
    • Gerd Bayer's avatar
      s390/ism: ism driver implies smc protocol · d565fa43
      Gerd Bayer authored
      Since commit a72178cf ("net/smc: Fix dependency of SMC on ISM")
      you can build the ism code without selecting the SMC network protocol.
      That leaves some ism functions be reported as unused. Move these
      functions under the conditional compile with CONFIG_SMC.
      
      Also codify the suggestion to also configure the SMC protocol in ism's
      Kconfig - but with an "imply" rather than a "select" as SMC depends on
      other config options and allow for a deliberate decision not to build
      SMC. Also, mention that in ISM's help.
      
      Fixes: a72178cf
      
       ("net/smc: Fix dependency of SMC on ISM")
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Closes: https://lore.kernel.org/netdev/afd142a2-1fa0-46b9-8b2d-7652d41d3ab8@infradead.org/
      Signed-off-by: default avatarGerd Bayer <gbayer@linux.ibm.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d565fa43
    • David Howells's avatar
      rxrpc: Fix some minor issues with bundle tracing · 0c3bd086
      David Howells authored
      
      
      Fix some superficial issues with the tracing of rxrpc_bundle structs,
      including:
      
       (1) Set the debug_id when the bundle is allocated rather than when it is
           set up so that the "NEW" trace line displays the correct bundle ID.
      
       (2) Show the refcount when emitting the "FREE" traceline.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c3bd086
    • Jean Delvare's avatar
      stmmac: dwmac-loongson: Add architecture dependency · 7fbd5fc2
      Jean Delvare authored
      
      
      Only present the DWMAC_LOONGSON option on architectures where it can
      actually be used.
      
      This follows the same logic as the DWMAC_INTEL option.
      
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Cc: Keguang Zhang <keguang.zhang@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7fbd5fc2
    • Oliver Neukum's avatar
      usb: aqc111: check packet for fixup for true limit · ccab434e
      Oliver Neukum authored
      
      
      If a device sends a packet that is inbetween 0
      and sizeof(u64) the value passed to skb_trim()
      as length will wrap around ending up as some very
      large value.
      
      The driver will then proceed to parse the header
      located at that position, which will either oops or
      process some random value.
      
      The fix is to check against sizeof(u64) rather than
      0, which the driver currently does. The issue exists
      since the introduction of the driver.
      
      Signed-off-by: default avatarOliver Neukum <oneukum@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ccab434e
  3. Nov 16, 2023
    • Linus Torvalds's avatar
      Merge tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 7475e51b
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from BPF and netfilter.
      
        Current release - regressions:
      
         - core: fix undefined behavior in netdev name allocation
      
         - bpf: do not allocate percpu memory at init stage
      
         - netfilter: nf_tables: split async and sync catchall in two
           functions
      
         - mptcp: fix possible NULL pointer dereference on close
      
        Current release - new code bugs:
      
         - eth: ice: dpll: fix initial lock status of dpll
      
        Previous releases - regressions:
      
         - bpf: fix precision backtracking instruction iteration
      
         - af_unix: fix use-after-free in unix_stream_read_actor()
      
         - tipc: fix kernel-infoleak due to uninitialized TLV value
      
         - eth: bonding: stop the device in bond_setup_by_slave()
      
         - eth: mlx5:
            - fix double free of encap_header
            - avoid referencing skb after free-ing in drop path
      
         - eth: hns3: fix VF reset
      
         - eth: mvneta: fix calls to page_pool_get_stats
      
        Previous releases - always broken:
      
         - core: set SOCK_RCU_FREE before inserting socket into hashtable
      
         - bpf: fix control-flow graph checking in privileged mode
      
         - eth: ppp: limit MRU to 64K
      
         - eth: stmmac: avoid rx queue overrun
      
         - eth: icssg-prueth: fix error cleanup on failing initialization
      
         - eth: hns3: fix out-of-bounds access may occur when coalesce info is
           read via debugfs
      
         - eth: cortina: handle large frames
      
        Misc:
      
         - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45"
      
      * tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
        macvlan: Don't propagate promisc change to lower dev in passthru
        net: sched: do not offload flows with a helper in act_ct
        net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors
        net/mlx5e: Check return value of snprintf writing to fw_version buffer
        net/mlx5e: Reduce the size of icosq_str
        net/mlx5: Increase size of irq name buffer
        net/mlx5e: Update doorbell for port timestamping CQ before the software counter
        net/mlx5e: Track xmit submission to PTP WQ after populating metadata map
        net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe
        net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload
        net/mlx5e: Fix pedit endianness
        net/mlx5e: fix double free of encap_header in update funcs
        net/mlx5e: fix double free of encap_header
        net/mlx5: Decouple PHC .adjtime and .adjphase implementations
        net/mlx5: DR, Allow old devices to use multi destination FTE
        net/mlx5: Free used cpus mask when an IRQ is released
        Revert "net/mlx5: DR, Supporting inline WQE when possible"
        bpf: Do not allocate percpu memory at init stage
        net: Fix undefined behavior in netdev name allocation
        dt-bindings: net: ethernet-controller: Fix formatting error
        ...
      7475e51b
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.7a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 6eb1acd9
      Linus Torvalds authored
      Pull xen updates from Juergen Gross:
      
       - A fix in the Xen events driver avoiding the use of RCU after
         the call to rcu_report_dead() when taking a cpu down
      
       - A fix for running as Xen dom0 to line up ACPI's idea of power
         management capabilities with the one of Xen
      
       - A cleanup eliminating several kernel-doc warnings in Xen related
         code
      
       - A cleanup series of the Xen events driver
      
      * tag 'for-linus-6.7a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/events: remove some info_for_irq() calls in pirq handling
        xen/events: modify internal [un]bind interfaces
        xen/events: drop xen_allocate_irqs_dynamic()
        xen/events: remove some simple helpers from events_base.c
        xen/events: reduce externally visible helper functions
        xen/events: remove unused functions
        xen/events: fix delayed eoi list handling
        xen/shbuf: eliminate 17 kernel-doc warnings
        acpi/processor: sanitize _OSC/_PDC capabilities for Xen dom0
        xen/events: avoid using info_for_irq() in xen_send_IPI_one()
      6eb1acd9
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · 372bed5f
      Linus Torvalds authored
      Pull virtio fixes from Michael Tsirkin:
       "Bugfixes all over the place"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
        vhost-vdpa: fix use after free in vhost_vdpa_probe()
        virtio_pci: Switch away from deprecated irq_set_affinity_hint
        riscv, qemu_fw_cfg: Add support for RISC-V architecture
        vdpa_sim_blk: allocate the buffer zeroed
        virtio_pci: move structure to a header
      372bed5f
    • Paolo Abeni's avatar
      Merge tag 'nf-23-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · cff088d9
      Paolo Abeni authored
      
      
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Remove unused variable causing compilation warning in nft_set_rbtree,
         from Yang Li. This unused variable is a left over from previous
         merge window.
      
      2) Possible return of uninitialized in nf_conntrack_bridge, from
         Linkui Xiao. This is there since nf_conntrack_bridge is available.
      
      3) Fix incorrect pointer math in nft_byteorder, from Dan Carpenter.
         Problem has been there since 2016.
      
      4) Fix bogus error in destroy set element command. Problem is there
         since this new destroy command was added.
      
      5) Fix race condition in ipset between swap and destroy commands and
         add/del/test control plane. This problem is there since ipset was
         merged.
      
      6) Split async and sync catchall GC in two function to fix unsafe
         iteration over RCU. This is a fix-for-fix that was included in
         the previous pull request.
      
      * tag 'nf-23-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: split async and sync catchall in two functions
        netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test
        netfilter: nf_tables: bogus ENOENT when destroying element which does not exist
        netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()
        netfilter: nf_conntrack_bridge: initialize err to 0
        netfilter: nft_set_rbtree: Remove unused variable nft_net
      ====================
      
      Link: https://lore.kernel.org/r/20231115184514.8965-1-pablo@netfilter.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cff088d9
    • Vlad Buslov's avatar
      macvlan: Don't propagate promisc change to lower dev in passthru · 7e1caeac
      Vlad Buslov authored
      Macvlan device in passthru mode sets its lower device promiscuous mode
      according to its MACVLAN_FLAG_NOPROMISC flag instead of synchronizing it to
      its own promiscuity setting. However, macvlan_change_rx_flags() function
      doesn't check the mode before propagating such changes to the lower device
      which can cause net_device->promiscuity counter overflow as illustrated by
      reproduction example [0] and resulting dmesg log [1]. Fix the issue by
      first verifying the mode in macvlan_change_rx_flags() function before
      propagating promiscuous mode change to the lower device.
      
      [0]:
      ip link add macvlan1 link enp8s0f0 type macvlan mode passthru
      ip link set macvlan1 promisc on
      ip l set dev macvlan1 up
      ip link set macvlan1 promisc off
      ip l set dev macvlan1 down
      ip l set dev macvlan1 up
      
      [1]:
      [ 5156.281724] macvlan1: entered promiscuous mode
      [ 5156.285467] mlx5_core 0000:08:00.0 enp8s0f0: entered promiscuous mode
      [ 5156.287639] macvlan1: left promiscuous mode
      [ 5156.288339] mlx5_core 0000:08:00.0 enp8s0f0: left promiscuous mode
      [ 5156.290907] mlx5_core 0000:08:00.0 enp8s0f0: entered promiscuous mode
      [ 5156.317197] mlx5_core 0000:08:00.0 enp8s0f0: promiscuity touches roof, set promiscuity failed. promiscuity feature of device might be broken.
      
      Fixes: efdbd2b3
      
       ("macvlan: Propagate promiscuity setting to lower devices.")
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20231114175915.1649154-1-vladbu@nvidia.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7e1caeac
    • Xin Long's avatar
      net: sched: do not offload flows with a helper in act_ct · 7cd5af0e
      Xin Long authored
      There is no hardware supporting ct helper offload. However, prior to this
      patch, a flower filter with a helper in the ct action can be successfully
      set into the HW, for example (eth1 is a bnxt NIC):
      
        # tc qdisc add dev eth1 ingress_block 22 ingress
        # tc filter add block 22 proto ip flower skip_sw ip_proto tcp \
          dst_port 21 ct_state -trk action ct helper ipv4-tcp-ftp
        # tc filter show dev eth1 ingress
      
          filter block 22 protocol ip pref 49152 flower chain 0 handle 0x1
            eth_type ipv4
            ip_proto tcp
            dst_port 21
            ct_state -trk
            skip_sw
            in_hw in_hw_count 1   <----
              action order 1: ct zone 0 helper ipv4-tcp-ftp pipe
               index 2 ref 1 bind 1
              used_hw_stats delayed
      
      This might cause the flower filter not to work as expected in the HW.
      
      This patch avoids this problem by simply returning -EOPNOTSUPP in
      tcf_ct_offload_act_setup() to not allow to offload flows with a helper
      in act_ct.
      
      Fixes: a21b06e7
      
       ("net: sched: add helper support in act_ct")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/f8685ec7702c4a448a1371a8b34b43217b583b9d.1699898008.git.lucien.xin@gmail.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7cd5af0e
    • Jakub Kicinski's avatar
      Merge branch 'mlx5-fixes-2023-11-13-manual' · bdc454fc
      Jakub Kicinski authored
      
      
      Saeed Mahameed says:
      
      ====================
      This series provides bug fixes to mlx5 driver.
      ====================
      
      Link: https://lore.kernel.org/r/20231114215846.5902-1-saeed@kernel.org/
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bdc454fc
    • Rahul Rameshbabu's avatar
      net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors · 1b2bd0c0
      Rahul Rameshbabu authored
      Treat the operation as an error case when the return value is equivalent to
      the size of the name buffer. Failed to write null terminator to the name
      buffer, making the string malformed and should not be used. Provide a
      string with only the firmware version when forming the string with the
      board id fails. This logic for representors is identical to normal flow
      with ethtool.
      
      Without check, will trigger -Wformat-truncation with W=1.
      
          drivers/net/ethernet/mellanox/mlx5/core/en_rep.c: In function 'mlx5e_rep_get_drvinfo':
          drivers/net/ethernet/mellanox/mlx5/core/en_rep.c:78:31: warning: '%.16s' directive output may be truncated writing up to 16 bytes into a region of size between 13 and 22 [-Wformat-truncation=]
            78 |                  "%d.%d.%04d (%.16s)",
               |                               ^~~~~
          drivers/net/ethernet/mellanox/mlx5/core/en_rep.c:77:9: note: 'snprintf' output between 12 and 37 bytes into a destination of size 32
            77 |         snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version),
               |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            78 |                  "%d.%d.%04d (%.16s)",
               |                  ~~~~~~~~~~~~~~~~~~~~~
            79 |                  fw_rev_maj(mdev), fw_rev_min(mdev),
               |                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            80 |                  fw_rev_sub(mdev), mdev->board_id);
               |                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fixes: cf83c8fd ("net/mlx5e: Add missing ethtool driver info for representors")
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4ab2e9
      
      
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-16-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1b2bd0c0
    • Rahul Rameshbabu's avatar
      net/mlx5e: Check return value of snprintf writing to fw_version buffer · 41e63c2b
      Rahul Rameshbabu authored
      Treat the operation as an error case when the return value is equivalent to
      the size of the name buffer. Failed to write null terminator to the name
      buffer, making the string malformed and should not be used. Provide a
      string with only the firmware version when forming the string with the
      board id fails.
      
      Without check, will trigger -Wformat-truncation with W=1.
      
          drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c: In function 'mlx5e_ethtool_get_drvinfo':
          drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c:49:31: warning: '%.16s' directive output may be truncated writing up to 16 bytes into a region of size between 13 and 22 [-Wformat-truncation=]
            49 |                  "%d.%d.%04d (%.16s)",
               |                               ^~~~~
          drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c:48:9: note: 'snprintf' output between 12 and 37 bytes into a destination of size 32
            48 |         snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version),
               |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            49 |                  "%d.%d.%04d (%.16s)",
               |                  ~~~~~~~~~~~~~~~~~~~~~
            50 |                  fw_rev_maj(mdev), fw_rev_min(mdev), fw_rev_sub(mdev),
               |                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            51 |                  mdev->board_id);
               |                  ~~~~~~~~~~~~~~~
      
      Fixes: 84e11edb ("net/mlx5e: Show board id in ethtool driver information")
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4ab2e9
      
      
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      41e63c2b
    • Saeed Mahameed's avatar
      net/mlx5e: Reduce the size of icosq_str · dce94142
      Saeed Mahameed authored
      icosq_str size is unnecessarily too long, and it causes a build warning
      -Wformat-truncation with W=1. Looking closely, It doesn't need to be 255B,
      hence this patch reduces the size to 32B which should be more than enough
      to host the string: "ICOSQ: 0x%x, ".
      
      While here, add a missing space in the formatted string.
      
      This fixes the following build warning:
      
      $ KCFLAGS='-Wall -Werror'
      $ make O=/tmp/kbuild/linux W=1 -s -j12 drivers/net/ethernet/mellanox/mlx5/core/
      
      drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c: In function 'mlx5e_reporter_rx_timeout':
      drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c:718:56:
      error: ', CQ: 0x' directive output may be truncated writing 8 bytes into a region of size between 0 and 255 [-Werror=format-truncation=]
        718 |                  "RX timeout on channel: %d, %sRQ: 0x%x, CQ: 0x%x",
            |                                                        ^~~~~~~~
      drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c:717:9: note: 'snprintf' output between 43 and 322 bytes into a destination of size 288
        717 |         snprintf(err_str, sizeof(err_str),
            |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        718 |                  "RX timeout on channel: %d, %sRQ: 0x%x, CQ: 0x%x",
            |                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        719 |                  rq->ix, icosq_str, rq->rqn, rq->cq.mcq.cqn);
            |                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fixes: 521f31af ("net/mlx5e: Allow RQ outside of channel context")
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4ab2e9
      
      
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-14-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dce94142
    • Rahul Rameshbabu's avatar
      net/mlx5: Increase size of irq name buffer · 3338bebf
      Rahul Rameshbabu authored
      Without increased buffer size, will trigger -Wformat-truncation with W=1
      for the snprintf operation writing to the buffer.
      
          drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c: In function 'mlx5_irq_alloc':
          drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c:296:7: error: '@pci:' directive output may be truncated writing 5 bytes into a region of size between 1 and 32 [-Werror=format-truncation=]
            296 |    "%s@pci:%s", name, pci_name(dev->pdev));
                |       ^~~~~
          drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c:295:2: note: 'snprintf' output 6 or more bytes (assuming 37) into a destination of size 32
            295 |  snprintf(irq->name, MLX5_MAX_IRQ_NAME,
                |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            296 |    "%s@pci:%s", name, pci_name(dev->pdev));
                |    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fixes: ada9f5d0 ("IB/mlx5: Fix eq names to display nicely in /proc/interrupts")
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4ab2e9
      
      
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-13-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3338bebf
    • Rahul Rameshbabu's avatar
      net/mlx5e: Update doorbell for port timestamping CQ before the software counter · 92214be5
      Rahul Rameshbabu authored
      Previously, mlx5e_ptp_poll_ts_cq would update the device doorbell with the
      incremented consumer index after the relevant software counters in the
      kernel were updated. In the mlx5e_sq_xmit_wqe context, this would lead to
      either overrunning the device CQ or exceeding the expected software buffer
      size in the device CQ if the device CQ size was greater than the software
      buffer size. Update the relevant software counter only after updating the
      device CQ consumer index in the port timestamping napi_poll context.
      
      Log:
          mlx5_core 0000:08:00.0: cq_err_event_notifier:517:(pid 0): CQ error on CQN 0x487, syndrome 0x1
          mlx5_core 0000:08:00.0 eth2: mlx5e_cq_error_event: cqn=0x000487 event=0x04
      
      Fixes: 1880bc4e
      
       ("net/mlx5e: Add TX port timestamp support")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-12-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      92214be5
    • Rahul Rameshbabu's avatar
      net/mlx5e: Track xmit submission to PTP WQ after populating metadata map · 7e3f3ba9
      Rahul Rameshbabu authored
      Ensure the skb is available in metadata mapping to skbs before tracking the
      metadata index for detecting undelivered CQEs. If the metadata index is put
      in the tracking list before putting the skb in the map, the metadata index
      might be used for detecting undelivered CQEs before the relevant skb is
      available in the map, which can lead to a null-ptr-deref.
      
      Log:
          general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN
          KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
          CPU: 0 PID: 1243 Comm: kworker/0:2 Not tainted 6.6.0-rc4+ #108
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
          Workqueue: events mlx5e_rx_dim_work [mlx5_core]
          RIP: 0010:mlx5e_ptp_napi_poll+0x9a4/0x2290 [mlx5_core]
          Code: 8c 24 38 cc ff ff 4c 8d 3c c1 4c 89 f9 48 c1 e9 03 42 80 3c 31 00 0f 85 97 0f 00 00 4d 8b 3f 49 8d 7f 28 48 89 f9 48 c1 e9 03 <42> 80 3c 31 00 0f 85 8b 0f 00 00 49 8b 47 28 48 85 c0 0f 84 05 07
          RSP: 0018:ffff8884d3c09c88 EFLAGS: 00010206
          RAX: 0000000000000069 RBX: ffff8881160349d8 RCX: 0000000000000005
          RDX: ffffed10218f48cf RSI: 0000000000000004 RDI: 0000000000000028
          RBP: ffff888122707700 R08: 0000000000000001 R09: ffffed109a781383
          R10: 0000000000000003 R11: 0000000000000003 R12: ffff88810c7a7a40
          R13: ffff888122707700 R14: dffffc0000000000 R15: 0000000000000000
          FS:  0000000000000000(0000) GS:ffff8884d3c00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007f4f878dd6e0 CR3: 000000014d108002 CR4: 0000000000370eb0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
          <IRQ>
          ? die_addr+0x3c/0xa0
          ? exc_general_protection+0x144/0x210
          ? asm_exc_general_protection+0x22/0x30
          ? mlx5e_ptp_napi_poll+0x9a4/0x2290 [mlx5_core]
          ? mlx5e_ptp_napi_poll+0x8f6/0x2290 [mlx5_core]
          __napi_poll.constprop.0+0xa4/0x580
          net_rx_action+0x460/0xb80
          ? _raw_spin_unlock_irqrestore+0x32/0x60
          ? __napi_poll.constprop.0+0x580/0x580
          ? tasklet_action_common.isra.0+0x2ef/0x760
          __do_softirq+0x26c/0x827
          irq_exit_rcu+0xc2/0x100
          common_interrupt+0x7f/0xa0
          </IRQ>
          <TASK>
          asm_common_interrupt+0x22/0x40
          RIP: 0010:__kmem_cache_alloc_node+0xb/0x330
          Code: 41 5d 41 5e 41 5f c3 8b 44 24 14 8b 4c 24 10 09 c8 eb d5 e8 b7 43 ca 01 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 <41> 56 41 89 d6 41 55 41 89 f5 41 54 49 89 fc 53 48 83 e4 f0 48 83
          RSP: 0018:ffff88812c4079c0 EFLAGS: 00000246
          RAX: 1ffffffff083c7fe RBX: ffff888100042dc0 RCX: 0000000000000218
          RDX: 00000000ffffffff RSI: 0000000000000dc0 RDI: ffff888100042dc0
          RBP: ffff88812c4079c8 R08: ffffffffa0289f96 R09: ffffed1025880ea9
          R10: ffff888138839f80 R11: 0000000000000002 R12: 0000000000000dc0
          R13: 0000000000000100 R14: 000000000000008c R15: ffff8881271fc450
          ? cmd_exec+0x796/0x2200 [mlx5_core]
          kmalloc_trace+0x26/0xc0
          cmd_exec+0x796/0x2200 [mlx5_core]
          mlx5_cmd_do+0x22/0xc0 [mlx5_core]
          mlx5_cmd_exec+0x17/0x30 [mlx5_core]
          mlx5_core_modify_cq_moderation+0x139/0x1b0 [mlx5_core]
          ? mlx5_add_cq_to_tasklet+0x280/0x280 [mlx5_core]
          ? lockdep_set_lock_cmp_fn+0x190/0x190
          ? process_one_work+0x659/0x1220
          mlx5e_rx_dim_work+0x9d/0x100 [mlx5_core]
          process_one_work+0x730/0x1220
          ? lockdep_hardirqs_on_prepare+0x400/0x400
          ? max_active_store+0xf0/0xf0
          ? assign_work+0x168/0x240
          worker_thread+0x70f/0x12d0
          ? __kthread_parkme+0xd1/0x1d0
          ? process_one_work+0x1220/0x1220
          kthread+0x2d9/0x3b0
          ? kthread_complete_and_exit+0x20/0x20
          ret_from_fork+0x2d/0x70
          ? kthread_complete_and_exit+0x20/0x20
          ret_from_fork_asm+0x11/0x20
          </TASK>
          Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_ib ib_uverbs ib_core zram zsmalloc mlx5_core fuse
          ---[ end trace 0000000000000000 ]---
      
      Fixes: 3178308a
      
       ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-11-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7e3f3ba9
    • Rahul Rameshbabu's avatar
      net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe · 64f14d16
      Rahul Rameshbabu authored
      When SQ is a port timestamping SQ for PTP, do not access tx flags of skb
      after free-ing the skb. Free the skb only after all references that depend
      on it have been handled in the dropped WQE path.
      
      Fixes: 3178308a
      
       ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-10-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      64f14d16
    • Jianbo Liu's avatar
      net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload · bdf788cf
      Jianbo Liu authored
      As IPSec packet offload in switchdev mode is not supported with LAG,
      it's unnecessary to modify those sent-to-vport rules to the peer eswitch.
      
      Fixes: c6c2bf5d
      
       ("net/mlx5e: Support IPsec packet offload for TX in switchdev mode")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-9-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bdf788cf
    • Vlad Buslov's avatar
      net/mlx5e: Fix pedit endianness · 0c101a23
      Vlad Buslov authored
      Referenced commit addressed endianness issue in mlx5 pedit implementation
      in ad hoc manner instead of systematically treating integer values
      according to their types which left pedit fields of sizes not equal to 4
      and where the bytes being modified are not least significant ones broken on
      big endian machines since wrong bits will be consumed during parsing which
      leads to following example error when applying pedit to source and
      destination MAC addresses:
      
      [Wed Oct 18 12:52:42 2023] mlx5_core 0001:00:00.1 p1v3_r: attempt to offload an unsupported field (cmd 0)
      [Wed Oct 18 12:52:42 2023] mask: 00000000330c5b68: 00 00 00 00 ff ff 00 00 00 00 ff ff 00 00 00 00  ................
      [Wed Oct 18 12:52:42 2023] mask: 0000000017d22fd9: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [Wed Oct 18 12:52:42 2023] mask: 000000008186d717: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [Wed Oct 18 12:52:42 2023] mask: 0000000029eb6149: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [Wed Oct 18 12:52:42 2023] mask: 000000007ed103e4: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [Wed Oct 18 12:52:42 2023] mask: 00000000db8101a6: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [Wed Oct 18 12:52:42 2023] mask: 00000000ec3c08a9: 00 00 00 00 00 00 00 00 00 00 00 00              ............
      
      Treat masks and values of pedit and filter match as network byte order,
      refactor pointers to them to void pointers instead of confusing u32
      pointers and only cast to pointer-to-integer when reading a value from
      them. Treat pedit mlx5_fields->field_mask as host byte order according to
      its type u32, change the constants in fields array accordingly.
      
      Fixes: 82198d8b
      
       ("net/mlx5e: Fix endianness when calculating pedit mask first bit")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20231114215846.5902-8-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c101a23