Skip to content
  1. Oct 17, 2023
    • Krzysztof Kozlowski's avatar
      nfc: nci: fix possible NULL pointer dereference in send_acknowledge() · 7937609c
      Krzysztof Kozlowski authored
      
      
      Handle memory allocation failure from nci_skb_alloc() (calling
      alloc_skb()) to avoid possible NULL pointer dereference.
      
      Reported-by: default avatar黄思聪 <huangsicong@iie.ac.cn>
      Fixes: 391d8a2d
      
       ("NFC: Add NCI over SPI receive")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20231013184129.18738-1-krzysztof.kozlowski@linaro.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7937609c
    • Christoph Paasch's avatar
      netlink: Correct offload_xstats size · 503930f8
      Christoph Paasch authored
      rtnl_offload_xstats_get_size_hw_s_info_one() conditionalizes the
      size-computation for IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED based on whether
      or not the device has offload_xstats enabled.
      
      However, rtnl_offload_xstats_fill_hw_s_info_one() is adding the u8 for
      that field uncondtionally.
      
      syzkaller triggered a WARNING in rtnl_stats_get due to this:
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 754 at net/core/rtnetlink.c:5982 rtnl_stats_get+0x2f4/0x300
      Modules linked in:
      CPU: 0 PID: 754 Comm: syz-executor148 Not tainted 6.6.0-rc2-g331b78eb12af #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      RIP: 0010:rtnl_stats_get+0x2f4/0x300 net/core/rtnetlink.c:5982
      Code: ff ff 89 ee e8 7d 72 50 ff 83 fd a6 74 17 e8 33 6e 50 ff 4c 89 ef be 02 00 00 00 e8 86 00 fa ff e9 7b fe ff ff e8 1c 6e 50 ff <0f> 0b eb e5 e8 73 79 7b 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90
      RSP: 0018:ffffc900006837c0 EFLAGS: 00010293
      RAX: ffffffff81cf7f24 RBX: ffff8881015d9000 RCX: ffff888101815a00
      RDX: 0000000000000000 RSI: 00000000ffffffa6 RDI: 00000000ffffffa6
      RBP: 00000000ffffffa6 R08: ffffffff81cf7f03 R09: 0000000000000001
      R10: ffff888101ba47b9 R11: ffff888101815a00 R12: ffff8881017dae00
      R13: ffff8881017dad00 R14: ffffc90000683ab8 R15: ffffffff83c1f740
      FS:  00007fbc22dbc740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000046 CR3: 000000010264e003 CR4: 0000000000170ef0
      Call Trace:
       <TASK>
       rtnetlink_rcv_msg+0x677/0x710 net/core/rtnetlink.c:6480
       netlink_rcv_skb+0xea/0x1c0 net/netlink/af_netlink.c:2545
       netlink_unicast+0x430/0x500 net/netlink/af_netlink.c:1342
       netlink_sendmsg+0x4fc/0x620 net/netlink/af_netlink.c:1910
       sock_sendmsg+0xa8/0xd0 net/socket.c:730
       ____sys_sendmsg+0x22a/0x320 net/socket.c:2541
       ___sys_sendmsg+0x143/0x190 net/socket.c:2595
       __x64_sys_sendmsg+0xd8/0x150 net/socket.c:2624
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      RIP: 0033:0x7fbc22e8d6a9
      Code: 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 37 0d 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffc4320e778 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004007d0 RCX: 00007fbc22e8d6a9
      RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000004007d0
      R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc4320e898
      R13: 00007ffc4320e8a8 R14: 00000000004004a0 R15: 00007fbc22fa5a80
       </TASK>
      ---[ end trace 0000000000000000 ]---
      
      Which didn't happen prior to commit bf9f1baa ("net: add dedicated
      kmem_cache for typical/small skb->head") as the skb always was large
      enough.
      
      Fixes: 0e7788fd
      
       ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats")
      Signed-off-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Link: https://lore.kernel.org/r/20231013041448.8229-1-cpaasch@apple.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      503930f8
    • Dust Li's avatar
      net/smc: return the right falback reason when prefix checks fail · 4abbd2e3
      Dust Li authored
      In the smc_listen_work(), if smc_listen_prfx_check() failed,
      the real reason: SMC_CLC_DECL_DIFFPREFIX was dropped, and
      SMC_CLC_DECL_NOSMCDEV was returned.
      
      Althrough this is also kind of SMC_CLC_DECL_NOSMCDEV, but return
      the real reason is much friendly for debugging.
      
      Fixes: e49300a6
      
       ("net/smc: add listen processing for SMC-Rv2")
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Link: https://lore.kernel.org/r/20231012123729.29307-1-dust.li@linux.alibaba.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4abbd2e3
  2. Oct 16, 2023
    • David S. Miller's avatar
      Merge branch 'ovs-selftests' · 883f0dc0
      David S. Miller authored
      
      
      From: Aaron Conole <aconole@redhat.com>
      To: netdev@vger.kernel.org
      Cc: dev@openvswitch.org, linux-kselftest@vger.kernel.org,
      	linux-kernel@vger.kernel.org, Pravin B Shelar <pshelar@ovn.org>,
      	"David S. Miller" <davem@davemloft.net>,
      	Eric Dumazet <edumazet@google.com>,
      	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
      	Adrian Moreno <amorenoz@redhat.com>,
      	Eelco Chaudron <echaudro@redhat.com>,
      	shuah@kernel.org
      Subject: [PATCH net v2 0/4] selftests: openvswitch: Minor fixes for some systems
      Date: Wed, 11 Oct 2023 15:49:35 -0400	[thread overview]
      Message-ID: <20231011194939.704565-1-aconole@redhat.com> (raw)
      
      A number of corner cases were caught when trying to run the selftests on
      older systems.  Missed skip conditions, some error cases, and outdated
      python setups would all report failures but the issue would actually be
      related to some other condition rather than the selftest suite.
      
      Address these individual cases.
      ====================
      
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      883f0dc0
    • Aaron Conole's avatar
      selftests: openvswitch: Fix the ct_tuple for v4 · 8eff0e06
      Aaron Conole authored
      The ct_tuple v4 data structure decode / encode routines were using
      the v6 IP address decode and relying on default encode. This could
      cause exceptions during encode / decode depending on how a ct4
      tuple would appear in a netlink message.
      
      Caught during code review.
      
      Fixes: e52b07aa
      
       ("selftests: openvswitch: add flow dump support")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8eff0e06
    • Aaron Conole's avatar
      selftests: openvswitch: Skip drop testing on older kernels · 76035fd1
      Aaron Conole authored
      Kernels that don't have support for openvswitch drop reasons also
      won't have the drop counter reasons, so we should skip the test
      completely.  It previously wasn't possible to build a test case
      for this without polluting the datapath, so we introduce a mechanism
      to clear all the flows from a datapath allowing us to test for
      explicit drop actions, and then clear the flows to build the
      original test case.
      
      Fixes: 42420291
      
       ("selftests: openvswitch: add explicit drop testcase")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76035fd1
    • Aaron Conole's avatar
      selftests: openvswitch: Catch cases where the tests are killed · af846afa
      Aaron Conole authored
      In case of fatal signal, or early abort at least cleanup the current
      test case.
      
      Fixes: 25f16c87
      
       ("selftests: add openvswitch selftest suite")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af846afa
    • Aaron Conole's avatar
      selftests: openvswitch: Add version check for pyroute2 · 92e37f20
      Aaron Conole authored
      Paolo Abeni reports that on some systems the pyroute2 version isn't
      new enough to run the test suite.  Ensure that we support a minimum
      version of 0.6 for all cases (which does include the existing ones).
      The 0.6.1 version was released in May of 2021, so should be
      propagated to most installations at this point.
      
      The alternative that Paolo proposed was to only skip when the
      add-flow is being run.  This would be okay for most cases, except
      if a future test case is added that needs to do flow dump without
      an associated add (just guessing).  In that case, it could also be
      broken and we would need additional skip logic anyway.  Just draw
      a line in the sand now.
      
      Fixes: 25f16c87
      
       ("selftests: add openvswitch selftest suite")
      Reported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Closes: https://lore.kernel.org/lkml/8470c431e0930d2ea204a9363a60937289b7fdbe.camel@redhat.com/
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92e37f20
    • Willem de Bruijn's avatar
      net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation · fc8b2a61
      Willem de Bruijn authored
      Syzbot reported two new paths to hit an internal WARNING using the
      new virtio gso type VIRTIO_NET_HDR_GSO_UDP_L4.
      
          RIP: 0010:skb_checksum_help+0x4a2/0x600 net/core/dev.c:3260
          skb len=64521 gso_size=344
      and
      
          RIP: 0010:skb_warn_bad_offload+0x118/0x240 net/core/dev.c:3262
      
      Older virtio types have historically had loose restrictions, leading
      to many entirely impractical fuzzer generated packets causing
      problems deep in the kernel stack. Ideally, we would have had strict
      validation for all types from the start.
      
      New virtio types can have tighter validation. Limit UDP GSO packets
      inserted via virtio to the same limits imposed by the UDP_SEGMENT
      socket interface:
      
      1. must use checksum offload
      2. checksum offload matches UDP header
      3. no more segments than UDP_MAX_SEGMENTS
      4. UDP GSO does not take modifier flags, notably SKB_GSO_TCP_ECN
      
      Fixes: 860b7f27
      
       ("linux/virtio_net.h: Support USO offload in vnet header.")
      Reported-by: default avatar <syzbot+01cdbc31e9c0ae9b33ac@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/netdev/0000000000005039270605eb0b7f@google.com/
      Reported-by: default avatar <syzbot+c99d835ff081ca30f986@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/netdev/0000000000005426680605eb0b9f@google.com/
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc8b2a61
  3. Oct 15, 2023
    • Manish Chopra's avatar
      qed: fix LL2 RX buffer allocation · 2f3389c7
      Manish Chopra authored
      Driver allocates the LL2 rx buffers from kmalloc()
      area to construct the skb using slab_build_skb()
      
      The required size allocation seems to have overlooked
      for accounting both skb_shared_info size and device
      placement padding bytes which results into the below
      panic when doing skb_put() for a standard MTU sized frame.
      
      skbuff: skb_over_panic: text:ffffffffc0b0225f len:1514 put:1514
      head:ff3dabceaf39c000 data:ff3dabceaf39c042 tail:0x62c end:0x566
      dev:<NULL>
      …
      skb_panic+0x48/0x4a
      skb_put.cold+0x10/0x10
      qed_ll2b_complete_rx_packet+0x14f/0x260 [qed]
      qed_ll2_rxq_handle_completion.constprop.0+0x169/0x200 [qed]
      qed_ll2_rxq_completion+0xba/0x320 [qed]
      qed_int_sp_dpc+0x1a7/0x1e0 [qed]
      
      This patch fixes this by accouting skb_shared_info and device
      placement padding size bytes when allocating the buffers.
      
      Cc: David S. Miller <davem@davemloft.net>
      Fixes: 0a7fb11c
      
       ("qed: Add Light L2 support")
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f3389c7
  4. Oct 14, 2023
    • Jakub Kicinski's avatar
      Merge tag 'mlx5-fixes-2023-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 2d1c882d
      Jakub Kicinski authored
      
      
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2023-10-12
      
      This series provides bug fixes to mlx5 driver.
      
      * tag 'mlx5-fixes-2023-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
        net/mlx5e: Fix VF representors reporting zero counters to "ip -s" command
        net/mlx5e: Don't offload internal port if filter device is out device
        net/mlx5e: Take RTNL lock before triggering netdev notifiers
        net/mlx5e: XDP, Fix XDP_REDIRECT mpwqe page fragment leaks on shutdown
        net/mlx5e: RX, Fix page_pool allocation failure recovery for legacy rq
        net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq
        net/mlx5: Handle fw tracer change ownership event based on MTRC
        net/mlx5: Bridge, fix peer entry ageing in LAG mode
        net/mlx5: E-switch, register event handler before arming the event
        net/mlx5: Perform DMA operations in the right locations
      ====================
      
      Link: https://lore.kernel.org/r/20231012195127.129585-1-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d1c882d
    • Jakub Kicinski's avatar
      Merge branch 'intel-wired-lan-driver-updates-2023-10-11-i40e-ice' · aeae0ef0
      Jakub Kicinski authored
      Jacob Keller says:
      
      ====================
      Intel Wired LAN Driver Updates 2023-10-11 (i40e, ice)
      
      This series contains fixes for the i40e and ice drivers.
      
      Jesse adds handling to the ice driver which resetis the device when loading
      on a crash kernel, preventing stale transactions from causing machine check
      exceptions which could prevent capturing crash data.
      
      Mateusz fixes a bug in the ice driver 'Safe mode' logic for handling the
      device when the DDP is missing.
      
      Michal fixes a crash when probing the i40e driver in the event that HW
      registers are reporting invalid/unexpected values.
      
      The following are changes since commit a950a592
      
      :
        net/smc: Fix pos miscalculation in statistics
      
      I'm covering for Tony Nguyen while he's out, and don't have access to create
      a pull request branch on his net-queue, so these are sent via mail only.
      ====================
      
      Link: https://lore.kernel.org/r/20231011233334.336092-1-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aeae0ef0
    • Mateusz Pacuszka's avatar
      ice: Fix safe mode when DDP is missing · 42066c4d
      Mateusz Pacuszka authored
      One thing is broken in the safe mode, that is
      ice_deinit_features() is being executed even
      that ice_init_features() was not causing stack
      trace during pci_unregister_driver().
      
      Add check on the top of the function.
      
      Fixes: 5b246e53
      
       ("ice: split probe into smaller functions")
      Signed-off-by: default avatarMateusz Pacuszka <mateuszx.pacuszka@intel.com>
      Signed-off-by: default avatarJan Sokolowski <jan.sokolowski@intel.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Link: https://lore.kernel.org/r/20231011233334.336092-4-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      42066c4d
    • Jesse Brandeburg's avatar
      ice: reset first in crash dump kernels · 0288c3e7
      Jesse Brandeburg authored
      When the system boots into the crash dump kernel after a panic, the ice
      networking device may still have pending transactions that can cause errors
      or machine checks when the device is re-enabled. This can prevent the crash
      dump kernel from loading the driver or collecting the crash data.
      
      To avoid this issue, perform a function level reset (FLR) on the ice device
      via PCIe config space before enabling it on the crash kernel. This will
      clear any outstanding transactions and stop all queues and interrupts.
      Restore the config space after the FLR, otherwise it was found in testing
      that the driver wouldn't load successfully.
      
      The following sequence causes the original issue:
      - Load the ice driver with modprobe ice
      - Enable SR-IOV with 2 VFs: echo 2 > /sys/class/net/eth0/device/sriov_num_vfs
      - Trigger a crash with echo c > /proc/sysrq-trigger
      - Load the ice driver again (or let it load automatically) with modprobe ice
      - The system crashes again during pcim_enable_device()
      
      Fixes: 837f08fd
      
       ("ice: Add basic driver framework for Intel(R) E800 Series")
      Reported-by: default avatarVishal Agrawal <vagrawal@redhat.com>
      Reviewed-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Link: https://lore.kernel.org/r/20231011233334.336092-3-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0288c3e7
    • Michal Schmidt's avatar
      i40e: prevent crash on probe if hw registers have invalid values · fc6f716a
      Michal Schmidt authored
      The hardware provides the indexes of the first and the last available
      queue and VF. From the indexes, the driver calculates the numbers of
      queues and VFs. In theory, a faulty device might say the last index is
      smaller than the first index. In that case, the driver's calculation
      would underflow, it would attempt to write to non-existent registers
      outside of the ioremapped range and crash.
      
      I ran into this not by having a faulty device, but by an operator error.
      I accidentally ran a QE test meant for i40e devices on an ice device.
      The test used 'echo i40e > /sys/...ice PCI device.../driver_override',
      bound the driver to the device and crashed in one of the wr32 calls in
      i40e_clear_hw.
      
      Add checks to prevent underflows in the calculations of num_queues and
      num_vfs. With this fix, the wrong device probing reports errors and
      returns a failure without crashing.
      
      Fixes: 838d41d9
      
       ("i40e: clear all queues and interrupts")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Link: https://lore.kernel.org/r/20231011233334.336092-2-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fc6f716a
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-10-12' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · f50ee3a0
      Jakub Kicinski authored
      
      
      Florian Westphal says:
      
      ====================
      netfilter updates for net
      
      Patch 1, from Pablo Neira Ayuso, fixes a performance regression
      (since 6.4) when a large pending set update has to be canceled towards
      the end of the transaction.
      
      Patch 2 from myself, silences an incorrect compiler warning reported
      with a few (older) compiler toolchains.
      
      Patch 3, from Kees Cook, adds __counted_by annotation to
      nft_pipapo set backend type.  I took this for net instead of -next
      given infra is already in place and no actual code change is made.
      
      Patch 4, from Pablo Neira Ayso, disables timeout resets on
      stateful element reset.  The rest should only affect internal object
      state, e.g. reset a quota or counter, but not affect a pending timeout.
      
      Patches 5 and 6 fix NULL dereferences in 'inner header' match,
      control plane doesn't test for netlink attribute presence before
      accessing them. Broken since feature was added in 6.2, fixes from
      Xingyuan Mo.
      
      Last patch, from myself, fixes a bogus rule match when skb has
      a 0-length mac header, in this case we'd fetch data from network
      header instead of canceling rule evaluation.  This is a day 0 bug,
      present since nftables was merged in 3.13.
      
      * tag 'nf-23-10-12' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nft_payload: fix wrong mac header matching
        nf_tables: fix NULL pointer dereference in nft_expr_inner_parse()
        nf_tables: fix NULL pointer dereference in nft_inner_init()
        netfilter: nf_tables: do not refresh timeout when resetting element
        netfilter: nf_tables: Annotate struct nft_pipapo_match with __counted_by
        netfilter: nfnetlink_log: silence bogus compiler warning
        netfilter: nf_tables: do not remove elements if set backend implements .abort
      ====================
      
      Link: https://lore.kernel.org/r/20231012085724.15155-1-fw@strlen.de
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f50ee3a0
    • MD Danish Anwar's avatar
      net: ti: icssg-prueth: Fix tx_total_bytes count · 2c0d808f
      MD Danish Anwar authored
      ICSSG HW stats on TX side considers 8 preamble bytes as data bytes. Due
      to this the tx_bytes of ICSSG interface doesn't match the rx_bytes of the
      link partner. There is no public errata available yet.
      
      As a workaround to fix this, decrease tx_bytes by 8 bytes for every tx
      frame.
      
      Fixes: c1e10d5d
      
       ("net: ti: icssg-prueth: Add ICSSG Stats")
      Signed-off-by: default avatarMD Danish Anwar <danishanwar@ti.com>
      Link: https://lore.kernel.org/r/20231012064626.977466-1-danishanwar@ti.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2c0d808f
    • Mateusz Polchlopek's avatar
      docs: fix info about representor identification · a258c804
      Mateusz Polchlopek authored
      Update the "How are representors identified?" documentation
      subchapter. For newer kernels driver should use
      SET_NETDEV_DEVLINK_PORT instead of ndo_get_devlink_port()
      callback.
      
      Fixes: 7712b3e9
      
       ("Merge branch 'net-fix-netdev-to-devlink_port-linkage-and-expose-to-user'")
      Signed-off-by: default avatarMateusz Polchlopek <mateusz.polchlopek@intel.com>
      Reviewed-by: default avatarWojciech Drewek <wojciech.drewek@intel.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Reviewed-by: default avatarEdward Cree <ecree.xilinx@gmail.com>
      Link: https://lore.kernel.org/r/20231012123144.15768-1-mateusz.polchlopek@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a258c804
    • Jiri Pirko's avatar
      netlink: specs: devlink: fix reply command values · 0f4d44f6
      Jiri Pirko authored
      Make sure that the command values used for replies are correct. This is
      only affecting generated userspace helpers, no change on kernel code.
      
      Fixes: 7199c862
      
       ("netlink: specs: devlink: add commands that do per-instance dump")
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20231012115811.298129-1-jiri@resnulli.us
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0f4d44f6
    • Albert Huang's avatar
      net/smc: fix smc clc failed issue when netdevice not in init_net · c68681ae
      Albert Huang authored
      If the netdevice is within a container and communicates externally
      through network technologies such as VxLAN, we won't be able to find
      routing information in the init_net namespace. To address this issue,
      we need to add a struct net parameter to the smc_ib_find_route function.
      This allow us to locate the routing information within the corresponding
      net namespace, ensuring the correct completion of the SMC CLC interaction.
      
      Fixes: e5c4744c
      
       ("net/smc: add SMC-Rv2 connection establishment")
      Signed-off-by: default avatarAlbert Huang <huangjie.albert@bytedance.com>
      Reviewed-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Link: https://lore.kernel.org/r/20231011074851.95280-1-huangjie.albert@bytedance.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c68681ae
    • Paolo Abeni's avatar
      tcp: allow again tcp_disconnect() when threads are waiting · 419ce133
      Paolo Abeni authored
      As reported by Tom, .NET and applications build on top of it rely
      on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP
      socket.
      
      The blamed commit below caused a regression, as such cancellation
      can now fail.
      
      As suggested by Eric, this change addresses the problem explicitly
      causing blocking I/O operation to terminate immediately (with an error)
      when a concurrent disconnect() is executed.
      
      Instead of tracking the number of threads blocked on a given socket,
      track the number of disconnect() issued on such socket. If such counter
      changes after a blocking operation releasing and re-acquiring the socket
      lock, error out the current operation.
      
      Fixes: 4faeee0c
      
       ("tcp: deny tcp_disconnect() when threads are waiting")
      Reported-by: default avatarTom Deseyn <tdeseyn@redhat.com>
      Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      419ce133
    • Jesse Brandeburg's avatar
      ice: fix over-shifted variable · 242e3450
      Jesse Brandeburg authored
      Since the introduction of the ice driver the code has been
      double-shifting the RSS enabling field, because the define already has
      shifts in it and can't have the regular pattern of "a << shiftval &
      mask" applied.
      
      Most places in the code got it right, but one line was still wrong. Fix
      this one location for easy backports to stable. An in-progress patch
      fixes the defines to "standard" and will be applied as part of the
      regular -next process sometime after this one.
      
      Fixes: d76a60ba
      
       ("ice: Add support for VLANs and offloads")
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20231010203101.406248-1-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      242e3450
    • Jinjie Ruan's avatar
      net: dsa: bcm_sf2: Fix possible memory leak in bcm_sf2_mdio_register() · 61b40cef
      Jinjie Ruan authored
      In bcm_sf2_mdio_register(), the class_find_device() will call get_device()
      to increment reference count for priv->master_mii_bus->dev if
      of_mdio_find_bus() succeeds. If mdiobus_alloc() or mdiobus_register()
      fails, it will call get_device() twice without decrement reference count
      for the device. And it is the same if bcm_sf2_mdio_register() succeeds but
      fails in bcm_sf2_sw_probe(), or if bcm_sf2_sw_probe() succeeds. If the
      reference count has not decremented to zero, the dev related resource will
      not be freed.
      
      So remove the get_device() in bcm_sf2_mdio_register(), and call
      put_device() if mdiobus_alloc() or mdiobus_register() fails and in
      bcm_sf2_mdio_unregister() to solve the issue.
      
      And as Simon suggested, unwind from errors for bcm_sf2_mdio_register() and
      just return 0 if it succeeds to make it cleaner.
      
      Fixes: 461cd1b0
      
       ("net: dsa: bcm_sf2: Register our slave MDIO bus")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Suggested-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Link: https://lore.kernel.org/r/20231011032419.2423290-1-ruanjinjie@huawei.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      61b40cef
    • Jakub Kicinski's avatar
      Merge branch 'selftests-fib_tests-fixes-for-multipath-list-receive-tests' · dda5e1ee
      Jakub Kicinski authored
      
      
      Ido Schimmel says:
      
      ====================
      selftests: fib_tests: Fixes for multipath list receive tests
      
      Fix two issues in recently added FIB multipath list receive tests.
      ====================
      
      Link: https://lore.kernel.org/r/20231010132113.3014691-1-idosch@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dda5e1ee
    • Ido Schimmel's avatar
      selftests: fib_tests: Count all trace point invocations · aa13e524
      Ido Schimmel authored
      The tests rely on the IPv{4,6} FIB trace points being triggered once for
      each forwarded packet. If receive processing is deferred to the
      ksoftirqd task these invocations will not be counted and the tests will
      fail. Fix by specifying the '-a' flag to avoid perf from filtering on
      the mausezahn task.
      
      Before:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (.68)                               [FAIL]
      
       # ./fib_tests.sh -t ipv6_mpath_list
      
       IPv6 multipath list receive tests
           TEST: Multipath route hit ratio (.27)                               [FAIL]
      
      After:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (1.00)                              [ OK ]
      
       # ./fib_tests.sh -t ipv6_mpath_list
      
       IPv6 multipath list receive tests
           TEST: Multipath route hit ratio (.99)                               [ OK ]
      
      Fixes: 8ae9efb8
      
       ("selftests: fib_tests: Add multipath list receive tests")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/netdev/202309191658.c00d8b8-oliver.sang@intel.com/
      Tested-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Tested-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Link: https://lore.kernel.org/r/20231010132113.3014691-3-idosch@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aa13e524
    • Ido Schimmel's avatar
      selftests: fib_tests: Disable RP filter in multipath list receive test · dbb13378
      Ido Schimmel authored
      The test relies on the fib:fib_table_lookup trace point being triggered
      once for each forwarded packet. If RP filter is not disabled, the trace
      point will be triggered twice for each packet (for source validation and
      forwarding), potentially masking actual bugs. Fix by explicitly
      disabling RP filter.
      
      Before:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (1.99)                              [ OK ]
      
      After:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (.99)                               [ OK ]
      
      Fixes: 8ae9efb8
      
       ("selftests: fib_tests: Add multipath list receive tests")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/netdev/202309191658.c00d8b8-oliver.sang@intel.com/
      Tested-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Tested-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Link: https://lore.kernel.org/r/20231010132113.3014691-2-idosch@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dbb13378
    • Kuniyuki Iwashima's avatar
      tcp: Fix listen() warning with v4-mapped-v6 address. · 8702cf12
      Kuniyuki Iwashima authored
      syzbot reported a warning [0] introduced by commit c48ef9c4 ("tcp: Fix
      bind() regression for v4-mapped-v6 non-wildcard address.").
      
      After the cited commit, a v4 socket's address matches the corresponding
      v4-mapped-v6 tb2 in inet_bind2_bucket_match_addr(), not vice versa.
      
      During X.X.X.X -> ::ffff:X.X.X.X order bind()s, the second bind() uses
      bhash and conflicts properly without checking bhash2 so that we need not
      check if a v4-mapped-v6 sk matches the corresponding v4 address tb2 in
      inet_bind2_bucket_match_addr().  However, the repro shows that we need
      to check that in a no-conflict case.
      
      The repro bind()s two sockets to the 2-tuples using SO_REUSEPORT and calls
      listen() for the first socket:
      
        from socket import *
      
        s1 = socket()
        s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
        s1.bind(('127.0.0.1', 0))
      
        s2 = socket(AF_INET6)
        s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
        s2.bind(('::ffff:127.0.0.1', s1.getsockname()[1]))
      
        s1.listen()
      
      The second socket should belong to the first socket's tb2, but the second
      bind() creates another tb2 bucket because inet_bind2_bucket_find() returns
      NULL in inet_csk_get_port() as the v4-mapped-v6 sk does not match the
      corresponding v4 address tb2.
      
        bhash2[] -> tb2(::ffff:X.X.X.X) -> tb2(X.X.X.X)
      
      Then, listen() for the first socket calls inet_csk_get_port(), where the
      v4 address matches the v4-mapped-v6 tb2 and WARN_ON() is triggered.
      
      To avoid that, we need to check if v4-mapped-v6 sk address matches with
      the corresponding v4 address tb2 in inet_bind2_bucket_match().
      
      The same checks are needed in inet_bind2_bucket_addr_match() too, so we
      can move all checks there and call it from inet_bind2_bucket_match().
      
      Note that now tb->family is just an address family of tb->(v6_)?rcv_saddr
      and not of sockets in the bucket.  This could be refactored later by
      defining tb->rcv_saddr as tb->v6_rcv_saddr.s6_addr32[3] and prepending
      ::ffff: when creating v4 tb2.
      
      [0]:
      WARNING: CPU: 0 PID: 5049 at net/ipv4/inet_connection_sock.c:587 inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587
      Modules linked in:
      CPU: 0 PID: 5049 Comm: syz-executor288 Not tainted 6.6.0-rc2-syzkaller-00018-g2cf0f7156238 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023
      RIP: 0010:inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587
      Code: 7c 24 08 e8 4c b6 8a 01 31 d2 be 88 01 00 00 48 c7 c7 e0 94 ae 8b e8 59 2e a3 f8 2e 2e 2e 31 c0 e9 04 fe ff ff e8 ca 88 d0 f8 <0f> 0b e9 0f f9 ff ff e8 be 88 d0 f8 49 8d 7e 48 e8 65 ca 5a 00 31
      RSP: 0018:ffffc90003abfbf0 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: ffff888026429100 RCX: 0000000000000000
      RDX: ffff88807edcbb80 RSI: ffffffff88b73d66 RDI: ffff888026c49f38
      RBP: ffff888026c49f30 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff9260f200
      R13: ffff888026c49880 R14: 0000000000000000 R15: ffff888026429100
      FS:  00005555557d5380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000045ad50 CR3: 0000000025754000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       inet_csk_listen_start+0x155/0x360 net/ipv4/inet_connection_sock.c:1256
       __inet_listen_sk+0x1b8/0x5c0 net/ipv4/af_inet.c:217
       inet_listen+0x93/0xd0 net/ipv4/af_inet.c:239
       __sys_listen+0x194/0x270 net/socket.c:1866
       __do_sys_listen net/socket.c:1875 [inline]
       __se_sys_listen net/socket.c:1873 [inline]
       __x64_sys_listen+0x53/0x80 net/socket.c:1873
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f3a5bce3af9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffc1a1c79e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a5bce3af9
      RDX: 00007f3a5bce3af9 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 00007f3a5bd565f0 R08: 0000000000000006 R09: 0000000000000006
      R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000001
      R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001
       </TASK>
      
      Fixes: c48ef9c4
      
       ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address.")
      Reported-by: default avatar <syzbot+71e724675ba3958edb31@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=71e724675ba3958edb31
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20231010013814.70571-1-kuniyu@amazon.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8702cf12
  5. Oct 13, 2023
    • Jiri Wiesner's avatar
      bonding: Return pointer to data after pull on skb · d93f3f99
      Jiri Wiesner authored
      Since 429e3d12 ("bonding: Fix extraction of ports from the packet
      headers"), header offsets used to compute a hash in bond_xmit_hash() are
      relative to skb->data and not skb->head. If the tail of the header buffer
      of an skb really needs to be advanced and the operation is successful, the
      pointer to the data must be returned (and not a pointer to the head of the
      buffer).
      
      Fixes: 429e3d12
      
       ("bonding: Fix extraction of ports from the packet headers")
      Signed-off-by: default avatarJiri Wiesner <jwiesner@suse.de>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d93f3f99
    • Linus Torvalds's avatar
      Merge tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · e8c127b0
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from CAN and BPF.
      
        We have a regression in TC currently under investigation, otherwise
        the things that stand off most are probably the TCP and AF_PACKET
        fixes, with both issues coming from 6.5.
      
        Previous releases - regressions:
      
         - af_packet: fix fortified memcpy() without flex array.
      
         - tcp: fix crashes trying to free half-baked MTU probes
      
         - xdp: fix zero-size allocation warning in xskq_create()
      
         - can: sja1000: always restart the tx queue after an overrun
      
         - eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
      
         - eth: nfp: avoid rmmod nfp crash issues
      
         - eth: octeontx2-pf: fix page pool frag allocation warning
      
        Previous releases - always broken:
      
         - mctp: perform route lookups under a RCU read-side lock
      
         - bpf: s390: fix clobbering the caller's backchain in the trampoline
      
         - phy: lynx-28g: cancel the CDR check work item on the remove path
      
         - dsa: qca8k: fix qca8k driver for Turris 1.x
      
         - eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
      
         - eth: ixgbe: fix crash with empty VF macvlan list"
      
      * tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
        rswitch: Fix imbalance phy_power_off() calling
        rswitch: Fix renesas_eth_sw_remove() implementation
        octeontx2-pf: Fix page pool frag allocation warning
        nfc: nci: assert requested protocol is valid
        af_packet: Fix fortified memcpy() without flex array.
        net: tcp: fix crashes trying to free half-baked MTU probes
        net/smc: Fix pos miscalculation in statistics
        nfp: flower: avoid rmmod nfp crash issues
        net: usb: dm9601: fix uninitialized variable use in dm9601_mdio_read
        ethtool: Fix mod state of verbose no_mask bitset
        net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
        mctp: perform route lookups under a RCU read-side lock
        net: skbuff: fix kernel-doc typos
        s390/bpf: Fix unwinding past the trampoline
        s390/bpf: Fix clobbering the caller's backchain in the trampoline
        net/mlx5e: Again mutually exclude RX-FCS and RX-port-timestamp
        net/smc: Fix dependency of SMC on ISM
        ixgbe: fix crash with empty VF macvlan list
        net/mlx5e: macsec: use update_pn flag instead of PN comparation
        net: phy: mscc: macsec: reject PN update requests
        ...
      e8c127b0
    • Linus Torvalds's avatar
      Merge tag 'soc-fixes-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 9a5a1494
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "AngeloGioacchino Del Regno is stepping in as co-maintainer for the
        MediaTek SoC platform and starts by sending some dts fixes for the
        mt8195 platform that had been pending for a while.
      
        On the ixp4xx platform, Krzysztof Halasa steps down as co-maintainer,
        reflecting that Linus Walleij has been handling this on his own for
        the past few years.
      
        Generic RISC-V kernels are now marked as incompatible with the RZ/Five
        platform that requires custom hacks both for managing its DMA bounce
        buffers and for addressing low virtual memory.
      
       Finally, there is one bugfix for the AMDTEE firmware driver to prevent
       a use-after-free bug"
      
      * tag 'soc-fixes-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        IXP4xx MAINTAINERS entries
        arm64: dts: mediatek: mt8195: Set DSU PMU status to fail
        arm64: dts: mediatek: fix t-phy unit name
        arm64: dts: mediatek: mt8195-demo: update and reorder reserved memory regions
        arm64: dts: mediatek: mt8195-demo: fix the memory size to 8GB
        MAINTAINERS: Add Angelo as MediaTek SoC co-maintainer
        soc: renesas: Make ARCH_R9A07G043 (riscv version) depend on NONPORTABLE
        tee: amdtee: fix use-after-free vulnerability in amdtee_close_session
      9a5a1494
    • Linus Torvalds's avatar
      Merge tag 'pmdomain-v6.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm · 9b1ad4ba
      Linus Torvalds authored
      Pull pmdomain fix from Ulf Hansson:
      
       - imx: scu-pd: Correct the DMA2 channel
      
      * tag 'pmdomain-v6.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
        pmdomain: imx: scu-pd: correct DMA2 channel
      9b1ad4ba
    • Amir Tzin's avatar
      net/mlx5e: Fix VF representors reporting zero counters to "ip -s" command · 80f12414
      Amir Tzin authored
      Although vf_vport entry of struct mlx5e_stats is never updated, its
      values are mistakenly copied to the caller structure in the VF
      representor .ndo_get_stat_64 callback mlx5e_rep_get_stats(). Remove
      redundant entry and use the updated one, rep_stats, instead.
      
      Fixes: 64b68e36
      
       ("net/mlx5: Refactor and expand rep vport stat group")
      Reviewed-by: default avatarPatrisious Haddad <phaddad@nvidia.com>
      Signed-off-by: default avatarAmir Tzin <amirtz@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      80f12414
    • Jianbo Liu's avatar
      net/mlx5e: Don't offload internal port if filter device is out device · 06b4eac9
      Jianbo Liu authored
      In the cited commit, if the routing device is ovs internal port, the
      out device is set to uplink, and packets go out after encapsulation.
      
      If filter device is uplink, it can trigger the following syndrome:
      mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 3966): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xcdb051), err(-22)
      
      Fix this issue by not offloading internal port if filter device is out
      device. In this case, packets are not forwarded to the root table to
      be processed, the termination table is used instead to forward them
      from uplink to uplink.
      
      Fixes: 100ad4e2
      
       ("net/mlx5e: Offload internal port as encap route device")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarAriel Levkovich <lariel@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      06b4eac9
    • Lama Kayal's avatar
      net/mlx5e: Take RTNL lock before triggering netdev notifiers · c51c6734
      Lama Kayal authored
      Hold RTNL lock when calling xdp_set_features() with a registered netdev,
      as the call triggers the netdev notifiers. This could happen when
      switching from nic profile to uplink representor for example.
      
      Similar logic which fixed a similar scenario was previously introduced in
      the following commit:
      commit 72cc6549 net/mlx5e: Take RTNL lock when needed before calling
      xdp_set_features().
      
      This fixes the following assertion and warning call trace:
      
      RTNL: assertion failed at net/core/dev.c (1961)
      WARNING: CPU: 13 PID: 2529 at net/core/dev.c:1961
      call_netdevice_notifiers_info+0x7c/0x80
      Modules linked in: rpcrdma rdma_ucm ib_iser libiscsi
      scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
      ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink
      nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5
      auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
      CPU: 13 PID: 2529 Comm: devlink Not tainted
      6.5.0_for_upstream_min_debug_2023_09_07_20_04 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
      rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      RIP: 0010:call_netdevice_notifiers_info+0x7c/0x80
      Code: 8f ff 80 3d 77 0d 16 01 00 75 c5 ba a9 07 00 00 48
      c7 c6 c4 bb 0d 82 48 c7 c7 18 c8 06 82 c6 05 5b 0d 16 01 01 e8 44 f6 8c
      ff <0f> 0b eb a2 0f 1f 44 00 00 55 48 89 e5 41 54 48 83 e4 f0 48 83 ec
      RSP: 0018:ffff88819930f7f0 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffffffff8309f740 RCX: 0000000000000027
      RDX: ffff88885fb5b5c8 RSI: 0000000000000001 RDI: ffff88885fb5b5c0
      RBP: 0000000000000028 R08: ffff88887ffabaa8 R09: 0000000000000003
      R10: ffff88887fecbac0 R11: ffff88887ff7bac0 R12: ffff88819930f810
      R13: ffff88810b7fea40 R14: ffff8881154e8fd8 R15: ffff888107e881a0
      FS:  00007f3ad248f800(0000) GS:ffff88885fb40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000563b85f164e0 CR3: 0000000113b5c006 CR4: 0000000000370ea0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ? __warn+0x79/0x120
       ? call_netdevice_notifiers_info+0x7c/0x80
       ? report_bug+0x17c/0x190
       ? handle_bug+0x3c/0x60
       ? exc_invalid_op+0x14/0x70
       ? asm_exc_invalid_op+0x16/0x20
       ? call_netdevice_notifiers_info+0x7c/0x80
       call_netdevice_notifiers+0x2e/0x50
       mlx5e_set_xdp_feature+0x21/0x50 [mlx5_core]
       mlx5e_build_rep_params+0x97/0x130 [mlx5_core]
       mlx5e_init_ul_rep+0x9f/0x100 [mlx5_core]
       mlx5e_netdev_init_profile+0x76/0x110 [mlx5_core]
       mlx5e_netdev_attach_profile+0x1f/0x90 [mlx5_core]
       mlx5e_netdev_change_profile+0x92/0x160 [mlx5_core]
       mlx5e_vport_rep_load+0x329/0x4a0 [mlx5_core]
       mlx5_esw_offloads_rep_load+0x9e/0xf0 [mlx5_core]
       esw_offloads_enable+0x4bc/0xe90 [mlx5_core]
       mlx5_eswitch_enable_locked+0x3c8/0x570 [mlx5_core]
       ? kmalloc_trace+0x25/0x80
       mlx5_devlink_eswitch_mode_set+0x224/0x680 [mlx5_core]
       ? devlink_get_from_attrs_lock+0x9e/0x110
       devlink_nl_cmd_eswitch_set_doit+0x60/0xe0
       genl_family_rcv_msg_doit+0xd0/0x120
       genl_rcv_msg+0x180/0x2b0
       ? devlink_get_from_attrs_lock+0x110/0x110
       ? devlink_nl_cmd_eswitch_get_doit+0x290/0x290
       ? devlink_pernet_pre_exit+0xf0/0xf0
       ? genl_family_rcv_msg_dumpit+0xf0/0xf0
       netlink_rcv_skb+0x54/0x100
       genl_rcv+0x24/0x40
       netlink_unicast+0x1fc/0x2c0
       netlink_sendmsg+0x232/0x4a0
       sock_sendmsg+0x38/0x60
       ? _copy_from_user+0x2a/0x60
       __sys_sendto+0x110/0x160
       ? handle_mm_fault+0x161/0x260
       ? do_user_addr_fault+0x276/0x620
       __x64_sys_sendto+0x20/0x30
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f3ad231340a
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3
      0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f
      05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      RSP: 002b:00007ffd70aad4b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000c36b00 RCX:00007f3ad231340a
      RDX: 0000000000000038 RSI: 0000000000c36b00 RDI: 0000000000000003
      RBP: 0000000000c36910 R08: 00007f3ad2625200 R09: 000000000000000c
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      ---[ end trace 0000000000000000 ]---
      ------------[ cut here ]------------
      
      Fixes: 4d5ab0ad
      
       ("net/mlx5e: take into account device reconfiguration for xdp_features flag")
      Signed-off-by: default avatarLama Kayal <lkayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c51c6734
    • Dragos Tatulea's avatar
      net/mlx5e: XDP, Fix XDP_REDIRECT mpwqe page fragment leaks on shutdown · aaab619c
      Dragos Tatulea authored
      When mlx5e_xdp_xmit is called without the XDP_XMIT_FLUSH set it is
      possible that it leaves a mpwqe session open. That is ok during runtime:
      the session will be closed on the next call to mlx5e_xdp_xmit. But
      having a mpwqe session still open at XDP sq close time is problematic:
      the pc counter is not updated before flushing the contents of the
      xdpi_fifo. This results in leaking page fragments.
      
      The fix is to always close the mpwqe session at the end of
      mlx5e_xdp_xmit, regardless of the XDP_XMIT_FLUSH flag being set or not.
      
      Fixes: 5e0d2eef
      
       ("net/mlx5e: XDP, Support Enhanced Multi-Packet TX WQE")
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      aaab619c
    • Dragos Tatulea's avatar
      net/mlx5e: RX, Fix page_pool allocation failure recovery for legacy rq · ef9369e9
      Dragos Tatulea authored
      When a page allocation fails during refill in mlx5e_refill_rx_wqes, the
      page will be released again on the next refill call. This triggers the
      page_pool negative page fragment count warning below:
      
       [  338.326070] WARNING: CPU: 4 PID: 0 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
        ...
       [  338.328993] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329094] Call Trace:
       [  338.329097]  <IRQ>
       [  338.329100]  ? __warn+0x7d/0x120
       [  338.329105]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329173]  ? report_bug+0x155/0x180
       [  338.329179]  ? handle_bug+0x3c/0x60
       [  338.329183]  ? exc_invalid_op+0x13/0x60
       [  338.329187]  ? asm_exc_invalid_op+0x16/0x20
       [  338.329192]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329259]  mlx5e_post_rx_wqes+0x210/0x5a0 [mlx5_core]
       [  338.329327]  ? mlx5e_poll_rx_cq+0x88/0x6f0 [mlx5_core]
       [  338.329394]  mlx5e_napi_poll+0x127/0x6b0 [mlx5_core]
       [  338.329461]  __napi_poll+0x25/0x1a0
       [  338.329465]  net_rx_action+0x28a/0x300
       [  338.329468]  __do_softirq+0xcd/0x279
       [  338.329473]  irq_exit_rcu+0x6a/0x90
       [  338.329477]  common_interrupt+0x82/0xa0
       [  338.329482]  </IRQ>
      
      This patch fixes the legacy rq case by releasing all allocated fragments
      and then setting the skip flag on all released fragments. It is
      important to note that the number of released fragments will be higher
      than the number of allocated fragments when an allocation error occurs.
      
      Fixes: 3f93f829
      
       ("net/mlx5e: RX, Defer page release in legacy rq for better recycling")
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.com
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ef9369e9
    • Dragos Tatulea's avatar
      net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq · be43b748
      Dragos Tatulea authored
      When a page allocation fails during refill in mlx5e_post_rx_mpwqes, the
      page will be released again on the next refill call. This triggers the
      page_pool negative page fragment count warning below:
      
       [ 2436.447717] WARNING: CPU: 1 PID: 2419 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       ...
       [ 2436.447895] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.447991] Call Trace:
       [ 2436.447975]  mlx5e_post_rx_mpwqes+0x1d5/0xcf0 [mlx5_core]
       [ 2436.447994]  <IRQ>
       [ 2436.447996]  ? __warn+0x7d/0x120
       [ 2436.448009]  ? mlx5e_handle_rx_cqe_mpwrq+0x109/0x1d0 [mlx5_core]
       [ 2436.448002]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.448044]  ? mlx5e_poll_rx_cq+0x87/0x6e0 [mlx5_core]
       [ 2436.448061]  ? report_bug+0x155/0x180
       [ 2436.448065]  ? handle_bug+0x36/0x70
       [ 2436.448067]  ? exc_invalid_op+0x13/0x60
       [ 2436.448070]  ? asm_exc_invalid_op+0x16/0x20
       [ 2436.448079]  mlx5e_napi_poll+0x122/0x6b0 [mlx5_core]
       [ 2436.448077]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.448113]  ? generic_exec_single+0x35/0x100
       [ 2436.448117]  __napi_poll+0x25/0x1a0
       [ 2436.448120]  net_rx_action+0x28a/0x300
       [ 2436.448122]  __do_softirq+0xcd/0x279
       [ 2436.448126]  irq_exit_rcu+0x6a/0x90
       [ 2436.448128]  sysvec_apic_timer_interrupt+0x6e/0x90
       [ 2436.448130]  </IRQ>
      
      This patch fixes the striding rq case by setting the skip flag on all
      the wqe pages that were expected to have new pages allocated.
      
      Fixes: 4c2a1323
      
       ("net/mlx5e: RX, Defer page release in striding rq for better recycling")
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.com
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      be43b748
    • Maher Sanalla's avatar
      net/mlx5: Handle fw tracer change ownership event based on MTRC · 92fd3963
      Maher Sanalla authored
      Currently, whenever fw issues a change ownership event, the PF that owns
      the fw tracer drops its ownership directly and the other PFs try to pick
      up the ownership via what MTRC register suggests.
      
      In some cases, driver releases the ownership of the tracer and reacquires
      it later on. Whenever the driver releases ownership of the tracer, fw
      issues a change ownership event. This event can be delayed and come after
      driver has reacquired ownership of the tracer. Thus the late event will
      trigger the tracer owner PF to release the ownership again and lead to a
      scenario where no PF is owning the tracer.
      
      To prevent the scenario described above, when handling a change
      ownership event, do not drop ownership of the tracer directly, instead
      read the fw MTRC register to retrieve the up-to-date owner of the tracer
      and set it accordingly in driver level.
      
      Fixes: f53aaa31
      
       ("net/mlx5: FW tracer, implement tracer logic")
      Signed-off-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      92fd3963
    • Vlad Buslov's avatar
      net/mlx5: Bridge, fix peer entry ageing in LAG mode · 7a3ce807
      Vlad Buslov authored
      
      
      With current implementation in single FDB LAG mode all packets are
      processed by eswitch 0 rules. As such, 'peer' FDB entries receive the
      packets for rules of other eswitches and are responsible for updating the
      main entry by sending SWITCHDEV_FDB_ADD_TO_BRIDGE notification from their
      background update wq task. However, this introduces a race condition when
      non-zero eswitch instance decides to delete a FDB entry, sends
      SWITCHDEV_FDB_DEL_TO_BRIDGE notification, but another eswitch's update task
      refreshes the same entry concurrently while its async delete work is still
      pending on the workque. In such case another SWITCHDEV_FDB_ADD_TO_BRIDGE
      event may be generated and entry will remain stuck in FDB marked as
      'offloaded' since no more SWITCHDEV_FDB_DEL_TO_BRIDGE notifications are
      sent for deleting the peer entries.
      
      Fix the issue by synchronously marking deleted entries with
      MLX5_ESW_BRIDGE_FLAG_DELETED flag and skipping them in background update
      job.
      
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7a3ce807
    • Shay Drory's avatar
      net/mlx5: E-switch, register event handler before arming the event · 7624e58a
      Shay Drory authored
      Currently, mlx5 is registering event handler for vport context change
      event some time after arming the event. this can lead to missing an
      event, which will result in wrong rules in the FDB.
      Hence, register the event handler before arming the event.
      
      This solution is valid since FW is sending vport context change event
      only on vports which SW armed, and SW arming the vport when enabling
      it, which is done after the FDB has been created.
      
      Fixes: 6933a937
      
       ("net/mlx5: E-Switch, Use async events chain")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7624e58a