Skip to content
  1. Oct 16, 2023
  2. Oct 15, 2023
    • Manish Chopra's avatar
      qed: fix LL2 RX buffer allocation · 2f3389c7
      Manish Chopra authored
      Driver allocates the LL2 rx buffers from kmalloc()
      area to construct the skb using slab_build_skb()
      
      The required size allocation seems to have overlooked
      for accounting both skb_shared_info size and device
      placement padding bytes which results into the below
      panic when doing skb_put() for a standard MTU sized frame.
      
      skbuff: skb_over_panic: text:ffffffffc0b0225f len:1514 put:1514
      head:ff3dabceaf39c000 data:ff3dabceaf39c042 tail:0x62c end:0x566
      dev:<NULL>
      …
      skb_panic+0x48/0x4a
      skb_put.cold+0x10/0x10
      qed_ll2b_complete_rx_packet+0x14f/0x260 [qed]
      qed_ll2_rxq_handle_completion.constprop.0+0x169/0x200 [qed]
      qed_ll2_rxq_completion+0xba/0x320 [qed]
      qed_int_sp_dpc+0x1a7/0x1e0 [qed]
      
      This patch fixes this by accouting skb_shared_info and device
      placement padding size bytes when allocating the buffers.
      
      Cc: David S. Miller <davem@davemloft.net>
      Fixes: 0a7fb11c
      
       ("qed: Add Light L2 support")
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f3389c7
  3. Oct 14, 2023
    • Jakub Kicinski's avatar
      Merge tag 'mlx5-fixes-2023-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 2d1c882d
      Jakub Kicinski authored
      
      
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2023-10-12
      
      This series provides bug fixes to mlx5 driver.
      
      * tag 'mlx5-fixes-2023-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
        net/mlx5e: Fix VF representors reporting zero counters to "ip -s" command
        net/mlx5e: Don't offload internal port if filter device is out device
        net/mlx5e: Take RTNL lock before triggering netdev notifiers
        net/mlx5e: XDP, Fix XDP_REDIRECT mpwqe page fragment leaks on shutdown
        net/mlx5e: RX, Fix page_pool allocation failure recovery for legacy rq
        net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq
        net/mlx5: Handle fw tracer change ownership event based on MTRC
        net/mlx5: Bridge, fix peer entry ageing in LAG mode
        net/mlx5: E-switch, register event handler before arming the event
        net/mlx5: Perform DMA operations in the right locations
      ====================
      
      Link: https://lore.kernel.org/r/20231012195127.129585-1-saeed@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d1c882d
    • Jakub Kicinski's avatar
      Merge branch 'intel-wired-lan-driver-updates-2023-10-11-i40e-ice' · aeae0ef0
      Jakub Kicinski authored
      Jacob Keller says:
      
      ====================
      Intel Wired LAN Driver Updates 2023-10-11 (i40e, ice)
      
      This series contains fixes for the i40e and ice drivers.
      
      Jesse adds handling to the ice driver which resetis the device when loading
      on a crash kernel, preventing stale transactions from causing machine check
      exceptions which could prevent capturing crash data.
      
      Mateusz fixes a bug in the ice driver 'Safe mode' logic for handling the
      device when the DDP is missing.
      
      Michal fixes a crash when probing the i40e driver in the event that HW
      registers are reporting invalid/unexpected values.
      
      The following are changes since commit a950a592
      
      :
        net/smc: Fix pos miscalculation in statistics
      
      I'm covering for Tony Nguyen while he's out, and don't have access to create
      a pull request branch on his net-queue, so these are sent via mail only.
      ====================
      
      Link: https://lore.kernel.org/r/20231011233334.336092-1-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aeae0ef0
    • Mateusz Pacuszka's avatar
      ice: Fix safe mode when DDP is missing · 42066c4d
      Mateusz Pacuszka authored
      One thing is broken in the safe mode, that is
      ice_deinit_features() is being executed even
      that ice_init_features() was not causing stack
      trace during pci_unregister_driver().
      
      Add check on the top of the function.
      
      Fixes: 5b246e53
      
       ("ice: split probe into smaller functions")
      Signed-off-by: default avatarMateusz Pacuszka <mateuszx.pacuszka@intel.com>
      Signed-off-by: default avatarJan Sokolowski <jan.sokolowski@intel.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Link: https://lore.kernel.org/r/20231011233334.336092-4-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      42066c4d
    • Jesse Brandeburg's avatar
      ice: reset first in crash dump kernels · 0288c3e7
      Jesse Brandeburg authored
      When the system boots into the crash dump kernel after a panic, the ice
      networking device may still have pending transactions that can cause errors
      or machine checks when the device is re-enabled. This can prevent the crash
      dump kernel from loading the driver or collecting the crash data.
      
      To avoid this issue, perform a function level reset (FLR) on the ice device
      via PCIe config space before enabling it on the crash kernel. This will
      clear any outstanding transactions and stop all queues and interrupts.
      Restore the config space after the FLR, otherwise it was found in testing
      that the driver wouldn't load successfully.
      
      The following sequence causes the original issue:
      - Load the ice driver with modprobe ice
      - Enable SR-IOV with 2 VFs: echo 2 > /sys/class/net/eth0/device/sriov_num_vfs
      - Trigger a crash with echo c > /proc/sysrq-trigger
      - Load the ice driver again (or let it load automatically) with modprobe ice
      - The system crashes again during pcim_enable_device()
      
      Fixes: 837f08fd
      
       ("ice: Add basic driver framework for Intel(R) E800 Series")
      Reported-by: default avatarVishal Agrawal <vagrawal@redhat.com>
      Reviewed-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Link: https://lore.kernel.org/r/20231011233334.336092-3-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0288c3e7
    • Michal Schmidt's avatar
      i40e: prevent crash on probe if hw registers have invalid values · fc6f716a
      Michal Schmidt authored
      The hardware provides the indexes of the first and the last available
      queue and VF. From the indexes, the driver calculates the numbers of
      queues and VFs. In theory, a faulty device might say the last index is
      smaller than the first index. In that case, the driver's calculation
      would underflow, it would attempt to write to non-existent registers
      outside of the ioremapped range and crash.
      
      I ran into this not by having a faulty device, but by an operator error.
      I accidentally ran a QE test meant for i40e devices on an ice device.
      The test used 'echo i40e > /sys/...ice PCI device.../driver_override',
      bound the driver to the device and crashed in one of the wr32 calls in
      i40e_clear_hw.
      
      Add checks to prevent underflows in the calculations of num_queues and
      num_vfs. With this fix, the wrong device probing reports errors and
      returns a failure without crashing.
      
      Fixes: 838d41d9
      
       ("i40e: clear all queues and interrupts")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Link: https://lore.kernel.org/r/20231011233334.336092-2-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fc6f716a
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-10-12' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · f50ee3a0
      Jakub Kicinski authored
      
      
      Florian Westphal says:
      
      ====================
      netfilter updates for net
      
      Patch 1, from Pablo Neira Ayuso, fixes a performance regression
      (since 6.4) when a large pending set update has to be canceled towards
      the end of the transaction.
      
      Patch 2 from myself, silences an incorrect compiler warning reported
      with a few (older) compiler toolchains.
      
      Patch 3, from Kees Cook, adds __counted_by annotation to
      nft_pipapo set backend type.  I took this for net instead of -next
      given infra is already in place and no actual code change is made.
      
      Patch 4, from Pablo Neira Ayso, disables timeout resets on
      stateful element reset.  The rest should only affect internal object
      state, e.g. reset a quota or counter, but not affect a pending timeout.
      
      Patches 5 and 6 fix NULL dereferences in 'inner header' match,
      control plane doesn't test for netlink attribute presence before
      accessing them. Broken since feature was added in 6.2, fixes from
      Xingyuan Mo.
      
      Last patch, from myself, fixes a bogus rule match when skb has
      a 0-length mac header, in this case we'd fetch data from network
      header instead of canceling rule evaluation.  This is a day 0 bug,
      present since nftables was merged in 3.13.
      
      * tag 'nf-23-10-12' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nft_payload: fix wrong mac header matching
        nf_tables: fix NULL pointer dereference in nft_expr_inner_parse()
        nf_tables: fix NULL pointer dereference in nft_inner_init()
        netfilter: nf_tables: do not refresh timeout when resetting element
        netfilter: nf_tables: Annotate struct nft_pipapo_match with __counted_by
        netfilter: nfnetlink_log: silence bogus compiler warning
        netfilter: nf_tables: do not remove elements if set backend implements .abort
      ====================
      
      Link: https://lore.kernel.org/r/20231012085724.15155-1-fw@strlen.de
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f50ee3a0
    • MD Danish Anwar's avatar
      net: ti: icssg-prueth: Fix tx_total_bytes count · 2c0d808f
      MD Danish Anwar authored
      ICSSG HW stats on TX side considers 8 preamble bytes as data bytes. Due
      to this the tx_bytes of ICSSG interface doesn't match the rx_bytes of the
      link partner. There is no public errata available yet.
      
      As a workaround to fix this, decrease tx_bytes by 8 bytes for every tx
      frame.
      
      Fixes: c1e10d5d
      
       ("net: ti: icssg-prueth: Add ICSSG Stats")
      Signed-off-by: default avatarMD Danish Anwar <danishanwar@ti.com>
      Link: https://lore.kernel.org/r/20231012064626.977466-1-danishanwar@ti.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2c0d808f
    • Mateusz Polchlopek's avatar
      docs: fix info about representor identification · a258c804
      Mateusz Polchlopek authored
      Update the "How are representors identified?" documentation
      subchapter. For newer kernels driver should use
      SET_NETDEV_DEVLINK_PORT instead of ndo_get_devlink_port()
      callback.
      
      Fixes: 7712b3e9
      
       ("Merge branch 'net-fix-netdev-to-devlink_port-linkage-and-expose-to-user'")
      Signed-off-by: default avatarMateusz Polchlopek <mateusz.polchlopek@intel.com>
      Reviewed-by: default avatarWojciech Drewek <wojciech.drewek@intel.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Reviewed-by: default avatarEdward Cree <ecree.xilinx@gmail.com>
      Link: https://lore.kernel.org/r/20231012123144.15768-1-mateusz.polchlopek@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a258c804
    • Jiri Pirko's avatar
      netlink: specs: devlink: fix reply command values · 0f4d44f6
      Jiri Pirko authored
      Make sure that the command values used for replies are correct. This is
      only affecting generated userspace helpers, no change on kernel code.
      
      Fixes: 7199c862
      
       ("netlink: specs: devlink: add commands that do per-instance dump")
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20231012115811.298129-1-jiri@resnulli.us
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0f4d44f6
    • Albert Huang's avatar
      net/smc: fix smc clc failed issue when netdevice not in init_net · c68681ae
      Albert Huang authored
      If the netdevice is within a container and communicates externally
      through network technologies such as VxLAN, we won't be able to find
      routing information in the init_net namespace. To address this issue,
      we need to add a struct net parameter to the smc_ib_find_route function.
      This allow us to locate the routing information within the corresponding
      net namespace, ensuring the correct completion of the SMC CLC interaction.
      
      Fixes: e5c4744c
      
       ("net/smc: add SMC-Rv2 connection establishment")
      Signed-off-by: default avatarAlbert Huang <huangjie.albert@bytedance.com>
      Reviewed-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Link: https://lore.kernel.org/r/20231011074851.95280-1-huangjie.albert@bytedance.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c68681ae
    • Paolo Abeni's avatar
      tcp: allow again tcp_disconnect() when threads are waiting · 419ce133
      Paolo Abeni authored
      As reported by Tom, .NET and applications build on top of it rely
      on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP
      socket.
      
      The blamed commit below caused a regression, as such cancellation
      can now fail.
      
      As suggested by Eric, this change addresses the problem explicitly
      causing blocking I/O operation to terminate immediately (with an error)
      when a concurrent disconnect() is executed.
      
      Instead of tracking the number of threads blocked on a given socket,
      track the number of disconnect() issued on such socket. If such counter
      changes after a blocking operation releasing and re-acquiring the socket
      lock, error out the current operation.
      
      Fixes: 4faeee0c
      
       ("tcp: deny tcp_disconnect() when threads are waiting")
      Reported-by: default avatarTom Deseyn <tdeseyn@redhat.com>
      Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      419ce133
    • Jesse Brandeburg's avatar
      ice: fix over-shifted variable · 242e3450
      Jesse Brandeburg authored
      Since the introduction of the ice driver the code has been
      double-shifting the RSS enabling field, because the define already has
      shifts in it and can't have the regular pattern of "a << shiftval &
      mask" applied.
      
      Most places in the code got it right, but one line was still wrong. Fix
      this one location for easy backports to stable. An in-progress patch
      fixes the defines to "standard" and will be applied as part of the
      regular -next process sometime after this one.
      
      Fixes: d76a60ba
      
       ("ice: Add support for VLANs and offloads")
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20231010203101.406248-1-jacob.e.keller@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      242e3450
    • Jinjie Ruan's avatar
      net: dsa: bcm_sf2: Fix possible memory leak in bcm_sf2_mdio_register() · 61b40cef
      Jinjie Ruan authored
      In bcm_sf2_mdio_register(), the class_find_device() will call get_device()
      to increment reference count for priv->master_mii_bus->dev if
      of_mdio_find_bus() succeeds. If mdiobus_alloc() or mdiobus_register()
      fails, it will call get_device() twice without decrement reference count
      for the device. And it is the same if bcm_sf2_mdio_register() succeeds but
      fails in bcm_sf2_sw_probe(), or if bcm_sf2_sw_probe() succeeds. If the
      reference count has not decremented to zero, the dev related resource will
      not be freed.
      
      So remove the get_device() in bcm_sf2_mdio_register(), and call
      put_device() if mdiobus_alloc() or mdiobus_register() fails and in
      bcm_sf2_mdio_unregister() to solve the issue.
      
      And as Simon suggested, unwind from errors for bcm_sf2_mdio_register() and
      just return 0 if it succeeds to make it cleaner.
      
      Fixes: 461cd1b0
      
       ("net: dsa: bcm_sf2: Register our slave MDIO bus")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Suggested-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Link: https://lore.kernel.org/r/20231011032419.2423290-1-ruanjinjie@huawei.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      61b40cef
    • Jakub Kicinski's avatar
      Merge branch 'selftests-fib_tests-fixes-for-multipath-list-receive-tests' · dda5e1ee
      Jakub Kicinski authored
      
      
      Ido Schimmel says:
      
      ====================
      selftests: fib_tests: Fixes for multipath list receive tests
      
      Fix two issues in recently added FIB multipath list receive tests.
      ====================
      
      Link: https://lore.kernel.org/r/20231010132113.3014691-1-idosch@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dda5e1ee
    • Ido Schimmel's avatar
      selftests: fib_tests: Count all trace point invocations · aa13e524
      Ido Schimmel authored
      The tests rely on the IPv{4,6} FIB trace points being triggered once for
      each forwarded packet. If receive processing is deferred to the
      ksoftirqd task these invocations will not be counted and the tests will
      fail. Fix by specifying the '-a' flag to avoid perf from filtering on
      the mausezahn task.
      
      Before:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (.68)                               [FAIL]
      
       # ./fib_tests.sh -t ipv6_mpath_list
      
       IPv6 multipath list receive tests
           TEST: Multipath route hit ratio (.27)                               [FAIL]
      
      After:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (1.00)                              [ OK ]
      
       # ./fib_tests.sh -t ipv6_mpath_list
      
       IPv6 multipath list receive tests
           TEST: Multipath route hit ratio (.99)                               [ OK ]
      
      Fixes: 8ae9efb8
      
       ("selftests: fib_tests: Add multipath list receive tests")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/netdev/202309191658.c00d8b8-oliver.sang@intel.com/
      Tested-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Tested-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Link: https://lore.kernel.org/r/20231010132113.3014691-3-idosch@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aa13e524
    • Ido Schimmel's avatar
      selftests: fib_tests: Disable RP filter in multipath list receive test · dbb13378
      Ido Schimmel authored
      The test relies on the fib:fib_table_lookup trace point being triggered
      once for each forwarded packet. If RP filter is not disabled, the trace
      point will be triggered twice for each packet (for source validation and
      forwarding), potentially masking actual bugs. Fix by explicitly
      disabling RP filter.
      
      Before:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (1.99)                              [ OK ]
      
      After:
      
       # ./fib_tests.sh -t ipv4_mpath_list
      
       IPv4 multipath list receive tests
           TEST: Multipath route hit ratio (.99)                               [ OK ]
      
      Fixes: 8ae9efb8
      
       ("selftests: fib_tests: Add multipath list receive tests")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/netdev/202309191658.c00d8b8-oliver.sang@intel.com/
      Tested-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Tested-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Link: https://lore.kernel.org/r/20231010132113.3014691-2-idosch@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dbb13378
    • Kuniyuki Iwashima's avatar
      tcp: Fix listen() warning with v4-mapped-v6 address. · 8702cf12
      Kuniyuki Iwashima authored
      syzbot reported a warning [0] introduced by commit c48ef9c4 ("tcp: Fix
      bind() regression for v4-mapped-v6 non-wildcard address.").
      
      After the cited commit, a v4 socket's address matches the corresponding
      v4-mapped-v6 tb2 in inet_bind2_bucket_match_addr(), not vice versa.
      
      During X.X.X.X -> ::ffff:X.X.X.X order bind()s, the second bind() uses
      bhash and conflicts properly without checking bhash2 so that we need not
      check if a v4-mapped-v6 sk matches the corresponding v4 address tb2 in
      inet_bind2_bucket_match_addr().  However, the repro shows that we need
      to check that in a no-conflict case.
      
      The repro bind()s two sockets to the 2-tuples using SO_REUSEPORT and calls
      listen() for the first socket:
      
        from socket import *
      
        s1 = socket()
        s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
        s1.bind(('127.0.0.1', 0))
      
        s2 = socket(AF_INET6)
        s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
        s2.bind(('::ffff:127.0.0.1', s1.getsockname()[1]))
      
        s1.listen()
      
      The second socket should belong to the first socket's tb2, but the second
      bind() creates another tb2 bucket because inet_bind2_bucket_find() returns
      NULL in inet_csk_get_port() as the v4-mapped-v6 sk does not match the
      corresponding v4 address tb2.
      
        bhash2[] -> tb2(::ffff:X.X.X.X) -> tb2(X.X.X.X)
      
      Then, listen() for the first socket calls inet_csk_get_port(), where the
      v4 address matches the v4-mapped-v6 tb2 and WARN_ON() is triggered.
      
      To avoid that, we need to check if v4-mapped-v6 sk address matches with
      the corresponding v4 address tb2 in inet_bind2_bucket_match().
      
      The same checks are needed in inet_bind2_bucket_addr_match() too, so we
      can move all checks there and call it from inet_bind2_bucket_match().
      
      Note that now tb->family is just an address family of tb->(v6_)?rcv_saddr
      and not of sockets in the bucket.  This could be refactored later by
      defining tb->rcv_saddr as tb->v6_rcv_saddr.s6_addr32[3] and prepending
      ::ffff: when creating v4 tb2.
      
      [0]:
      WARNING: CPU: 0 PID: 5049 at net/ipv4/inet_connection_sock.c:587 inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587
      Modules linked in:
      CPU: 0 PID: 5049 Comm: syz-executor288 Not tainted 6.6.0-rc2-syzkaller-00018-g2cf0f7156238 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023
      RIP: 0010:inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587
      Code: 7c 24 08 e8 4c b6 8a 01 31 d2 be 88 01 00 00 48 c7 c7 e0 94 ae 8b e8 59 2e a3 f8 2e 2e 2e 31 c0 e9 04 fe ff ff e8 ca 88 d0 f8 <0f> 0b e9 0f f9 ff ff e8 be 88 d0 f8 49 8d 7e 48 e8 65 ca 5a 00 31
      RSP: 0018:ffffc90003abfbf0 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: ffff888026429100 RCX: 0000000000000000
      RDX: ffff88807edcbb80 RSI: ffffffff88b73d66 RDI: ffff888026c49f38
      RBP: ffff888026c49f30 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff9260f200
      R13: ffff888026c49880 R14: 0000000000000000 R15: ffff888026429100
      FS:  00005555557d5380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000045ad50 CR3: 0000000025754000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       inet_csk_listen_start+0x155/0x360 net/ipv4/inet_connection_sock.c:1256
       __inet_listen_sk+0x1b8/0x5c0 net/ipv4/af_inet.c:217
       inet_listen+0x93/0xd0 net/ipv4/af_inet.c:239
       __sys_listen+0x194/0x270 net/socket.c:1866
       __do_sys_listen net/socket.c:1875 [inline]
       __se_sys_listen net/socket.c:1873 [inline]
       __x64_sys_listen+0x53/0x80 net/socket.c:1873
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f3a5bce3af9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffc1a1c79e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a5bce3af9
      RDX: 00007f3a5bce3af9 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 00007f3a5bd565f0 R08: 0000000000000006 R09: 0000000000000006
      R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000001
      R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001
       </TASK>
      
      Fixes: c48ef9c4
      
       ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address.")
      Reported-by: default avatar <syzbot+71e724675ba3958edb31@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=71e724675ba3958edb31
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20231010013814.70571-1-kuniyu@amazon.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8702cf12
  4. Oct 13, 2023
    • Jiri Wiesner's avatar
      bonding: Return pointer to data after pull on skb · d93f3f99
      Jiri Wiesner authored
      Since 429e3d12 ("bonding: Fix extraction of ports from the packet
      headers"), header offsets used to compute a hash in bond_xmit_hash() are
      relative to skb->data and not skb->head. If the tail of the header buffer
      of an skb really needs to be advanced and the operation is successful, the
      pointer to the data must be returned (and not a pointer to the head of the
      buffer).
      
      Fixes: 429e3d12
      
       ("bonding: Fix extraction of ports from the packet headers")
      Signed-off-by: default avatarJiri Wiesner <jwiesner@suse.de>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d93f3f99
    • Linus Torvalds's avatar
      Merge tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · e8c127b0
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from CAN and BPF.
      
        We have a regression in TC currently under investigation, otherwise
        the things that stand off most are probably the TCP and AF_PACKET
        fixes, with both issues coming from 6.5.
      
        Previous releases - regressions:
      
         - af_packet: fix fortified memcpy() without flex array.
      
         - tcp: fix crashes trying to free half-baked MTU probes
      
         - xdp: fix zero-size allocation warning in xskq_create()
      
         - can: sja1000: always restart the tx queue after an overrun
      
         - eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
      
         - eth: nfp: avoid rmmod nfp crash issues
      
         - eth: octeontx2-pf: fix page pool frag allocation warning
      
        Previous releases - always broken:
      
         - mctp: perform route lookups under a RCU read-side lock
      
         - bpf: s390: fix clobbering the caller's backchain in the trampoline
      
         - phy: lynx-28g: cancel the CDR check work item on the remove path
      
         - dsa: qca8k: fix qca8k driver for Turris 1.x
      
         - eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
      
         - eth: ixgbe: fix crash with empty VF macvlan list"
      
      * tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
        rswitch: Fix imbalance phy_power_off() calling
        rswitch: Fix renesas_eth_sw_remove() implementation
        octeontx2-pf: Fix page pool frag allocation warning
        nfc: nci: assert requested protocol is valid
        af_packet: Fix fortified memcpy() without flex array.
        net: tcp: fix crashes trying to free half-baked MTU probes
        net/smc: Fix pos miscalculation in statistics
        nfp: flower: avoid rmmod nfp crash issues
        net: usb: dm9601: fix uninitialized variable use in dm9601_mdio_read
        ethtool: Fix mod state of verbose no_mask bitset
        net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
        mctp: perform route lookups under a RCU read-side lock
        net: skbuff: fix kernel-doc typos
        s390/bpf: Fix unwinding past the trampoline
        s390/bpf: Fix clobbering the caller's backchain in the trampoline
        net/mlx5e: Again mutually exclude RX-FCS and RX-port-timestamp
        net/smc: Fix dependency of SMC on ISM
        ixgbe: fix crash with empty VF macvlan list
        net/mlx5e: macsec: use update_pn flag instead of PN comparation
        net: phy: mscc: macsec: reject PN update requests
        ...
      e8c127b0
    • Linus Torvalds's avatar
      Merge tag 'soc-fixes-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 9a5a1494
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "AngeloGioacchino Del Regno is stepping in as co-maintainer for the
        MediaTek SoC platform and starts by sending some dts fixes for the
        mt8195 platform that had been pending for a while.
      
        On the ixp4xx platform, Krzysztof Halasa steps down as co-maintainer,
        reflecting that Linus Walleij has been handling this on his own for
        the past few years.
      
        Generic RISC-V kernels are now marked as incompatible with the RZ/Five
        platform that requires custom hacks both for managing its DMA bounce
        buffers and for addressing low virtual memory.
      
       Finally, there is one bugfix for the AMDTEE firmware driver to prevent
       a use-after-free bug"
      
      * tag 'soc-fixes-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        IXP4xx MAINTAINERS entries
        arm64: dts: mediatek: mt8195: Set DSU PMU status to fail
        arm64: dts: mediatek: fix t-phy unit name
        arm64: dts: mediatek: mt8195-demo: update and reorder reserved memory regions
        arm64: dts: mediatek: mt8195-demo: fix the memory size to 8GB
        MAINTAINERS: Add Angelo as MediaTek SoC co-maintainer
        soc: renesas: Make ARCH_R9A07G043 (riscv version) depend on NONPORTABLE
        tee: amdtee: fix use-after-free vulnerability in amdtee_close_session
      9a5a1494
    • Linus Torvalds's avatar
      Merge tag 'pmdomain-v6.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm · 9b1ad4ba
      Linus Torvalds authored
      Pull pmdomain fix from Ulf Hansson:
      
       - imx: scu-pd: Correct the DMA2 channel
      
      * tag 'pmdomain-v6.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
        pmdomain: imx: scu-pd: correct DMA2 channel
      9b1ad4ba
    • Amir Tzin's avatar
      net/mlx5e: Fix VF representors reporting zero counters to "ip -s" command · 80f12414
      Amir Tzin authored
      Although vf_vport entry of struct mlx5e_stats is never updated, its
      values are mistakenly copied to the caller structure in the VF
      representor .ndo_get_stat_64 callback mlx5e_rep_get_stats(). Remove
      redundant entry and use the updated one, rep_stats, instead.
      
      Fixes: 64b68e36
      
       ("net/mlx5: Refactor and expand rep vport stat group")
      Reviewed-by: default avatarPatrisious Haddad <phaddad@nvidia.com>
      Signed-off-by: default avatarAmir Tzin <amirtz@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      80f12414
    • Jianbo Liu's avatar
      net/mlx5e: Don't offload internal port if filter device is out device · 06b4eac9
      Jianbo Liu authored
      In the cited commit, if the routing device is ovs internal port, the
      out device is set to uplink, and packets go out after encapsulation.
      
      If filter device is uplink, it can trigger the following syndrome:
      mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 3966): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xcdb051), err(-22)
      
      Fix this issue by not offloading internal port if filter device is out
      device. In this case, packets are not forwarded to the root table to
      be processed, the termination table is used instead to forward them
      from uplink to uplink.
      
      Fixes: 100ad4e2
      
       ("net/mlx5e: Offload internal port as encap route device")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarAriel Levkovich <lariel@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      06b4eac9
    • Lama Kayal's avatar
      net/mlx5e: Take RTNL lock before triggering netdev notifiers · c51c6734
      Lama Kayal authored
      Hold RTNL lock when calling xdp_set_features() with a registered netdev,
      as the call triggers the netdev notifiers. This could happen when
      switching from nic profile to uplink representor for example.
      
      Similar logic which fixed a similar scenario was previously introduced in
      the following commit:
      commit 72cc6549 net/mlx5e: Take RTNL lock when needed before calling
      xdp_set_features().
      
      This fixes the following assertion and warning call trace:
      
      RTNL: assertion failed at net/core/dev.c (1961)
      WARNING: CPU: 13 PID: 2529 at net/core/dev.c:1961
      call_netdevice_notifiers_info+0x7c/0x80
      Modules linked in: rpcrdma rdma_ucm ib_iser libiscsi
      scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
      ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink
      nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5
      auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
      CPU: 13 PID: 2529 Comm: devlink Not tainted
      6.5.0_for_upstream_min_debug_2023_09_07_20_04 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
      rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      RIP: 0010:call_netdevice_notifiers_info+0x7c/0x80
      Code: 8f ff 80 3d 77 0d 16 01 00 75 c5 ba a9 07 00 00 48
      c7 c6 c4 bb 0d 82 48 c7 c7 18 c8 06 82 c6 05 5b 0d 16 01 01 e8 44 f6 8c
      ff <0f> 0b eb a2 0f 1f 44 00 00 55 48 89 e5 41 54 48 83 e4 f0 48 83 ec
      RSP: 0018:ffff88819930f7f0 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffffffff8309f740 RCX: 0000000000000027
      RDX: ffff88885fb5b5c8 RSI: 0000000000000001 RDI: ffff88885fb5b5c0
      RBP: 0000000000000028 R08: ffff88887ffabaa8 R09: 0000000000000003
      R10: ffff88887fecbac0 R11: ffff88887ff7bac0 R12: ffff88819930f810
      R13: ffff88810b7fea40 R14: ffff8881154e8fd8 R15: ffff888107e881a0
      FS:  00007f3ad248f800(0000) GS:ffff88885fb40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000563b85f164e0 CR3: 0000000113b5c006 CR4: 0000000000370ea0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ? __warn+0x79/0x120
       ? call_netdevice_notifiers_info+0x7c/0x80
       ? report_bug+0x17c/0x190
       ? handle_bug+0x3c/0x60
       ? exc_invalid_op+0x14/0x70
       ? asm_exc_invalid_op+0x16/0x20
       ? call_netdevice_notifiers_info+0x7c/0x80
       call_netdevice_notifiers+0x2e/0x50
       mlx5e_set_xdp_feature+0x21/0x50 [mlx5_core]
       mlx5e_build_rep_params+0x97/0x130 [mlx5_core]
       mlx5e_init_ul_rep+0x9f/0x100 [mlx5_core]
       mlx5e_netdev_init_profile+0x76/0x110 [mlx5_core]
       mlx5e_netdev_attach_profile+0x1f/0x90 [mlx5_core]
       mlx5e_netdev_change_profile+0x92/0x160 [mlx5_core]
       mlx5e_vport_rep_load+0x329/0x4a0 [mlx5_core]
       mlx5_esw_offloads_rep_load+0x9e/0xf0 [mlx5_core]
       esw_offloads_enable+0x4bc/0xe90 [mlx5_core]
       mlx5_eswitch_enable_locked+0x3c8/0x570 [mlx5_core]
       ? kmalloc_trace+0x25/0x80
       mlx5_devlink_eswitch_mode_set+0x224/0x680 [mlx5_core]
       ? devlink_get_from_attrs_lock+0x9e/0x110
       devlink_nl_cmd_eswitch_set_doit+0x60/0xe0
       genl_family_rcv_msg_doit+0xd0/0x120
       genl_rcv_msg+0x180/0x2b0
       ? devlink_get_from_attrs_lock+0x110/0x110
       ? devlink_nl_cmd_eswitch_get_doit+0x290/0x290
       ? devlink_pernet_pre_exit+0xf0/0xf0
       ? genl_family_rcv_msg_dumpit+0xf0/0xf0
       netlink_rcv_skb+0x54/0x100
       genl_rcv+0x24/0x40
       netlink_unicast+0x1fc/0x2c0
       netlink_sendmsg+0x232/0x4a0
       sock_sendmsg+0x38/0x60
       ? _copy_from_user+0x2a/0x60
       __sys_sendto+0x110/0x160
       ? handle_mm_fault+0x161/0x260
       ? do_user_addr_fault+0x276/0x620
       __x64_sys_sendto+0x20/0x30
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f3ad231340a
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3
      0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f
      05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      RSP: 002b:00007ffd70aad4b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000c36b00 RCX:00007f3ad231340a
      RDX: 0000000000000038 RSI: 0000000000c36b00 RDI: 0000000000000003
      RBP: 0000000000c36910 R08: 00007f3ad2625200 R09: 000000000000000c
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      ---[ end trace 0000000000000000 ]---
      ------------[ cut here ]------------
      
      Fixes: 4d5ab0ad
      
       ("net/mlx5e: take into account device reconfiguration for xdp_features flag")
      Signed-off-by: default avatarLama Kayal <lkayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c51c6734
    • Dragos Tatulea's avatar
      net/mlx5e: XDP, Fix XDP_REDIRECT mpwqe page fragment leaks on shutdown · aaab619c
      Dragos Tatulea authored
      When mlx5e_xdp_xmit is called without the XDP_XMIT_FLUSH set it is
      possible that it leaves a mpwqe session open. That is ok during runtime:
      the session will be closed on the next call to mlx5e_xdp_xmit. But
      having a mpwqe session still open at XDP sq close time is problematic:
      the pc counter is not updated before flushing the contents of the
      xdpi_fifo. This results in leaking page fragments.
      
      The fix is to always close the mpwqe session at the end of
      mlx5e_xdp_xmit, regardless of the XDP_XMIT_FLUSH flag being set or not.
      
      Fixes: 5e0d2eef
      
       ("net/mlx5e: XDP, Support Enhanced Multi-Packet TX WQE")
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      aaab619c
    • Dragos Tatulea's avatar
      net/mlx5e: RX, Fix page_pool allocation failure recovery for legacy rq · ef9369e9
      Dragos Tatulea authored
      When a page allocation fails during refill in mlx5e_refill_rx_wqes, the
      page will be released again on the next refill call. This triggers the
      page_pool negative page fragment count warning below:
      
       [  338.326070] WARNING: CPU: 4 PID: 0 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
        ...
       [  338.328993] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329094] Call Trace:
       [  338.329097]  <IRQ>
       [  338.329100]  ? __warn+0x7d/0x120
       [  338.329105]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329173]  ? report_bug+0x155/0x180
       [  338.329179]  ? handle_bug+0x3c/0x60
       [  338.329183]  ? exc_invalid_op+0x13/0x60
       [  338.329187]  ? asm_exc_invalid_op+0x16/0x20
       [  338.329192]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329259]  mlx5e_post_rx_wqes+0x210/0x5a0 [mlx5_core]
       [  338.329327]  ? mlx5e_poll_rx_cq+0x88/0x6f0 [mlx5_core]
       [  338.329394]  mlx5e_napi_poll+0x127/0x6b0 [mlx5_core]
       [  338.329461]  __napi_poll+0x25/0x1a0
       [  338.329465]  net_rx_action+0x28a/0x300
       [  338.329468]  __do_softirq+0xcd/0x279
       [  338.329473]  irq_exit_rcu+0x6a/0x90
       [  338.329477]  common_interrupt+0x82/0xa0
       [  338.329482]  </IRQ>
      
      This patch fixes the legacy rq case by releasing all allocated fragments
      and then setting the skip flag on all released fragments. It is
      important to note that the number of released fragments will be higher
      than the number of allocated fragments when an allocation error occurs.
      
      Fixes: 3f93f829
      
       ("net/mlx5e: RX, Defer page release in legacy rq for better recycling")
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.com
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ef9369e9
    • Dragos Tatulea's avatar
      net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq · be43b748
      Dragos Tatulea authored
      When a page allocation fails during refill in mlx5e_post_rx_mpwqes, the
      page will be released again on the next refill call. This triggers the
      page_pool negative page fragment count warning below:
      
       [ 2436.447717] WARNING: CPU: 1 PID: 2419 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       ...
       [ 2436.447895] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.447991] Call Trace:
       [ 2436.447975]  mlx5e_post_rx_mpwqes+0x1d5/0xcf0 [mlx5_core]
       [ 2436.447994]  <IRQ>
       [ 2436.447996]  ? __warn+0x7d/0x120
       [ 2436.448009]  ? mlx5e_handle_rx_cqe_mpwrq+0x109/0x1d0 [mlx5_core]
       [ 2436.448002]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.448044]  ? mlx5e_poll_rx_cq+0x87/0x6e0 [mlx5_core]
       [ 2436.448061]  ? report_bug+0x155/0x180
       [ 2436.448065]  ? handle_bug+0x36/0x70
       [ 2436.448067]  ? exc_invalid_op+0x13/0x60
       [ 2436.448070]  ? asm_exc_invalid_op+0x16/0x20
       [ 2436.448079]  mlx5e_napi_poll+0x122/0x6b0 [mlx5_core]
       [ 2436.448077]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.448113]  ? generic_exec_single+0x35/0x100
       [ 2436.448117]  __napi_poll+0x25/0x1a0
       [ 2436.448120]  net_rx_action+0x28a/0x300
       [ 2436.448122]  __do_softirq+0xcd/0x279
       [ 2436.448126]  irq_exit_rcu+0x6a/0x90
       [ 2436.448128]  sysvec_apic_timer_interrupt+0x6e/0x90
       [ 2436.448130]  </IRQ>
      
      This patch fixes the striding rq case by setting the skip flag on all
      the wqe pages that were expected to have new pages allocated.
      
      Fixes: 4c2a1323
      
       ("net/mlx5e: RX, Defer page release in striding rq for better recycling")
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.com
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      be43b748
    • Maher Sanalla's avatar
      net/mlx5: Handle fw tracer change ownership event based on MTRC · 92fd3963
      Maher Sanalla authored
      Currently, whenever fw issues a change ownership event, the PF that owns
      the fw tracer drops its ownership directly and the other PFs try to pick
      up the ownership via what MTRC register suggests.
      
      In some cases, driver releases the ownership of the tracer and reacquires
      it later on. Whenever the driver releases ownership of the tracer, fw
      issues a change ownership event. This event can be delayed and come after
      driver has reacquired ownership of the tracer. Thus the late event will
      trigger the tracer owner PF to release the ownership again and lead to a
      scenario where no PF is owning the tracer.
      
      To prevent the scenario described above, when handling a change
      ownership event, do not drop ownership of the tracer directly, instead
      read the fw MTRC register to retrieve the up-to-date owner of the tracer
      and set it accordingly in driver level.
      
      Fixes: f53aaa31
      
       ("net/mlx5: FW tracer, implement tracer logic")
      Signed-off-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      92fd3963
    • Vlad Buslov's avatar
      net/mlx5: Bridge, fix peer entry ageing in LAG mode · 7a3ce807
      Vlad Buslov authored
      
      
      With current implementation in single FDB LAG mode all packets are
      processed by eswitch 0 rules. As such, 'peer' FDB entries receive the
      packets for rules of other eswitches and are responsible for updating the
      main entry by sending SWITCHDEV_FDB_ADD_TO_BRIDGE notification from their
      background update wq task. However, this introduces a race condition when
      non-zero eswitch instance decides to delete a FDB entry, sends
      SWITCHDEV_FDB_DEL_TO_BRIDGE notification, but another eswitch's update task
      refreshes the same entry concurrently while its async delete work is still
      pending on the workque. In such case another SWITCHDEV_FDB_ADD_TO_BRIDGE
      event may be generated and entry will remain stuck in FDB marked as
      'offloaded' since no more SWITCHDEV_FDB_DEL_TO_BRIDGE notifications are
      sent for deleting the peer entries.
      
      Fix the issue by synchronously marking deleted entries with
      MLX5_ESW_BRIDGE_FLAG_DELETED flag and skipping them in background update
      job.
      
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7a3ce807
    • Shay Drory's avatar
      net/mlx5: E-switch, register event handler before arming the event · 7624e58a
      Shay Drory authored
      Currently, mlx5 is registering event handler for vport context change
      event some time after arming the event. this can lead to missing an
      event, which will result in wrong rules in the FDB.
      Hence, register the event handler before arming the event.
      
      This solution is valid since FW is sending vport context change event
      only on vports which SW armed, and SW arming the vport when enabling
      it, which is done after the FDB has been created.
      
      Fixes: 6933a937
      
       ("net/mlx5: E-Switch, Use async events chain")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7624e58a
    • Shay Drory's avatar
      net/mlx5: Perform DMA operations in the right locations · 8698cb92
      Shay Drory authored
      The cited patch change mlx5 driver so that during probe DMA
      operations were performed before pci_enable_device(), and during
      teardown DMA operations were performed after pci_disable_device().
      DMA operations require PCI to be enabled. Hence, The above leads to
      the following oops in PPC systems[1].
      
      On s390x systems, as reported by Niklas Schnelle, this is a problem
      because mlx5_pci_init() is where the DMA and coherent mask is set but
      mlx5_cmd_init() already does a dma_alloc_coherent(). Thus a DMA
      allocation is done during probe before the correct mask is set. This
      causes probe to fail initialization of the cmdif SW structs on s390x
      after that is converted to the common dma-iommu code. This is because on
      s390x DMA addresses below 4 GiB are reserved on current machines and
      unlike the old s390x specific DMA API implementation common code
      enforces DMA masks.
      
      Fix it by performing the DMA operations during probe after
      pci_enable_device() and after the dma mask is set,
      and during teardown before pci_disable_device().
      
      [1]
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
      Modules linked in: xt_MASQUERADE nf_conntrack_netlink
      nfnetlink xfrm_user iptable_nat xt_addrtype xt_conntrack nf_nat
      nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 netconsole rpcsec_gss_krb5
      auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser ib_umad
      rdma_cm ib_ipoib iw_cm libiscsi scsi_transport_iscsi ib_cm ib_uverbs
      ib_core mlx5_core(-) ptp pps_core fuse vmx_crypto crc32c_vpmsum [last
      unloaded: mlx5_ib]
      CPU: 1 PID: 8937 Comm: modprobe Not tainted 6.5.0-rc3_for_upstream_min_debug_2023_07_31_16_02 #1
      Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
      NIP:  c000000000423388 LR: c0000000001e733c CTR: c0000000001e4720
      REGS: c0000000055636d0 TRAP: 0380   Not tainted (6.5.0-rc3_for_upstream_min_debug_2023_07_31_16_02)
      MSR:  8000000000009033  CR: 24008884  XER: 20040000
      CFAR: c0000000001e7338 IRQMASK: 0
      NIP [c000000000423388] __free_pages+0x28/0x160
      LR [c0000000001e733c] dma_direct_free+0xac/0x190
      Call Trace:
      [c000000005563970] [5deadbeef0000100] 0x5deadbeef0000100 (unreliable)
      [c0000000055639b0] [c0000000003d46cc] kfree+0x7c/0x150
      [c000000005563a40] [c0000000001e47c8] dma_free_attrs+0xa8/0x1a0
      [c000000005563aa0] [c008000000d0064c] mlx5_cmd_cleanup+0xa4/0x100 [mlx5_core]
      [c000000005563ad0] [c008000000cf629c] mlx5_mdev_uninit+0xf4/0x140 [mlx5_core]
      [c000000005563b00] [c008000000cf6448] remove_one+0x160/0x1d0 [mlx5_core]
      [c000000005563b40] [c000000000958540] pci_device_remove+0x60/0x110
      [c000000005563b80] [c000000000a35e80] device_remove+0x70/0xd0
      [c000000005563bb0] [c000000000a37a38] device_release_driver_internal+0x2a8/0x330
      [c000000005563c00] [c000000000a37b8c] driver_detach+0x8c/0x160
      [c000000005563c40] [c000000000a35350] bus_remove_driver+0x90/0x110
      [c000000005563c80] [c000000000a38948] driver_unregister+0x48/0x90
      [c000000005563cf0] [c000000000957e38] pci_unregister_driver+0x38/0x150
      [c000000005563d40] [c008000000eb6140] mlx5_cleanup+0x38/0x90 [mlx5_core]
      
      Fixes: 06cd555f
      
       ("net/mlx5: split mlx5_cmd_init() to probe and reload routines")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Tested-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8698cb92
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 82a040a8
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
       "Some pin control fixes for v6.6 which have been stacking up in my
        tree.
      
        Dmitry's fix to some locking in the core is the most substantial, that
        was a really neat fix.
      
        The rest is the usual assorted spray of minor driver fixes.
      
         - Drop some minor code causing warnings in the Lantiq driver
      
         - Fix out of bounds write in the Nuvoton driver
      
         - Fix lost IRQs with CONFIG_PM in the Starfive driver
      
         - Fix a locking issue in find_pinctrl()
      
         - Revert a regressive Tegra debug patch
      
         - Fix the Renesas RZN1 pin muxing"
      
      * tag 'pinctrl-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: renesas: rzn1: Enable missing PINMUX
        Revert "pinctrl: tegra: Add support to display pin function"
        pinctrl: avoid unsafe code pattern in find_pinctrl()
        pinctrl: starfive: jh7110: Add system pm ops to save and restore context
        pinctrl: starfive: jh7110: Fix failure to set irq after CONFIG_PM is enabled
        pinctrl: nuvoton: wpcm450: fix out of bounds write
        pinctrl: lantiq: Remove unsued declaration ltq_pinctrl_unregister()
      82a040a8
  5. Oct 12, 2023