Skip to content
  1. Feb 09, 2022
    • Vladimir Oltean's avatar
      net: dsa: felix: don't use devres for mdiobus · 209bdb7e
      Vladimir Oltean authored
      As explained in commits:
      74b6d7d1 ("net: dsa: realtek: register the MDIO bus under devres")
      5135e96a ("net: dsa: don't allocate the slave_mii_bus using devres")
      
      mdiobus_free() will panic when called from devm_mdiobus_free() <-
      devres_release_all() <- __device_release_driver(), and that mdiobus was
      not previously unregistered.
      
      The Felix VSC9959 switch is a PCI device, so the initial set of
      constraints that I thought would cause this (I2C or SPI buses which call
      ->remove on ->shutdown) do not apply. But there is one more which
      applies here.
      
      If the DSA master itself is on a bus that calls ->remove from ->shutdown
      (like dpaa2-eth, which is on the fsl-mc bus), there is a device link
      between the switch and the DSA master, and device_links_unbind_consumers()
      will unbind the felix switch driver on shutdown.
      
      So the same treatment must be applied to all DSA switch drivers, which
      is: either use devres for both the mdiobus allocation and registration,
      or don't use devres at all.
      
      The felix driver has the code structure in place for orderly mdiobus
      removal, so just replace devm_mdiobus_alloc_size() with the non-devres
      variant, and add manual free where necessary, to ensure that we don't
      let devres free a still-registered bus.
      
      Fixes: ac3a68d5
      
       ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      209bdb7e
    • Vladimir Oltean's avatar
      net: dsa: bcm_sf2: don't use devres for mdiobus · 08f1a208
      Vladimir Oltean authored
      As explained in commits:
      74b6d7d1 ("net: dsa: realtek: register the MDIO bus under devres")
      5135e96a ("net: dsa: don't allocate the slave_mii_bus using devres")
      
      mdiobus_free() will panic when called from devm_mdiobus_free() <-
      devres_release_all() <- __device_release_driver(), and that mdiobus was
      not previously unregistered.
      
      The Starfighter 2 is a platform device, so the initial set of
      constraints that I thought would cause this (I2C or SPI buses which call
      ->remove on ->shutdown) do not apply. But there is one more which
      applies here.
      
      If the DSA master itself is on a bus that calls ->remove from ->shutdown
      (like dpaa2-eth, which is on the fsl-mc bus), there is a device link
      between the switch and the DSA master, and device_links_unbind_consumers()
      will unbind the bcm_sf2 switch driver on shutdown.
      
      So the same treatment must be applied to all DSA switch drivers, which
      is: either use devres for both the mdiobus allocation and registration,
      or don't use devres at all.
      
      The bcm_sf2 driver has the code structure in place for orderly mdiobus
      removal, so just replace devm_mdiobus_alloc() with the non-devres
      variant, and add manual free where necessary, to ensure that we don't
      let devres free a still-registered bus.
      
      Fixes: ac3a68d5
      
       ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      08f1a208
    • Vladimir Oltean's avatar
      net: dsa: ar9331: register the mdiobus under devres · 50facd86
      Vladimir Oltean authored
      As explained in commits:
      74b6d7d1 ("net: dsa: realtek: register the MDIO bus under devres")
      5135e96a ("net: dsa: don't allocate the slave_mii_bus using devres")
      
      mdiobus_free() will panic when called from devm_mdiobus_free() <-
      devres_release_all() <- __device_release_driver(), and that mdiobus was
      not previously unregistered.
      
      The ar9331 is an MDIO device, so the initial set of constraints that I
      thought would cause this (I2C or SPI buses which call ->remove on
      ->shutdown) do not apply. But there is one more which applies here.
      
      If the DSA master itself is on a bus that calls ->remove from ->shutdown
      (like dpaa2-eth, which is on the fsl-mc bus), there is a device link
      between the switch and the DSA master, and device_links_unbind_consumers()
      will unbind the ar9331 switch driver on shutdown.
      
      So the same treatment must be applied to all DSA switch drivers, which
      is: either use devres for both the mdiobus allocation and registration,
      or don't use devres at all.
      
      The ar9331 driver doesn't have a complex code structure for mdiobus
      removal, so just replace of_mdiobus_register with the devres variant in
      order to be all-devres and ensure that we don't free a still-registered
      bus.
      
      Fixes: ac3a68d5
      
       ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      50facd86
    • Vladimir Oltean's avatar
      net: dsa: mv88e6xxx: don't use devres for mdiobus · f53a2ce8
      Vladimir Oltean authored
      As explained in commits:
      74b6d7d1 ("net: dsa: realtek: register the MDIO bus under devres")
      5135e96a ("net: dsa: don't allocate the slave_mii_bus using devres")
      
      mdiobus_free() will panic when called from devm_mdiobus_free() <-
      devres_release_all() <- __device_release_driver(), and that mdiobus was
      not previously unregistered.
      
      The mv88e6xxx is an MDIO device, so the initial set of constraints that
      I thought would cause this (I2C or SPI buses which call ->remove on
      ->shutdown) do not apply. But there is one more which applies here.
      
      If the DSA master itself is on a bus that calls ->remove from ->shutdown
      (like dpaa2-eth, which is on the fsl-mc bus), there is a device link
      between the switch and the DSA master, and device_links_unbind_consumers()
      will unbind the Marvell switch driver on shutdown.
      
      systemd-shutdown[1]: Powering off.
      mv88e6085 0x0000000008b96000:00 sw_gl0: Link is Down
      fsl-mc dpbp.9: Removing from iommu group 7
      fsl-mc dpbp.8: Removing from iommu group 7
      ------------[ cut here ]------------
      kernel BUG at drivers/net/phy/mdio_bus.c:677!
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00040-gdc05f73788e5 #15
      pc : mdiobus_free+0x44/0x50
      lr : devm_mdiobus_free+0x10/0x20
      Call trace:
       mdiobus_free+0x44/0x50
       devm_mdiobus_free+0x10/0x20
       devres_release_all+0xa0/0x100
       __device_release_driver+0x190/0x220
       device_release_driver_internal+0xac/0xb0
       device_links_unbind_consumers+0xd4/0x100
       __device_release_driver+0x4c/0x220
       device_release_driver_internal+0xac/0xb0
       device_links_unbind_consumers+0xd4/0x100
       __device_release_driver+0x94/0x220
       device_release_driver+0x28/0x40
       bus_remove_device+0x118/0x124
       device_del+0x174/0x420
       fsl_mc_device_remove+0x24/0x40
       __fsl_mc_device_remove+0xc/0x20
       device_for_each_child+0x58/0xa0
       dprc_remove+0x90/0xb0
       fsl_mc_driver_remove+0x20/0x5c
       __device_release_driver+0x21c/0x220
       device_release_driver+0x28/0x40
       bus_remove_device+0x118/0x124
       device_del+0x174/0x420
       fsl_mc_bus_remove+0x80/0x100
       fsl_mc_bus_shutdown+0xc/0x1c
       platform_shutdown+0x20/0x30
       device_shutdown+0x154/0x330
       kernel_power_off+0x34/0x6c
       __do_sys_reboot+0x15c/0x250
       __arm64_sys_reboot+0x20/0x30
       invoke_syscall.constprop.0+0x4c/0xe0
       do_el0_svc+0x4c/0x150
       el0_svc+0x24/0xb0
       el0t_64_sync_handler+0xa8/0xb0
       el0t_64_sync+0x178/0x17c
      
      So the same treatment must be applied to all DSA switch drivers, which
      is: either use devres for both the mdiobus allocation and registration,
      or don't use devres at all.
      
      The Marvell driver already has a good structure for mdiobus removal, so
      just plug in mdiobus_free and get rid of devres.
      
      Fixes: ac3a68d5
      
       ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Reported-by: default avatarRafael Richter <Rafael.Richter@gin.de>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: default avatarDaniel Klauer <daniel.klauer@gin.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f53a2ce8
    • Mahesh Bandewar's avatar
      bonding: pair enable_port with slave_arr_updates · 23de0d7b
      Mahesh Bandewar authored
      When 803.2ad mode enables a participating port, it should update
      the slave-array. I have observed that the member links are participating
      and are part of the active aggregator while the traffic is egressing via
      only one member link (in a case where two links are participating). Via
      kprobes I discovered that slave-arr has only one link added while
      the other participating link wasn't part of the slave-arr.
      
      I couldn't see what caused that situation but the simple code-walk
      through provided me hints that the enable_port wasn't always associated
      with the slave-array update.
      
      Fixes: ee637714
      
       ("bonding: Simplify the xmit function for modes that use xmit_hash")
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Link: https://lore.kernel.org/r/20220207222901.1795287-1-maheshb@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      23de0d7b
    • Tao Liu's avatar
      gve: Recording rx queue before sending to napi · 084cbb2e
      Tao Liu authored
      This caused a significant performance degredation when using generic XDP
      with multiple queues.
      
      Fixes: f5cedc84
      
       ("gve: Add transmit and receive support")
      Signed-off-by: default avatarTao Liu <xliutaox@google.com>
      Link: https://lore.kernel.org/r/20220207175901.2486596-1-jeroendb@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      084cbb2e
  2. Feb 08, 2022
  3. Feb 07, 2022
  4. Feb 06, 2022
    • Eric Dumazet's avatar
      net/smc: fix ref_tracker issue in smc_pnet_add() · 28f92221
      Eric Dumazet authored
      I added the netdev_tracker_alloc() right after ndev was
      stored into the newly allocated object:
      
        new_pe->ndev = ndev;
        if (ndev)
            netdev_tracker_alloc(ndev, &new_pe->dev_tracker, GFP_KERNEL);
      
      But I missed that later, we could end up freeing new_pe,
      then calling dev_put(ndev) to release the reference on ndev.
      
      The new_pe->dev_tracker would not be freed.
      
      To solve this issue, move the netdev_tracker_alloc() call to
      the point we know for sure new_pe will be kept.
      
      syzbot report (on net-next tree, but the bug is present in net tree)
      WARNING: CPU: 0 PID: 6019 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
      Modules linked in:
      CPU: 0 PID: 6019 Comm: syz-executor.3 Not tainted 5.17.0-rc2-syzkaller-00650-g5a8fb33e5305 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
      Code: 1d f4 70 a0 09 31 ff 89 de e8 4d bc 99 fd 84 db 75 e0 e8 64 b8 99 fd 48 c7 c7 20 0c 06 8a c6 05 d4 70 a0 09 01 e8 9e 4e 28 05 <0f> 0b eb c4 e8 48 b8 99 fd 0f b6 1d c3 70 a0 09 31 ff 89 de e8 18
      RSP: 0018:ffffc900043b7400 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000040000 RSI: ffffffff815fb318 RDI: fffff52000876e72
      RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff815f507e R11: 0000000000000000 R12: 1ffff92000876e85
      R13: 0000000000000000 R14: ffff88805c1c6600 R15: 0000000000000000
      FS:  00007f1ef6feb700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2d02b000 CR3: 00000000223f4000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       __refcount_dec include/linux/refcount.h:344 [inline]
       refcount_dec include/linux/refcount.h:359 [inline]
       ref_tracker_free+0x53f/0x6c0 lib/ref_tracker.c:119
       netdev_tracker_free include/linux/netdevice.h:3867 [inline]
       dev_put_track include/linux/netdevice.h:3884 [inline]
       dev_put_track include/linux/netdevice.h:3880 [inline]
       dev_put include/linux/netdevice.h:3910 [inline]
       smc_pnet_add_eth net/smc/smc_pnet.c:399 [inline]
       smc_pnet_enter net/smc/smc_pnet.c:493 [inline]
       smc_pnet_add+0x5fc/0x15f0 net/smc/smc_pnet.c:556
       genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
       genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
       genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: b6064524
      
       ("net/smc: add net device tracker to struct smc_pnetentry")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28f92221
    • Pavel Parkhomenko's avatar
      net: phy: marvell: Fix MDI-x polarity setting in 88e1118-compatible PHYs · aec12836
      Pavel Parkhomenko authored
      When setting up autonegotiation for 88E1118R and compatible PHYs,
      a software reset of PHY is issued before setting up polarity.
      This is incorrect as changes of MDI Crossover Mode bits are
      disruptive to the normal operation and must be followed by a
      software reset to take effect. Let's patch m88e1118_config_aneg()
      to fix the issue mentioned before by invoking software reset
      of the PHY just after setting up MDI-x polarity.
      
      Fixes: 605f196e
      
       ("phy: Add support for Marvell 88E1118 PHY")
      Signed-off-by: default avatarPavel Parkhomenko <Pavel.Parkhomenko@baikalelectronics.ru>
      Reviewed-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Suggested-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aec12836
  5. Feb 05, 2022
  6. Feb 04, 2022
    • Samuel Mendoza-Jonas's avatar
      ixgbevf: Require large buffers for build_skb on 82599VF · fe68195d
      Samuel Mendoza-Jonas authored
      From 4.17 onwards the ixgbevf driver uses build_skb() to build an skb
      around new data in the page buffer shared with the ixgbe PF.
      This uses either a 2K or 3K buffer, and offsets the DMA mapping by
      NET_SKB_PAD + NET_IP_ALIGN. When using a smaller buffer RXDCTL is set to
      ensure the PF does not write a full 2K bytes into the buffer, which is
      actually 2K minus the offset.
      
      However on the 82599 virtual function, the RXDCTL mechanism is not
      available. The driver attempts to work around this by using the SET_LPE
      mailbox method to lower the maximm frame size, but the ixgbe PF driver
      ignores this in order to keep the PF and all VFs in sync[0].
      
      This means the PF will write up to the full 2K set in SRRCTL, causing it
      to write NET_SKB_PAD + NET_IP_ALIGN bytes past the end of the buffer.
      With 4K pages split into two buffers, this means it either writes
      NET_SKB_PAD + NET_IP_ALIGN bytes past the first buffer (and into the
      second), or NET_SKB_PAD + NET_IP_ALIGN bytes past the end of the DMA
      mapping.
      
      Avoid this by only enabling build_skb when using "large" buffers (3K).
      These are placed in each half of an order-1 page, preventing the PF from
      writing past the end of the mapping.
      
      [0]: Technically it only ever raises the max frame size, see
      ixgbe_set_vf_lpe() in ixgbe_sriov.c
      
      Fixes: f15c5ba5
      
       ("ixgbevf: add support for using order 1 pages to receive large frames")
      Signed-off-by: default avatarSamuel Mendoza-Jonas <samjonas@amazon.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe68195d
    • Florian Westphal's avatar
      netfilter: ctnetlink: disable helper autoassign · d1ca60ef
      Florian Westphal authored
      When userspace, e.g. conntrackd, inserts an entry with a specified helper,
      its possible that the helper is lost immediately after its added:
      
      ctnetlink_create_conntrack
        -> nf_ct_helper_ext_add + assign helper
          -> ctnetlink_setup_nat
            -> ctnetlink_parse_nat_setup
               -> parse_nat_setup -> nfnetlink_parse_nat_setup
      	                       -> nf_nat_setup_info
                                       -> nf_conntrack_alter_reply
                                         -> __nf_ct_try_assign_helper
      
      ... and __nf_ct_try_assign_helper will zero the helper again.
      
      Set IPS_HELPER bit to bypass auto-assign logic, its unwanted, just like
      when helper is assigned via ruleset.
      
      Dropped old 'not strictly necessary' comment, it referred to use of
      rcu_assign_pointer() before it got replaced by RCU_INIT_POINTER().
      
      NB: Fixes tag intentionally incorrect, this extends the referenced commit,
      but this change won't build without IPS_HELPER introduced there.
      
      Fixes: 6714cf54
      
       ("netfilter: nf_conntrack: fix explicit helper attachment and NAT")
      Reported-by: default avatarPham Thanh Tuyen <phamtyn@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d1ca60ef
    • Florian Westphal's avatar
      MAINTAINERS: netfilter: update git links · 1f6339e0
      Florian Westphal authored
      
      
      nf and nf-next have a new location.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1f6339e0
    • Florian Westphal's avatar
      netfilter: conntrack: re-init state for retransmitted syn-ack · 82b72cb9
      Florian Westphal authored
      
      
      TCP conntrack assumes that a syn-ack retransmit is identical to the
      previous syn-ack.  This isn't correct and causes stuck 3whs in some more
      esoteric scenarios.  tcpdump to illustrate the problem:
      
       client > server: Flags [S] seq 1365731894, win 29200, [mss 1460,sackOK,TS val 2083035583 ecr 0,wscale 7]
       server > client: Flags [S.] seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215367629 ecr 2082921663]
      
      Note the invalid/outdated synack ack number.
      Conntrack marks this syn-ack as out-of-window/invalid, but it did
      initialize the reply direction parameters based on this packets content.
      
       client > server: Flags [S] seq 1365731894, win 29200, [mss 1460,sackOK,TS val 2083036623 ecr 0,wscale 7]
      
      ... retransmit...
      
       server > client: Flags [S.], seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215368644 ecr 2082921663]
      
      and another bogus synack. This repeats, then client re-uses for a new
      attempt:
      
      client > server: Flags [S], seq 2375731741, win 29200, [mss 1460,sackOK,TS val 2083100223 ecr 0,wscale 7]
      server > client: Flags [S.], seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215430754 ecr 2082921663]
      
      ... but still gets a invalid syn-ack.
      
      This repeats until:
      
       server > client: Flags [S.], seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215437785 ecr 2082921663]
       server > client: Flags [R.], seq 145824454, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215443451 ecr 2082921663]
       client > server: Flags [S], seq 2375731741, win 29200, [mss 1460,sackOK,TS val 2083115583 ecr 0,wscale 7]
       server > client: Flags [S.], seq 162602410, ack 2375731742, win 65535, [mss 8952,wscale 5,TS val 3215445754 ecr 2083115583]
      
      This syn-ack has the correct ack number, but conntrack flags it as
      invalid: The internal state was created from the first syn-ack seen
      so the sequence number of the syn-ack is treated as being outside of
      the announced window.
      
      Don't assume that retransmitted syn-ack is identical to previous one.
      Treat it like the first syn-ack and reinit state.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      82b72cb9
    • Florian Westphal's avatar
      netfilter: conntrack: move synack init code to helper · cc4f9d62
      Florian Westphal authored
      
      
      It seems more readable to use a common helper in the followup fix rather
      than copypaste or goto.
      
      No functional change intended.  The function is only called for syn-ack
      or syn in repy direction in case of simultaneous open.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      cc4f9d62
    • Florian Westphal's avatar
      netfilter: nft_payload: don't allow th access for fragments · a9e8503d
      Florian Westphal authored
      Loads relative to ->thoff naturally expect that this points to the
      transport header, but this is only true if pkt->fragoff == 0.
      
      This has little effect for rulesets with connection tracking/nat because
      these enable ip defra. For other rulesets this prevents false matches.
      
      Fixes: 96518518
      
       ("netfilter: add nftables")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a9e8503d
    • Florian Westphal's avatar
      netfilter: conntrack: don't refresh sctp entries in closed state · 77b33719
      Florian Westphal authored
      
      
      Vivek Thrivikraman reported:
       An SCTP server application which is accessed continuously by client
       application.
       When the session disconnects the client retries to establish a connection.
       After restart of SCTP server application the session is not established
       because of stale conntrack entry with connection state CLOSED as below.
      
       (removing this entry manually established new connection):
      
       sctp 9 CLOSED src=10.141.189.233 [..]  [ASSURED]
      
      Just skip timeout update of closed entries, we don't want them to
      stay around forever.
      
      Reported-and-tested-by: default avatarVivek Thrivikraman <vivek.thrivikraman@est.tech>
      Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1579
      
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      77b33719
    • Steen Hegelund's avatar
      net: sparx5: Fix get_stat64 crash in tcpdump · ed14fc7a
      Steen Hegelund authored
      This problem was found with Sparx5 when the tcpdump tool requests the
      do_get_stats64 (sparx5_get_stats64) statistic.
      
      The portstats pointer was incorrectly incremented when fetching priority
      based statistics.
      
      Fixes: af4b1102
      
       (net: sparx5: add ethtool configuration and statistics support)
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Link: https://lore.kernel.org/r/20220203102900.528987-1-steen.hegelund@microchip.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ed14fc7a
    • Kees Cook's avatar
      gcc-plugins/stackleak: Use noinstr in favor of notrace · dcb85f85
      Kees Cook authored
      While the stackleak plugin was already using notrace, objtool is now a
      bit more picky.  Update the notrace uses to noinstr.  Silences the
      following objtool warnings when building with:
      
      CONFIG_DEBUG_ENTRY=y
      CONFIG_STACK_VALIDATION=y
      CONFIG_VMLINUX_VALIDATION=y
      CONFIG_GCC_PLUGIN_STACKLEAK=y
      
        vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinst...
      dcb85f85
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · eb2eb516
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, netfilter, and ieee802154.
      
        Current release - regressions:
      
         - Partially revert "net/smc: Add netlink net namespace support", fix
           uABI breakage
      
         - netfilter:
            - nft_ct: fix use after free when attaching zone template
            - nft_byteorder: track register operations
      
        Previous releases - regressions:
      
         - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
      
         - phy: qca8081: fix speeds lower than 2.5Gb/s
      
         - sched: fix use-after-free in tc_new_tfilter()
      
        Previous releases - always broken:
      
         - tcp: fix mem under-charging with zerocopy sendmsg()
      
         - tcp: add missing tcp_skb_can_collapse() test in
           tcp_shift_skb_data()
      
         - neigh: do not trigger immediate probes on NUD_FAILED from
           neigh_managed_work, avoid a deadlock
      
         - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
           false-positives
      
         - netfilter: nft_reject_bridge: fix for missing reply from prerouting
      
         - smc: forward wakeup to smc socket waitqueue after fallback
      
         - ieee802154:
            - return meaningful error codes from the netlink helpers
            - mcr20a: fix lifs/sifs periods
            - at86rf230, ca8210: stop leaking skbs on error paths
      
         - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
      
         - ax25: add refcount in ax25_dev to avoid UAF bugs
      
         - eth: mlx5e:
            - fix SFP module EEPROM query
            - fix broken SKB allocation in HW-GRO
            - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
      
         - eth: amd-xgbe:
            - fix skb data length underflow
            - ensure reset of the tx_timer_active flag, avoid Tx timeouts
      
         - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
      
         - eth: e1000e: handshake with CSME starts from Alder Lake platforms"
      
      * tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
        ax25: fix reference count leaks of ax25_dev
        net: stmmac: ensure PTP time register reads are consistent
        net: ipa: request IPA register values be retained
        dt-bindings: net: qcom,ipa: add optional qcom,qmp property
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
        tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
        net: sparx5: do not refer to skb after passing it on
        Partially revert "net/smc: Add netlink net namespace support"
        net/mlx5e: Avoid field-overflowing memcpy()
        net/mlx5e: Use struct_group() for memcpy() region
        net/mlx5e: Avoid implicit modify hdr for decap drop rule
        net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
        net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
        net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
        net/mlx5: E-Switch, Fix uninitialized variable modact
        net/mlx5e: Fix handling of wrong devices during bond netevent
        net/mlx5e: Fix broken SKB allocation in HW-GRO
        net/mlx5e: Fix wrong calculation of header index in HW_GRO
        ...
      eb2eb516
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 551007a8
      Linus Torvalds authored
      Pull selinux fix from Paul Moore:
       "One small SELinux patch to ensure that a policy structure field is
        properly reset after freeing so that we don't inadvertently do a
        double-free on certain error conditions"
      
      * tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: fix double free of cond_list on error paths
      551007a8
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of... · 25b20ae8
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
       "Important fixes to several tests and documentation clarification on
        running mainline kselftest on stable releases. A few notable fixes:
      
         - fix kselftest run hang due to child processes that haven't been
           terminated. Fix signals all child processes
      
         - fix false pass/fail results from vdso_test_abi, openat2, mincore
      
         - build failures when using -j (multiple jobs) option
      
         - exec test build failure due to incorrect build rule for a run-time
           created "pipe"
      
         - zram test fixes related to interaction with zram-generator to make
           sure zram test to coordinate deleted with zram-generator
      
         - zram test compression ratio calculation fix and skipping
           max_comp_streams.
      
         - increasing rtc test timeout
      
         - cpufreq test to write test results to stdout which will necessary
           on automated test systems"
      
      * tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kselftest: Fix vdso_test_abi return status
        selftests: skip mincore.check_file_mmap when fs lacks needed support
        selftests: openat2: Skip testcases that fail with EOPNOTSUPP
        selftests: openat2: Add missing dependency in Makefile
        selftests: openat2: Print also errno in failure messages
        selftests: futex: Use variable MAKE instead of make
        selftests/exec: Remove pipe from TEST_GEN_FILES
        selftests/zram: Adapt the situation that /dev/zram0 is being used
        selftests/zram01.sh: Fix compression ratio calculation
        selftests/zram: Skip max_comp_streams interface on newer kernel
        docs/kselftest: clarify running mainline tests on stables
        kselftest: signal all child processes
        selftests: cpufreq: Write test output to stdout as well
        selftests: rtc: Increase test timeout so that all tests run
      25b20ae8
    • Duoming Zhou's avatar
      ax25: fix reference count leaks of ax25_dev · 87563a04
      Duoming Zhou authored
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev
      to avoid UAF bugs") introduces refcount into ax25_dev, but there
      are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
      ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
      
      This patch uses ax25_dev_put() and adjusts the position of
      ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
      
      Fixes: d01ffb9e
      
       ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      87563a04
    • Yannick Vignon's avatar
      net: stmmac: ensure PTP time register reads are consistent · 80d46090
      Yannick Vignon authored
      Even if protected from preemption and interrupts, a small time window
      remains when the 2 register reads could return inconsistent values,
      each time the "seconds" register changes. This could lead to an about
      1-second error in the reported time.
      
      Add logic to ensure the "seconds" and "nanoseconds" values are consistent.
      
      Fixes: 92ba6888
      
       ("stmmac: add the support for PTP hw clock driver")
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20220203160025.750632-1-yannick.vignon@oss.nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80d46090
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 77b1b8b4
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-02-03
      
      We've added 6 non-merge commits during the last 10 day(s) which contain
      a total of 7 files changed, 11 insertions(+), 236 deletions(-).
      
      The main changes are:
      
      1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
         flag which otherwise trips over KASAN, from Hou Tao.
      
      2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
         rename, from Alexei Starovoitov.
      
      3) Fix a possible race in inc_misses_counter() when IRQ would trigger
         during counter update, from He Fengqing.
      
      4) Fix tooling infra for cross-building with clang upon probing whether
         gcc provides the standard libraries, from Jean-Philippe Brucker.
      
      5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.
      
      6) Drop unneeded and outdated lirc.h header copy from tooling infra as
         BPF does not require it anymore, from Sean Young.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        tools: Ignore errors from `which' when searching a GCC toolchain
        tools headers UAPI: remove stale lirc.h
        bpf: Fix possible race in inc_misses_counter
        bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
      ====================
      
      Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77b1b8b4
    • Mickaël Salaün's avatar
      printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin() · 1f2cfdd3
      Mickaël Salaün authored
      The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
      kernel/printk/sysctl.c introduced an incorrect __user attribute to the
      buffer argument.  I spotted this change in [1] as well as the kernel
      test robot.  Revert this change to please sparse:
      
        kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
        kernel/printk/sysctl.c:20:51:    expected void *
        kernel/printk/sysctl.c:20:51:    got void [noderef] __user *buffer
      
      Fixes: faaa357a ("printk: move printk sysctl to printk/sysctl.c")
      Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net
      
       [1]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarMickaël Salaün <mic@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20220203145029.272640-1-mic@digikod.net
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f2cfdd3
    • Igor Pylypiv's avatar
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv authored
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without...
      67d6212a
    • Linus Torvalds's avatar
      Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 305e6c42
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
      
       - Eric's fix for a long standing cgroup1 permission issue where it only
         checks for uid 0 instead of CAP which inadvertently allows
         unprivileged userns roots to modify release_agent userhelper
      
       - Fixes for the fallout from Waiman's recent cpuset work
      
      * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
        cgroup-v1: Require capabilities to set release_agent
        cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
        cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
      305e6c42
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-enable-register-retention' · 0166556a
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: enable register retention
      
      With runtime power management in place, we sometimes need to issue
      a command to enable retention of IPA register values before power
      collapse.  This requires a new Device Tree property, whose presence
      will also be used to signal that the command is required.
      ====================
      
      Link: https://lore.kernel.org/r/20220201150205.468403-1-elder@linaro.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0166556a
    • Alex Elder's avatar
      net: ipa: request IPA register values be retained · 34a08176
      Alex Elder authored
      In some cases, the IPA hardware needs to request the always-on
      subsystem (AOSS) to coordinate with the IPA microcontroller to
      retain IPA register values at power collapse.  This is done by
      issuing a QMP request to the AOSS microcontroller.  A similar
      request ondoes that request.
      
      We must get and hold the "QMP" handle early, because we might get
      back EPROBE_DEFER for that.  But the actual request should be sent
      while we know the IPA clock is active, and when we know the
      microcontroller is operational.
      
      Fixes: 1aac309d
      
       ("net: ipa: use autosuspend")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34a08176
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: add optional qcom,qmp property · ac62a017
      Alex Elder authored
      
      
      For some systems, the IPA driver must make a request to ensure that
      its registers are retained across power collapse of the IPA hardware.
      On such systems, we'll use the existence of the "qcom,qmp" property
      as a signal that this request is required.
      
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ac62a017
  7. Feb 03, 2022
    • Waiman Long's avatar
      cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning · 2bdfd282
      Waiman Long authored
      It was found that a "suspicious RCU usage" lockdep warning was issued
      with the rcu_read_lock() call in update_sibling_cpumasks().  It is
      because the update_cpumasks_hier() function may sleep. So we have
      to release the RCU lock, call update_cpumasks_hier() and reacquire
      it afterward.
      
      Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
      instead of stating that in the comment.
      
      Fixes: 4716909c
      
       ("cpuset: Track cpusets that use parent's effective_cpus")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Tested-by: default avatarPhil Auld <pauld@redhat.com>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2bdfd282
    • Nathan Chancellor's avatar
      tools/resolve_btfids: Do not print any commands when building silently · 7f3bdbc3
      Nathan Chancellor authored
      When building with 'make -s', there is some output from resolve_btfids:
      
      $ make -sj"$(nproc)" oldconfig prepare
        MKDIR     .../tools/bpf/resolve_btfids/libbpf/
        MKDIR     .../tools/bpf/resolve_btfids//libsubcmd
        LINK     resolve_btfids
      
      Silent mode means that no information should be emitted about what is
      currently being done. Use the $(silent) variable from Makefile.include
      to avoid defining the msg macro so that there is no information printed.
      
      Fixes: fbbb68de
      
       ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220201212503.731732-1-nathan@kernel.org
      7f3bdbc3
    • John Hubbard's avatar
      Revert "mm/gup: small refactoring: simplify try_grab_page()" · c36c04c2
      John Hubbard authored
      This reverts commit 54d516b1
      
      That commit did a refactoring that effectively combined fast and slow
      gup paths (again).  And that was again incorrect, for two reasons:
      
       a) Fast gup and slow gup get reference counts on pages in different
          ways and with different goals: see Linus' writeup in commit
          cd1adf1b ("Revert "mm/gup: remove try_get_page(), call
          try_get_compound_head() directly""), and
      
       b) try_grab_compound_head() also has a specific check for
          "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller
          can fall back to slow gup. This resulted in new failures, as
          recently report by Will McVicker [1].
      
      But (a) has problems too, even though they may not have been reported
      yet.  So just revert this.
      
      Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1]
      Fixes: 54d516b1
      
       ("mm/gup: small refactoring: simplify try_grab_page()")
      Reported-and-tested-by: default avatarWill McVicker <willmcvicker@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: stable@vger.kernel.org # 5.15
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c36c04c2
    • Linus Torvalds's avatar
      Merge tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · d394bb77
      Linus Torvalds authored
      Pull MIPS fixes from Thomas Bogendoerfer:
      
       - fix missed change for PTR->PTR_WD conversion
      
       - kernel-doc fixes
      
      * tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: KVM: fix vz.c kernel-doc notation
        MIPS: octeon: Fix missed PTR->PTR_WD conversion
      d394bb77
    • Hou Tao's avatar
      bpf: Use VM_MAP instead of VM_ALLOC for ringbuf · b293dcc4
      Hou Tao authored
      After commit 2fd3fb0be1d1 ("kasan, vmalloc: unpoison VM_ALLOC pages
      after mapping"), non-VM_ALLOC mappings will be marked as accessible
      in __get_vm_area_node() when KASAN is enabled. But now the flag for
      ringbuf area is VM_ALLOC, so KASAN will complain out-of-bound access
      after vmap() returns. Because the ringbuf area is created by mapping
      allocated pages, so use VM_MAP instead.
      
      After the change, info in /proc/vmallocinfo also changes from
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmalloc user
      to
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmap user
      
      Fixes: 457f4436
      
       ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: default avatar <syzbot+5ad567a418794b9b5983@syzkaller.appspotmail.com>
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220202060158.6260-1-houtao1@huawei.com
      b293dcc4
    • Daniel Borkmann's avatar
      net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work · 4a81f6da
      Daniel Borkmann authored
      syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
      
        kworker/0:16/14617 is trying to acquire lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        [...]
        but task is already holding lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
      
      The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
      triggered an immediate probe as per commit cd28ca0a ("neigh: reduce
      arp latency") via neigh_probe() given table lock was held.
      
      One option to fix this situation is to defer the neigh_probe() back to
      the neigh_timer_handler() similarly as pre cd28ca0a. For the case
      of NTF_MANAGED, this deferral is acceptable given this only happens on
      actual failure state and regular / expected state is NUD_VALID with the
      entry already present.
      
      The fix adds a parameter to __neigh_event_send() in order to communicate
      whether immediate probe is allowed or disallowed. Existing call-sites
      of neigh_event_send() default as-is to immediate probe. However, the
      neigh_managed_work() disables it via use of neigh_event_send_probe().
      
      [0] <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
        print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
        check_deadlock kernel/locking/lockdep.c:2999 [inline]
        validate_chain kernel/locking/lockdep.c:3788 [inline]
        __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
        lock_acquire kernel/locking/lockdep.c:5639 [inline]
        lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
        __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
        _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
        ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
        __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
        __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
        ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
        NF_HOOK_COND include/linux/netfilter.h:296 [inline]
        ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
        dst_output include/net/dst.h:451 [inline]
        NF_HOOK include/linux/netfilter.h:307 [inline]
        ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
        ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
        ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
        neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
        __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
        neigh_event_send include/net/neighbour.h:470 [inline]
        neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
        process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
        worker_thread+0x657/0x1110 kernel/workqueue.c:2454
        kthread+0x2e9/0x3a0 kernel/kthread.c:377
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        </TASK>
      
      Fixes: 7482e384
      
       ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
      Reported-by: default avatar <syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Tested-by: default avatar <syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4a81f6da