Skip to content
  1. Aug 16, 2023
    • Florian Westphal's avatar
      netfilter: nf_tables: fix kdoc warnings after gc rework · 08713cb0
      Florian Westphal authored
      Jakub Kicinski says:
        We've got some new kdoc warnings here:
        net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc'
        net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc'
        include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set'
      
      Fixes: 5f68718b ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
      Fixes: f6c383b8
      
       ("netfilter: nf_tables: adapt set backend to use GC transaction API")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      08713cb0
    • Florian Westphal's avatar
      netfilter: nf_tables: fix false-positive lockdep splat · b9f052dc
      Florian Westphal authored
      ->abort invocation may cause splat on debug kernels:
      
      WARNING: suspicious RCU usage
      net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage!
      [..]
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid
      [..]
       lockdep_rcu_suspicious+0x1ad/0x260
       nft_pipapo_abort+0x145/0x180
       __nf_tables_abort+0x5359/0x63d0
       nf_tables_abort+0x24/0x40
       nfnetlink_rcv+0x1a0a/0x22c0
       netlink_unicast+0x73c/0x900
       netlink_sendmsg+0x7f0/0xc20
       ____sys_sendmsg+0x48d/0x760
      
      Transaction mutex is held, so parallel updates are not possible.
      Switch to _protected and check mutex is held for lockdep enabled builds.
      
      Fixes: 212ed75d
      
       ("netfilter: nf_tables: integrate pipapo into commit protocol")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      b9f052dc
    • Jason Xing's avatar
      net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled · e4dd0d3a
      Jason Xing authored
      In the real workload, I encountered an issue which could cause the RTO
      timer to retransmit the skb per 1ms with linear option enabled. The amount
      of lost-retransmitted skbs can go up to 1000+ instantly.
      
      The root cause is that if the icsk_rto happens to be zero in the 6th round
      (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero
      due to the changed calculation method in tcp_retransmit_timer() as follows:
      
      icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
      
      Above line could be converted to
      icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0
      
      Therefore, the timer expires so quickly without any doubt.
      
      I read through the RFC 6298 and found that the RTO value can be rounded
      up to a certain value, in Linux, say TCP_RTO_MIN as default, which is
      regarded as the lower bound in this patch as suggested by Eric.
      
      Fixes: 36e31b0a
      
       ("net: TCP thin linear timeouts")
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJason Xing <kernelxing@tencent.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4dd0d3a
  2. Aug 15, 2023
    • Liang Chen's avatar
      net: veth: Page pool creation error handling for existing pools only · 8a519a57
      Liang Chen authored
      The failure handling procedure destroys page pools for all queues,
      including those that haven't had their page pool created yet. this patch
      introduces necessary adjustments to prevent potential risks and
      inconsistency with the error handling behavior.
      
      Fixes: 0ebab78c
      
       ("net: veth: add page_pool for page recycling")
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Link: https://lore.kernel.org/r/20230812023016.10553-1-liangchen.linux@gmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8a519a57
    • Jakub Kicinski's avatar
      Merge branch 'octeon_ep-fixes-for-error-and-remove-paths' · f6f978fc
      Jakub Kicinski authored
      
      
      Michal Schmidt says:
      
      ====================
      octeon_ep: fixes for error and remove paths
      
      I have an Octeon card that's misconfigured in a way that exposes a
      couple of bugs in the octeon_ep driver's error paths. It can reproduce
      the issues that patches 1 & 4 are fixing. Patches 2 & 3 are a result of
      reviewing the nearby code.
      ====================
      
      Link: https://lore.kernel.org/r/20230810150114.107765-1-mschmidt@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f6f978fc
    • Michal Schmidt's avatar
      octeon_ep: cancel queued works in probe error path · 758c9107
      Michal Schmidt authored
      If it fails to get the devices's MAC address, octep_probe exits while
      leaving the delayed work intr_poll_task queued. When the work later
      runs, it's a use after free.
      
      Move the cancelation of intr_poll_task from octep_remove into
      octep_device_cleanup. This does not change anything in the octep_remove
      flow, but octep_device_cleanup is called also in the octep_probe error
      path, where the cancelation is needed.
      
      Note that the cancelation of ctrl_mbox_task has to follow
      intr_poll_task's, because the ctrl_mbox_task may be queued by
      intr_poll_task.
      
      Fixes: 24d43332
      
       ("octeon_ep: poll for control messages")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Link: https://lore.kernel.org/r/20230810150114.107765-5-mschmidt@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      758c9107
    • Michal Schmidt's avatar
      octeon_ep: cancel ctrl_mbox_task after intr_poll_task · 607a7a45
      Michal Schmidt authored
      intr_poll_task may queue ctrl_mbox_task. The function
      octep_poll_non_ioq_interrupts_cn93_pf does this.
      
      When removing the driver and canceling these two works, cancel
      ctrl_mbox_task last to guarantee it does not run anymore.
      
      Fixes: 24d43332
      
       ("octeon_ep: poll for control messages")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Link: https://lore.kernel.org/r/20230810150114.107765-4-mschmidt@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      607a7a45
    • Michal Schmidt's avatar
      octeon_ep: cancel tx_timeout_task later in remove sequence · 28458c80
      Michal Schmidt authored
      tx_timeout_task is canceled too early when removing the driver. Nothing
      prevents .ndo_tx_timeout from triggering and queuing the work again.
      
      Better cancel it after the netdev is unregistered.
      It's harmless for octep_tx_timeout_task to run in the window between the
      unregistration and cancelation, because it checks netif_running.
      
      Fixes: 862cd659
      
       ("octeon_ep: Add driver framework and device initialization")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Link: https://lore.kernel.org/r/20230810150114.107765-3-mschmidt@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      28458c80
    • Michal Schmidt's avatar
      octeon_ep: fix timeout value for waiting on mbox response · 519b2279
      Michal Schmidt authored
      The intention was to wait up to 500 ms for the mbox response.
      The third argument to wait_event_interruptible_timeout() is supposed to
      be the timeout duration. The driver mistakenly passed absolute time
      instead.
      
      Fixes: 577f0d1b
      
       ("octeon_ep: add separate mailbox command and response queues")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230810150114.107765-2-mschmidt@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      519b2279
    • Radhey Shyam Pandey's avatar
      net: macb: In ZynqMP resume always configure PS GTR for non-wakeup source · 6c461e39
      Radhey Shyam Pandey authored
      On Zynq UltraScale+ MPSoC ubuntu platform when systemctl issues suspend,
      network manager bring down the interface and goes into suspend. When it
      wakes up it again enables the interface.
      
      This leads to xilinx-psgtr "PLL lock timeout" on interface bringup, as
      the power management controller power down the entire FPD (including
      SERDES) if none of the FPD devices are in use and serdes is not
      initialized on resume.
      
      $ sudo rtcwake -m no -s 120 -v
      $ sudo systemctl suspend  <this does ifconfig eth1 down>
      $ ifconfig eth1 up
      xilinx-psgtr fd400000.phy: lane 0 (type 10, protocol 5): PLL lock timeout
      phy phy-fd400000.phy.0: phy poweron failed --> -110
      
      macb driver is called in this way:
      1. macb_close: Stop network interface. In this function, it
         reset MACB IP and disables PHY and network interface.
      
      2. macb_suspend: It is called in kernel suspend flow. But because
         network interface has been disabled(netif_running(ndev) is
         false), it does nothing and returns directly;
      
      3. System goes into suspend state. Some time later, system is
         waken up by RTC wakeup device;
      
      4. macb_resume: It does nothing because network interface has
         been disabled;
      
      5. macb_open: It is called to enable network interface again. ethernet
         interface is initialized in this API but serdes which is power-off
         by PMUFW during FPD-off suspend is not initialized again and so
         we hit GT PLL lock issue on open.
      
      To resolve this PLL timeout issue always do PS GTR initialization
      when ethernet device is configured as non-wakeup source.
      
      Fixes: f22bd29b ("net: macb: Fix ZynqMP SGMII non-wakeup source resume failure")
      Fixes: 8b73fa3a
      
       ("net: macb: Added ZynqMP-specific initialization")
      Signed-off-by: default avatarRadhey Shyam Pandey <radhey.shyam.pandey@amd.com>
      Link: https://lore.kernel.org/r/1691414091-2260697-1-git-send-email-radhey.shyam.pandey@amd.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c461e39
  3. Aug 14, 2023
  4. Aug 13, 2023
    • Russell King (Oracle)'s avatar
      net: phy: fix IRQ-based wake-on-lan over hibernate / power off · cc941e54
      Russell King (Oracle) authored
      Uwe reports:
      "Most PHYs signal WoL using an interrupt. So disabling interrupts [at
      shutdown] breaks WoL at least on PHYs covered by the marvell driver."
      
      Discussing with Ioana, the problem which was trying to be solved was:
      "The board in question is a LS1021ATSN which has two AR8031 PHYs that
      share an interrupt line. In case only one of the PHYs is probed and
      there are pending interrupts on the PHY#2 an IRQ storm will happen
      since there is no entity to clear the interrupt from PHY#2's registers.
      PHY#1's driver will get stuck in .handle_interrupt() indefinitely."
      
      Further confirmation that "the two AR8031 PHYs are on the same MDIO
      bus."
      
      With WoL using interrupts to wake the system, in such a case, the
      system will begin booting with an asserted interrupt. Thus, we need to
      cope with an interrupt asserted during boot.
      
      Solve this instead by disabling interrupts during PHY probe. This will
      ensure in Ioana's situation that both PHYs of the same type sharing an
      interrupt line on a common MDIO bus will have their interrupt outputs
      disabled when the driver probes the device, but before we hook in any
      interrupt handlers - thus avoiding the interrupt storm.
      
      A better fix would be for platform firmware to disable the interrupting
      devices at source during boot, before control is handed to the kernel.
      
      Fixes: e2f016cf
      
       ("net: phy: add a shutdown procedure")
      Link: 20230804071757.383971-1-u.kleine-koenig@pengutronix.de
      Reported-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc941e54
  5. Aug 11, 2023
    • Xiang Yang's avatar
      net: pcs: Add missing put_device call in miic_create · 829c6524
      Xiang Yang authored
      The reference of pdev->dev is taken by of_find_device_by_node, so
      it should be released when not need anymore.
      
      Fixes: 7dc54d3b
      
       ("net: pcs: add Renesas MII converter driver")
      Signed-off-by: default avatarXiang Yang <xiangyang3@huawei.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      829c6524
    • Jason Wang's avatar
      virtio-net: set queues after driver_ok · 51b81317
      Jason Wang authored
      Commit 25266128 ("virtio-net: fix race between set queues and
      probe") tries to fix the race between set queues and probe by calling
      _virtnet_set_queues() before DRIVER_OK is set. This violates virtio
      spec. Fixing this by setting queues after virtio_device_ready().
      
      Note that rtnl needs to be held for userspace requests to change the
      number of queues. So we are serialized in this way.
      
      Fixes: 25266128
      
       ("virtio-net: fix race between set queues and probe")
      Reported-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51b81317
    • Jakub Kicinski's avatar
      Merge branch 'x86/bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9ebbb29d
      Jakub Kicinski authored
      
      
      Cross merge x86 fixes to fix clang linking errors:
      
      ld.lld: error: ./arch/x86/kernel/vmlinux.lds:221: at least one side of the expression must be absolute
      
      These will hopefully be downstream by the time we ship
      the next batch of fixes.
      
      * 'x86/bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86: Move gds_ucode_mitigated() declaration to header
        x86/speculation: Add cpu_show_gds() prototype
        driver core: cpu: Make cpu_show_not_affected() static
        x86/srso: Fix build breakage with the LLVM linker
        Documentation/srso: Document IBPB aspect and fix formatting
        driver core: cpu: Unify redundant silly stubs
        Documentation/hw-vuln: Unify filename specification in index
      
      Link: https://lore.kernel.org/all/CAHk-=wj_b+FGTnevQSBAtCWuhCk=0oQ_THvthBW2hzqpOTLFmg@mail.gmail.com/
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9ebbb29d
    • Linus Torvalds's avatar
      Merge tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 25aa0beb
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from netfilter, wireless and bpf.
      
        Still trending up in size but the good news is that the "current"
        regressions are resolved, AFAIK.
      
        We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
        handling on embedded this week (most not merged yet), not sure why.
        But those are all for older bugs.
      
        Current release - regressions:
      
         - tls: set MSG_SPLICE_PAGES consistently when handing encrypted data
           over to TCP
      
        Current release - new code bugs:
      
         - eth: mlx5: correct IDs on VFs internal to the device (IPU)
      
        Previous releases - regressions:
      
         - phy: at803x: fix WoL support / reporting on AR8032
      
         - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from
           slaves, leading to BUG_ON()
      
         - tun: prevent tun_build_skb() from exceeding the packet size limit
      
         - wifi: rtw89: fix 8852AE disconnection caused by RX full flags
      
         - eth/PCI: enetc: fix probing after 6fffbc7a ("PCI: Honor
           firmware's device disabled status"), keep PCI devices around even
           if they are disabled / not going to be probed to be able to apply
           quirks on them
      
         - eth: prestera: fix handling IPv4 routes with nexthop IDs
      
        Previous releases - always broken:
      
         - netfilter: re-work garbage collection to avoid races between
           user-facing API and timeouts
      
         - tunnels: fix generating ipv4 PMTU error on non-linear skbs
      
         - nexthop: fix infinite nexthop bucket dump when using maximum
           nexthop ID
      
         - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
      
        Misc:
      
         - unix: use consistent error code in SO_PEERPIDFD
      
         - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for
           upcoming IETF RFC"
      
      * tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
        net: hns3: fix strscpy causing content truncation issue
        net: tls: set MSG_SPLICE_PAGES consistently
        ibmvnic: Ensure login failure recovery is safe from other resets
        ibmvnic: Do partial reset on login failure
        ibmvnic: Handle DMA unmapping of login buffs in release functions
        ibmvnic: Unmap DMA login rsp buffer on send login fail
        ibmvnic: Enforce stronger sanity checks on login response
        net: mana: Fix MANA VF unload when hardware is unresponsive
        netfilter: nf_tables: remove busy mark and gc batch API
        netfilter: nft_set_hash: mark set element as dead when deleting from packet path
        netfilter: nf_tables: adapt set backend to use GC transaction API
        netfilter: nf_tables: GC transaction API to avoid race with control plane
        selftests/bpf: Add sockmap test for redirecting partial skb data
        selftests/bpf: fix a CI failure caused by vsock sockmap test
        bpf, sockmap: Fix bug that strp_done cannot be called
        bpf, sockmap: Fix map type error in sock_map_del_link
        xsk: fix refcount underflow in error path
        ipv6: adjust ndisc_is_useropt() to also return true for PIO
        selftests: forwarding: bridge_mdb: Make test more robust
        selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
        ...
      25aa0beb
    • Hao Chen's avatar
      net: hns3: fix strscpy causing content truncation issue · 5e3d2061
      Hao Chen authored
      hns3_dbg_fill_content()/hclge_dbg_fill_content() is aim to integrate some
      items to a string for content, and we add '\n' and '\0' in the last
      two bytes of content.
      
      strscpy() will add '\0' in the last byte of destination buffer(one of
      items), it result in finishing content print ahead of schedule and some
      dump content truncation.
      
      One Error log shows as below:
      cat mac_list/uc
      UC MAC_LIST:
      
      Expected:
      UC MAC_LIST:
      FUNC_ID  MAC_ADDR            STATE
      pf       00:2b:19:05:03:00   ACTIVE
      
      The destination buffer is length-bounded and not required to be
      NUL-terminated, so just change strscpy() to memcpy() to fix it.
      
      Fixes: 1cf3d556
      
       ("net: hns3: fix strncpy() not using dest-buf length as length issue")
      Signed-off-by: default avatarHao Chen <chenhao418@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Link: https://lore.kernel.org/r/20230809020902.1941471-1-shaojijie@huawei.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5e3d2061
    • Jakub Kicinski's avatar
      net: tls: set MSG_SPLICE_PAGES consistently · 6b486676
      Jakub Kicinski authored
      We used to change the flags for the last segment, because
      non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
      That flag is no longer a thing so remove the setting.
      
      Since flags most likely don't have MSG_SPLICE_PAGES set
      this avoids passing parts of the sg as splice and parts
      as non-splice. Before commit under Fixes we'd have called
      tcp_sendpage() which would add the MSG_SPLICE_PAGES.
      
      Why this leads to trouble remains unclear but Tariq
      reports hitting the WARN_ON(!sendpage_ok()) due to
      page refcount of 0.
      
      Fixes: e117dcfd
      
       ("tls: Inline do_tcp_sendpages()")
      Reported-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/all/4c49176f-147a-4283-f1b1-32aac7b4b996@gmail.com/
      Tested-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20230808180917.1243540-1-kuba@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6b486676
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine · 30813656
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
      
       - HAS_IOMEM fixes for fsl edma and intel idma
      
       - return-value fix, interrupt vector setting and typo fix for xilinx
         xdma
      
       - email updates for codeaurora email domain move
      
       - correct pause status for pl330 driver
      
       - idxd clear flag on disable fix
      
       - function documentation fix for owl dma
      
       - potential un-allocated memory fix for mcf driver
      
      * tag 'dmaengine-fix-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
        dmaengine: xilinx: xdma: Fix typo
        dmaengine: xilinx: xdma: Fix interrupt vector setting
        dmaengine: owl-dma: Modify mismatched function name
        dmaengine: idxd: Clear PRS disable flag when disabling IDXD device
        dmaengine: pl330: Return DMA_PAUSED when transaction is paused
        dmaengine: qcom_hidma: Update codeaurora email domain
        dmaengine: mcf-edma: Fix a potential un-allocated memory access
        dmaengine: xilinx: xdma: Fix Judgment of the return value
        idmaengine: make FSL_EDMA and INTEL_IDMA64 depends on HAS_IOMEM
      30813656
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 3e91b0eb
      Jakub Kicinski authored
      
      
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The existing attempt to resolve races between control plane and GC work
      is error prone, as reported by Bien Pham <phamnnb@sea.com>, some places
      forgot to call nft_set_elem_mark_busy(), leading to double-deactivation
      of elements.
      
      This series contains the following patches:
      
      1) Do not skip expired elements during walk otherwise elements might
         never decrement the reference counter on data, leading to memleak.
      
      2) Add a GC transaction API to replace the former attempt to deal with
         races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT
         on elements and it creates a GC transaction to remove the expired
         elements, GC transaction could abort in case of interference with
         control plane and retried later (GC async). Set backends such as
         rbtree and pipapo also perform GC from control plane (GC sync), in
         such case, element deactivation and removal is safe because mutex
         is held then collected elements are released via call_rcu().
      
      3) Adapt existing set backends to use the GC transaction API.
      
      4) Update rhash set backend to set on _DEAD bit to report deleted
         elements from datapath for GC.
      
      5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT.
      
      * tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: remove busy mark and gc batch API
        netfilter: nft_set_hash: mark set element as dead when deleting from packet path
        netfilter: nf_tables: adapt set backend to use GC transaction API
        netfilter: nf_tables: GC transaction API to avoid race with control plane
        netfilter: nf_tables: don't skip expired elements during walk
      ====================
      
      Link: https://lore.kernel.org/r/20230810070830.24064-1-pablo@netfilter.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3e91b0eb
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 62d02fca
      Jakub Kicinski authored
      
      
      Martin KaFai Lau says:
      
      ====================
      pull-request: bpf 2023-08-09
      
      We've added 5 non-merge commits during the last 7 day(s) which contain
      a total of 6 files changed, 102 insertions(+), 8 deletions(-).
      
      The main changes are:
      
      1) A bpf sockmap memleak fix and a fix in accessing the programs of
         a sockmap under the incorrect map type from Xu Kuohai.
      
      2) A refcount underflow fix in xsk from Magnus Karlsson.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: Add sockmap test for redirecting partial skb data
        selftests/bpf: fix a CI failure caused by vsock sockmap test
        bpf, sockmap: Fix bug that strp_done cannot be called
        bpf, sockmap: Fix map type error in sock_map_del_link
        xsk: fix refcount underflow in error path
      ====================
      
      Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.dev
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      62d02fca
    • Nick Child's avatar
      ibmvnic: Ensure login failure recovery is safe from other resets · 6db541ae
      Nick Child authored
      If a login request fails, the recovery process should be protected
      against parallel resets. It is a known issue that freeing and
      registering CRQ's in quick succession can result in a failover CRQ from
      the VIOS. Processing a failover during login recovery is dangerous for
      two reasons:
       1. This will result in two parallel initialization processes, this can
       cause serious issues during login.
       2. It is possible that the failover CRQ is received but never executed.
       We get notified of a pending failover through a transport event CRQ.
       The reset is not performed until a INIT CRQ request is received.
       Previously, if CRQ init fails during login recovery, then the ibmvnic
       irq is freed and the login process returned error. If failover_pending
       is true (a transport event was received), then the ibmvnic device
       would never be able to process the reset since it cannot receive the
       CRQ_INIT request due to the irq being freed. This leaved the device
       in a inoperable state.
      
      Therefore, the login failure recovery process must be hardened against
      these possible issues. Possible failovers (due to quick CRQ free and
      init) must be avoided and any issues during re-initialization should be
      dealt with instead of being propagated up the stack. This logic is
      similar to that of ibmvnic_probe().
      
      Fixes: dff515a3
      
       ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-5-nnac123@linux.ibm.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6db541ae
    • Nick Child's avatar
      ibmvnic: Do partial reset on login failure · 23cc5f66
      Nick Child authored
      Perform a partial reset before sending a login request if any of the
      following are true:
       1. If a previous request times out. This can be dangerous because the
       	VIOS could still receive the old login request at any point after
       	the timeout. Therefore, it is best to re-register the CRQ's  and
       	sub-CRQ's before retrying.
       2. If the previous request returns an error that is not described in
       	PAPR. PAPR provides procedures if the login returns with partial
       	success or aborted return codes (section L.5.1) but other values
      	do not have a defined procedure. Previously, these conditions
      	just returned error from the login function rather than trying
      	to resolve the issue.
       	This can cause further issues since most callers of the login
       	function are not prepared to handle an error when logging in. This
       	improper cleanup can lead to the device being permanently DOWN'd.
       	For example, if the VIOS believes that the device is already logged
       	in then it will return INVALID_STATE (-7). If we never re-register
       	CRQ's then it will always think that the device is already logged
       	in. This leaves the device inoperable.
      
      The partial reset involves freeing the sub-CRQs, freeing the CRQ then
      registering and initializing a new CRQ and sub-CRQs. This essentially
      restarts all communication with VIOS to allow for a fresh login attempt
      that will be unhindered by any previous failed attempts.
      
      Fixes: dff515a3
      
       ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-4-nnac123@linux.ibm.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      23cc5f66
    • Nick Child's avatar
      ibmvnic: Handle DMA unmapping of login buffs in release functions · d78a671e
      Nick Child authored
      Rather than leaving the DMA unmapping of the login buffers to the
      login response handler, move this work into the login release functions.
      Previously, these functions were only used for freeing the allocated
      buffers. This could lead to issues if there are more than one
      outstanding login buffer requests, which is possible if a login request
      times out.
      
      If a login request times out, then there is another call to send login.
      The send login function makes a call to the login buffer release
      function. In the past, this freed the buffers but did not DMA unmap.
      Therefore, the VIOS could still write to the old login (now freed)
      buffer. It is for this reason that it is a good idea to leave the DMA
      unmap call to the login buffers release function.
      
      Since the login buffer release functions now handle DMA unmapping,
      remove the duplicate DMA unmapping in handle_login_rsp().
      
      Fixes: dff515a3
      
       ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-3-nnac123@linux.ibm.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d78a671e
    • Nick Child's avatar
      ibmvnic: Unmap DMA login rsp buffer on send login fail · 411c565b
      Nick Child authored
      If the LOGIN CRQ fails to send then we must DMA unmap the response
      buffer. Previously, if the CRQ failed then the memory was freed without
      DMA unmapping.
      
      Fixes: c98d9cc4
      
       ("ibmvnic: send_login should check for crq errors")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-2-nnac123@linux.ibm.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      411c565b
    • Nick Child's avatar
      ibmvnic: Enforce stronger sanity checks on login response · db17ba71
      Nick Child authored
      Ensure that all offsets in a login response buffer are within the size
      of the allocated response buffer. Any offsets or lengths that surpass
      the allocation are likely the result of an incomplete response buffer.
      In these cases, a full reset is necessary.
      
      When attempting to login, the ibmvnic device will allocate a response
      buffer and pass a reference to the VIOS. The VIOS will then send the
      ibmvnic device a LOGIN_RSP CRQ to signal that the buffer has been filled
      with data. If the ibmvnic device does not get a response in 20 seconds,
      the old buffer is freed and a new login request is sent. With 2
      outstanding requests, any LOGIN_RSP CRQ's could be for the older
      login request. If this is the case then the login response buffer (which
      is for the newer login request) could be incomplete and contain invalid
      data. Therefore, we must enforce strict sanity checks on the response
      buffer values.
      
      Testing has shown that the `off_rxadd_buff_size` value is filled in last
      by the VIOS and will be the smoking gun for these circumstances.
      
      Until VIOS can implement a mechanism for tracking outstanding response
      buffers and a method for mapping a LOGIN_RSP CRQ to a particular login
      response buffer, the best ibmvnic can do in this situation is perform a
      full reset.
      
      Fixes: dff515a3
      
       ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-1-nnac123@linux.ibm.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      db17ba71
    • Souradeep Chakrabarti's avatar
      net: mana: Fix MANA VF unload when hardware is unresponsive · a7dfeda6
      Souradeep Chakrabarti authored
      When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
      hardware to complete any inflight packets and set the pending send count
      to zero. But if the hardware has failed, mana_dealloc_queues()
      could wait forever.
      
      Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
      which is a somewhat arbitrary value that is more than long enough for
      functional hardware to complete any sends.
      
      Cc: stable@vger.kernel.org
      Fixes: ca9c54d2
      
       ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
      Signed-off-by: default avatarSouradeep Chakrabarti <schakrabarti@linux.microsoft.com>
      Link: https://lore.kernel.org/r/1691576525-24271-1-git-send-email-schakrabarti@linux.microsoft.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a7dfeda6
    • Arnd Bergmann's avatar
      x86: Move gds_ucode_mitigated() declaration to header · eb3515dc
      Arnd Bergmann authored
      The declaration got placed in the .c file of the caller, but that
      causes a warning for the definition:
      
      arch/x86/kernel/cpu/bugs.c:682:6: error: no previous prototype for 'gds_ucode_mitigated' [-Werror=missing-prototypes]
      
      Move it to a header where both sides can observe it instead.
      
      Fixes: 81ac7e5d
      
       ("KVM: Add GDS_NO support to KVM")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Tested-by: default avatarDaniel Sneddon <daniel.sneddon@linux.intel.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/all/20230809130530.1913368-2-arnd%40kernel.org
      eb3515dc
    • Arnd Bergmann's avatar
      x86/speculation: Add cpu_show_gds() prototype · a57c27c7
      Arnd Bergmann authored
      The newly added function has two definitions but no prototypes:
      
      drivers/base/cpu.c:605:16: error: no previous prototype for 'cpu_show_gds' [-Werror=missing-prototypes]
      
      Add a declaration next to the other ones for this file to avoid the
      warning.
      
      Fixes: 8974eb58
      
       ("x86/speculation: Add Gather Data Sampling mitigation")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Tested-by: default avatarDaniel Sneddon <daniel.sneddon@linux.intel.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/all/20230809130530.1913368-1-arnd%40kernel.org
      a57c27c7
  6. Aug 10, 2023
    • Borislav Petkov (AMD)'s avatar
      driver core: cpu: Make cpu_show_not_affected() static · 6524c798
      Borislav Petkov (AMD) authored
      Fix a -Wmissing-prototypes warning and add the gather_data_sampling()
      stub macro call for real.
      
      Fixes: 0fddfe33
      
       ("driver core: cpu: Unify redundant silly stubs")
      Closes: https://lore.kernel.org/oe-kbuild-all/202308101956.oRj1ls7s-lkp@intel.com
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Link: https://lore.kernel.org/r/202308101956.oRj1ls7s-lkp@intel.com
      6524c798
    • Nick Desaulniers's avatar
      x86/srso: Fix build breakage with the LLVM linker · cbe8ded4
      Nick Desaulniers authored
      The assertion added to verify the difference in bits set of the
      addresses of srso_untrain_ret_alias() and srso_safe_ret_alias() would fail
      to link in LLVM's ld.lld linker with the following error:
      
        ld.lld: error: ./arch/x86/kernel/vmlinux.lds:210: at least one side of
        the expression must be absolute
        ld.lld: error: ./arch/x86/kernel/vmlinux.lds:211: at least one side of
        the expression must be absolute
      
      Use ABSOLUTE to evaluate the expression referring to at least one of the
      symbols so that LLD can evaluate the linker script.
      
      Also, add linker version info to the comment about XOR being unsupported
      in either ld.bfd or ld.lld until somewhat recently.
      
      Fixes: fb3bd914
      
       ("x86/srso: Add a Speculative RAS Overflow mitigation")
      Closes: https://lore.kernel.org/llvm/CA+G9fYsdUeNu-gwbs0+T6XHi4hYYk=Y9725-wFhZ7gJMspLDRA@mail.gmail.com/
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reported-by: default avatarDaniel Kolesa <daniel@octaforge.org>
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Suggested-by: default avatarSven Volkinsfeld <thyrc@gmx.net>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Link: https://github.com/ClangBuiltLinux/linux/issues/1907
      Link: https://lore.kernel.org/r/20230809-gds-v1-1-eaac90b0cbcc@google.com
      cbe8ded4
    • Borislav Petkov (AMD)'s avatar
      Documentation/srso: Document IBPB aspect and fix formatting · 09f9f37c
      Borislav Petkov (AMD) authored
      
      
      Add a note about the dependency of the User->User mitigation on the
      previous Spectre v2 IBPB selection.
      
      Make the layout moar pretty.
      
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/r/20230809102700.29449-4-bp@alien8.de
      09f9f37c
    • Borislav Petkov (AMD)'s avatar
      driver core: cpu: Unify redundant silly stubs · 0fddfe33
      Borislav Petkov (AMD) authored
      
      
      Make them all a weak function, aliasing to a single function which
      issues the "Not affected" string.
      
      No functional changes.
      
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarNikolay Borisov <nik.borisov@suse.com>
      Link: https://lore.kernel.org/r/20230809102700.29449-3-bp@alien8.de
      0fddfe33
    • Borislav Petkov (AMD)'s avatar
      Documentation/hw-vuln: Unify filename specification in index · 182ac870
      Borislav Petkov (AMD) authored
      
      
      Most of the index.rst files in Documentation/ refer to other rst files
      without their file extension in the name. Do that here too.
      
      No functional changes.
      
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/r/20230809102700.29449-2-bp@alien8.de
      182ac870
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: remove busy mark and gc batch API · a2dd0233
      Pablo Neira Ayuso authored
      
      
      Ditch it, it has been replace it by the GC transaction API and it has no
      clients anymore.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a2dd0233
    • Pablo Neira Ayuso's avatar
      netfilter: nft_set_hash: mark set element as dead when deleting from packet path · c92db303
      Pablo Neira Ayuso authored
      Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of
      performing element removal which might race with an ongoing transaction.
      Enable gc when dynamic flag is set on since dynset deletion requires
      garbage collection after this patch.
      
      Fixes: d0a8d877
      
       ("netfilter: nft_dynset: support for element deletion")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c92db303
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: adapt set backend to use GC transaction API · f6c383b8
      Pablo Neira Ayuso authored
      Use the GC transaction API to replace the old and buggy gc API and the
      busy mark approach.
      
      No set elements are removed from async garbage collection anymore,
      instead the _DEAD bit is set on so the set element is not visible from
      lookup path anymore. Async GC enqueues transaction work that might be
      aborted and retried later.
      
      rbtree and pipapo set backends does not set on the _DEAD bit from the
      sync GC path since this runs in control plane path where mutex is held.
      In this case, set elements are deactivated, removed and then released
      via RCU callback, sync GC never fails.
      
      Fixes: 3c4287f6 ("nf_tables: Add set type for arbitrary concatenation of ranges")
      Fixes: 8d8540c4 ("netfilter: nft_set_rbtree: add timeout support")
      Fixes: 9d098292
      
       ("netfilter: nft_hash: add support for timeouts")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f6c383b8
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: GC transaction API to avoid race with control plane · 5f68718b
      Pablo Neira Ayuso authored
      The set types rhashtable and rbtree use a GC worker to reclaim memory.
      From system work queue, in periodic intervals, a scan of the table is
      done.
      
      The major caveat here is that the nft transaction mutex is not held.
      This causes a race between control plane and GC when they attempt to
      delete the same element.
      
      We cannot grab the netlink mutex from the work queue, because the
      control plane has to wait for the GC work queue in case the set is to be
      removed, so we get following deadlock:
      
         cpu 1                                cpu2
           GC work                            transaction comes in , lock nft mutex
             `acquire nft mutex // BLOCKS
                                              transaction asks to remove the set
                                              set destruction calls cancel_work_sync()
      
      cancel_work_sync will now block forever, because it is waiting for the
      mutex the caller already owns.
      
      This patch adds a new API that deals with garbage collection in two
      steps:
      
      1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
         so they are not visible via lookup. Annotate current GC sequence in
         the GC transaction. Enqueue GC transaction work as soon as it is
         full. If ruleset is updated, then GC transaction is aborted and
         retried later.
      
      2) GC work grabs the mutex. If GC sequence has changed then this GC
         transaction lost race with control plane, abort it as it contains
         stale references to objects and let GC try again later. If the
         ruleset is intact, then this GC transaction deactivates and removes
         the elements and it uses call_rcu() to destroy elements.
      
      Note that no elements are removed from GC lockless path, the _DEAD bit
      is set and pointers are collected. GC catchall does not remove the
      elements anymore too. There is a new set->dead flag that is set on to
      abort the GC transaction to deal with set->ops->destroy() path which
      removes the remaining elements in the set from commit_release, where no
      mutex is held.
      
      To deal with GC when mutex is held, which allows safe deactivate and
      removal, add sync GC API which releases the set element object via
      call_rcu(). This is used by rbtree and pipapo backends which also
      perform garbage collection from control plane path.
      
      Since element removal from sets can happen from control plane and
      element garbage collection/timeout, it is necessary to keep the set
      structure alive until all elements have been deactivated and destroyed.
      
      We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
      its called with the transaction mutex held, but the aforementioned async
      work queue might be blocked on the very mutex that nft_set_destroy()
      callchain is sitting on.
      
      This gives us the choice of ABBA deadlock or UaF.
      
      To avoid both, add set->refs refcount_t member. The GC API can then
      increment the set refcount and release it once the elements have been
      free'd.
      
      Set backends are adapted to use the GC transaction API in a follow up
      patch entitled:
      
        ("netfilter: nf_tables: use gc transaction API in set backends")
      
      This is joint work with Florian Westphal.
      
      Fixes: cfed7e1b
      
       ("netfilter: nf_tables: add set garbage collection helpers")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      5f68718b
    • Linus Torvalds's avatar
      Merge tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd · 374a7f47
      Linus Torvalds authored
      Pull smb server fixes from Steve French:
       "Two ksmbd server fixes, both also for stable:
      
         - improve buffer validation when multiple EAs returned
      
         - missing check for command payload size"
      
      * tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd:
        ksmbd: fix wrong next length validation of ea buffer in smb2_set_ea()
        ksmbd: validate command request size
      374a7f47