Skip to content
  1. Jan 12, 2023
    • David Arinzon's avatar
      net: ena: Update NUMA TPH hint register upon NUMA node update · e5fbeb3d
      David Arinzon authored
      [ Upstream commit a8ee104f ]
      
      The device supports a PCIe optimization hint, which indicates on
      which NUMA the queue is currently processed. This hint is utilized
      by PCIe in order to reduce its access time by accessing the
      correct NUMA resources and maintaining cache coherence.
      
      The driver calls the register update for the hint (called TPH -
      TLP Processing Hint) during the NAPI loop.
      
      Though the update is expected upon a NUMA change (when a queue
      is moved from one NUMA to the other), the current logic performs
      a register update when the queue is moved to a different CPU,
      but the CPU is not necessarily in a different NUMA.
      
      The changes include:
      1. Performing the TPH update only when the queue has switched
      a NUMA node.
      2. Moving the TPH update call to be triggered only when NAPI was
      scheduled from interrupt context, as opposed to a busy-polling loop.
      This is due to the fact that during busy-polling, the frequency
      of CPU switches for a particular queue is significantly higher,
      thus, the likelihood to switch NUMA is much higher. Therefore,
      providing the frequent updates to the device upon a NUMA update
      are unlikely to be beneficial.
      
      Fixes: 1738cd3e
      
       ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e5fbeb3d
    • David Arinzon's avatar
      net: ena: Set default value for RX interrupt moderation · 7840b93c
      David Arinzon authored
      [ Upstream commit e712f3e4 ]
      
      RX ring can be NULL in XDP use cases where only TX queues
      are configured. In this scenario, the RX interrupt moderation
      value sent to the device remains in its default value of 0.
      
      In this change, setting the default value of the RX interrupt
      moderation to be the same as of the TX.
      
      Fixes: 548c4940
      
       ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7840b93c
    • David Arinzon's avatar
      net: ena: Fix rx_copybreak value update · d09b7a9d
      David Arinzon authored
      [ Upstream commit c7062aae ]
      
      Make the upper bound on rx_copybreak tighter, by
      making sure it is smaller than the minimum of mtu and
      ENA_PAGE_SIZE. With the current upper bound of mtu,
      rx_copybreak can be larger than a page. Such large
      rx_copybreak will not bring any performance benefit to
      the user and therefore makes no sense.
      
      In addition, the value update was only reflected in
      the adapter structure, but not applied for each ring,
      causing it to not take effect.
      
      Fixes: 1738cd3e
      
       ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
      Signed-off-by: default avatarOsama Abboud <osamaabb@amazon.com>
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d09b7a9d
    • David Arinzon's avatar
      net: ena: Use bitmask to indicate packet redirection · 0e7ad9b0
      David Arinzon authored
      [ Upstream commit 59811faa ]
      
      Redirecting packets with XDP Redirect is done in two phases:
      1. A packet is passed by the driver to the kernel using
         xdp_do_redirect().
      2. After finishing polling for new packets the driver lets the kernel
         know that it can now process the redirected packet using
         xdp_do_flush_map().
         The packets' redirection is handled in the napi context of the
         queue that called xdp_do_redirect()
      
      To avoid calling xdp_do_flush_map() each time the driver first checks
      whether any packets were redirected, using
      	xdp_flags |= xdp_verdict;
      and
      	if (xdp_flags & XDP_REDIRECT)
      	    xdp_do_flush_map()
      
      essentially treating XDP instructions as a bitmask, which isn't the case:
          enum xdp_action {
      	    XDP_ABORTED = 0,
      	    XDP_DROP,
      	    XDP_PASS,
      	    XDP_TX,
      	    XDP_REDIRECT,
          };
      
      Given the current possible values of xdp_action, the current design
      doesn't have a bug (since XDP_REDIRECT = 100b), but it is still
      flawed.
      
      This patch makes the driver use a bitmask instead, to avoid future
      issues.
      
      Fixes: a318c70a
      
       ("net: ena: introduce XDP redirect implementation")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0e7ad9b0
    • David Arinzon's avatar
      net: ena: Account for the number of processed bytes in XDP · 5d496498
      David Arinzon authored
      [ Upstream commit c7f5e34d ]
      
      The size of packets that were forwarded or dropped by XDP wasn't added
      to the total processed bytes statistic.
      
      Fixes: 548c4940
      
       ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5d496498
    • David Arinzon's avatar
      net: ena: Don't register memory info on XDP exchange · f17d9aec
      David Arinzon authored
      [ Upstream commit 9c9e5399 ]
      
      Since the queues aren't destroyed when we only exchange XDP programs,
      there's no need to re-register them again.
      
      Fixes: 548c4940
      
       ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f17d9aec
    • David Arinzon's avatar
      net: ena: Fix toeplitz initial hash value · a4aa727a
      David Arinzon authored
      [ Upstream commit 332b49ff ]
      
      On driver initialization, RSS hash initial value is set to zero,
      instead of the default value. This happens because we pass NULL as
      the RSS key parameter, which caused us to never initialize
      the RSS hash value.
      
      This patch fixes it by making sure the initial value is set, no matter
      what the value of the RSS key is.
      
      Fixes: 91a65b7d
      
       ("net: ena: fix potential crash when rxfh key is NULL")
      Signed-off-by: default avatarNati Koler <nkoler@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a4aa727a
    • Jiguang Xiao's avatar
      net: amd-xgbe: add missed tasklet_kill · 0bec17f1
      Jiguang Xiao authored
      [ Upstream commit d530ece7 ]
      
      The driver does not call tasklet_kill in several places.
      Add the calls to fix it.
      
      Fixes: 85b85c85
      
       ("amd-xgbe: Re-issue interrupt if interrupt status not cleared")
      Signed-off-by: default avatarJiguang Xiao <jiguang.xiao@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0bec17f1
    • Adham Faris's avatar
      net/mlx5e: Fix hw mtu initializing at XDP SQ allocation · cb2f7468
      Adham Faris authored
      [ Upstream commit 1e267ab8 ]
      
      Current xdp xmit functions logic (mlx5e_xmit_xdp_frame_mpwqe or
      mlx5e_xmit_xdp_frame), validates xdp packet length by comparing it to
      hw mtu (configured at xdp sq allocation) before xmiting it. This check
      does not account for ethernet fcs length (calculated and filled by the
      nic). Hence, when we try sending packets with length > (hw-mtu -
      ethernet-fcs-size), the device port drops it and tx_errors_phy is
      incremented. Desired behavior is to catch these packets and drop them
      by the driver.
      
      Fix this behavior in XDP SQ allocation function (mlx5e_alloc_xdpsq) by
      subtracting ethernet FCS header size (4 Bytes) from current hw mtu
      value, since ethernet FCS is calculated and written to ethernet frames
      by the nic.
      
      Fixes: d8bec2b2
      
       ("net/mlx5e: Support bpf_xdp_adjust_head()")
      Signed-off-by: default avatarAdham Faris <afaris@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cb2f7468
    • Chris Mi's avatar
      net/mlx5e: Always clear dest encap in neigh-update-del · 6c72abb7
      Chris Mi authored
      [ Upstream commit 2951b2e1 ]
      
      The cited commit introduced a bug for multiple encapsulations flow.
      If one dest encap becomes invalid, the flow is set slow path flag.
      But when other dests encap become invalid, they are not cleared due
      to slow path flag of the flow. When neigh-update-add is running, it
      will use invalid encap.
      
      Fix it by checking slow path flag after clearing dest encap.
      
      Fixes: 9a5f9cc7
      
       ("net/mlx5e: Fix possible use-after-free deleting fdb rule")
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6c72abb7
    • Roi Dayan's avatar
      net/mlx5e: TC, Refactor mlx5e_tc_add_flow_mod_hdr() to get flow attr · b36783bc
      Roi Dayan authored
      [ Upstream commit ff993167
      
       ]
      
      In later commit we are going to instantiate multiple attr instances
      for flow instead of single attr.
      Make sure mlx5e_tc_add_flow_mod_hdr() use the correct attr and not flow->attr.
      
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Stable-dep-of: 2951b2e1
      
       ("net/mlx5e: Always clear dest encap in neigh-update-del")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b36783bc
    • Dragos Tatulea's avatar
      net/mlx5e: IPoIB, Don't allow CQE compression to be turned on by default · f8c10eeb
      Dragos Tatulea authored
      [ Upstream commit b12d581e ]
      
      mlx5e_build_nic_params will turn CQE compression on if the hardware
      capability is enabled and the slow_pci_heuristic condition is detected.
      As IPoIB doesn't support CQE compression, make sure to disable the
      feature in the IPoIB profile init.
      
      Please note that the feature is not exposed to the user for IPoIB
      interfaces, so it can't be subsequently turned on.
      
      Fixes: b797a684
      
       ("net/mlx5e: Enable CQE compression when PCI is slower than link")
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f8c10eeb
    • Shay Drory's avatar
      net/mlx5: Avoid recovery in probe flows · 7227bbb7
      Shay Drory authored
      [ Upstream commit 9078e843 ]
      
      Currently, recovery is done without considering whether the device is
      still in probe flow.
      This may lead to recovery before device have finished probed
      successfully. e.g.: while mlx5_init_one() is running. Recovery flow is
      using functionality that is loaded only by mlx5_init_one(), and there
      is no point in running recovery without mlx5_init_one() finished
      successfully.
      
      Fix it by waiting for probe flow to finish and checking whether the
      device is probed before trying to perform recovery.
      
      Fixes: 51d138c2
      
       ("net/mlx5: Fix health error state handling")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7227bbb7
    • Jiri Pirko's avatar
      net/mlx5: Add forgotten cleanup calls into mlx5_init_once() error path · 9369b9af
      Jiri Pirko authored
      [ Upstream commit 2a35b2c2 ]
      
      There are two cleanup calls missing in mlx5_init_once() error path.
      Add them making the error path flow to be the same as
      mlx5_cleanup_once().
      
      Fixes: 52ec462e ("net/mlx5: Add reserved-gids support")
      Fixes: 7c39afb3
      
       ("net/mlx5: PTP code migration to driver core section")
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9369b9af
    • Moshe Shemesh's avatar
      net/mlx5: E-Switch, properly handle ingress tagged packets on VST · d966f2ee
      Moshe Shemesh authored
      [ Upstream commit 1f0ae22a ]
      
      Fix SRIOV VST mode behavior to insert cvlan when a guest tag is already
      present in the frame. Previous VST mode behavior was to drop packets or
      override existing tag, depending on the device version.
      
      In this patch we fix this behavior by correctly building the HW steering
      rule with a push vlan action, or for older devices we ask the FW to stack
      the vlan when a vlan is already present.
      
      Fixes: 07bab950 ("net/mlx5: E-Switch, Refactor eswitch ingress acl codes")
      Fixes: dfcb1ed3
      
       ("net/mlx5: E-Switch, Vport ingress/egress ACLs rules for VST mode")
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d966f2ee
    • Stefano Garzarella's avatar
      vdpa_sim: fix vringh initialization in vdpasim_queue_ready() · 6a37a01a
      Stefano Garzarella authored
      [ Upstream commit 794ec498 ]
      
      When we initialize vringh, we should pass the features and the
      number of elements in the virtqueue negotiated with the driver,
      otherwise operations with vringh may fail.
      
      This was discovered in a case where the driver sets a number of
      elements in the virtqueue different from the value returned by
      .get_vq_num_max().
      
      In vdpasim_vq_reset() is safe to initialize the vringh with
      default values, since the virtqueue will not be used until
      vdpasim_queue_ready() is called again.
      
      Fixes: 2c53d0f6
      
       ("vdpasim: vDPA device simulator")
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Message-Id: <20221110141335.62171-1-sgarzare@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarEugenio Pérez <eperezma@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6a37a01a
    • Stefano Garzarella's avatar
      vhost: fix range used in translate_desc() · e3462410
      Stefano Garzarella authored
      [ Upstream commit 98047313 ]
      
      vhost_iotlb_itree_first() requires `start` and `last` parameters
      to search for a mapping that overlaps the range.
      
      In translate_desc() we cyclically call vhost_iotlb_itree_first(),
      incrementing `addr` by the amount already translated, so rightly
      we move the `start` parameter passed to vhost_iotlb_itree_first(),
      but we should hold the `last` parameter constant.
      
      Let's fix it by saving the `last` parameter value before incrementing
      `addr` in the loop.
      
      Fixes: a9709d68
      
       ("vhost: convert pre sorted vhost memory array to interval tree")
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Message-Id: <20221109102503.18816-3-sgarzare@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e3462410
    • Stefano Garzarella's avatar
      vringh: fix range used in iotlb_translate() · 13871f60
      Stefano Garzarella authored
      [ Upstream commit f85efa9b ]
      
      vhost_iotlb_itree_first() requires `start` and `last` parameters
      to search for a mapping that overlaps the range.
      
      In iotlb_translate() we cyclically call vhost_iotlb_itree_first(),
      incrementing `addr` by the amount already translated, so rightly
      we move the `start` parameter passed to vhost_iotlb_itree_first(),
      but we should hold the `last` parameter constant.
      
      Let's fix it by saving the `last` parameter value before incrementing
      `addr` in the loop.
      
      Fixes: 9ad9c49c
      
       ("vringh: IOTLB support")
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Message-Id: <20221109102503.18816-2-sgarzare@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      13871f60
    • Yuan Can's avatar
      vhost/vsock: Fix error handling in vhost_vsock_init() · e05d4c8c
      Yuan Can authored
      [ Upstream commit 7a4efe18 ]
      
      A problem about modprobe vhost_vsock failed is triggered with the
      following log given:
      
      modprobe: ERROR: could not insert 'vhost_vsock': Device or resource busy
      
      The reason is that vhost_vsock_init() returns misc_register() directly
      without checking its return value, if misc_register() failed, it returns
      without calling vsock_core_unregister() on vhost_transport, resulting the
      vhost_vsock can never be installed later.
      A simple call graph is shown as below:
      
       vhost_vsock_init()
         vsock_core_register() # register vhost_transport
         misc_register()
           device_create_with_groups()
             device_create_groups_vargs()
               dev = kzalloc(...) # OOM happened
         # return without unregister vhost_transport
      
      Fix by calling vsock_core_unregister() when misc_register() returns error.
      
      Fixes: 433fc58e
      
       ("VSOCK: Introduce vhost_vsock.ko")
      Signed-off-by: default avatarYuan Can <yuancan@huawei.com>
      Message-Id: <20221108101705.45981-1-yuancan@huawei.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e05d4c8c
    • ruanjinjie's avatar
      vdpa_sim: fix possible memory leak in vdpasim_net_init() and vdpasim_blk_init() · 586e6fd7
      ruanjinjie authored
      [ Upstream commit aeca7ff2 ]
      
      Inject fault while probing module, if device_register() fails in
      vdpasim_net_init() or vdpasim_blk_init(), but the refcount of kobject is
      not decreased to 0, the name allocated in dev_set_name() is leaked.
      Fix this by calling put_device(), so that name can be freed in
      callback function kobject_cleanup().
      
      (vdpa_sim_net)
      unreferenced object 0xffff88807eebc370 (size 16):
        comm "modprobe", pid 3848, jiffies 4362982860 (age 18.153s)
        hex dump (first 16 bytes):
          76 64 70 61 73 69 6d 5f 6e 65 74 00 6b 6b 6b a5  vdpasim_net.kkk.
        backtrace:
          [<ffffffff8174f19e>] __kmalloc_node_track_caller+0x4e/0x150
          [<ffffffff81731d53>] kstrdup+0x33/0x60
          [<ffffffff83a5d421>] kobject_set_name_vargs+0x41/0x110
          [<ffffffff82d87aab>] dev_set_name+0xab/0xe0
          [<ffffffff82d91a23>] device_add+0xe3/0x1a80
          [<ffffffffa0270013>] 0xffffffffa0270013
          [<ffffffff81001c27>] do_one_initcall+0x87/0x2e0
          [<ffffffff813739cb>] do_init_module+0x1ab/0x640
          [<ffffffff81379d20>] load_module+0x5d00/0x77f0
          [<ffffffff8137bc40>] __do_sys_finit_module+0x110/0x1b0
          [<ffffffff83c4d505>] do_syscall_64+0x35/0x80
          [<ffffffff83e0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      (vdpa_sim_blk)
      unreferenced object 0xffff8881070c1250 (size 16):
        comm "modprobe", pid 6844, jiffies 4364069319 (age 17.572s)
        hex dump (first 16 bytes):
          76 64 70 61 73 69 6d 5f 62 6c 6b 00 6b 6b 6b a5  vdpasim_blk.kkk.
        backtrace:
          [<ffffffff8174f19e>] __kmalloc_node_track_caller+0x4e/0x150
          [<ffffffff81731d53>] kstrdup+0x33/0x60
          [<ffffffff83a5d421>] kobject_set_name_vargs+0x41/0x110
          [<ffffffff82d87aab>] dev_set_name+0xab/0xe0
          [<ffffffff82d91a23>] device_add+0xe3/0x1a80
          [<ffffffffa0220013>] 0xffffffffa0220013
          [<ffffffff81001c27>] do_one_initcall+0x87/0x2e0
          [<ffffffff813739cb>] do_init_module+0x1ab/0x640
          [<ffffffff81379d20>] load_module+0x5d00/0x77f0
          [<ffffffff8137bc40>] __do_sys_finit_module+0x110/0x1b0
          [<ffffffff83c4d505>] do_syscall_64+0x35/0x80
          [<ffffffff83e0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: 899c4d18 ("vdpa_sim_blk: add support for vdpa management tool")
      Fixes: a3c06ae1
      
       ("vdpa_sim_net: Add support for user supported devices")
      
      Signed-off-by: default avatarruanjinjie <ruanjinjie@huawei.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Message-Id: <20221110082348.4105476-1-ruanjinjie@huawei.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      586e6fd7
    • Miaoqian Lin's avatar
      nfc: Fix potential resource leaks · b63bc2db
      Miaoqian Lin authored
      [ Upstream commit df49908f ]
      
      nfc_get_device() take reference for the device, add missing
      nfc_put_device() to release it when not need anymore.
      Also fix the style warnning by use error EOPNOTSUPP instead of
      ENOTSUPP.
      
      Fixes: 5ce3f32b ("NFC: netlink: SE API implementation")
      Fixes: 29e76924
      
       ("nfc: netlink: Add capability to reply to vendor_cmd with data")
      Signed-off-by: default avatarMiaoqian Lin <linmq006@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b63bc2db
    • Johnny S. Lee's avatar
      net: dsa: mv88e6xxx: depend on PTP conditionally · 945e58bd
      Johnny S. Lee authored
      [ Upstream commit 30e72553 ]
      
      PTP hardware timestamping related objects are not linked when PTP
      support for MV88E6xxx (NET_DSA_MV88E6XXX_PTP) is disabled, therefore
      NET_DSA_MV88E6XXX should not depend on PTP_1588_CLOCK_OPTIONAL
      regardless of NET_DSA_MV88E6XXX_PTP.
      
      Instead, condition more strictly on how NET_DSA_MV88E6XXX_PTP's
      dependencies are met, making sure that it cannot be enabled when
      NET_DSA_MV88E6XXX=y and PTP_1588_CLOCK=m.
      
      In other words, this commit allows NET_DSA_MV88E6XXX to be built-in
      while PTP_1588_CLOCK is a module, as long as NET_DSA_MV88E6XXX_PTP is
      prevented from being enabled.
      
      Fixes: e5f31552
      
       ("ethernet: fix PTP_1588_CLOCK dependencies")
      Signed-off-by: default avatarJohnny S. Lee <foss@jsl.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      945e58bd
    • Daniil Tatianin's avatar
      qlcnic: prevent ->dcb use-after-free on qlcnic_dcb_enable() failure · 95df720e
      Daniil Tatianin authored
      [ Upstream commit 13a7c896 ]
      
      adapter->dcb would get silently freed inside qlcnic_dcb_enable() in
      case qlcnic_dcb_attach() would return an error, which always happens
      under OOM conditions. This would lead to use-after-free because both
      of the existing callers invoke qlcnic_dcb_get_info() on the obtained
      pointer, which is potentially freed at that point.
      
      Propagate errors from qlcnic_dcb_enable(), and instead free the dcb
      pointer at callsite using qlcnic_dcb_free(). This also removes the now
      unused qlcnic_clear_dcb_ops() helper, which was a simple wrapper around
      kfree() also causing memory leaks for partially initialized dcb.
      
      Found by Linux Verification Center (linuxtesting.org) with the SVACE
      static analysis tool.
      
      Fixes: 3c44bba1
      
       ("qlcnic: Disable DCB operations from SR-IOV VFs")
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: default avatarDaniil Tatianin <d-tatianin@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      95df720e
    • Hawkins Jiawei's avatar
      net: sched: fix memory leak in tcindex_set_parms · 6c55953e
      Hawkins Jiawei authored
      [ Upstream commit 399ab7fe ]
      
      Syzkaller reports a memory leak as follows:
      ====================================
      BUG: memory leak
      unreferenced object 0xffff88810c287f00 (size 256):
        comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046
          [<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline]
          [<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline]
          [<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline]
          [<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline]
          [<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342
          [<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553
          [<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147
          [<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082
          [<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540
          [<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
          [<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345
          [<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921
          [<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline]
          [<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734
          [<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482
          [<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536
          [<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622
          [<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline]
          [<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline]
          [<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648
          [<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
      ====================================
      
      Kernel uses tcindex_change() to change an existing
      filter properties.
      
      Yet the problem is that, during the process of changing,
      if `old_r` is retrieved from `p->perfect`, then
      kernel uses tcindex_alloc_perfect_hash() to newly
      allocate filter results, uses tcindex_filter_result_init()
      to clear the old filter result, without destroying
      its tcf_exts structure, which triggers the above memory leak.
      
      To be more specific, there are only two source for the `old_r`,
      according to the tcindex_lookup(). `old_r` is retrieved from
      `p->perfect`, or `old_r` is retrieved from `p->h`.
      
        * If `old_r` is retrieved from `p->perfect`, kernel uses
      tcindex_alloc_perfect_hash() to newly allocate the
      filter results. Then `r` is assigned with `cp->perfect + handle`,
      which is newly allocated. So condition `old_r && old_r != r` is
      true in this situation, and kernel uses tcindex_filter_result_init()
      to clear the old filter result, without destroying
      its tcf_exts structure
      
        * If `old_r` is retrieved from `p->h`, then `p->perfect` is NULL
      according to the tcindex_lookup(). Considering that `cp->h`
      is directly copied from `p->h` and `p->perfect` is NULL,
      `r` is assigned with `tcindex_lookup(cp, handle)`, whose value
      should be the same as `old_r`, so condition `old_r && old_r != r`
      is false in this situation, kernel ignores using
      tcindex_filter_result_init() to clear the old filter result.
      
      So only when `old_r` is retrieved from `p->perfect` does kernel use
      tcindex_filter_result_init() to clear the old filter result, which
      triggers the above memory leak.
      
      Considering that there already exists a tc_filter_wq workqueue
      to destroy the old tcindex_data by tcindex_partial_destroy_work()
      at the end of tcindex_set_parms(), this patch solves
      this memory leak bug by removing this old filter result
      clearing part and delegating it to the tc_filter_wq workqueue.
      
      Note that this patch doesn't introduce any other issues. If
      `old_r` is retrieved from `p->perfect`, this patch just
      delegates old filter result clearing part to the
      tc_filter_wq workqueue; If `old_r` is retrieved from `p->h`,
      kernel doesn't reach the old filter result clearing part, so
      removing this part has no effect.
      
      [Thanks to the suggestion from Jakub Kicinski, Cong Wang, Paolo Abeni
      and Dmitry Vyukov]
      
      Fixes: b9a24bb7
      
       ("net_sched: properly handle failure case of tcf_exts_init()")
      Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/
      Reported-by: default avatar <syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com>
      Tested-by: default avatar <syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com>
      Cc: Cong Wang <cong.wang@bytedance.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarHawkins Jiawei <yin31149@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6c55953e
    • Jian Shen's avatar
      net: hns3: fix VF promisc mode not update when mac table full · d14a4b24
      Jian Shen authored
      [ Upstream commit 8ee57c7b ]
      
      Currently, it missed set HCLGE_VPORT_STATE_PROMISC_CHANGE
      flag for VF when vport->overflow_promisc_flags changed.
      So the VF won't check whether to update promisc mode in
      this case. So add it.
      
      Fixes: 1e6e7610
      
       ("net: hns3: configure promisc mode for VF asynchronously")
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Signed-off-by: default avatarHao Lan <lanhao@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d14a4b24
    • Jian Shen's avatar
      net: hns3: fix miss L3E checking for rx packet · 7ed205b9
      Jian Shen authored
      [ Upstream commit 7d89b53c ]
      
      For device supports RXD advanced layout, the driver will
      return directly if the hardware finish the checksum
      calculate. It cause missing L3E checking for ip packets.
      Fixes it.
      
      Fixes: 1ddc028a
      
       ("net: hns3: refactor out RX completion checksum")
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Signed-off-by: default avatarHao Lan <lanhao@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7ed205b9
    • Peng Li's avatar
      net: hns3: extract macro to simplify ring stats update code · 47868cb7
      Peng Li authored
      [ Upstream commit e6d72f6a
      
       ]
      
      As the code to update ring stats is alike for different ring stats
      type, this patch extract macro to simplify ring stats update code.
      
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Stable-dep-of: 7d89b53c
      
       ("net: hns3: fix miss L3E checking for rx packet")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      47868cb7
    • Hao Chen's avatar
      net: hns3: refactor hns3_nic_reuse_page() · 7457c5a7
      Hao Chen authored
      [ Upstream commit e74a726d
      
       ]
      
      Split rx copybreak handle into a separate function from function
      hns3_nic_reuse_page() to improve code simplicity.
      
      Signed-off-by: default avatarHao Chen <chenhao288@hisilicon.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Stable-dep-of: 7d89b53c
      
       ("net: hns3: fix miss L3E checking for rx packet")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7457c5a7
    • Jie Wang's avatar
      net: hns3: add interrupts re-initialization while doing VF FLR · 4a6e9fb5
      Jie Wang authored
      [ Upstream commit 09e6b30e ]
      
      Currently keep alive message between PF and VF may be lost and the VF is
      unalive in PF. So the VF will not do reset during PF FLR reset process.
      This would make the allocated interrupt resources of VF invalid and VF
      would't receive or respond to PF any more.
      
      So this patch adds VF interrupts re-initialization during VF FLR for VF
      recovery in above cases.
      
      Fixes: 862d969a
      
       ("net: hns3: do VF's pci re-initialization while PF doing FLR")
      Signed-off-by: default avatarJie Wang <wangjie125@huawei.com>
      Signed-off-by: default avatarHao Lan <lanhao@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4a6e9fb5
    • Jeff Layton's avatar
      nfsd: shut down the NFSv4 state objects before the filecache · 5e48ed80
      Jeff Layton authored
      [ Upstream commit 789e1e10
      
       ]
      
      Currently, we shut down the filecache before trying to clean up the
      stateids that depend on it. This leads to the kernel trying to free an
      nfsd_file twice, and a refcount overput on the nf_mark.
      
      Change the shutdown procedure to tear down all of the stateids prior
      to shutting down the filecache.
      
      Reported-and-tested-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Fixes: 5e113224
      
       ("nfsd: nfsd_file cache entries should be per net namespace")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5e48ed80
    • Shawn Bohrer's avatar
      veth: Fix race with AF_XDP exposing old or uninitialized descriptors · 7e2825f5
      Shawn Bohrer authored
      [ Upstream commit fa349e39 ]
      
      When AF_XDP is used on on a veth interface the RX ring is updated in two
      steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
      fills them and places them in the RX ring updating the cached_prod
      pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
      cached_prod pointer allowing user-space to see the recently filled in
      descriptors.  The rings are intended to be SPSC, however the existing
      order in veth_poll allows the xdp_do_flush() to run concurrently with
      another CPU creating a race condition that allows user-space to see old
      or uninitialized descriptors in the RX ring.  This bug has been observed
      in production systems.
      
      To summarize, we are expecting this ordering:
      
      CPU 0 __xsk_rcv_zc()
      CPU 0 __xsk_map_flush()
      CPU 2 __xsk_rcv_zc()
      CPU 2 __xsk_map_flush()
      
      But we are seeing this order:
      
      CPU 0 __xsk_rcv_zc()
      CPU 2 __xsk_rcv_zc()
      CPU 0 __xsk_map_flush()
      CPU 2 __xsk_map_flush()
      
      This occurs because we rely on NAPI to ensure that only one napi_poll
      handler is running at a time for the given veth receive queue.
      napi_schedule_prep() will prevent multiple instances from getting
      scheduled. However calling napi_complete_done() signals that this
      napi_poll is complete and allows subsequent calls to
      napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
      concurrent napi_poll before the xdp_do_flush() has been called.  For the
      veth driver a concurrent call to napi_schedule_prep() and
      __napi_schedule() can occur on a different CPU because the veth xmit
      path can additionally schedule a napi_poll creating the race.
      
      The fix as suggested by Magnus Karlsson, is to simply move the
      xdp_do_flush() call before napi_complete_done().  This syncs the
      producer ring pointers before another instance of napi_poll can be
      scheduled on another CPU.  It will also slightly improve performance by
      moving the flush closer to when the descriptors were placed in the
      RX ring.
      
      Fixes: d1396004
      
       ("veth: Add XDP TX and REDIRECT")
      Suggested-by: default avatarMagnus Karlsson <magnus.karlsson@gmail.com>
      Signed-off-by: default avatarShawn Bohrer <sbohrer@cloudflare.com>
      Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7e2825f5
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: honor set timeout and garbage collection updates · ac95cdaf
      Pablo Neira Ayuso authored
      [ Upstream commit 123b9961 ]
      
      Set timeout and garbage collection interval updates are ignored on
      updates. Add transaction to update global set element timeout and
      garbage collection interval.
      
      Fixes: 96518518
      
       ("netfilter: add nftables")
      Suggested-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ac95cdaf
    • Ronak Doshi's avatar
      vmxnet3: correctly report csum_level for encapsulated packet · 49677ea1
      Ronak Doshi authored
      [ Upstream commit 3d8f2c42 ]
      
      Commit dacce2be ("vmxnet3: add geneve and vxlan tunnel offload
      support") added support for encapsulation offload. However, the
      pathc did not report correctly the csum_level for encapsulated packet.
      
      This patch fixes this issue by reporting correct csum level for the
      encapsulated packet.
      
      Fixes: dacce2be
      
       ("vmxnet3: add geneve and vxlan tunnel offload support")
      Signed-off-by: default avatarRonak Doshi <doshir@vmware.com>
      Acked-by: default avatarPeng Li <lpeng@vmware.com>
      Link: https://lore.kernel.org/r/20221220202556.24421-1-doshir@vmware.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      49677ea1
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: perform type checking for existing sets · 9d30cb44
      Pablo Neira Ayuso authored
      [ Upstream commit f6594c37 ]
      
      If a ruleset declares a set name that matches an existing set in the
      kernel, then validate that this declaration really refers to the same
      set, otherwise bail out with EEXIST.
      
      Currently, the kernel reports success when adding a set that already
      exists in the kernel. This usually results in EINVAL errors at a later
      stage, when the user adds elements to the set, if the set declaration
      mismatches the existing set representation in the kernel.
      
      Add a new function to check that the set declaration really refers to
      the same existing set in the kernel.
      
      Fixes: 96518518
      
       ("netfilter: add nftables")
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9d30cb44
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add function to create set stateful expressions · c3bfb778
      Pablo Neira Ayuso authored
      [ Upstream commit a8fe4154
      
       ]
      
      Add a helper function to allocate and initialize the stateful expressions
      that are defined in a set.
      
      This patch allows to reuse this code from the set update path, to check
      that type of the update matches the existing set in the kernel.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Stable-dep-of: f6594c37
      
       ("netfilter: nf_tables: perform type checking for existing sets")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c3bfb778
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: consolidate set description · 996cd779
      Pablo Neira Ayuso authored
      [ Upstream commit bed4a63e
      
       ]
      
      Add the following fields to the set description:
      
      - key type
      - data type
      - object type
      - policy
      - gc_int: garbage collection interval)
      - timeout: element timeout
      
      This prepares for stricter set type checks on updates in a follow up
      patch.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Stable-dep-of: f6594c37
      
       ("netfilter: nf_tables: perform type checking for existing sets")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      996cd779
    • Steven Price's avatar
      drm/panfrost: Fix GEM handle creation ref-counting · 4f1105ee
      Steven Price authored
      [ Upstream commit 4217c6ac
      
       ]
      
      panfrost_gem_create_with_handle() previously returned a BO but with the
      only reference being from the handle, which user space could in theory
      guess and release, causing a use-after-free. Additionally if the call to
      panfrost_gem_mapping_get() in panfrost_ioctl_create_bo() failed then
      a(nother) reference on the BO was dropped.
      
      The _create_with_handle() is a problematic pattern, so ditch it and
      instead create the handle in panfrost_ioctl_create_bo(). If the call to
      panfrost_gem_mapping_get() fails then this means that user space has
      indeed gone behind our back and freed the handle. In which case just
      return an error code.
      
      Reported-by: default avatarRob Clark <robdclark@chromium.org>
      Fixes: f3ba9122
      
       ("drm/panfrost: Add initial panfrost driver")
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Reviewed-by: default avatarRob Clark <robdclark@gmail.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20221219140130.410578-1-steven.price@arm.com
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4f1105ee
    • Jakub Kicinski's avatar
      bpf: pull before calling skb_postpull_rcsum() · df493f67
      Jakub Kicinski authored
      [ Upstream commit 54c3f1a8 ]
      
      Anand hit a BUG() when pulling off headers on egress to a SW tunnel.
      We get to skb_checksum_help() with an invalid checksum offset
      (commit d7ea0d9d
      
       ("net: remove two BUG() from skb_checksum_help()")
      converted those BUGs to WARN_ONs()).
      He points out oddness in how skb_postpull_rcsum() gets used.
      Indeed looks like we should pull before "postpull", otherwise
      the CHECKSUM_PARTIAL fixup from skb_postpull_rcsum() will not
      be able to do its job:
      
      	if (skb->ip_summed == CHECKSUM_PARTIAL &&
      	    skb_checksum_start_offset(skb) < 0)
      		skb->ip_summed = CHECKSUM_NONE;
      
      Reported-by: default avatarAnand Parthasarathy <anpartha@meta.com>
      Fixes: 6578171a
      
       ("bpf: add bpf_skb_change_proto helper")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20221220004701.402165-1-kuba@kernel.org
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      df493f67
    • Sasha Levin's avatar
      btrfs: fix an error handling path in btrfs_defrag_leaves() · d7e817e6
      Sasha Levin authored
      [ Upstream commit db0a4a7b ]
      
      All error handling paths end to 'out', except this memory allocation
      failure.
      
      This is spurious. So branch to the error handling path also in this case.
      It will add a call to:
      
      	memset(&root->defrag_progress, 0,
      	       sizeof(root->defrag_progress));
      
      Fixes: 6702ed49
      
       ("Btrfs: Add run time btree defrag, and an ioctl to force btree defrag")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d7e817e6
    • minoura makoto's avatar
      SUNRPC: ensure the matching upcall is in-flight upon downcall · 4d69cdba
      minoura makoto authored
      [ Upstream commit b18cba09 ]
      
      Commit 9130b8db
      
       ("SUNRPC: allow for upcalls for the same uid
      but different gss service") introduced `auth` argument to
      __gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
      since it (and auth->service) was not (yet) determined.
      
      When multiple upcalls with the same uid and different service are
      ongoing, it could happen that __gss_find_upcall(), which returns the
      first match found in the pipe->in_downcall list, could not find the
      correct gss_msg corresponding to the downcall we are looking for.
      Moreover, it might return a msg which is not sent to rpc.gssd yet.
      
      We could see mount.nfs process hung in D state with multiple mount.nfs
      are executed in parallel.  The call trace below is of CentOS 7.9
      kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
      elrepo kernel-ml-6.0.7-1.el7.
      
      PID: 71258  TASK: ffff91ebd4be0000  CPU: 36  COMMAND: "mount.nfs"
       #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
       #1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
       #2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
       #3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
      [sunrpc]
       #4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
       #5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
       #6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
       #7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
       #8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
       #9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]
      
      The scenario is like this. Let's say there are two upcalls for
      services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.
      
      When rpc.gssd reads pipe to get the upcall msg corresponding to
      service B from pipe->pipe and then writes the response, in
      gss_pipe_downcall the msg corresponding to service A will be picked
      because only uid is used to find the msg and it is before the one for
      B in pipe->in_downcall.  And the process waiting for the msg
      corresponding to service A will be woken up.
      
      Actual scheduing of that process might be after rpc.gssd processes the
      next msg.  In rpc_pipe_generic_upcall it clears msg->errno (for A).
      The process is scheduled to see gss_msg->ctx == NULL and
      gss_msg->msg.errno == 0, therefore it cannot break the loop in
      gss_create_upcall and is never woken up after that.
      
      This patch adds a simple check to ensure that a msg which is not
      sent to rpc.gssd yet is not chosen as the matching upcall upon
      receiving a downcall.
      
      Signed-off-by: default avatarminoura makoto <minoura@valinux.co.jp>
      Signed-off-by: default avatarHiroshi Shimamoto <h-shimamoto@nec.com>
      Tested-by: default avatarHiroshi Shimamoto <h-shimamoto@nec.com>
      Cc: Trond Myklebust <trondmy@hammerspace.com>
      Fixes: 9130b8db
      
       ("SUNRPC: allow for upcalls for same uid but different gss service")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4d69cdba