Skip to content
  1. Aug 31, 2020
  2. Aug 27, 2020
  3. Aug 25, 2020
  4. Aug 24, 2020
    • Håkon Bugge's avatar
      IB/mlx4: Adjust delayed work when a dup is observed · 785167a1
      Håkon Bugge authored
      When scheduling delayed work to clean up the cache, if the entry already
      has been scheduled for deletion, we adjust the delay.
      
      Fixes: 3cf69cc8 ("IB/mlx4: Add CM paravirtualization")
      Link: https://lore.kernel.org/r/20200803061941.1139994-7-haakon.bugge@oracle.com
      
      
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      785167a1
    • Håkon Bugge's avatar
      IB/mlx4: Add support for REJ due to timeout · 227a0e14
      Håkon Bugge authored
      A CM REJ packet with its reason equal to timeout is a special beast in the
      sense that it doesn't have a Remote Communication ID nor does it have a
      Remote Port GID.
      
      Using CX-3 virtual functions, either from a bare-metal machine or
      pass-through from a VM, MAD packets are proxied through the PF driver.
      
      Since the VF drivers have separate name spaces for MAD Transaction Ids
      (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
      in a cache.
      
      This proxying doesn't not handle said REJ packets.
      
      If the active side abandons its connection attempt after having sent a
      REQ, it will send a REJ with the reason being timeout. This example can be
      provoked by a simple user-verbs program, which ends up doing:
      
          rdma_connect(cm_id, &conn_param);
          rdma_destroy_id(cm_id);
      
      using the async librdmacm API.
      
      Having dynamic debug prints enabled in the mlx4_ib driver, we will then
      see:
      
      mlx4_ib_demux_cm_handler: Couldn't find an entry for pv_cm_id 0x0, attr_id 0x12
      
      The solution is to introduce a radix-tree. When a REQ packet is received
      and handled in mlx4_ib_demux_cm_handler(), we know the connecting peer's
      para-virtual cm_id and the destination slave. We then insert an entry into
      the tree with said information. We also schedule work to remove this entry
      from the tree and free it, in order to avoid memory leak.
      
      When a REJ packet with reason timeout is received, we can look up the
      slave in the tree, and deliver the packet to the correct slave.
      
      When a duplicate REQ packet is received, the entry is in the tree. In this
      case, we adjust the delayed work in order to avoid a too premature
      eviction of the entry.
      
      When cleaning up, we simply traverse the tree and modify any delayed work
      to use a zero delay. A subsequent flush of the system_wq will ensure all
      entries being wiped out.
      
      Fixes: 3cf69cc8 ("IB/mlx4: Add CM paravirtualization")
      Link: https://lore.kernel.org/r/20200803061941.1139994-6-haakon.bugge@oracle.com
      
      
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      227a0e14
    • Håkon Bugge's avatar
      IB/mlx4: Fix starvation in paravirt mux/demux · 7fd1507d
      Håkon Bugge authored
      The mlx4 driver will proxy MAD packets through the PF driver. A VM or an
      instantiated VF will send its MAD packets to the PF driver using
      loop-back. The PF driver will be informed by an interrupt, but defer the
      handling and polling of CQEs to a worker thread running on an ordered
      work-queue.
      
      Consider the following scenario: the VMs will in short proximity in time,
      for example due to a network event, send many MAD packets to the PF
      driver. Lets say there are K VMs, each sending N packets.
      
      The interrupt from the first VM will start the worker thread, which will
      poll N CQEs. A common case here is where the PF driver will multiplex the
      packets received from the VMs out on the wire QP.
      
      But before the wire QP has returned a send CQE and associated interrupt,
      the other K - 1 VMs have sent their N packets as well.
      
      The PF driver has to multiplex K * N packets out on the wire QP. But the
      send-queue on the wire QP has a finite capacity.
      
      So, in this scenario, if K * N is larger than the send-queue capacity of
      the wire QP, we will get MAD packets dropped on the floor with this
      dynamic debug message:
      
      mlx4_ib_multiplex_mad: failed sending GSI to wire on behalf of slave 2 (-11)
      
      and this despite the fact that the wire send-queue could have capacity,
      but the PF driver isn't aware, because the wire send CQEs have not yet
      been polled.
      
      We can also have a similar scenario inbound, with a wire recv-queue larger
      than the tunnel QP's send-queue. If many remote peers send MAD packets to
      the very same VM, the tunnel send-queue destined to the VM could allegedly
      be construed to be full by the PF driver.
      
      This starvation is fixed by introducing separate work queues for the wire
      QPs vs. the tunnel QPs.
      
      With this fix, using a dual ported HCA, 8 VFs instantiated, we could run
      cmtime on each of the 18 interfaces towards a similar configured peer,
      each cmtime instance with 800 QPs (all in all 14400 QPs) without a single
      CM packet getting lost.
      
      Fixes: 3cf69cc8 ("IB/mlx4: Add CM paravirtualization")
      Link: https://lore.kernel.org/r/20200803061941.1139994-5-haakon.bugge@oracle.com
      
      
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      7fd1507d
    • Håkon Bugge's avatar
      IB/mlx4: Separate tunnel and wire bufs parameters · 0ae207fb
      Håkon Bugge authored
      Using CX-3 in virtualized mode, MAD packets are proxied through the PF
      driver. The feed is N tunnel QPs, and what is received from the VFs is
      multiplexed out on the wire QP. Since this is a many-to-one scenario, it
      is better to have separate initialization parameters for the two usages.
      
      The number of wire and tunnel bufs are yanked up to 2K and 512
      respectively. With this set of parameters, a system consisting of eight
      physical servers, each with eight VMs and 14 I/O servers (BM), can run
      switch fail-over without seeing:
      
      mlx4_ib_demux_mad: failed sending GSI to slave 3 via tunnel qp (-11)
      
      or
      
      mlx4_ib_multiplex_mad: failed sending GSI to wire on behalf of slave 2 (-11)
      
      Fixes: 3cf69cc8 ("IB/mlx4: Add CM paravirtualization")
      Link: https://lore.kernel.org/r/20200803061941.1139994-4-haakon.bugge@oracle.com
      
      
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      0ae207fb
    • Håkon Bugge's avatar
      IB/mlx4: Add support for MRA · e7d087fc
      Håkon Bugge authored
      Using CX-3 in virtualized mode, MAD packets are proxied through the PF
      driver. However, the handling lacks support of the MRA (Message Receipt
      Acknowledgment) packet. When having dynamic debug enabled, we see tons of:
      
      mlx4_ib_multiplex_cm_handler: id{slave: 7, sl_cm_id: 0x8fcb45a0} is NULL! attr_id: 0x11
      
      Fixes: 3cf69cc8 ("IB/mlx4: Add CM paravirtualization")
      Link: https://lore.kernel.org/r/20200803061941.1139994-3-haakon.bugge@oracle.com
      
      
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      e7d087fc
    • Håkon Bugge's avatar
      IB/mlx4: Add and improve logging · 09461944
      Håkon Bugge authored
      Add missing check for success after call to mlx4_ib_send_to_wire() in
      mlx4_ib_multiplex_mad().
      
      Amended the existing pr_debug() in mlx4_ib_multiplex_cm_handler() and
      mlx4_ib_demux_cm_handler() with attr_id during a lookup failure.
      
      Removed two noisy pr_debug() in mad.c
      
      Fixes: 3cf69cc8 ("IB/mlx4: Add CM paravirtualization")
      Link: https://lore.kernel.org/r/20200803061941.1139994-2-haakon.bugge@oracle.com
      
      
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      09461944
  5. Aug 19, 2020