Skip to content
  1. Jul 17, 2023
    • Shiraz Saleem's avatar
      RDMA/core: Update CMA destination address on rdma_resolve_addr · 0e158630
      Shiraz Saleem authored
      8d037973 ("RDMA/core: Refactor rdma_bind_addr") intoduces as regression
      on irdma devices on certain tests which uses rdma CM, such as cmtime.
      
      No connections can be established with the MAD QP experiences a fatal
      error on the active side.
      
      The cma destination address is not updated with the dst_addr when ULP
      on active side calls rdma_bind_addr followed by rdma_resolve_addr.
      The id_priv state is 'bound' in resolve_prepare_src and update is skipped.
      
      This leaves the dgid passed into irdma driver to create an Address Handle
      (AH) for the MAD QP at 0. The create AH descriptor as well as the ARP cache
      entry is invalid and HW throws an asynchronous events as result.
      
      [ 1207.656888] resolve_prepare_src caller: ucma_resolve_addr+0xff/0x170 [rdma_ucm] daddr=200.0.4.28 id_priv->state=7
      [....]
      [ 1207.680362] ice 0000:07:00.1 rocep7s0f1: caller: irdma_create_ah+0x3e/0x70 [irdma] ah_id=0 arp_idx=0 dest_ip=0.0.0.0
      destMAC=00:00:64:ca:b7:52 ipvalid=1 raw=0000:0000:0000:0000:0000:ffff:0000:0000
      [ 1207.682077] ice 0000:07:00.1 rocep7s0f1: abnormal ae_id = 0x401 bool qp=1 qp_id = 1, ae_src=5
      [ 1207.691657] infiniband rocep7s0f1: Fatal error (1) on MAD QP (1)
      
      Fix this by updating the CMA destination address when the ULP calls
      a resolve address with the CM state already bound.
      
      Fixes: 8d037973
      
       ("RDMA/core: Refactor rdma_bind_addr")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230712234133.1343-1-shiraz.saleem@intel.com
      
      
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      0e158630
    • Shiraz Saleem's avatar
      RDMA/irdma: Fix data race on CQP request done · f0842bb3
      Shiraz Saleem authored
      KCSAN detects a data race on cqp_request->request_done memory location
      which is accessed locklessly in irdma_handle_cqp_op while being
      updated in irdma_cqp_ce_handler.
      
      Annotate lockless intent with READ_ONCE/WRITE_ONCE to avoid any
      compiler optimizations like load fusing and/or KCSAN warning.
      
      [222808.417128] BUG: KCSAN: data-race in irdma_cqp_ce_handler [irdma] / irdma_wait_event [irdma]
      
      [222808.417532] write to 0xffff8e44107019dc of 1 bytes by task 29658 on cpu 5:
      [222808.417610]  irdma_cqp_ce_handler+0x21e/0x270 [irdma]
      [222808.417725]  cqp_compl_worker+0x1b/0x20 [irdma]
      [222808.417827]  process_one_work+0x4d1/0xa40
      [222808.417835]  worker_thread+0x319/0x700
      [222808.417842]  kthread+0x180/0x1b0
      [222808.417852]  ret_from_fork+0x22/0x30
      
      [222808.417918] read to 0xffff8e44107019dc of 1 bytes by task 29688 on cpu 1:
      [222808.417995]  irdma_wait_event+0x1e2/0x2c0 [irdma]
      [222808.418099]  irdma_handle_cqp_op+0xae/0x170 [irdma]
      [222808.418202]  irdma_cqp_cq_destroy_cmd+0x70/0x90 [irdma]
      [222808.418308]  irdma_puda_dele_rsrc+0x46d/0x4d0 [irdma]
      [222808.418411]  irdma_rt_deinit_hw+0x179/0x1d0 [irdma]
      [222808.418514]  irdma_ib_dealloc_device+0x11/0x40 [irdma]
      [222808.418618]  ib_dealloc_device+0x2a/0x120 [ib_core]
      [222808.418823]  __ib_unregister_device+0xde/0x100 [ib_core]
      [222808.418981]  ib_unregister_device+0x22/0x40 [ib_core]
      [222808.419142]  irdma_ib_unregister_device+0x70/0x90 [irdma]
      [222808.419248]  i40iw_close+0x6f/0xc0 [irdma]
      [222808.419352]  i40e_client_device_unregister+0x14a/0x180 [i40e]
      [222808.419450]  i40iw_remove+0x21/0x30 [irdma]
      [222808.419554]  auxiliary_bus_remove+0x31/0x50
      [222808.419563]  device_remove+0x69/0xb0
      [222808.419572]  device_release_driver_internal+0x293/0x360
      [222808.419582]  driver_detach+0x7c/0xf0
      [222808.419592]  bus_remove_driver+0x8c/0x150
      [222808.419600]  driver_unregister+0x45/0x70
      [222808.419610]  auxiliary_driver_unregister+0x16/0x30
      [222808.419618]  irdma_exit_module+0x18/0x1e [irdma]
      [222808.419733]  __do_sys_delete_module.constprop.0+0x1e2/0x310
      [222808.419745]  __x64_sys_delete_module+0x1b/0x30
      [222808.419755]  do_syscall_64+0x39/0x90
      [222808.419763]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      [222808.419829] value changed: 0x01 -> 0x03
      
      Fixes: 915cc7ac
      
       ("RDMA/irdma: Add miscellaneous utility definitions")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230711175253.1289-4-shiraz.saleem@intel.com
      
      
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      f0842bb3
    • Shiraz Saleem's avatar
      RDMA/irdma: Fix data race on CQP completion stats · f2c30378
      Shiraz Saleem authored
      CQP completion statistics is read lockesly in irdma_wait_event and
      irdma_check_cqp_progress while it can be updated in the completion
      thread irdma_sc_ccq_get_cqe_info on another CPU as KCSAN reports.
      
      Make completion statistics an atomic variable to reflect coherent updates
      to it. This will also avoid load/store tearing logic bug potentially
      possible by compiler optimizations.
      
      [77346.170861] BUG: KCSAN: data-race in irdma_handle_cqp_op [irdma] / irdma_sc_ccq_get_cqe_info [irdma]
      
      [77346.171383] write to 0xffff8a3250b108e0 of 8 bytes by task 9544 on cpu 4:
      [77346.171483]  irdma_sc_ccq_get_cqe_info+0x27a/0x370 [irdma]
      [77346.171658]  irdma_cqp_ce_handler+0x164/0x270 [irdma]
      [77346.171835]  cqp_compl_worker+0x1b/0x20 [irdma]
      [77346.172009]  process_one_work+0x4d1/0xa40
      [77346.172024]  worker_thread+0x319/0x700
      [77346.172037]  kthread+0x180/0x1b0
      [77346.172054]  ret_from_fork+0x22/0x30
      
      [77346.172136] read to 0xffff8a3250b108e0 of 8 bytes by task 9838 on cpu 2:
      [77346.172234]  irdma_handle_cqp_op+0xf4/0x4b0 [irdma]
      [77346.172413]  irdma_cqp_aeq_cmd+0x75/0xa0 [irdma]
      [77346.172592]  irdma_create_aeq+0x390/0x45a [irdma]
      [77346.172769]  irdma_rt_init_hw.cold+0x212/0x85d [irdma]
      [77346.172944]  irdma_probe+0x54f/0x620 [irdma]
      [77346.173122]  auxiliary_bus_probe+0x66/0xa0
      [77346.173137]  really_probe+0x140/0x540
      [77346.173154]  __driver_probe_device+0xc7/0x220
      [77346.173173]  driver_probe_device+0x5f/0x140
      [77346.173190]  __driver_attach+0xf0/0x2c0
      [77346.173208]  bus_for_each_dev+0xa8/0xf0
      [77346.173225]  driver_attach+0x29/0x30
      [77346.173240]  bus_add_driver+0x29c/0x2f0
      [77346.173255]  driver_register+0x10f/0x1a0
      [77346.173272]  __auxiliary_driver_register+0xbc/0x140
      [77346.173287]  irdma_init_module+0x55/0x1000 [irdma]
      [77346.173460]  do_one_initcall+0x7d/0x410
      [77346.173475]  do_init_module+0x81/0x2c0
      [77346.173491]  load_module+0x1232/0x12c0
      [77346.173506]  __do_sys_finit_module+0x101/0x180
      [77346.173522]  __x64_sys_finit_module+0x3c/0x50
      [77346.173538]  do_syscall_64+0x39/0x90
      [77346.173553]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      [77346.173634] value changed: 0x0000000000000094 -> 0x0000000000000095
      
      Fixes: 915cc7ac
      
       ("RDMA/irdma: Add miscellaneous utility definitions")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230711175253.1289-3-shiraz.saleem@intel.com
      
      
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      f2c30378
    • Shiraz Saleem's avatar
      RDMA/irdma: Add missing read barriers · 4984eb51
      Shiraz Saleem authored
      On code inspection, there are many instances in the driver where
      CEQE and AEQE fields written to by HW are read without guaranteeing
      that the polarity bit has been read and checked first.
      
      Add a read barrier to avoid reordering of loads on the CEQE/AEQE fields
      prior to checking the polarity bit.
      
      Fixes: 3f49d684
      
       ("RDMA/irdma: Implement HW Admin Queue OPs")
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Link: https://lore.kernel.org/r/20230711175253.1289-2-shiraz.saleem@intel.com
      
      
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      4984eb51
  2. Jul 12, 2023
  3. Jul 10, 2023
  4. Jul 09, 2023