Skip to content
  1. Dec 08, 2023
  2. Dec 05, 2023
    • Bitao Hu's avatar
      nvme: fix deadlock between reset and scan · 839a40d1
      Bitao Hu authored
      
      
      If controller reset occurs when allocating namespace, both
      nvme_reset_work and nvme_scan_work will hang, as shown below.
      
      Test Scripts:
      
          for ((t=1;t<=128;t++))
          do
          nsid=`nvme create-ns /dev/nvme1 -s 14537724 -c 14537724 -f 0 -m 0 \
          -d 0 | awk -F: '{print($NF);}'`
          nvme attach-ns /dev/nvme1 -n $nsid -c 0
          done
          nvme reset /dev/nvme1
      
      We will find that both nvme_reset_work and nvme_scan_work hung:
      
          INFO: task kworker/u249:4:17848 blocked for more than 120 seconds.
          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
          message.
          task:kworker/u249:4  state:D stack:    0 pid:17848 ppid:     2
          flags:0x00000028
          Workqueue: nvme-reset-wq nvme_reset_work [nvme]
          Call trace:
          __switch_to+0xb4/0xfc
          __schedule+0x22c/0x670
          schedule+0x4c/0xd0
          blk_mq_freeze_queue_wait+0x84/0xc0
          nvme_wait_freeze+0x40/0x64 [nvme_core]
          nvme_reset_work+0x1c0/0x5cc [nvme]
          process_one_work+0x1d8/0x4b0
          worker_thread+0x230/0x440
          kthread+0x114/0x120
          INFO: task kworker/u249:3:22404 blocked for more than 120 seconds.
          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
          message.
          task:kworker/u249:3  state:D stack:    0 pid:22404 ppid:     2
          flags:0x00000028
          Workqueue: nvme-wq nvme_scan_work [nvme_core]
          Call trace:
          __switch_to+0xb4/0xfc
          __schedule+0x22c/0x670
          schedule+0x4c/0xd0
          rwsem_down_write_slowpath+0x32c/0x98c
          down_write+0x70/0x80
          nvme_alloc_ns+0x1ac/0x38c [nvme_core]
          nvme_validate_or_alloc_ns+0xbc/0x150 [nvme_core]
          nvme_scan_ns_list+0xe8/0x2e4 [nvme_core]
          nvme_scan_work+0x60/0x500 [nvme_core]
          process_one_work+0x1d8/0x4b0
          worker_thread+0x260/0x440
          kthread+0x114/0x120
          INFO: task nvme:28428 blocked for more than 120 seconds.
          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
          message.
          task:nvme            state:D stack:    0 pid:28428 ppid: 27119
          flags:0x00000000
          Call trace:
          __switch_to+0xb4/0xfc
          __schedule+0x22c/0x670
          schedule+0x4c/0xd0
          schedule_timeout+0x160/0x194
          do_wait_for_common+0xac/0x1d0
          __wait_for_common+0x78/0x100
          wait_for_completion+0x24/0x30
          __flush_work.isra.0+0x74/0x90
          flush_work+0x14/0x20
          nvme_reset_ctrl_sync+0x50/0x74 [nvme_core]
          nvme_dev_ioctl+0x1b0/0x250 [nvme_core]
          __arm64_sys_ioctl+0xa8/0xf0
          el0_svc_common+0x88/0x234
          do_el0_svc+0x7c/0x90
          el0_svc+0x1c/0x30
          el0_sync_handler+0xa8/0xb0
          el0_sync+0x148/0x180
      
      The reason for the hang is that nvme_reset_work occurs while nvme_scan_work
      is still running. nvme_scan_work may add new ns into ctrl->namespaces
      list after nvme_reset_work frozen all ns->q in ctrl->namespaces list.
      The newly added ns is not frozen, so nvme_wait_freeze will wait forever.
      Unfortunately, ctrl->namespaces_rwsem is held by nvme_reset_work, so
      nvme_scan_work will also wait forever. Now we are deadlocked!
      
      PROCESS1                         PROCESS2
      ==============                   ==============
      nvme_scan_work
        ...                            nvme_reset_work
        nvme_validate_or_alloc_ns        nvme_dev_disable
          nvme_alloc_ns                    nvme_start_freeze
           down_write                      ...
           nvme_ns_add_to_ctrl_list        ...
           up_write                      nvme_wait_freeze
          ...                              down_read
          nvme_alloc_ns                    blk_mq_freeze_queue_wait
           down_write
      
      Fix by marking the ctrl with say NVME_CTRL_FROZEN flag set in
      nvme_start_freeze and cleared in nvme_unfreeze. Then the scan can check
      it before adding the new namespace (under the namespaces_rwsem).
      
      Signed-off-by: default avatarBitao Hu <yaoma@linux.alibaba.com>
      Reviewed-by: default avatarGuixin Liu <kanie@linux.alibaba.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      839a40d1
    • Nitesh Shetty's avatar
      nvme: prevent potential spectre v1 gadget · 20dc66f2
      Nitesh Shetty authored
      
      
      This patch fixes the smatch warning, "nvmet_ns_ana_grpid_store() warn:
      potential spectre issue 'nvmet_ana_group_enabled' [w] (local cap)"
      Prevent the contents of kernel memory from being leaked to  user space
      via speculative execution by using array_index_nospec.
      
      Signed-off-by: default avatarNitesh Shetty <nj.shetty@samsung.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      20dc66f2
    • Shin'ichiro Kawasaki's avatar
      nvme: improve NVME_HOST_AUTH and NVME_TARGET_AUTH config descriptions · 29ac4b2f
      Shin'ichiro Kawasaki authored
      
      
      Currently two similar config options NVME_HOST_AUTH and NVME_TARGET_AUTH
      have almost same descriptions. It is confusing to choose them in
      menuconfig. Improve the descriptions to distinguish them.
      
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      29ac4b2f
    • Keith Busch's avatar
      nvme-ioctl: move capable() admin check to the end · 7be866b1
      Keith Busch authored
      
      
      This can be an expensive call on some kernel configs. Move it to the end
      after checking the cheaper ways to determine if the command is allowed.
      
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      7be866b1
    • Keith Busch's avatar
      nvme: ensure reset state check ordering · e6e7f7ac
      Keith Busch authored
      
      
      A different CPU may be setting the ctrl->state value, so ensure proper
      barriers to prevent optimizing to a stale state. Normally it isn't a
      problem to observe the wrong state as it is merely advisory to take a
      quicker path during initialization and error recovery, but seeing an old
      state can report unexpected ENETRESET errors when a reset request was in
      fact successful.
      
      Reported-by: default avatarMinh Hoang <mh2022@meta.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      e6e7f7ac
    • Keith Busch's avatar
      nvme: introduce helper function to get ctrl state · 5c687c28
      Keith Busch authored
      
      
      The controller state is typically written by another CPU, so reading it
      should ensure no optimizations are taken. This is a repeated pattern in
      the driver, so start with adding a convenience function that returns the
      controller state with READ_ONCE().
      
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      5c687c28
  3. Dec 02, 2023
  4. Dec 01, 2023
  5. Nov 30, 2023
  6. Nov 29, 2023
  7. Nov 28, 2023
    • Ewan D. Milne's avatar
      nvme: check for valid nvme_identify_ns() before using it · d8b90d60
      Ewan D. Milne authored
      When scanning namespaces, it is possible to get valid data from the first
      call to nvme_identify_ns() in nvme_alloc_ns(), but not from the second
      call in nvme_update_ns_info_block().  In particular, if the NSID becomes
      inactive between the two commands, a storage device may return a buffer
      filled with zero as per 4.1.5.1.  In this case, we can get a kernel crash
      due to a divide-by-zero in blk_stack_limits() because ns->lba_shift will
      be set to zero.
      
      PID: 326      TASK: ffff95fec3cd8000  CPU: 29   COMMAND: "kworker/u98:10"
       #0 [ffffad8f8702f9e0] machine_kexec at ffffffff91c76ec7
       #1 [ffffad8f8702fa38] __crash_kexec at ffffffff91dea4fa
       #2 [ffffad8f8702faf8] crash_kexec at ffffffff91deb788
       #3 [ffffad8f8702fb00] oops_end at ffffffff91c2e4bb
       #4 [ffffad8f8702fb20] do_trap at ffffffff91c2a4ce
       #5 [ffffad8f8702fb70] do_error_trap at ffffffff91c2a595
       #6 [ffffad8f8702fbb0] exc_divide_error at ffffffff928506e6
       #7 [ffffad8f8702fbd0] asm_exc_divide_error at ffffffff92a00926
          [exception RIP: blk_stack_limits+434]
          RIP: ffffffff92191872  RSP: ffffad8f8702fc80  RFLAGS: 00010246
          RAX: 0000000000000000  RBX: ffff95efa0c91800  RCX: 0000000000000001
          RDX: 0000000000000000  RSI: 0000000000000001  RDI: 0000000000000001
          RBP: 00000000ffffffff   R8: ffff95fec7df35a8   R9: 0000000000000000
          R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
          R13: 0000000000000000  R14: 0000000000000000  R15: ffff95fed33c09a8
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #8 [ffffad8f8702fce0] nvme_update_ns_info_block at ffffffffc06d3533 [nvme_core]
       #9 [ffffad8f8702fd18] nvme_scan_ns at ffffffffc06d6fa7 [nvme_core]
      
      This happened when the check for valid data was moved out of nvme_identify_ns()
      into one of the callers.  Fix this by checking in both callers.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=218186
      Fixes: 0dd6fff2
      
       ("nvme: bring back auto-removal of deleted namespaces during sequential scan")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEwan D. Milne <emilne@redhat.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      d8b90d60
    • Maurizio Lombardi's avatar
      nvme-core: fix a memory leak in nvme_ns_info_from_identify() · e3139cef
      Maurizio Lombardi authored
      
      
      In case of error, free the nvme_id_ns structure that was allocated
      by nvme_identify_ns().
      
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarKanchan Joshi <joshi.k@samsung.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      e3139cef
    • Mark O'Donovan's avatar
      nvme: fine-tune sending of first keep-alive · 136cfcb8
      Mark O'Donovan authored
      
      
      Keep-alive commands are sent half-way through the kato period.
      This normally works well but fails when the keep-alive system is
      started when we are more than half way through the kato.
      This can happen on larger setups or due to host delays.
      With this change we now time the initial keep-alive command from
      the controller initialisation time, rather than the keep-alive
      mechanism activation time.
      
      Signed-off-by: default avatarMark O'Donovan <shiftee@posteo.net>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      136cfcb8
  8. Nov 25, 2023
    • Markus Weippert's avatar
      bcache: revert replacing IS_ERR_OR_NULL with IS_ERR · bb6cc253
      Markus Weippert authored
      Commit 028ddcac ("bcache: Remove unnecessary NULL point check in
      node allocations") replaced IS_ERR_OR_NULL by IS_ERR. This leads to a
      NULL pointer dereference.
      
      BUG: kernel NULL pointer dereference, address: 0000000000000080
      Call Trace:
       ? __die_body.cold+0x1a/0x1f
       ? page_fault_oops+0xd2/0x2b0
       ? exc_page_fault+0x70/0x170
       ? asm_exc_page_fault+0x22/0x30
       ? btree_node_free+0xf/0x160 [bcache]
       ? up_write+0x32/0x60
       btree_gc_coalesce+0x2aa/0x890 [bcache]
       ? bch_extent_bad+0x70/0x170 [bcache]
       btree_gc_recurse+0x130/0x390 [bcache]
       ? btree_gc_mark_node+0x72/0x230 [bcache]
       bch_btree_gc+0x5da/0x600 [bcache]
       ? cpuusage_read+0x10/0x10
       ? bch_btree_gc+0x600/0x600 [bcache]
       bch_gc_thread+0x135/0x180 [bcache]
      
      The relevant code starts with:
      
          new_nodes[0] = NULL;
      
          for (i = 0; i < nodes; i++) {
              if (__bch_keylist_realloc(&keylist, bkey_u64s(&r[i].b->key)))
                  goto out_nocoalesce;
          // ...
      out_nocoalesce:
          // ...
          for (i = 0; i < nodes; i++)
              if (!IS_ERR(new_nodes[i])) {  // IS_ERR_OR_NULL before
      028ddcac
                  btree_node_free(new_nodes[i]);  // new_nodes[0] is NULL
                  rw_unlock(true, new_nodes[i]);
              }
      
      This patch replaces IS_ERR() by IS_ERR_OR_NULL() to fix this.
      
      Fixes: 028ddcac ("bcache: Remove unnecessary NULL point check in node allocations")
      Link: https://lore.kernel.org/all/3DF4A87A-2AC1-4893-AE5F-E921478419A9@suse.de/
      
      
      Cc: stable@vger.kernel.org
      Cc: Zheng Wang <zyytlz.wz@163.com>
      Cc: Coly Li <colyli@suse.de>
      Signed-off-by: default avatarMarkus Weippert <markus@gekmihesg.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bb6cc253
  9. Nov 23, 2023
    • Arnd Bergmann's avatar
      nvme: tcp: fix compile-time checks for TLS mode · 0e6c4fe7
      Arnd Bergmann authored
      
      
      When CONFIG_NVME_KEYRING is enabled as a loadable module, but the TCP
      host code is built-in, it fails to link:
      
      arm-linux-gnueabi-ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl':
      tcp.c:(.text+0x1940): undefined reference to `nvme_tls_psk_default'
      
      The problem is that the compile-time conditionals are inconsistent here,
      using a mix of #ifdef CONFIG_NVME_TCP_TLS, IS_ENABLED(CONFIG_NVME_TCP_TLS)
      and IS_ENABLED(CONFIG_NVME_KEYRING) checks, with CONFIG_NVME_KEYRING
      controlling whether the implementation is actually built.
      
      Change it to use IS_ENABLED(CONFIG_NVME_KEYRING) checks consistently,
      which should help readability and make it less error-prone. Combining
      it with the check for the ctrl->opts->tls flag lets the compiler drop
      all the TLS code in configurations without this feature, which also
      helps runtime behavior in addition to avoiding the link failure.
      
      To make it possible for the compiler to build the dead code, both
      the tls_handshake_timeout variable and the TLS specific members
      of nvme_tcp_queue need to be moved out of the #ifdef block as well,
      but at least the former of these gets optimized out again.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20231122224719.4042108-4-arnd@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0e6c4fe7
    • Arnd Bergmann's avatar
      nvme: target: fix Kconfig select statements · 65e2a74c
      Arnd Bergmann authored
      
      
      When the NVME target code is built-in but its TCP frontend is a loadable
      module, enabling keyring support causes a link failure:
      
      x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
      configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'
      
      The problem is that CONFIG_NVME_TARGET_TCP_TLS is a 'bool' symbol that
      depends on the tristate CONFIG_NVME_TARGET_TCP, so any 'select' from
      it inherits the state of the tristate symbol rather than the intended
      CONFIG_NVME_TARGET one that contains the actual call.
      
      The same thing is true for CONFIG_KEYS, which itself is required for
      NVME_KEYRING.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20231122224719.4042108-3-arnd@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      65e2a74c
    • Arnd Bergmann's avatar
      nvme: target: fix nvme_keyring_id() references · d78abcba
      Arnd Bergmann authored
      
      
      In configurations without CONFIG_NVME_TARGET_TCP_TLS, the keyring
      code might not be available, or using it will result in a runtime
      failure:
      
      x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
      configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'
      
      Add a check to ensure we only check the keyring if there is a chance
      of it being used, which avoids both the runtime and link-time
      problems.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20231122224719.4042108-2-arnd@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d78abcba
    • Jens Axboe's avatar
      Merge tag 'nvme-6.7-2023-11-22' of git://git.infradead.org/nvme into block-6.7 · 55072cd7
      Jens Axboe authored
      Pull NVMe fixes from Keith:
      
      "nvme fixes for Linux 6.7
      
       - TCP TLS fixes (Hannes)
       - Authentifaction fixes (Mark, Hannes)
       - Properly terminate target names (Christoph)"
      
      * tag 'nvme-6.7-2023-11-22' of git://git.infradead.org/nvme:
        nvme: move nvme_stop_keep_alive() back to original position
        nvmet-tcp: always initialize tls_handshake_tmo_work
        nvmet: nul-terminate the NQNs passed in the connect command
        nvme: blank out authentication fabrics options if not configured
        nvme: catch errors from nvme_configure_metadata()
        nvme-tcp: only evaluate 'tls' option if TLS is selected
        nvme-auth: set explanation code for failure2 msgs
        nvme-auth: unlock mutex in one place only
      55072cd7
    • Hannes Reinecke's avatar
      nvme: move nvme_stop_keep_alive() back to original position · 3af755a4
      Hannes Reinecke authored
      Stopping keep-alive not only stops the keep-alive workqueue,
      but also needs to be synchronized with I/O termination as we
      must not send a keep-alive command after all I/O had been
      terminated.
      So to avoid any regressions move the call to stop_keep_alive()
      back to its original position and ensure that keep-alive is
      correctly stopped failing to setup the admin queue.
      
      Fixes: 4733b65d
      
       ("nvme: start keep-alive after admin queue setup")
      Suggested-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      3af755a4
  10. Nov 21, 2023