Skip to content
  1. Aug 28, 2021
  2. Aug 27, 2021
  3. Aug 25, 2021
  4. Aug 18, 2021
    • Ming Lei's avatar
      blk-mq: fix is_flush_rq · a9ed27a7
      Ming Lei authored
      is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the
      following check:
      
      	hctx->fq->flush_rq == req
      
      but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because:
      
      1) memory re-order in blk_mq_rq_ctx_init():
      
      	rq->mq_hctx = data->hctx;
      	...
      	refcount_set(&rq->ref, 1);
      
      OR
      
      2) tag re-use and ->rqs[] isn't updated with new request.
      
      Fix the issue by re-writing is_flush_rq() as:
      
      	return rq->end_io == flush_end_io;
      
      which turns out simpler to follow and immune to data race since we have
      ordered WRITE rq->end_io and refcount_set(&rq->ref, 1).
      
      Fixes: 2e315dc0
      
       ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
      Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
      Cc: Yufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a9ed27a7
  5. Aug 17, 2021
  6. Aug 13, 2021
  7. Aug 12, 2021
    • Tejun Heo's avatar
      Revert "block/mq-deadline: Add cgroup support" · 0f783995
      Tejun Heo authored
      This reverts commit 08a9ad8b ("block/mq-deadline: Add cgroup support")
      and a follow-up commit c06bc5a3
      
       ("block/mq-deadline: Remove a
      WARN_ON_ONCE() call"). The added cgroup support has the following issues:
      
      * It breaks cgroup interface file format rule by adding custom elements to a
        nested key-value file.
      
      * It registers mq-deadline as a cgroup-aware policy even though all it's
        doing is collecting per-cgroup stats. Even if we need these stats, this
        isn't the right way to add them.
      
      * It hasn't been reviewed from cgroup side.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f783995
  8. Aug 10, 2021
  9. Aug 07, 2021
  10. Aug 06, 2021
  11. Aug 05, 2021
  12. Aug 04, 2021
  13. Jul 28, 2021
    • Christoph Hellwig's avatar
      block: delay freeing the gendisk · 340e8457
      Christoph Hellwig authored
      blkdev_get_no_open acquires a reference to the block_device through
      the block device inode and then tries to acquire a device model
      reference to the gendisk.  But at this point the disk migh already
      be freed (although the race is free).  Fix this by only freeing the
      gendisk from the whole device bdevs ->free_inode callback as well.
      
      Fixes: 22ae8ce8
      
       ("block: simplify bdev/disk lookup in blkdev_get")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20210722075402.983367-2-hch@lst.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      340e8457
    • Tejun Heo's avatar
      blk-iocost: fix operation ordering in iocg_wake_fn() · 5ab189cf
      Tejun Heo authored
      
      
      iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it
      wants the wq_entry to be always removed whether it ended up waking the
      task or not. finish_wait() tests whether wq_entry needs removal without
      grabbing the wait_queue lock and expects the waker to use
      list_del_init_careful() after all waking operations are complete, which
      iocg_wake_fn() didn't do. The operation order was wrong and the regular
      list_del_init() was used.
      
      The result is that if a waiter wakes up racing the waker, it can free pop
      the wq_entry off stack before the waker is still looking at it, which can
      lead to a backtrace like the following.
      
        [7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP
        ...
        [7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0
        ...
        [7312084.858314] Call Trace:
        [7312084.863548]  _raw_spin_lock_irqsave+0x22/0x30
        [7312084.872605]  try_to_wake_up+0x4c/0x4f0
        [7312084.880444]  iocg_wake_fn+0x71/0x80
        [7312084.887763]  __wake_up_common+0x71/0x140
        [7312084.895951]  iocg_kick_waitq+0xe8/0x2b0
        [7312084.903964]  ioc_rqos_throttle+0x275/0x650
        [7312084.922423]  __rq_qos_throttle+0x20/0x30
        [7312084.930608]  blk_mq_make_request+0x120/0x650
        [7312084.939490]  generic_make_request+0xca/0x310
        [7312084.957600]  submit_bio+0x173/0x200
        [7312084.981806]  swap_readpage+0x15c/0x240
        [7312084.989646]  read_swap_cache_async+0x58/0x60
        [7312084.998527]  swap_cluster_readahead+0x201/0x320
        [7312085.023432]  swapin_readahead+0x2df/0x450
        [7312085.040672]  do_swap_page+0x52f/0x820
        [7312085.058259]  handle_mm_fault+0xa16/0x1420
        [7312085.066620]  do_page_fault+0x2c6/0x5c0
        [7312085.074459]  page_fault+0x2f/0x40
      
      Fix it by switching to list_del_init_careful() and putting it at the end.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRik van Riel <riel@surriel.com>
      Fixes: 7caa4715
      
       ("blkcg: implement blk-iocost")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5ab189cf
    • John Garry's avatar
      blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling · b93af305
      John Garry authored
      If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we
      call blk_mq_sched_free_tags() -> blk_mq_free_rqs().
      
      It is incorrect to do so, as any rqs would have already been freed in the
      blk_mq_alloc_rqs() call.
      
      Fix by calling blk_mq_free_rq_map() only directly.
      
      Fixes: 6917ff0b
      
       ("blk-mq-sched: refactor scheduler initialization")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/1627378373-148090-1-git-send-email-john.garry@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b93af305
  14. Jul 24, 2021
  15. Jul 23, 2021
    • Jens Axboe's avatar
      Merge tag 'nvme-5.14-2021-07-22' of git://git.infradead.org/nvme into block-5.14 · 7054133d
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 5.14:
      
       - tracing fix (Keith Busch)
       - fix multipath head refcounting (Hannes Reinecke)
       - Write Zeroes vs PI fix (me)
       - drop a bogus WARN_ON (Zhihao Cheng)"
      
      * tag 'nvme-5.14-2021-07-22' of git://git.infradead.org/nvme:
        nvme: set the PRACT bit when using Write Zeroes with T10 PI
        nvme: fix nvme_setup_command metadata trace event
        nvme: fix refcounting imbalance when all paths are down
        nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
      7054133d
  16. Jul 21, 2021
    • Christoph Hellwig's avatar
      nvme: set the PRACT bit when using Write Zeroes with T10 PI · aaeb7bb0
      Christoph Hellwig authored
      When using Write Zeroes on a namespace that has protection
      information enabled they behavior without the PRACT bit
      counter-intuitive and will generally lead to validation failures
      when reading the written blocks.  Fix this by always setting the
      PRACT bit that generates matching PI data on the fly.
      
      Fixes: 6e02318e
      
       ("nvme: add support for the Write Zeroes command")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      aaeb7bb0
    • Keith Busch's avatar
      nvme: fix nvme_setup_command metadata trace event · 234211b8
      Keith Busch authored
      
      
      The metadata address is set after the trace event, so the trace is not
      capturing anything useful. Rather than logging the memory address, it's
      useful to know if the command carries a metadata payload, so change the
      trace event to log that true/false state instead.
      
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      234211b8
    • Hannes Reinecke's avatar
      nvme: fix refcounting imbalance when all paths are down · 5396fdac
      Hannes Reinecke authored
      
      
      When the last path to a ns_head drops the current code
      removes the ns_head from the subsystem list, but will only
      delete the disk itself if the last reference to the ns_head
      drops. This is causing an refcounting imbalance eg when
      applications have a reference to the disk, as then they'll
      never get notified that the disk is in fact dead.
      This patch moves the call 'del_gendisk' into nvme_mpath_check_last_path(),
      ensuring that the disk can be properly removed and applications get the
      appropriate notifications.
      
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      5396fdac
    • Zhihao Cheng's avatar
      nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING · 7764656b
      Zhihao Cheng authored
      Followling process:
      nvme_probe
        nvme_reset_ctrl
          nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)
          queue_work(nvme_reset_wq, &ctrl->reset_work)
      
      -------------->	nvme_remove
      		  nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING)
      worker_thread
        process_one_work
          nvme_reset_work
          WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)
      
      , which will trigger WARN_ON in nvme_reset_work():
      [  127.534298] WARNING: CPU: 0 PID: 139 at drivers/nvme/host/pci.c:2594
      [  127.536161] CPU: 0 PID: 139 Comm: kworker/u8:7 Not tainted 5.13.0
      [  127.552518] Call Trace:
      [  127.552840]  ? kvm_sched_clock_read+0x25/0x40
      [  127.553936]  ? native_send_call_func_single_ipi+0x1c/0x30
      [  127.555117]  ? send_call_function_single_ipi+0x9b/0x130
      [  127.556263]  ? __smp_call_single_queue+0x48/0x60
      [  127.557278]  ? ttwu_queue_wakelist+0xfa/0x1c0
      [  127.558231]  ? try_to_wake_up+0x265/0x9d0
      [  127.559120]  ? ext4_end_io_rsv_work+0x160/0x290
      [  127.560118]  process_one_work+0x28c/0x640
      [  127.561002]  worker_thread+0x39a/0x700
      [  127.561833]  ? rescuer_thread+0x580/0x580
      [  127.562714]  kthread+0x18c/0x1e0
      [  127.563444]  ? set_kthread_struct+0x70/0x70
      [  127.564347]  ret_from_fork+0x1f/0x30
      
      The preceding problem can be easily reproduced by executing following
      script (based on blktests suite):
      test() {
        pdev="$(_get_pci_dev_from_blkdev)"
        sysfs="/sys/bus/pci/devices/${pdev}"
        for ((i = 0; i < 10; i++)); do
          echo 1 > "$sysfs/remove"
          echo 1 > /sys/bus/pci/rescan
        done
      }
      
      Since the device ctrl could be updated as an non-RESETTING state by
      repeating probe/remove in userspace (which is a normal situation), we
      can replace stack dumping WARN_ON with a warnning message.
      
      Fixes: 82b057ca
      
       ("nvme-pci: fix multiple ctrl removal schedulin")
      Signed-off-by: default avatarZhihao Cheng <chengzhihao1@huawei.com>
      7764656b
  17. Jul 18, 2021
  18. Jul 15, 2021
  19. Jul 13, 2021
    • Casey Chen's avatar
      nvme-pci: do not call nvme_dev_remove_admin from nvme_remove · 251ef6f7
      Casey Chen authored
      nvme_dev_remove_admin could free dev->admin_q and the admin_tagset
      while they are being accessed by nvme_dev_disable(), which can be called
      by nvme_reset_work via nvme_remove_dead_ctrl.
      
      Commit cb4bfda6 ("nvme-pci: fix hot removal during error handling")
      intended to avoid requests being stuck on a removed controller by killing
      the admin queue. But the later fix c8e9e9b7 ("nvme-pci: unquiesce
      admin queue on shutdown"), together with nvme_dev_disable(dev, true)
      right before nvme_dev_remove_admin() could help dispatch requests and
      fail them early, so we don't need nvme_dev_remove_admin() any more.
      
      Fixes: cb4bfda6
      
       ("nvme-pci: fix hot removal during error handling")
      Signed-off-by: default avatarCasey Chen <cachen@purestorage.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      251ef6f7
    • Casey Chen's avatar
      nvme-pci: fix multiple races in nvme_setup_io_queues · e4b9852a
      Casey Chen authored
      
      
      Below two paths could overlap each other if we power off a drive quickly
      after powering it on. There are multiple races in nvme_setup_io_queues()
      because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.
      
      nvme_reset_work()                                nvme_remove()
        nvme_setup_io_queues()                           nvme_dev_disable()
        ...                                              ...
      A1  clear NVMEQ_ENABLED bit for admin queue          lock
          retry:                                       B1  nvme_suspend_io_queues()
      A2    pci_free_irq() admin queue                 B2  nvme_suspend_queue() admin queue
      A3    pci_free_irq_vectors()                         nvme_pci_disable()
      A4    nvme_setup_irqs();                         B3    pci_free_irq_vectors()
            ...                                            unlock
      A5    queue_request_irq() for admin queue
            set NVMEQ_ENABLED bit
            ...
            nvme_create_io_queues()
      A6      result = queue_request_irq();
              set NVMEQ_ENABLED bit
            ...
            fail to allocate enough IO queues:
      A7      nvme_suspend_io_queues()
              goto retry
      
      If B3 runs in between A1 and A2, it will crash if irqaction haven't
      been freed by A2. B2 is supposed to free admin queue IRQ but it simply
      can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.
      
      Fix: combine A1 A2 so IRQ get freed as soon as the NVMEQ_ENABLED bit
      gets cleared.
      
      After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
      is checking irqaction. A3 also could race with B2 if B2 is freeing
      IRQ while A3 is checking irqaction.
      
      Fix: A2 and A3 take lock for mutual exclusion.
      
      A3 could race with B3 since they could run free_msi_irqs() in parallel.
      
      Fix: A3 takes lock for mutual exclusion.
      
      A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
      interrupted by B3.
      
      Fix: A4 takes lock for mutual exclusion.
      
      If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
      They are just allocated by A5/A6.
      
      Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.
      
      A7 could get chance to pci_free_irq() for certain IO queue while B3 is
      checking irqaction.
      
      Fix: A7 takes lock.
      
      nvme_dev->online_queues need to be protected by shutdown_lock. Since it
      is not atomic, both paths could modify it using its own copy.
      
      Co-developed-by: default avatarYuanyuan Zhong <yzhong@purestorage.com>
      Signed-off-by: default avatarCasey Chen <cachen@purestorage.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e4b9852a
    • Prabhakar Kushwaha's avatar
      nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE · 8b43ced6
      Prabhakar Kushwaha authored
      dev_get_by_name() finds network device by name but it also increases the
      reference count.
      
      If a nvme-tcp queue is present and the network device driver is removed
      before nvme_tcp, we will face the following continuous log:
      
        "kernel:unregister_netdevice: waiting for <eth> to become free. Usage count = 2"
      
      And rmmod further halts. Similar case arises during reboot/shutdown
      with nvme-tcp queue present and both never completes.
      
      To fix this, use __dev_get_by_name() which finds network device by
      name without increasing any reference counter.
      
      Fixes: 3ede8f72
      
       ("nvme-tcp: allow selecting the network interface for connections")
      Signed-off-by: default avatarOmkar Kulkarni <okulkarni@marvell.com>
      Signed-off-by: default avatarShai Malin <smalin@marvell.com>
      Signed-off-by: default avatarPrabhakar Kushwaha <pkushwaha@marvell.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      [hch: remove the ->ndev member entirely]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8b43ced6
  20. Jul 07, 2021
  21. Jul 05, 2021
  22. Jul 02, 2021