Skip to content
  1. Feb 02, 2021
  2. Jan 29, 2021
    • Damien Le Moal's avatar
      null_blk: cleanup zoned mode initialization · cd92cdb9
      Damien Le Moal authored
      
      
      To avoid potential compilation problems, replaced the badly written
      MB_TO_SECTS() macro (missing parenthesis around the argument use) with
      the inline function mb_to_sects(). And while at it, simplify the
      calculation of the total number of zones of the device using the
      round_up() macro.
      
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cd92cdb9
    • Jens Axboe's avatar
      Merge tag 'nvme-5.11-2021-01-28' of git://git.infradead.org/nvme into block-5.11 · e2579c76
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for 5.11:
      
       - add another Write Zeroes quirk (Chaitanya Kulkarni)
       - handle a no path available corner case (Daniel Wagner)
       - use the proper RCU aware list_add helper (Chao Leng)"
      
      * tag 'nvme-5.11-2021-01-28' of git://git.infradead.org/nvme:
        nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head
        nvme-multipath: Early exit if no path is available
        nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device
      e2579c76
    • Chao Leng's avatar
      nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head · 772ea326
      Chao Leng authored
      
      
      The "list" of nvme_ns_head is used as rcu list, now in nvme_init_ns_head
      list_add_tail is used to add ns->siblings to the rcu list. It is not safe.
      Should use list_add_tail_rcu instead of list_add_tail.
      
      Signed-off-by: default avatarChao Leng <lengchao@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      772ea326
    • Daniel Wagner's avatar
      nvme-multipath: Early exit if no path is available · d1bcf006
      Daniel Wagner authored
      nvme_round_robin_path() should test if the return ns pointer is valid.
      nvme_next_ns() will return a NULL pointer if there is no path left.
      
      Fixes: 75c10e73
      
       ("nvme-multipath: round-robin I/O policy")
      Signed-off-by: default avatarDaniel Wagner <dwagner@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      d1bcf006
    • Chaitanya Kulkarni's avatar
      nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device · 89919929
      Chaitanya Kulkarni authored
      
      
      This adds a quirk for SPCC 256GB NVMe 1.3 drive which fixes timeouts and
      I/O errors due to the fact that the controller does not properly
      handle the Write Zeroes command:
      
      [ 2745.659527] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G            E 5.10.6-BET #1
      [ 2745.659528] Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 3001 12/04/2020
      [ 2776.138874] nvme nvme1: I/O 414 QID 3 timeout, aborting
      [ 2776.138886] nvme nvme1: I/O 415 QID 3 timeout, aborting
      [ 2776.138891] nvme nvme1: I/O 416 QID 3 timeout, aborting
      [ 2776.138895] nvme nvme1: I/O 417 QID 3 timeout, aborting
      [ 2776.138912] nvme nvme1: Abort status: 0x0
      [ 2776.138921] nvme nvme1: I/O 428 QID 3 timeout, aborting
      [ 2776.138922] nvme nvme1: Abort status: 0x0
      [ 2776.138925] nvme nvme1: Abort status: 0x0
      [ 2776.138974] nvme nvme1: Abort status: 0x0
      [ 2776.138977] nvme nvme1: Abort status: 0x0
      [ 2806.346792] nvme nvme1: I/O 414 QID 3 timeout, reset controller
      [ 2806.363566] nvme nvme1: 15/0/0 default/read/poll queues
      [ 2836.554298] nvme nvme1: I/O 415 QID 3 timeout, disable controller
      [ 2836.672064] blk_update_request: I/O error, dev nvme1n1, sector 16350 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672072] blk_update_request: I/O error, dev nvme1n1, sector 16093 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672074] blk_update_request: I/O error, dev nvme1n1, sector 15836 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672076] blk_update_request: I/O error, dev nvme1n1, sector 15579 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672078] blk_update_request: I/O error, dev nvme1n1, sector 15322 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672080] blk_update_request: I/O error, dev nvme1n1, sector 15065 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672082] blk_update_request: I/O error, dev nvme1n1, sector 14808 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672083] blk_update_request: I/O error, dev nvme1n1, sector 14551 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672085] blk_update_request: I/O error, dev nvme1n1, sector 14294 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672087] blk_update_request: I/O error, dev nvme1n1, sector 14037 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
      [ 2836.672121] nvme nvme1: failed to mark controller live state
      [ 2836.672123] nvme nvme1: Removing after probe failure status: -19
      [ 2836.689016] Aborting journal on device dm-0-8.
      [ 2836.689024] Buffer I/O error on dev dm-0, logical block 25198592, lost sync page write
      [ 2836.689027] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
      
      Reported-by: default avatarBradley Chapman <chapman6235@comcast.net>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Tested-by: default avatarBradley Chapman <chapman6235@comcast.net>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      89919929
  3. Jan 28, 2021
    • Coly Li's avatar
      bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES · 0df28cad
      Coly Li authored
      For super block version < BCACHE_SB_VERSION_CDEV_WITH_FEATURES, it
      doesn't make sense to check the feature sets. This patch checks
      super block version in bch_has_feature_* routines, if the version
      doesn't have feature sets yet, returns 0 (false) to the caller.
      
      Fixes: 5342fd42 ("bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET")
      Fixes: ffa47032
      
       ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
      Cc: stable@vger.kernel.org # 5.9+
      Reported-and-tested-by: default avatarBockholdt Arne <a.bockholdt@precitec-optronik.de>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0df28cad
    • Damien Le Moal's avatar
      block: fix bd_size_lock use · 0fe37724
      Damien Le Moal authored
      
      
      Some block device drivers, e.g. the skd driver, call set_capacity() with
      IRQ disabled. This results in lockdep ito complain about inconsistent
      lock states ("inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage")
      because set_capacity takes a block device bd_size_lock using the
      functions spin_lock() and spin_unlock(). Ensure a consistent locking
      state by replacing these calls with spin_lock_irqsave() and
      spin_lock_irqrestore(). The same applies to bdev_set_nr_sectors().
      With this fix, all lockdep complaints are resolved.
      
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0fe37724
    • Baolin Wang's avatar
      blk-cgroup: Use cond_resched() when destroy blkgs · 6c635cae
      Baolin Wang authored
      
      
      On !PREEMPT kernel, we can get below softlockup when doing stress
      testing with creating and destroying block cgroup repeatly. The
      reason is it may take a long time to acquire the queue's lock in
      the loop of blkcg_destroy_blkgs(), or the system can accumulate a
      huge number of blkgs in pathological cases. We can add a need_resched()
      check on each loop and release locks and do cond_resched() if true
      to avoid this issue, since the blkcg_destroy_blkgs() is not called
      from atomic contexts.
      
      [ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
      [ 4757.010698] Call trace:
      [ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
      [ 4757.010701]  cgwb_release_workfn+0x104/0x158
      [ 4757.010702]  process_one_work+0x1bc/0x3f0
      [ 4757.010704]  worker_thread+0x164/0x468
      [ 4757.010705]  kthread+0x108/0x138
      
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6c635cae
    • Maxim Mikityanskiy's avatar
      Revert "block: simplify set_init_blocksize" to regain lost performance · 8dc932d3
      Maxim Mikityanskiy authored
      The cited commit introduced a serious regression with SATA write speed,
      as found by bisecting. This patch reverts this commit, which restores
      write speed back to the values observed before this commit.
      
      The performance tests were done on a Helios4 NAS (2nd batch) with 4 HDDs
      (WD8003FFBX) using dd (bs=1M count=2000). "Direct" is a test with a
      single HDD, the rest are different RAID levels built over the first
      partitions of 4 HDDs. Test results are in MB/s, R is read, W is write.
      
                      | Direct | RAID0 | RAID10 f2 | RAID10 n2 | RAID6
      ----------------+--------+-------+-----------+-----------+--------
      9011495c    | R:256  | R:313 | R:276     | R:313     | R:323
      (before faulty) | W:254  | W:253 | W:195     | W:204     | W:117
      ----------------+--------+-------+-----------+-----------+--------
      5ff9f192
      
          | R:257  | R:398 | R:312     | R:344     | R:391
      (faulty commit) | W:154  | W:122 | W:67.7    | W:66.6    | W:67.2
      ----------------+--------+-------+-----------+-----------+--------
      5.10.10         | R:256  | R:401 | R:312     | R:356     | R:375
      unpatched       | W:149  | W:123 | W:64      | W:64.1    | W:61.5
      ----------------+--------+-------+-----------+-----------+--------
      5.10.10         | R:255  | R:396 | R:312     | R:340     | R:393
      patched         | W:247  | W:274 | W:220     | W:225     | W:121
      
      Applying this patch doesn't hurt read performance, while improves the
      write speed by 1.5x - 3.5x (more impact on RAID tests). The write speed
      is restored back to the state before the faulty commit, and even a bit
      higher in RAID tests (which aren't HDD-bound on this device) - that is
      likely related to other optimizations done between the faulty commit and
      5.10.10 which also improved the read speed.
      
      Signed-off-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Fixes: 5ff9f192
      
       ("block: simplify set_init_blocksize")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8dc932d3
  4. Jan 26, 2021
    • Josef Bacik's avatar
      nbd: freeze the queue while we're adding connections · b98e762e
      Josef Bacik authored
      
      
      When setting up a device, we can krealloc the config->socks array to add
      new sockets to the configuration.  However if we happen to get a IO
      request in at this point even though we aren't setup we could hit a UAF,
      as we deref config->socks without any locking, assuming that the
      configuration was setup already and that ->socks is safe to access it as
      we have a reference on the configuration.
      
      But there's nothing really preventing IO from occurring at this point of
      the device setup, we don't want to incur the overhead of a lock to
      access ->socks when it will never change while the device is running.
      To fix this UAF scenario simply freeze the queue if we are adding
      sockets.  This will protect us from this particular case without adding
      any additional overhead for the normal running case.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b98e762e
    • Jan Höppner's avatar
      s390/dasd: Fix inconsistent kobject removal · ac55ad2b
      Jan Höppner authored
      Our intention was to only remove path kobjects whenever a device is
      being set offline. However, one corner case was missing.
      
      If a device is disabled and enabled (using the IOCTLs BIODASDDISABLE and
      BIODASDENABLE respectively), the enabling process will call
      dasd_eckd_reload_device() which itself calls dasd_eckd_read_conf() in
      order to update path information. During that update,
      dasd_eckd_clear_conf_data() clears all old data and also removes all
      kobjects. This will leave us with an inconsistent state of path kobjects
      and a subsequent path verification leads to a failing kobject creation.
      
      Fix this by removing kobjects only in the context of offlining a device
      as initially intended.
      
      Fixes: 19508b20
      
       ("s390/dasd: Display FC Endpoint Security information via sysfs")
      Reported-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Signed-off-by: default avatarJan Höppner <hoeppner@linux.ibm.com>
      Reviewed-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac55ad2b
  5. Jan 25, 2021
  6. Jan 21, 2021
    • Pan Bian's avatar
      lightnvm: fix memory leak when submit fails · 97784481
      Pan Bian authored
      The allocated page is not released if error occurs in
      nvm_submit_io_sync_raw(). __free_page() is moved ealier to avoid
      possible memory leak issue.
      
      Fixes: aff3fb18
      
       ("lightnvm: move bad block and chunk state logic to core")
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97784481
    • Jens Axboe's avatar
      Merge tag 'nvme-5.11-2020-01-21' of git://git.infradead.org/nvme into block-5.11 · 1df35bf0
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for 5.11:
      
       - fix a status code in nvmet (Chaitanya Kulkarni)
       - avoid double completions in nvme-rdma/nvme-tcp (Chao Leng)
       - fix the CMB support to cope with NVMe 1.4 controllers (Klaus Jensen)
       - fix PRINFO handling in the passthrough ioctl (Revanth Rajashekar)
       - fix a double DMA unmap in nvme-pci"
      
      * tag 'nvme-5.11-2020-01-21' of git://git.infradead.org/nvme:
        nvme-pci: fix error unwind in nvme_map_data
        nvme-pci: refactor nvme_unmap_data
        nvmet: set right status on error in id-ns handler
        nvme-pci: allow use of cmb on v1.4 controllers
        nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout
        nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout
        nvme: check the PRINFO bit before deciding the host buffer length
      1df35bf0
    • Christoph Hellwig's avatar
      nvme-pci: fix error unwind in nvme_map_data · fa073216
      Christoph Hellwig authored
      Properly unwind step by step using refactored helpers from nvme_unmap_data
      to avoid a potential double dma_unmap on a mapping failure.
      
      Fixes: 7fe07d14
      
       ("nvme-pci: merge nvme_free_iod into nvme_unmap_data")
      Reported-by: default avatarMarc Orr <marcorr@google.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarMarc Orr <marcorr@google.com>
      fa073216
    • Christoph Hellwig's avatar
      nvme-pci: refactor nvme_unmap_data · 9275c206
      Christoph Hellwig authored
      
      
      Split out three helpers from nvme_unmap_data that will allow finer grained
      unwinding from nvme_map_data.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarMarc Orr <marcorr@google.com>
      9275c206
    • Jens Axboe's avatar
      Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.11 · 8dfe1168
      Jens Axboe authored
      Pull MD fix from Song.
      
      * 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md: Set prev_flush_start and flush_bio in an atomic way
      8dfe1168
    • Xiao Ni's avatar
      md: Set prev_flush_start and flush_bio in an atomic way · dc5d17a3
      Xiao Ni authored
      
      
      One customer reports a crash problem which causes by flush request. It
      triggers a warning before crash.
      
              /* new request after previous flush is completed */
              if (ktime_after(req_start, mddev->prev_flush_start)) {
                      WARN_ON(mddev->flush_bio);
                      mddev->flush_bio = bio;
                      bio = NULL;
              }
      
      The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
      flush_bio in md_flush_request. But there is no lock protection in
      md_submit_flush_data. It can set flush_bio to NULL first because of
      compiler reordering write instructions.
      
      For example, flush bio1 sets flush bio to NULL first in
      md_submit_flush_data. An interrupt or vmware causing an extended stall
      happen between updating flush_bio and prev_flush_start. Because flush_bio
      is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
      flush bio1 updates prev_flush_start after the interrupt or extended stall.
      
      Then flush bio3 enters in md_flush_request. The start time req_start is
      behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
      finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
      again. INIT_WORK() will re-initialize the list pointers in the
      work_struct, which then can result in a corrupted work list and the
      work_struct queued a second time. With the work list corrupted, it can
      lead in invalid work items being used and cause a crash in
      process_one_work.
      
      We need to make sure only one flush bio can be handled at one same time.
      So add spin lock in md_submit_flush_data to protect prev_flush_start and
      flush_bio in an atomic way.
      
      Reviewed-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      dc5d17a3
  7. Jan 19, 2021
    • Chaitanya Kulkarni's avatar
      nvmet: set right status on error in id-ns handler · bffcd507
      Chaitanya Kulkarni authored
      
      
      The function nvmet_execute_identify_ns() doesn't set the status if call
      to nvmet_find_namespace() fails. In that case we set the status of the
      request to the value return by the nvmet_copy_sgl().
      
      Set the status to NVME_SC_INVALID_NS and adjust the code such that
      request will have the right status on nvmet_find_namespace() failure.
      
      Without this patch :-
      NVME Identify Namespace 3:
      nsze    : 0
      ncap    : 0
      nuse    : 0
      nsfeat  : 0
      nlbaf   : 0
      flbas   : 0
      mc      : 0
      dpc     : 0
      dps     : 0
      nmic    : 0
      rescap  : 0
      fpi     : 0
      dlfeat  : 0
      nawun   : 0
      nawupf  : 0
      nacwu   : 0
      nabsn   : 0
      nabo    : 0
      nabspf  : 0
      noiob   : 0
      nvmcap  : 0
      mssrl   : 0
      mcl     : 0
      msrc    : 0
      nsattr	: 0
      nvmsetid: 0
      anagrpid: 0
      endgid  : 0
      nguid   : 00000000000000000000000000000000
      eui64   : 0000000000000000
      lbaf  0 : ms:0   lbads:0  rp:0 (in use)
      
      With this patch-series :-
      feb3b88b501e (HEAD -> nvme-5.11) nvmet: remove extra variable in identify ns
      6302aa67210a nvmet: remove extra variable in id-desclist
      ed57951da453 nvmet: remove extra variable in smart log nsid
      be384b8c24dc nvmet: set right status on error in id-ns handler
      
      NVMe status: INVALID_NS: The namespace or the format of that namespace is invalid(0xb)
      
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      bffcd507
    • Klaus Jensen's avatar
      nvme-pci: allow use of cmb on v1.4 controllers · 20d3bb92
      Klaus Jensen authored
      
      
      Since NVMe v1.4 the Controller Memory Buffer must be explicitly enabled
      by the host.
      
      Signed-off-by: default avatarKlaus Jensen <k.jensen@samsung.com>
      [hch: avoid a local variable and add a comment]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      20d3bb92
    • Chao Leng's avatar
      nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout · 9ebbfe49
      Chao Leng authored
      
      
      Each name space has a request queue, if complete request long time,
      multi request queues may have time out requests at the same time,
      nvme_tcp_timeout will execute concurrently. Multi requests in different
      request queues may be queued in the same tcp queue, multi
      nvme_tcp_timeout may call nvme_tcp_stop_queue at the same time.
      The first nvme_tcp_stop_queue will clear NVME_TCP_Q_LIVE and continue
      stopping the tcp queue(cancel io_work), but the others check
      NVME_TCP_Q_LIVE is already cleared, and then directly complete the
      requests, complete request before the io work is completely canceled may
      lead to a use-after-free condition.
      Add a multex lock to serialize nvme_tcp_stop_queue.
      
      Signed-off-by: default avatarChao Leng <lengchao@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9ebbfe49
    • Chao Leng's avatar
      nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout · 7674073b
      Chao Leng authored
      
      
      A crash happens when inject completing request long time(nearly 30s).
      Each name space has a request queue, when inject completing request long
      time, multi request queues may have time out requests at the same time,
      nvme_rdma_timeout will execute concurrently. Multi requests in different
      request queues may be queued in the same rdma queue, multi
      nvme_rdma_timeout may call nvme_rdma_stop_queue at the same time.
      The first nvme_rdma_timeout will clear NVME_RDMA_Q_LIVE and continue
      stopping the rdma queue(drain qp), but the others check NVME_RDMA_Q_LIVE
      is already cleared, and then directly complete the requests, complete
      request before the qp is fully drained may lead to a use-after-free
      condition.
      
      Add a multex lock to serialize nvme_rdma_stop_queue.
      
      Signed-off-by: default avatarChao Leng <lengchao@huawei.com>
      Tested-by: default avatarIsrael Rukshin <israelr@nvidia.com>
      Reviewed-by: default avatarIsrael Rukshin <israelr@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      7674073b
    • Revanth Rajashekar's avatar
      nvme: check the PRINFO bit before deciding the host buffer length · 4d6b1c95
      Revanth Rajashekar authored
      
      
      According to NVMe spec v1.4, section 8.3.1, the PRINFO bit and
      the metadata size play a vital role in deteriming the host buffer size.
      
      If PRIFNO bit is set and MS==8, the host doesn't add the metadata buffer,
      instead the controller adds it.
      
      Signed-off-by: default avatarRevanth Rajashekar <revanth.rajashekar@intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      4d6b1c95
  8. Jan 15, 2021
  9. Jan 10, 2021
    • Coly Li's avatar
      bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET · 5342fd42
      Coly Li authored
      If BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET is set in incompat feature
      set, it means the cache device is created with obsoleted layout with
      obso_bucket_site_hi. Now bcache does not support this feature bit, a new
      BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE incompat feature bit is added
      for a better layout to support large bucket size.
      
      For the legacy compatibility purpose, if a cache device created with
      obsoleted BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit, all bcache
      devices attached to this cache set should be set to read-only. Then the
      dirty data can be written back to backing device before re-create the
      cache device with BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE feature bit
      by the latest bcache-tools.
      
      This patch checks BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit
      when running a cache set and attach a bcache device to the cache set. If
      this bit is set,
      - When run a cache set, print an error kernel message to indicate all
        following attached bcache device will be read-only.
      - When attach a bcache device, print an error kernel message to indicate
        the attached bcache device will be read-only, and ask users to update
        to latest bcache-tools.
      
      Such change is only for cache device whose bucket size >= 32MB, this is
      for the zoned SSD and almost nobody uses such large bucket size at this
      moment. If you don't explicit set a large bucket size for a zoned SSD,
      such change is totally transparent to your bcache device.
      
      Fixes: ffa47032
      
       ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5342fd42
    • Coly Li's avatar
      bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket · b16671e8
      Coly Li authored
      When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
      was introduced into the incompat feature set. It used bucket_size_hi
      (which was added at the tail of struct cache_sb_disk) to extend current
      16bit bucket size to 32bit with existing bucket_size in struct
      cache_sb_disk.
      
      This is not a good idea, there are two obvious problems,
      - Bucket size is always value power of 2, if store log2(bucket size) in
        existing bucket_size of struct cache_sb_disk, it is unnecessary to add
        bucket_size_hi.
      - Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
        struct cache_sb_disk, bucket_size_hi was added after d[] which makes
        csum_set calculate an unexpected super block checksum.
      
      To fix the above problems, this patch introduces a new incompat feature
      bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
      means bucket_size in struct cache_sb_disk stores the order of power-of-2
      bucket size value. When user specifies a bucket size larger than 32768
      sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
      incompat feature set, and bucket_size stores log2(bucket size) more
      than store the real bucket size value.
      
      The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
      it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
      recognized by kernel driver for legacy compatible purpose. The previous
      bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
      and not used in bcache-tools anymore.
      
      For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
      bcache-tools and kernel driver still recognize the feature string and
      display it as "obso_large_bucket".
      
      With this change, the unnecessary extra space extend of bcache on-disk
      super block can be avoided, and csum_set() may generate expected check
      sum as well.
      
      Fixes: ffa47032
      
       ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b16671e8
    • Coly Li's avatar
      bcache: check unsupported feature sets for bcache register · 1dfc0686
      Coly Li authored
      This patch adds the check for features which is incompatible for
      current supported feature sets.
      
      Now if the bcache device created by bcache-tools has features that
      current kernel doesn't support, read_super() will fail with error
      messoage. E.g. if an unsupported incompatible feature detected,
      bcache register will fail with dmesg "bcache: register_bcache() error :
      Unsupported incompatible feature found".
      
      Fixes: d721a43f ("bcache: increase super block version for cache device and backing device")
      Fixes: ffa47032
      
       ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1dfc0686
    • Coly Li's avatar
      bcache: fix typo from SUUP to SUPP in features.h · f7b4943d
      Coly Li authored
      This patch fixes the following typos,
      from BCH_FEATURE_COMPAT_SUUP to BCH_FEATURE_COMPAT_SUPP
      from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_INCOMPAT_SUPP
      from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_RO_COMPAT_SUPP
      
      Fixes: d721a43f ("bcache: increase super block version for cache device and backing device")
      Fixes: ffa47032
      
       ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f7b4943d
    • Yi Li's avatar
      bcache: set pdev_set_uuid before scond loop iteration · e8092707
      Yi Li authored
      
      
      There is no need to reassign pdev_set_uuid in the second loop iteration,
      so move it to the place before second loop.
      
      Signed-off-by: default avatarYi Li <yili@winhong.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8092707
  10. Jan 08, 2021
    • John Garry's avatar
      blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED · 02f938e9
      John Garry authored
      Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
      something like:
      
      root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
      alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3
      
      Add the decoding for that flag.
      
      Fixes: 32bc15af
      
       ("blk-mq: Facilitate a shared sbitmap per tagset")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      02f938e9
    • Jack Wang's avatar
      block/rnbd-clt: avoid module unload race with close confirmation · 3a21777c
      Jack Wang authored
      We had kernel panic, it is caused by unload module and last
      close confirmation.
      
      call trace:
      [1196029.743127]  free_sess+0x15/0x50 [rtrs_client]
      [1196029.743128]  rtrs_clt_close+0x4c/0x70 [rtrs_client]
      [1196029.743129]  ? rnbd_clt_unmap_device+0x1b0/0x1b0 [rnbd_client]
      [1196029.743130]  close_rtrs+0x25/0x50 [rnbd_client]
      [1196029.743131]  rnbd_client_exit+0x93/0xb99 [rnbd_client]
      [1196029.743132]  __x64_sys_delete_module+0x190/0x260
      
      And in the crashdump confirmation kworker is also running.
      PID: 6943   TASK: ffff9e2ac8098000  CPU: 4   COMMAND: "kworker/4:2"
       #0 [ffffb206cf337c30] __schedule at ffffffff9f93f891
       #1 [ffffb206cf337cc8] schedule at ffffffff9f93fe98
       #2 [ffffb206cf337cd0] schedule_timeout at ffffffff9f943938
       #3 [ffffb206cf337d50] wait_for_completion at ffffffff9f9410a7
       #4 [ffffb206cf337da0] __flush_work at ffffffff9f08ce0e
       #5 [ffffb206cf337e20] rtrs_clt_close_conns at ffffffffc0d5f668 [rtrs_client]
       #6 [ffffb206cf337e48] rtrs_clt_close at ffffffffc0d5f801 [rtrs_client]
       #7 [ffffb206cf337e68] close_rtrs at ffffffffc0d26255 [rnbd_client]
       #8 [ffffb206cf337e78] free_sess at ffffffffc0d262ad [rnbd_client]
       #9 [ffffb206cf337e88] rnbd_clt_put_dev at ffffffffc0d266a7 [rnbd_client]
      
      The problem is both code path try to close same session, which lead to
      panic.
      
      To fix it, just skip the sess if the refcount already drop to 0.
      
      Fixes: f7a7a5c2
      
       ("block/rnbd: client: main functionality")
      Signed-off-by: default avatarJack Wang <jinpu.wang@cloud.ionos.com>
      Reviewed-by: default avatarGioh Kim <gi-oh.kim@cloud.ionos.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3a21777c
    • Swapnil Ingle's avatar
      block/rnbd: Adding name to the Contributors List · ef8048dd
      Swapnil Ingle authored
      
      
      Adding name to the Contributors List
      
      Signed-off-by: default avatarSwapnil Ingle <ingleswapnil@gmail.com>
      Acked-by: default avatarJack Wang <jinpu.wang@cloud.ionos.com>
      Acked-by: default avatarDanil Kipnis <danil.kipnis@cloud.ionos.com>
      Signed-off-by: default avatarJack Wang <jinpu.wang@cloud.ionos.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ef8048dd
    • Guoqing Jiang's avatar
      block/rnbd-clt: Fix sg table use after free · 80f99093
      Guoqing Jiang authored
      Since dynamically allocate sglist is used for rnbd_iu, we can't free sg
      table after send_usr_msg since the callback function (cqe.done) could
      still access the sglist.
      
      Otherwise KASAN reports UAF issue:
      
      [ 4856.600257] BUG: KASAN: use-after-free in dma_direct_unmap_sg+0x53/0x290
      [ 4856.600772] Read of size 4 at addr ffff888206af3a98 by task swapper/1/0
      
      [ 4856.601729] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G        W         5.10.0-pserver #5.10.0-1+feature+linux+next+20201214.1025+0910d71
      [ 4856.601748] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 3.3 02/21/2020
      [ 4856.601766] Call Trace:
      [ 4856.601785]  <IRQ>
      [ 4856.601822]  dump_stack+0x99/0xcb
      [ 4856.601856]  ? dma_direct_unmap_sg+0x53/0x290
      [ 4856.601888]  print_address_description.constprop.7+0x1e/0x230
      [ 4856.601913]  ? freeze_kernel_threads+0x73/0x73
      [ 4856.601965]  ? mark_held_locks+0x29/0xa0
      [ 4856.602019]  ? dma_direct_unmap_sg+0x53/0x290
      [ 4856.602039]  ? dma_direct_unmap_sg+0x53/0x290
      [ 4856.602079]  kasan_report.cold.9+0x37/0x7c
      [ 4856.602188]  ? mlx5_ib_post_recv+0x430/0x520 [mlx5_ib]
      [ 4856.602209]  ? dma_direct_unmap_sg+0x53/0x290
      [ 4856.602256]  dma_direct_unmap_sg+0x53/0x290
      [ 4856.602366]  complete_rdma_req+0x188/0x4b0 [rtrs_client]
      [ 4856.602451]  ? rtrs_clt_close+0x80/0x80 [rtrs_client]
      [ 4856.602535]  ? mlx5_ib_poll_cq+0x48b/0x16e0 [mlx5_ib]
      [ 4856.602589]  ? radix_tree_insert+0x3a0/0x3a0
      [ 4856.602610]  ? do_raw_spin_lock+0x119/0x1d0
      [ 4856.602647]  ? rwlock_bug.part.1+0x60/0x60
      [ 4856.602740]  rtrs_clt_rdma_done+0x3f7/0x670 [rtrs_client]
      [ 4856.602804]  ? rtrs_clt_rdma_cm_handler+0xda0/0xda0 [rtrs_client]
      [ 4856.602857]  ? check_flags.part.31+0x6c/0x1f0
      [ 4856.602927]  ? rcu_read_lock_sched_held+0xaf/0xe0
      [ 4856.602963]  ? rcu_read_lock_bh_held+0xc0/0xc0
      [ 4856.603137]  __ib_process_cq+0x10a/0x350 [ib_core]
      [ 4856.603309]  ib_poll_handler+0x41/0x1c0 [ib_core]
      [ 4856.603358]  irq_poll_softirq+0xe6/0x280
      [ 4856.603392]  ? lockdep_hardirqs_on_prepare+0x111/0x210
      [ 4856.603446]  __do_softirq+0x10d/0x646
      [ 4856.603540]  asm_call_irq_on_stack+0x12/0x20
      [ 4856.603563]  </IRQ>
      
      [ 4856.605096] Allocated by task 8914:
      [ 4856.605510]  kasan_save_stack+0x19/0x40
      [ 4856.605532]  __kasan_kmalloc.constprop.7+0xc1/0xd0
      [ 4856.605552]  __kmalloc+0x155/0x320
      [ 4856.605574]  __sg_alloc_table+0x155/0x1c0
      [ 4856.605594]  sg_alloc_table+0x1f/0x50
      [ 4856.605620]  send_msg_sess_info+0x119/0x2e0 [rnbd_client]
      [ 4856.605646]  remap_devs+0x71/0x210 [rnbd_client]
      [ 4856.605676]  init_sess+0xad8/0xe10 [rtrs_client]
      [ 4856.605706]  rtrs_clt_reconnect_work+0xd6/0x170 [rtrs_client]
      [ 4856.605728]  process_one_work+0x521/0xa90
      [ 4856.605748]  worker_thread+0x65/0x5b0
      [ 4856.605769]  kthread+0x1f2/0x210
      [ 4856.605789]  ret_from_fork+0x22/0x30
      
      [ 4856.606159] Freed by task 8914:
      [ 4856.606559]  kasan_save_stack+0x19/0x40
      [ 4856.606580]  kasan_set_track+0x1c/0x30
      [ 4856.606601]  kasan_set_free_info+0x1b/0x30
      [ 4856.606622]  __kasan_slab_free+0x108/0x150
      [ 4856.606642]  slab_free_freelist_hook+0x64/0x190
      [ 4856.606661]  kfree+0xe2/0x650
      [ 4856.606681]  __sg_free_table+0xa4/0x100
      [ 4856.606707]  send_msg_sess_info+0x1d6/0x2e0 [rnbd_client]
      [ 4856.606733]  remap_devs+0x71/0x210 [rnbd_client]
      [ 4856.606763]  init_sess+0xad8/0xe10 [rtrs_client]
      [ 4856.606792]  rtrs_clt_reconnect_work+0xd6/0x170 [rtrs_client]
      [ 4856.606813]  process_one_work+0x521/0xa90
      [ 4856.606833]  worker_thread+0x65/0x5b0
      [ 4856.606853]  kthread+0x1f2/0x210
      [ 4856.606872]  ret_from_fork+0x22/0x30
      
      The solution is to free iu's sgtable after the iu is not used anymore.
      And also move sg_alloc_table into rnbd_get_iu accordingly.
      
      Fixes: 5a1328d0
      
       ("block/rnbd-clt: Dynamically allocate sglist for rnbd_iu")
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarJack Wang <jinpu.wang@cloud.ionos.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80f99093
    • Jack Wang's avatar
      block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close · 1a84e7c6
      Jack Wang authored
      KASAN detect following BUG:
      [  778.215311] ==================================================================
      [  778.216696] BUG: KASAN: use-after-free in rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
      [  778.219037] Read of size 8 at addr ffff88b1d6516c28 by task tee/8842
      
      [  778.220500] CPU: 37 PID: 8842 Comm: tee Kdump: loaded Not tainted 5.10.0-pserver #5.10.0-1+feature+linux+next+20201214.1025+0910d71
      [  778.220529] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 3.3 02/21/2020
      [  778.220555] Call Trace:
      [  778.220609]  dump_stack+0x99/0xcb
      [  778.220667]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
      [  778.220715]  print_address_description.constprop.7+0x1e/0x230
      [  778.220750]  ? freeze_kernel_threads+0x73/0x73
      [  778.220896]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
      [  778.220932]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
      [  778.220994]  kasan_report.cold.9+0x37/0x7c
      [  778.221066]  ? kobject_put+0x80/0x270
      [  778.221102]  ? rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
      [  778.221184]  rnbd_srv_sess_dev_force_close+0x38/0x60 [rnbd_server]
      [  778.221240]  rnbd_srv_dev_session_force_close_store+0x6a/0xc0 [rnbd_server]
      [  778.221304]  ? sysfs_file_ops+0x90/0x90
      [  778.221353]  kernfs_fop_write+0x141/0x240
      [  778.221451]  vfs_write+0x142/0x4d0
      [  778.221553]  ksys_write+0xc0/0x160
      [  778.221602]  ? __ia32_sys_read+0x50/0x50
      [  778.221684]  ? lockdep_hardirqs_on_prepare+0x13d/0x210
      [  778.221718]  ? syscall_enter_from_user_mode+0x1c/0x50
      [  778.221821]  do_syscall_64+0x33/0x40
      [  778.221862]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  778.221896] RIP: 0033:0x7f4affdd9504
      [  778.221928] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
      [  778.221956] RSP: 002b:00007fffebb36b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  778.222011] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f4affdd9504
      [  778.222038] RDX: 0000000000000002 RSI: 00007fffebb36c50 RDI: 0000000000000003
      [  778.222066] RBP: 00007fffebb36c50 R08: 0000556a151aa600 R09: 00007f4affeb1540
      [  778.222094] R10: fffffffffffffc19 R11: 0000000000000246 R12: 0000556a151aa520
      [  778.222121] R13: 0000000000000002 R14: 00007f4affea6760 R15: 0000000000000002
      
      [  778.222764] Allocated by task 3212:
      [  778.223285]  kasan_save_stack+0x19/0x40
      [  778.223316]  __kasan_kmalloc.constprop.7+0xc1/0xd0
      [  778.223347]  kmem_cache_alloc_trace+0x186/0x350
      [  778.223382]  rnbd_srv_rdma_ev+0xf16/0x1690 [rnbd_server]
      [  778.223422]  process_io_req+0x4d1/0x670 [rtrs_server]
      [  778.223573]  __ib_process_cq+0x10a/0x350 [ib_core]
      [  778.223709]  ib_cq_poll_work+0x31/0xb0 [ib_core]
      [  778.223743]  process_one_work+0x521/0xa90
      [  778.223773]  worker_thread+0x65/0x5b0
      [  778.223802]  kthread+0x1f2/0x210
      [  778.223833]  ret_from_fork+0x22/0x30
      
      [  778.224296] Freed by task 8842:
      [  778.224800]  kasan_save_stack+0x19/0x40
      [  778.224829]  kasan_set_track+0x1c/0x30
      [  778.224860]  kasan_set_free_info+0x1b/0x30
      [  778.224889]  __kasan_slab_free+0x108/0x150
      [  778.224919]  slab_free_freelist_hook+0x64/0x190
      [  778.224947]  kfree+0xe2/0x650
      [  778.224982]  rnbd_destroy_sess_dev+0x2fa/0x3b0 [rnbd_server]
      [  778.225011]  kobject_put+0xda/0x270
      [  778.225046]  rnbd_srv_sess_dev_force_close+0x30/0x60 [rnbd_server]
      [  778.225081]  rnbd_srv_dev_session_force_close_store+0x6a/0xc0 [rnbd_server]
      [  778.225111]  kernfs_fop_write+0x141/0x240
      [  778.225140]  vfs_write+0x142/0x4d0
      [  778.225169]  ksys_write+0xc0/0x160
      [  778.225198]  do_syscall_64+0x33/0x40
      [  778.225227]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [  778.226506] The buggy address belongs to the object at ffff88b1d6516c00
                      which belongs to the cache kmalloc-512 of size 512
      [  778.227464] The buggy address is located 40 bytes inside of
                      512-byte region [ffff88b1d6516c00, ffff88b1d6516e00)
      
      The problem is in the sess_dev release function we call
      rnbd_destroy_sess_dev, and could free the sess_dev already, but we still
      set the keep_id in rnbd_srv_sess_dev_force_close, which lead to use
      after free.
      
      To fix it, move the keep_id before the sysfs removal, and cache the
      rnbd_srv_session for lock accessing,
      
      Fixes: 78699805
      
       ("block/rnbd-srv: close a mapped device from server side.")
      Signed-off-by: default avatarJack Wang <jinpu.wang@cloud.ionos.com>
      Reviewed-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1a84e7c6