Skip to content
  1. Oct 09, 2018
    • Matias Bjørling's avatar
      lightnvm: move bad block and chunk state logic to core · aff3fb18
      Matias Bjørling authored
      
      
      pblk implements two data paths for recovery line state. One for 1.2
      and another for 2.0, instead of having pblk implement these, combine
      them in the core to reduce complexity and make available to other
      targets.
      
      The new interface will adhere to the 2.0 chunk definition,
      including managing open chunks with an active write pointer. To provide
      this interface, a 1.2 device recovers the state of the chunks by
      manually detecting if a chunk is either free/open/close/offline, and if
      open, scanning the flash pages sequentially to find the next writeable
      page. This process takes on average ~10 seconds on a device with 64 dies,
      1024 blocks and 60us read access time. The process can be parallelized
      but is left out for maintenance simplicity, as the 1.2 specification is
      deprecated. For 2.0 devices, the logic is maintained internally in the
      drive and retrieved through the 2.0 interface.
      
      Signed-off-by: default avatarMatias Bjørling <mb@lightnvm.io>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aff3fb18
    • Javier González's avatar
      lightnvm: pblk: fix race condition on metadata I/O · d8adaa3b
      Javier González authored
      In pblk, when a new line is allocated, metadata for the previously
      written line is scheduled. This is done through a fixed memory region
      that is shared through time and contexts across different lines and
      therefore protected by a lock. Unfortunately, this lock is not properly
      covering all the metadata used for sharing this memory regions,
      resulting in a race condition.
      
      This patch fixes this race condition by protecting this metadata
      properly.
      
      Fixes: dd2a4343
      
       ("lightnvm: pblk: sched. metadata on write thread")
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <mb@lightnvm.io>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d8adaa3b
    • Matias Bjørling's avatar
      lightnvm: move device L2P detection to core · 656e33ca
      Matias Bjørling authored
      
      
      A 1.2 device is able to manage the logical to physical mapping
      table internally or leave it to the host.
      
      A target only supports one of those approaches, and therefore must
      check on initialization. Move this check to core to avoid each target
      implement the check.
      
      Signed-off-by: default avatarMatias Bjørling <mb@lightnvm.io>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      656e33ca
    • Matias Bjørling's avatar
      lightnvm: pblk: fix rqd.error return value in pblk_blk_erase_sync · 4b5d56ed
      Matias Bjørling authored
      
      
      rqd.error is masked by the return value of pblk_submit_io_sync.
      The rqd structure is then passed on to the end_io function, which
      assumes that any error should lead to a chunk being marked
      offline/bad. Since the pblk_submit_io_sync can fail before the
      command is issued to the device, the error value maybe not correspond
      to a media failure, leading to chunks being immaturely retired.
      
      Also, the pblk_blk_erase_sync function prints an error message in case
      the erase fails. Since the caller prints an error message by itself,
      remove the error message in this function.
      
      Signed-off-by: default avatarMatias Bjørling <mb@lightnvm.io>
      Reviewed-by: default avatarJavier González <javier@cnexlabs.com>
      Reviewed-by: default avatarHans Holmberg <hans.holmberg@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b5d56ed
    • Matias Bjørling's avatar
      lightnvm: combine 1.2 and 2.0 command flags · d7b68016
      Matias Bjørling authored
      
      
      Add nvm_set_flags helper to enable core to appropriately
      set the command flags for read/write/erase depending on which version
      a drive supports.
      
      The flags arguments can be distilled into the access hint,
      scrambling, and program/erase suspend. Replace the access hint with
      a "is_seq" parameter. The rest of the flags are dependent on the
      command opcode, which is trivial to detect and set.
      
      Signed-off-by: default avatarMatias Bjørling <mb@lightnvm.io>
      Reviewed-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d7b68016
    • Matias Bjørling's avatar
      lightnvm: remove dependencies on BLK_DEV_NVME and PCI · 73569e11
      Matias Bjørling authored
      
      
      No need to force NVMe device driver to be compiled in if the
      lightnvm subsystem is selected. Also no need for PCI to be selected
      as well, as it would be selected by the device driver that hooks into
      the subsystem.
      
      Signed-off-by: default avatarMatias Bjørling <mb@lightnvm.io>
      Reviewed-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      73569e11
    • Ming Lei's avatar
      blk-mq: complete req in softirq context in case of single queue · 36e76539
      Ming Lei authored
      
      
      Lot of controllers may have only one irq vector for completing IO
      request. And usually affinity of the only irq vector is all possible
      CPUs, however, on most of ARCH, there may be only one specific CPU
      for handling this interrupt.
      
      So if all IOs are completed in hardirq context, it is inevitable to
      degrade IO performance because of increased irq latency.
      
      This patch tries to address this issue by allowing to complete request
      in softirq context, like the legacy IO path.
      
      IOPS is observed as ~13%+ in the following randread test on raid0 over
      virtio-scsi.
      
      mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi
      
      fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: Zach Marano <zmarano@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      36e76539
  2. Oct 08, 2018
  3. Oct 05, 2018
    • Bart Van Assche's avatar
      blk-mq-debugfs: Also show requests that have not yet been started · 6d8623a7
      Bart Van Assche authored
      
      
      When debugging e.g. the SCSI timeout handler it is important that
      requests that have not yet been started or that already have
      completed are also reported through debugfs.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d8623a7
    • Jens Axboe's avatar
      Merge branch 'nvme-4.20' of git://git.infradead.org/nvme into for-4.20/block · 4f5735f3
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "A relatively boring merge window:
      
       - better AEN tracing (Chaitanya)
       - NUMA aware PCIe multipathing (me)
       - RDMA workqueue fixes (Sagi)
       - better bio usage in the target (Sagi)
       - FC rework for target removal (James)
       - better multipath handling of ->queue_rq failures (James)
       - various cleanups (Milan)"
      
      * 'nvme-4.20' of git://git.infradead.org/nvme:
        nvmet-rdma: use a private workqueue for delete
        nvme: take node locality into account when selecting a path
        nvmet: don't split large I/Os unconditionally
        nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O
        nvme-core: add async event trace helper
        nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport device
        nvmet_fc: support target port removal with nvmet layer
        nvme-fc: fix for a minor typos
        nvmet: remove redundant module prefix
        nvme: fix typo in nvme_identify_ns_descs
      4f5735f3
    • Sagi Grimberg's avatar
      nvmet-rdma: use a private workqueue for delete · 2acf70ad
      Sagi Grimberg authored
      Queue deletion is done asynchronous when the last reference on the queue
      is dropped.  Thus, in order to make sure we don't over allocate under a
      connect/disconnect storm, we let queue deletion complete before making
      forward progress.
      
      However, given that we flush the system_wq from rdma_cm context which
      runs from a workqueue context, we can have a circular locking complaint
      [1]. Fix that by using a private workqueue for queue deletion.
      
      [1]:
      ======================================================
      WARNING: possible circular locking dependency detected
      4.19.0-rc4-dbg+ #3 Not tainted
      ------------------------------------------------------
      kworker/5:0/39 is trying to acquire lock:
      00000000a10b6db9 (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x6f/0x440 [rdma_cm]
      
      but task is already holding lock:
      00000000331b4e2c ((work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3ed/0xa20
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #3 ((work_completion)(&queue->release_work)){+.+.}:
             process_one_work+0x474/0xa20
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      
      -> #2 ((wq_completion)"events"){+.+.}:
             flush_workqueue+0xf3/0x970
             nvmet_rdma_cm_handler+0x133d/0x1734 [nvmet_rdma]
             cma_ib_req_handler+0x72f/0xf90 [rdma_cm]
             cm_process_work+0x2e/0x110 [ib_cm]
             cm_req_handler+0x135b/0x1c30 [ib_cm]
             cm_work_handler+0x2b7/0x38cd [ib_cm]
             process_one_work+0x4ae/0xa20
      nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 0000000040357082
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      nvme nvme0: Reconnecting in 10 seconds...
      
      -> #1 (&id_priv->handler_mutex/1){+.+.}:
             __mutex_lock+0xfe/0xbe0
             mutex_lock_nested+0x1b/0x20
             cma_ib_req_handler+0x6aa/0xf90 [rdma_cm]
             cm_process_work+0x2e/0x110 [ib_cm]
             cm_req_handler+0x135b/0x1c30 [ib_cm]
             cm_work_handler+0x2b7/0x38cd [ib_cm]
             process_one_work+0x4ae/0xa20
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      
      -> #0 (&id_priv->handler_mutex){+.+.}:
             lock_acquire+0xc5/0x200
             __mutex_lock+0xfe/0xbe0
             mutex_lock_nested+0x1b/0x20
             rdma_destroy_id+0x6f/0x440 [rdma_cm]
             nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma]
             process_one_work+0x4ae/0xa20
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      
      Fixes: 777dc823
      
       ("nvmet-rdma: occasionally flush ongoing controller teardown")
      Reported-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Tested-by: default avatarBart Van Assche <bvanassche@acm.org>
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      2acf70ad
  4. Oct 04, 2018
  5. Oct 02, 2018
  6. Oct 01, 2018
    • Jens Axboe's avatar
      Merge tag 'v4.19-rc6' into for-4.20/block · c0aac682
      Jens Axboe authored
      
      
      Merge -rc6 in, for two reasons:
      
      1) Resolve a trivial conflict in the blk-mq-tag.c documentation
      2) A few important regression fixes went into upstream directly, so
         they aren't in the 4.20 branch.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      
      * tag 'v4.19-rc6': (780 commits)
        Linux 4.19-rc6
        MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
        cpufreq: qcom-kryo: Fix section annotations
        perf/core: Add sanity check to deal with pinned event failure
        xen/blkfront: correct purging of persistent grants
        Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
        selftests/powerpc: Fix Makefiles for headers_install change
        blk-mq: I/O and timer unplugs are inverted in blktrace
        dax: Fix deadlock in dax_lock_mapping_entry()
        x86/boot: Fix kexec booting failure in the SEV bit detection code
        bcache: add separate workqueue for journal_write to avoid deadlock
        drm/amd/display: Fix Edid emulation for linux
        drm/amd/display: Fix Vega10 lightup on S3 resume
        drm/amdgpu: Fix vce work queue was not cancelled when suspend
        Revert "drm/panel: Add device_link from panel device to DRM device"
        xen/blkfront: When purging persistent grants, keep them in the buffer
        clocksource/drivers/timer-atmel-pit: Properly handle error cases
        block: fix deadline elevator drain for zoned block devices
        ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
        drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
        ...
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c0aac682
  7. Sep 30, 2018