Skip to content
  1. Oct 21, 2021
  2. Oct 20, 2021
  3. Oct 19, 2021
    • Jens Axboe's avatar
      block: attempt direct issue of plug list · dc5fc361
      Jens Axboe authored
      
      
      If we have just one queue type in the plug list, then we can extend our
      direct issue to cover a full plug list as well. This allows sending a
      batch of requests for direct issue, which is more efficient than doing
      one-at-a-time kind of issue.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dc5fc361
    • Jens Axboe's avatar
      block: change plugging to use a singly linked list · bc490f81
      Jens Axboe authored
      
      
      Use a singly linked list for the blk_plug. This saves 8 bytes in the
      blk_plug struct, and makes for faster list manipulations than doubly
      linked lists. As we don't use the doubly linked lists for anything,
      singly linked is just fine.
      
      This yields a bump in default (merging enabled) performance from 7.0
      to 7.1M IOPS, and ~7.5M IOPS with merging disabled.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bc490f81
    • Andrea Righi's avatar
      blk-wbt: prevent NULL pointer dereference in wb_timer_fn · 480d42dc
      Andrea Righi authored
      The timer callback used to evaluate if the latency is exceeded can be
      executed after the corresponding disk has been released, causing the
      following NULL pointer dereference:
      
      [ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
      [ 119.987617] #PF: supervisor read access in kernel mode
      [ 119.987971] #PF: error_code(0x0000) - not-present page
      [ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
      [ 119.988697] Oops: 0000 [#1] SMP NOPTI
      [ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
      [ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
      [ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
      [ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 <4c> 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
      [ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
      [ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
      [ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
      [ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
      [ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
      [ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
      [ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
      [ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
      [ 119.995906] Call Trace:
      [ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
      [ 119.996505] blk_stat_timer_fn+0x138/0x140
      [ 119.996830] call_timer_fn+0x2b/0x100
      [ 119.997136] __run_timers.part.0+0x1d1/0x240
      [ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
      [ 119.997826] ? ktime_get+0x3e/0xa0
      [ 119.998110] ? native_apic_msr_write+0x2c/0x30
      [ 119.998456] ? lapic_next_event+0x20/0x30
      [ 119.998779] ? clockevents_program_event+0x94/0xf0
      [ 119.999150] run_timer_softirq+0x2a/0x50
      [ 119.999465] __do_softirq+0xcb/0x26f
      [ 119.999764] irq_exit_rcu+0x8c/0xb0
      [ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
      [ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
      [ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20
      
      In this case simply return from the timer callback (no action
      required) to prevent the NULL pointer dereference.
      
      BugLink: https://bugs.launchpad.net/bugs/1947557
      Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
      Fixes: 34dbad5d
      
       ("blk-stat: convert to callback-based statistics reporting")
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      480d42dc
    • Jens Axboe's avatar
      block: align blkdev_dio inlined bio to a cacheline · 6155631a
      Jens Axboe authored
      
      
      We get all sorts of unreliable and funky results since the bio is
      designed to align on a cacheline, which it does not when inlined like
      this.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6155631a
    • Jens Axboe's avatar
      block: move blk_mq_tag_to_rq() inline · e028f167
      Jens Axboe authored
      
      
      This is in the fast path of driver issue or completion, and it's a single
      array index operation. Move it inline to avoid a function call for it.
      
      This does mean making struct blk_mq_tags block layer public, but there's
      not really much in there.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e028f167
    • Jens Axboe's avatar
      block: get rid of plug list sorting · df87eb0f
      Jens Axboe authored
      
      
      Even if we have multiple queues in the plug list, chances that they
      are very interspersed is minimal. Don't bother spending CPU cycles
      sorting the list.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      df87eb0f
    • Jens Axboe's avatar
      block: return whether or not to unplug through boolean · 87c037d1
      Jens Axboe authored
      
      
      Instead of returning the same queue request through a request pointer,
      use a boolean to accomplish the same.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      87c037d1
    • Christoph Hellwig's avatar
      block: don't call blk_status_to_errno in blk_update_request · 8a7d267b
      Christoph Hellwig authored
      
      
      We only need to call it to resolve the blk_status_t -> errno mapping for
      tracing, so move the conversion into the tracepoints that are not called
      at all when tracing isn't enabled.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a7d267b
    • Jens Axboe's avatar
      block: move bdev_read_only() into the header · db9a02ba
      Jens Axboe authored
      
      
      This is called for every write in the fast path, move it inline next
      to get_disk_ro() which is called internally.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db9a02ba
    • Jens Axboe's avatar
      block: fix too broad elevator check in blk_mq_free_request() · e0d78afe
      Jens Axboe authored
      We added RQF_ELV to tell whether there's an IO scheduler attached, and
      RQF_ELVPRIV tells us whether there's an IO scheduler with private data
      attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
      about here is just if we have scheduler private data attached.
      
      This fixes a boot crash
      
      Fixes: 2ff0682d
      
       ("block: store elevator state in request")
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Reported-by: default avatar <syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e0d78afe
    • Jens Axboe's avatar
      nvme: wire up completion batching for the IRQ path · 4f502245
      Jens Axboe authored
      
      
      Trivial to do now, just need our own io_comp_batch on the stack and pass
      that in to the usual command completion handling.
      
      I pondered making this dependent on how many entries we had to process,
      but even for a single entry there's no discernable difference in
      performance or latency. Running a sync workload over io_uring:
      
      t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
      
      yields the below performance before the patch:
      
      IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
      IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
      IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
      
      and the following after:
      
      IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
      IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
      IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
      
      which definitely isn't slower, about the same if you factor in a bit of
      variance. For peak performance workloads, benchmarking shows a 2%
      improvement.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f502245
    • Jens Axboe's avatar
      io_uring: utilize the io batching infrastructure for more efficient polled IO · b688f11e
      Jens Axboe authored
      
      
      Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
      supports it, we can handle high rates of polled IO more efficiently.
      
      This raises the single core efficiency on my system from ~6.1M IOPS to
      ~6.6M IOPS running a random read workload at depth 128 on two gen2
      Optane drives.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b688f11e
    • Jens Axboe's avatar
      nvme: add support for batched completion of polled IO · c234a653
      Jens Axboe authored
      
      
      Take advantage of struct io_comp_batch, if passed in to the nvme poll
      handler. If it's set, rather than complete each request individually
      inline, store them in the io_comp_batch list. We only do so for requests
      that will complete successfully, anything else will be completed inline as
      before.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c234a653
    • Jens Axboe's avatar
      block: add support for blk_mq_end_request_batch() · f794f335
      Jens Axboe authored
      
      
      Instead of calling blk_mq_end_request() on a single request, add a helper
      that takes the new struct io_comp_batch and completes any request stored
      in there.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f794f335
    • Jens Axboe's avatar
      sbitmap: add helper to clear a batch of tags · 1aec5e4a
      Jens Axboe authored
      
      
      sbitmap currently only supports clearing tags one-by-one, add a helper
      that allows the caller to pass in an array of tags to clear.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1aec5e4a
    • Jens Axboe's avatar
      block: add a struct io_comp_batch argument to fops->iopoll() · 5a72e899
      Jens Axboe authored
      
      
      struct io_comp_batch contains a list head and a completion handler, which
      will allow completions to more effciently completed batches of IO.
      
      For now, no functional changes in this patch, we just define the
      io_comp_batch structure and add the argument to the file_operations iopoll
      handler.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5a72e899