Skip to content
  1. Sep 30, 2022
    • Jens Axboe's avatar
      block: allow end_io based requests in the completion batch handling · ab3e1d3b
      Jens Axboe authored
      
      
      With end_io handlers now being able to potentially pass ownership of
      the request upon completion, we can allow requests with end_io handlers
      in the batch completion handling.
      
      Reviewed-by: default avatarAnuj Gupta <anuj20.g@samsung.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Co-developed-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab3e1d3b
    • Jens Axboe's avatar
      block: change request end_io handler to pass back a return value · de671d61
      Jens Axboe authored
      
      
      Everything is just converted to returning RQ_END_IO_NONE, and there
      should be no functional changes with this patch.
      
      In preparation for allowing the end_io handler to pass ownership back
      to the block layer, rather than retain ownership of the request.
      
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de671d61
    • Jens Axboe's avatar
      block: enable batched allocation for blk_mq_alloc_request() · 4b6a5d9c
      Jens Axboe authored
      
      
      The filesystem IO path can take advantage of allocating batches of
      requests, if the underlying submitter tells the block layer about it
      through the blk_plug. For passthrough IO, the exported API is the
      blk_mq_alloc_request() helper, and that one does not allow for
      request caching.
      
      Wire up request caching for blk_mq_alloc_request(), which is generally
      done without having a bio available upfront.
      
      Tested-by: default avatarAnuj Gupta <anuj20.g@samsung.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b6a5d9c
    • Jens Axboe's avatar
      block: kill deprecated BUG_ON() in the flush handling · e73a625b
      Jens Axboe authored
      
      
      We've never had any useful reports from this BUG_ON(), and in fact a
      number of the BUG_ON()'s in the flush handling need to be turned into
      more graceful handling.
      
      In preparation for allowing batched completions of the end_io handling,
      where we can enter the flush completion with queuelist having been reused
      for the batch, get rid of this BUG_ON().
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e73a625b
    • Jens Axboe's avatar
      Merge branch 'for-6.1/io_uring' into for-6.1/passthrough · 5853a7b5
      Jens Axboe authored
      * for-6.1/io_uring: (56 commits)
        io_uring/net: fix notif cqe reordering
        io_uring/net: don't update msg_name if not provided
        io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
        io_uring/rw: defer fsnotify calls to task context
        io_uring/net: fix fast_iov assignment in io_setup_async_msg()
        io_uring/net: fix non-zc send with address
        io_uring/net: don't skip notifs for failed requests
        io_uring/rw: don't lose short results on io_setup_async_rw()
        io_uring/rw: fix unexpected link breakage
        io_uring/net: fix cleanup double free free_iov init
        io_uring: fix CQE reordering
        io_uring/net: fix UAF in io_sendrecv_fail()
        selftest/net: adjust io_uring sendzc notif handling
        io_uring: ensure local task_work marks task as running
        io_uring/net: zerocopy sendmsg
        io_uring/net: combine fail handlers
        io_uring/net: rename io_sendzc()
        io_uring/net: support non-zerocopy sendto
        io_uring/net: refactor io_setup_async_addr
        io_uring/net: don't lose partial send_zc on fail
        ...
      5853a7b5
    • Jens Axboe's avatar
      Merge branch 'for-6.1/block' into for-6.1/passthrough · 736feaa3
      Jens Axboe authored
      
      
      * for-6.1/block: (162 commits)
        sbitmap: fix lockup while swapping
        block: add rationale for not using blk_mq_plug() when applicable
        block: adapt blk_mq_plug() to not plug for writes that require a zone lock
        s390/dasd: use blk_mq_alloc_disk
        blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
        nvmet: don't look at the request_queue in nvmet_bdev_set_limits
        nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
        blk-mq: use quiesced elevator switch when reinitializing queues
        block: replace blk_queue_nowait with bdev_nowait
        nvme: remove nvme_ctrl_init_connect_q
        nvme-loop: use the tagset alloc/free helpers
        nvme-loop: store the generic nvme_ctrl in set->driver_data
        nvme-loop: initialize sqsize later
        nvme-fc: use the tagset alloc/free helpers
        nvme-fc: store the generic nvme_ctrl in set->driver_data
        nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
        nvme-rdma: use the tagset alloc/free helpers
        nvme-rdma: store the generic nvme_ctrl in set->driver_data
        nvme-tcp: use the tagset alloc/free helpers
        nvme-tcp: store the generic nvme_ctrl in set->driver_data
        ...
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      736feaa3
    • Hugh Dickins's avatar
      sbitmap: fix lockup while swapping · 30514bd2
      Hugh Dickins authored
      Commit 4acb8341 ("sbitmap: fix batched wait_cnt accounting")
      is a big improvement: without it, I had to revert to before commit
      040b83fc ("sbitmap: fix possible io hung due to lost wakeup")
      to avoid the high system time and freezes which that had introduced.
      
      Now okay on the NVME laptop, but 4acb8341
      
       is a disaster for heavy
      swapping (kernel builds in low memory) on another: soon locking up in
      sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
      around with waitqueue_active() but wait_cnt 0 .  Here is a backtrace,
      showing the common pattern of outer sbitmap_queue_wake_up() interrupted
      before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
      are idle, in other cases they're spinning for a lock in dd_bio_merge()):
      
      sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
      __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
      scsi_end_request < scsi_io_completion < scsi_finish_command <
      scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
      __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
      _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
      sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
      __blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
      blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
      __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
      submit_bio < __swap_writepage < swap_writepage < pageout <
      shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
      shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
      __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
      do_anonymous_page < __handle_mm_fault < handle_mm_fault <
      do_user_addr_fault < exc_page_fault < asm_exc_page_fault
      
      See how the process-context sbitmap_queue_wake_up() has been interrupted,
      after bringing wait_cnt down to 0 (and in this example, after doing its
      wakeups), before advancing wake_index and refilling wake_cnt: an
      interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
      
      I have almost no grasp of all the possible sbitmap races, and their
      consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
      so it is better if sbq_wake_ptr() skips on to the next ws in that case:
      which fixes the lockup and shows no adverse consequence for me.
      
      The check for wait_cnt being 0 is obviously racy, and ultimately can lead
      to lost wakeups: for example, when there is only a single waitqueue with
      waiters.  However, lost wakeups are unlikely to matter in these cases,
      and a proper fix requires redesign (and benchmarking) of the batched
      wakeup code: so let's plug the hole with this bandaid for now.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30514bd2
    • Pavel Begunkov's avatar
      io_uring/net: fix notif cqe reordering · 108893dd
      Pavel Begunkov authored
      send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so
      we can't use task-tw ordering trick to order notification cqes
      with requests completions. In this case leave it alone and let
      io_send_zc_cleanup() flush it.
      
      Cc: stable@vger.kernel.org
      Fixes: 53bdc88a
      
       ("io_uring/notif: order notif vs send CQEs")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      108893dd
    • Pavel Begunkov's avatar
      io_uring/net: don't update msg_name if not provided · 6f10ae8a
      Pavel Begunkov authored
      
      
      io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't
      provide it, we should retain NULL in this case.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6f10ae8a
    • Jens Axboe's avatar
      io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL · 46a525e1
      Jens Axboe authored
      
      
      This isn't a reliable mechanism to tell if we have task_work pending, we
      really should be looking at whether we have any items queued. This is
      problematic if forward progress is gated on running said task_work. One
      such example is reading from a pipe, where the write side has been closed
      right before the read is started. The fput() of the file queues TWA_RESUME
      task_work, and we need that task_work to be run before ->release() is
      called for the pipe. If ->release() isn't called, then the read will sit
      forever waiting on data that will never arise.
      
      Fix this by io_run_task_work() so it checks if we have task_work pending
      rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
      doesn't work for task_work that is queued without TWA_SIGNAL.
      
      Reported-by: default avatarChristiano Haesbaert <haesbaert@haesbaert.org>
      Cc: stable@vger.kernel.org
      Link: https://github.com/axboe/liburing/issues/665
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      46a525e1
    • Jens Axboe's avatar
      io_uring/rw: defer fsnotify calls to task context · b000145e
      Jens Axboe authored
      We can't call these off the kiocb completion as that might be off
      soft/hard irq context. Defer the calls to when we process the
      task_work for this request. That avoids valid complaints like:
      
      stack backtrace:
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3694e #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_usage_bug kernel/locking/lockdep.c:3961 [inline]
       valid_state kernel/locking/lockdep.c:3973 [inline]
       mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
       mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
       mark_lock kernel/locking/lockdep.c:4596 [inline]
       mark_usage kernel/locking/lockdep.c:4527 [inline]
       __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
       lock_acquire kernel/locking/lockdep.c:5666 [inline]
       lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
       __fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
       fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
       might_alloc include/linux/sched/mm.h:271 [inline]
       slab_pre_alloc_hook mm/slab.h:700 [inline]
       slab_alloc mm/slab.c:3278 [inline]
       __kmem_cache_alloc_lru mm/slab.c:3471 [inline]
       kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
       fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
       fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
       fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
       send_to_group fs/notify/fsnotify.c:360 [inline]
       fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
       __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
       fsnotify_parent include/linux/fsnotify.h:77 [inline]
       fsnotify_file include/linux/fsnotify.h:99 [inline]
       fsnotify_access include/linux/fsnotify.h:309 [inline]
       __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
       io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
       iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
       iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
       bio_endio+0x5f9/0x780 block/bio.c:1564
       req_bio_endio block/blk-mq.c:695 [inline]
       blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
       scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
       scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
       scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
       blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
       __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
       invoke_softirq kernel/softirq.c:445 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
       common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240
      
      Fixes: f63cf519 ("io_uring: ensure that fsnotify is always called")
      Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
      
      
      Reported-by: default avatar <syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com>
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b000145e
  2. Sep 29, 2022
  3. Sep 28, 2022
  4. Sep 27, 2022