Skip to content
  1. Sep 04, 2021
    • Jens Axboe's avatar
      io_uring: io_uring_complete() trace should take an integer · 2fc2a7a6
      Jens Axboe authored
      
      
      It currently takes a long, and while that's normally OK, the io_uring
      limit is an int. Internally in io_uring it's an int, but sometimes it's
      passed as a long. That can yield confusing results where a completions
      seems to generate a huge result:
      
      ou-sqp-1297-1298    [001] ...1   788.056371: io_uring_complete: ring 000000000e98e046, user_data 0x0, result 4294967171, cflags 0
      
      which is due to -ECANCELED being stored in an unsigned, and then passed
      in as a long. Using the right int type, the trace looks correct:
      
      iou-sqp-338-339     [002] ...1    15.633098: io_uring_complete: ring 00000000e0ac60cf, user_data 0x0, result -125, cflags 0
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2fc2a7a6
  2. Sep 03, 2021
  3. Sep 02, 2021
    • Jens Axboe's avatar
      io-wq: only exit on fatal signals · 15e20db2
      Jens Axboe authored
      
      
      If the application uses io_uring and also relies heavily on signals
      for communication, that can cause io-wq workers to spuriously exit
      just because the parent has a signal pending. Just ignore signals
      unless they are fatal.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      15e20db2
    • Jens Axboe's avatar
      io-wq: split bounded and unbounded work into separate lists · f95dc207
      Jens Axboe authored
      
      
      We've got a few issues that all boil down to the fact that we have one
      list of pending work items, yet two different types of workers to
      serve them. This causes some oddities around workers switching type and
      even hashed work vs regular work on the same bounded list.
      
      Just separate them out cleanly, similarly to how we already do
      accounting of what is running. That provides a clean separation and
      removes some corner cases that can cause stalls when handling IO
      that is punted to io-wq.
      
      Fixes: ecc53c48 ("io-wq: check max_worker limits if a worker transitions bound state")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f95dc207
  4. Sep 01, 2021
    • Jens Axboe's avatar
      io-wq: fix queue stalling race · 0242f642
      Jens Axboe authored
      
      
      We need to set the stalled bit early, before we drop the lock for adding
      us to the stall hash queue. If not, then we can race with new work being
      queued between adding us to the stall hash and io_worker_handle_work()
      marking us stalled.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0242f642
    • Pavel Begunkov's avatar
      io_uring: don't submit half-prepared drain request · b8ce1b9d
      Pavel Begunkov authored
      
      
      [ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0
      [ 3784.910926] Call Trace:
      [ 3784.910928]  ? io_read+0x17c/0x480
      [ 3784.910945]  io_issue_sqe+0xcb/0x1840
      [ 3784.910953]  __io_queue_sqe+0x44/0x300
      [ 3784.910959]  io_req_task_submit+0x27/0x70
      [ 3784.910962]  tctx_task_work+0xeb/0x1d0
      [ 3784.910966]  task_work_run+0x61/0xa0
      [ 3784.910968]  io_run_task_work_sig+0x53/0xa0
      [ 3784.910975]  __x64_sys_io_uring_enter+0x22/0x30
      [ 3784.910977]  do_syscall_64+0x3d/0x90
      [ 3784.910981]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_drain_req() goes before checks for REQ_F_FAIL, which protect us from
      submitting under-prepared request (e.g. failed in io_init_req(). Fail
      such drained requests as well.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b8ce1b9d
    • Pavel Begunkov's avatar
      io_uring: fix queueing half-created requests · c6d3d9cb
      Pavel Begunkov authored
      
      
      [   27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI
      [   27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
      [   27.263730] RIP: 0010:sock_from_file+0x20/0x90
      [   27.272444] Call Trace:
      [   27.272736]  io_sendmsg+0x98/0x600
      [   27.279216]  io_issue_sqe+0x498/0x68d0
      [   27.281142]  __io_queue_sqe+0xab/0xb50
      [   27.285830]  io_req_task_submit+0xbf/0x1b0
      [   27.286306]  tctx_task_work+0x178/0xad0
      [   27.288211]  task_work_run+0xe2/0x190
      [   27.288571]  exit_to_user_mode_prepare+0x1a1/0x1b0
      [   27.289041]  syscall_exit_to_user_mode+0x19/0x50
      [   27.289521]  do_syscall_64+0x48/0x90
      [   27.289871]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_req_complete_failed() -> io_req_complete_post() ->
      io_req_task_queue() still would try to enqueue hard linked request,
      which can be half prepared (e.g. failed init), so we can't allow
      that to happen.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Reported-by: default avatar <syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c6d3d9cb
    • Jens Axboe's avatar
      io-wq: ensure that hash wait lock is IRQ disabling · 08bdbd39
      Jens Axboe authored
      
      
      A previous commit removed the IRQ safety of the worker and wqe locks,
      but that left one spot of the hash wait lock now being done without
      already having IRQs disabled.
      
      Ensure that we use the right locking variant for the hashed waitqueue
      lock.
      
      Fixes: a9a4aa9f ("io-wq: wqe and worker locks no longer need to be IRQ safe")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08bdbd39
    • Ming Lei's avatar
      io_uring: retry in case of short read on block device · 7db30437
      Ming Lei authored
      
      
      In case of buffered reading from block device, when short read happens,
      we should retry to read more, otherwise the IO will be completed
      partially, for example, the following fio expects to read 2MB, but it
      can only read 1M or less bytes:
      
          fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
      	--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
      	--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
      	--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting
      
      Fix the issue by allowing short read retry for block device, which sets
      FMODE_BUF_RASYNC really.
      
      Fixes: 9a173346 ("io_uring: fix short read retries for non-reg files")
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7db30437
    • Jens Axboe's avatar
      io_uring: IORING_OP_WRITE needs hash_reg_file set · 7b3188e7
      Jens Axboe authored
      
      
      During some testing, it became evident that using IORING_OP_WRITE doesn't
      hash buffered writes like the other writes commands do. That's simply
      an oversight, and can cause performance regressions when doing buffered
      writes with this command.
      
      Correct that and add the flag, so that buffered writes are correctly
      hashed when using the non-iovec based write command.
      
      Cc: stable@vger.kernel.org
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b3188e7
    • Jens Axboe's avatar
      io-wq: fix race between adding work and activating a free worker · 94ffb0a2
      Jens Axboe authored
      
      
      The attempt to find and activate a free worker for new work is currently
      combined with creating a new one if we don't find one, but that opens
      io-wq up to a race where the worker that is found and activated can
      put itself to sleep without knowing that it has been selected to perform
      this new work.
      
      Fix this by moving the activation into where we add the new work item,
      then we can retain it within the wqe->lock scope and elimiate the race
      with the worker itself checking inside the lock, but sleeping outside of
      it.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94ffb0a2
  5. Aug 30, 2021
  6. Aug 29, 2021
    • Jens Axboe's avatar
      io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts · 50c1df2b
      Jens Axboe authored
      Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
      CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.
      
      Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
      allows timeouts and linked timeouts to use the selected clock source.
      
      Only one clock source may be selected, and we -EINVAL the request if more
      than one is given. If neither BOOTIME nor REALTIME are selected, the
      previous default of MONOTONIC is used.
      
      Link: https://github.com/axboe/liburing/issues/369
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      50c1df2b
    • Jens Axboe's avatar
      io-wq: provide a way to limit max number of workers · 2e480058
      Jens Axboe authored
      
      
      io-wq divides work into two categories:
      
      1) Work that completes in a bounded time, like reading from a regular file
         or a block device. This type of work is limited based on the size of
         the SQ ring.
      
      2) Work that may never complete, we call this unbounded work. The amount
         of workers here is just limited by RLIMIT_NPROC.
      
      For various uses cases, it's handy to have the kernel limit the maximum
      amount of pending workers for both categories. Provide a way to do with
      with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.
      
      IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
      the max worker count to what is being passed in for each category. The
      old values are returned into that same array. If 0 is being passed in for
      either category, it simply returns the current value.
      
      The value is capped at RLIMIT_NPROC. This actually isn't that important
      as it's more of a hint, if we're exceeding the value then our attempt
      to fork a new worker will fail. This happens naturally already if more
      than one node is in the system, as these values are per-node internally
      for io-wq.
      
      Reported-by: default avatarJohannes Lundberg <johalun0@gmail.com>
      Link: https://github.com/axboe/liburing/issues/420
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2e480058
  7. Aug 27, 2021
  8. Aug 26, 2021
  9. Aug 25, 2021
  10. Aug 24, 2021