Skip to content
  1. Sep 03, 2021
    • Jens Axboe's avatar
      io-wq: make worker creation resilient against signals · 3146cba9
      Jens Axboe authored
      
      
      If a task is queueing async work and also handling signals, then we can
      run into the case where create_io_thread() is interrupted and returns
      failure because of that. If this happens for creating the first worker
      in a group, then that worker will never get created and we can hang the
      ring.
      
      If we do get a fork failure, retry from task_work. With signals we have
      to be a bit careful as we cannot simply queue as task_work, as we'll
      still have signals pending at that point. Punt over a normal workqueue
      first and then create from task_work after that.
      
      Lastly, ensure that we handle fatal worker creations. Worker creation
      failures are normally not fatal, only if we fail to create one in an empty
      worker group can we not make progress. Right now that is ignored, ensure
      that we handle that and run cancel on the work item.
      
      There are two paths that create new workers - one is the "existing worker
      going to sleep", and the other is "no workers found for this work, create
      one". The former is never fatal, as workers do exist in the group. Only
      the latter needs to be carefully handled.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3146cba9
    • Jens Axboe's avatar
      io-wq: get rid of FIXED worker flag · 05c5f4ee
      Jens Axboe authored
      
      
      It makes the logic easier to follow if we just get rid of the fixed worker
      flag, and simply ensure that we never exit the last worker in the group.
      This also means that no particular worker is special.
      
      Just track the last timeout state, and if we have hit it and no work
      is pending, check if there are other workers. If yes, then we can exit
      this one safely.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05c5f4ee
  2. Sep 02, 2021
    • Jens Axboe's avatar
      io-wq: only exit on fatal signals · 15e20db2
      Jens Axboe authored
      
      
      If the application uses io_uring and also relies heavily on signals
      for communication, that can cause io-wq workers to spuriously exit
      just because the parent has a signal pending. Just ignore signals
      unless they are fatal.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      15e20db2
    • Jens Axboe's avatar
      io-wq: split bounded and unbounded work into separate lists · f95dc207
      Jens Axboe authored
      We've got a few issues that all boil down to the fact that we have one
      list of pending work items, yet two different types of workers to
      serve them. This causes some oddities around workers switching type and
      even hashed work vs regular work on the same bounded list.
      
      Just separate them out cleanly, similarly to how we already do
      accounting of what is running. That provides a clean separation and
      removes some corner cases that can cause stalls when handling IO
      that is punted to io-wq.
      
      Fixes: ecc53c48
      
       ("io-wq: check max_worker limits if a worker transitions bound state")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f95dc207
  3. Sep 01, 2021
    • Jens Axboe's avatar
      io-wq: fix queue stalling race · 0242f642
      Jens Axboe authored
      
      
      We need to set the stalled bit early, before we drop the lock for adding
      us to the stall hash queue. If not, then we can race with new work being
      queued between adding us to the stall hash and io_worker_handle_work()
      marking us stalled.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0242f642
    • Pavel Begunkov's avatar
      io_uring: don't submit half-prepared drain request · b8ce1b9d
      Pavel Begunkov authored
      [ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0
      [ 3784.910926] Call Trace:
      [ 3784.910928]  ? io_read+0x17c/0x480
      [ 3784.910945]  io_issue_sqe+0xcb/0x1840
      [ 3784.910953]  __io_queue_sqe+0x44/0x300
      [ 3784.910959]  io_req_task_submit+0x27/0x70
      [ 3784.910962]  tctx_task_work+0xeb/0x1d0
      [ 3784.910966]  task_work_run+0x61/0xa0
      [ 3784.910968]  io_run_task_work_sig+0x53/0xa0
      [ 3784.910975]  __x64_sys_io_uring_enter+0x22/0x30
      [ 3784.910977]  do_syscall_64+0x3d/0x90
      [ 3784.910981]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_drain_req() goes before checks for REQ_F_FAIL, which protect us from
      submitting under-prepared request (e.g. failed in io_init_req(). Fail
      such drained requests as well.
      
      Fixes: a8295b98
      
       ("io_uring: fix failed linkchain code logic")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b8ce1b9d
    • Pavel Begunkov's avatar
      io_uring: fix queueing half-created requests · c6d3d9cb
      Pavel Begunkov authored
      [   27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI
      [   27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
      [   27.263730] RIP: 0010:sock_from_file+0x20/0x90
      [   27.272444] Call Trace:
      [   27.272736]  io_sendmsg+0x98/0x600
      [   27.279216]  io_issue_sqe+0x498/0x68d0
      [   27.281142]  __io_queue_sqe+0xab/0xb50
      [   27.285830]  io_req_task_submit+0xbf/0x1b0
      [   27.286306]  tctx_task_work+0x178/0xad0
      [   27.288211]  task_work_run+0xe2/0x190
      [   27.288571]  exit_to_user_mode_prepare+0x1a1/0x1b0
      [   27.289041]  syscall_exit_to_user_mode+0x19/0x50
      [   27.289521]  do_syscall_64+0x48/0x90
      [   27.289871]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_req_complete_failed() -> io_req_complete_post() ->
      io_req_task_queue() still would try to enqueue hard linked request,
      which can be half prepared (e.g. failed init), so we can't allow
      that to happen.
      
      Fixes: a8295b98
      
       ("io_uring: fix failed linkchain code logic")
      Reported-by: default avatar <syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c6d3d9cb
    • Jens Axboe's avatar
      io-wq: ensure that hash wait lock is IRQ disabling · 08bdbd39
      Jens Axboe authored
      A previous commit removed the IRQ safety of the worker and wqe locks,
      but that left one spot of the hash wait lock now being done without
      already having IRQs disabled.
      
      Ensure that we use the right locking variant for the hashed waitqueue
      lock.
      
      Fixes: a9a4aa9f
      
       ("io-wq: wqe and worker locks no longer need to be IRQ safe")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08bdbd39
    • Ming Lei's avatar
      io_uring: retry in case of short read on block device · 7db30437
      Ming Lei authored
      In case of buffered reading from block device, when short read happens,
      we should retry to read more, otherwise the IO will be completed
      partially, for example, the following fio expects to read 2MB, but it
      can only read 1M or less bytes:
      
          fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
      	--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
      	--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
      	--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting
      
      Fix the issue by allowing short read retry for block device, which sets
      FMODE_BUF_RASYNC really.
      
      Fixes: 9a173346
      
       ("io_uring: fix short read retries for non-reg files")
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7db30437
    • Jens Axboe's avatar
      io_uring: IORING_OP_WRITE needs hash_reg_file set · 7b3188e7
      Jens Axboe authored
      During some testing, it became evident that using IORING_OP_WRITE doesn't
      hash buffered writes like the other writes commands do. That's simply
      an oversight, and can cause performance regressions when doing buffered
      writes with this command.
      
      Correct that and add the flag, so that buffered writes are correctly
      hashed when using the non-iovec based write command.
      
      Cc: stable@vger.kernel.org
      Fixes: 3a6820f2
      
       ("io_uring: add non-vectored read/write commands")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b3188e7
    • Jens Axboe's avatar
      io-wq: fix race between adding work and activating a free worker · 94ffb0a2
      Jens Axboe authored
      
      
      The attempt to find and activate a free worker for new work is currently
      combined with creating a new one if we don't find one, but that opens
      io-wq up to a race where the worker that is found and activated can
      put itself to sleep without knowing that it has been selected to perform
      this new work.
      
      Fix this by moving the activation into where we add the new work item,
      then we can retain it within the wqe->lock scope and elimiate the race
      with the worker itself checking inside the lock, but sleeping outside of
      it.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94ffb0a2
  4. Aug 30, 2021
  5. Aug 29, 2021
    • Jens Axboe's avatar
      io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts · 50c1df2b
      Jens Axboe authored
      
      
      Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
      CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.
      
      Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
      allows timeouts and linked timeouts to use the selected clock source.
      
      Only one clock source may be selected, and we -EINVAL the request if more
      than one is given. If neither BOOTIME nor REALTIME are selected, the
      previous default of MONOTONIC is used.
      
      Link: https://github.com/axboe/liburing/issues/369
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      50c1df2b
    • Jens Axboe's avatar
      io-wq: provide a way to limit max number of workers · 2e480058
      Jens Axboe authored
      
      
      io-wq divides work into two categories:
      
      1) Work that completes in a bounded time, like reading from a regular file
         or a block device. This type of work is limited based on the size of
         the SQ ring.
      
      2) Work that may never complete, we call this unbounded work. The amount
         of workers here is just limited by RLIMIT_NPROC.
      
      For various uses cases, it's handy to have the kernel limit the maximum
      amount of pending workers for both categories. Provide a way to do with
      with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.
      
      IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
      the max worker count to what is being passed in for each category. The
      old values are returned into that same array. If 0 is being passed in for
      either category, it simply returns the current value.
      
      The value is capped at RLIMIT_NPROC. This actually isn't that important
      as it's more of a hint, if we're exceeding the value then our attempt
      to fork a new worker will fail. This happens naturally already if more
      than one node is in the system, as these values are per-node internally
      for io-wq.
      
      Reported-by: default avatarJohannes Lundberg <johalun0@gmail.com>
      Link: https://github.com/axboe/liburing/issues/420
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2e480058
  6. Aug 27, 2021
  7. Aug 26, 2021
  8. Aug 25, 2021
  9. Aug 24, 2021
    • Pavel Begunkov's avatar
      io_uring: fix io_try_cancel_userdata race for iowq · dadebc35
      Pavel Begunkov authored
      
      
      WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
      CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
      Call Trace:
       io_async_cancel fs/io_uring.c:6014 [inline]
       io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407
       io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511
       io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533
       io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
      
      io_try_cancel_userdata() can be called from io_async_cancel() executing
      in the io-wq context, so the warning fires, which is there to alert
      anyone accessing task->io_uring->io_wq in a racy way. However,
      io_wq_put_and_exit() always first waits for all threads to complete,
      so the only detail left is to zero tctx->io_wq after the context is
      removed.
      
      note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor
      won't touch ->io_wq, because io_wq_destroy() might cancel left pending
      requests in such a way.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatar <syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dadebc35
    • Pavel Begunkov's avatar
      io_uring: IRQ rw completion batching · 126180b9
      Pavel Begunkov authored
      
      
      Employ inline completion logic for read/write completions done via
      io_req_task_complete(). If ->uring_lock is contended, just do normal
      request completion, but if not, make tctx_task_work() to grab the lock
      and do batched inline completions in io_req_task_complete().
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      126180b9
    • Pavel Begunkov's avatar
      io_uring: batch task work locking · f237c30a
      Pavel Begunkov authored
      
      
      Many task_work handlers either grab ->uring_lock, or may benefit from
      having it. Move locking logic out of individual handlers to a lazy
      approach controlled by tctx_task_work(), so we don't keep doing
      tons of mutex lock/unlock.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f237c30a
    • Pavel Begunkov's avatar
      io_uring: flush completions for fallbacks · 5636c00d
      Pavel Begunkov authored
      
      
      io_fallback_req_func() doesn't expect anyone creating inline
      completions, and no one currently does that. Teach the function to flush
      completions preparing for further changes.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5636c00d
    • Pavel Begunkov's avatar
      io_uring: add ->splice_fd_in checks · 26578cda
      Pavel Begunkov authored
      
      
      ->splice_fd_in is used only by splice/tee, but no other request checks
      it for validity. Add the check for most of request types excluding
      reads/writes/sends/recvs, we don't want overhead for them and can leave
      them be as is until the field is actually used.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      26578cda
    • Jens Axboe's avatar
      io_uring: add clarifying comment for io_cqring_ev_posted() · 2c5d763c
      Jens Axboe authored
      We've previously had an issue where overflow flush unconditionally calls
      io_cqring_ev_posted() even if it didn't flush any events to the ring,
      causing wake and eventfd increment where no new events are available.
      Some applications don't like that, see commit b18032bb
      
       for details.
      
      This came up in discussion for another patch recently, hence add a
      comment detailing what the relationship between calling the events
      posted helper and CQ ring entries is.
      
      Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2c5d763c
    • Pavel Begunkov's avatar
      io_uring: place fixed tables under memcg limits · 0bea96f5
      Pavel Begunkov authored
      
      
      Fixed tables may be large enough, place all of them together with
      allocated tags under memcg limits.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0bea96f5
    • Pavel Begunkov's avatar
      io_uring: limit fixed table size by RLIMIT_NOFILE · 3a1b8a4e
      Pavel Begunkov authored
      
      
      Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE,
      that's the first and the simpliest restriction that we should impose.
      
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3a1b8a4e
    • Hao Xu's avatar
      io_uring: fix lack of protection for compl_nr · 99c8bc52
      Hao Xu authored
      coml_nr in ctx_flush_and_put() is not protected by uring_lock, this
      may cause problems when accessing in parallel:
      
      say coml_nr > 0
      
        ctx_flush_and put                  other context
         if (compl_nr)                      get mutex
                                            coml_nr > 0
                                            do flush
                                                coml_nr = 0
                                            release mutex
              get mutex
                 do flush (*)
              release mutex
      
      in (*) place, we call io_cqring_ev_posted() and users likely get
      no events there. To avoid spurious events, re-check the value when
      under the lock.
      
      Fixes: 2c32395d
      
       ("io_uring: fix __tctx_task_work() ctx race")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20210820221954.61815-1-haoxu@linux.alibaba.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99c8bc52
    • wangyangbo's avatar
      io_uring: Add register support for non-4k PAGE_SIZE · 187f08c1
      wangyangbo authored
      
      
      Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and
      accessing this table relies on each level index from fixed TABLE_SHIFT
      (12 - 3) in 4k page case. In order to correctly work in non-4k page,
      define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for
      2nd-level table entry number.
      
      Signed-off-by: default avatarwangyangbo <wangyangbo@uniontech.com>
      Link: https://lore.kernel.org/r/20210819055657.27327-1-wangyangbo@uniontech.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      187f08c1
    • Pavel Begunkov's avatar
      io_uring: extend task put optimisations · e98e49b2
      Pavel Begunkov authored
      
      
      Now with IRQ completions done via IRQ, almost all requests freeing
      are done from the context of submitter task, so it makes sense to
      extend task_put optimisation from io_req_free_batch_finish() to cover
      all the cases including task_work by moving it into io_put_task().
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e98e49b2
    • Jens Axboe's avatar
      io_uring: add comments on why PF_EXITING checking is safe · 316319e8
      Jens Axboe authored
      
      
      We have two checks of task->flags & PF_EXITING left:
      
      1) In io_req_task_submit(), which is called in task_work and hence always
         in the context of the original task. That means that
         req->task == current, and hence checking ->flags is totally fine.
      
      2) In io_poll_rewait(), where we need to stop re-arming poll to prevent
         it interfering with cancelation. This is only run from task_work as
         well, and hence for this case too req->task == current.
      
      Add a comment to both spots detailing that.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      316319e8