Skip to content
  1. Dec 10, 2020
    • Xiaoguang Wang's avatar
      io_uring: always let io_iopoll_complete() complete polled io · dad1b124
      Xiaoguang Wang authored
      
      
      Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring():
      [   95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0
      [   95.505921]
      [   95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G    B
      W         5.10.0-rc5+ #1
      [   95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   95.508248] Call Trace:
      [   95.508683]  dump_stack+0x107/0x163
      [   95.509323]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.509982]  print_address_description.constprop.0+0x3e/0x60
      [   95.510814]  ? vprintk_func+0x98/0x140
      [   95.511399]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.512036]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.512733]  kasan_report_invalid_free+0x51/0x80
      [   95.513431]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.514047]  __kasan_slab_free+0x141/0x160
      [   95.514699]  kfree+0xd1/0x390
      [   95.515182]  io_commit_cqring+0x3ec/0x8e0
      [   95.515799]  __io_req_complete.part.0+0x64/0x90
      [   95.516483]  io_wq_submit_work+0x1fa/0x260
      [   95.517117]  io_worker_handle_work+0xeac/0x1c00
      [   95.517828]  io_wqe_worker+0xc94/0x11a0
      [   95.518438]  ? io_worker_handle_work+0x1c00/0x1c00
      [   95.519151]  ? __kthread_parkme+0x11d/0x1d0
      [   95.519806]  ? io_worker_handle_work+0x1c00/0x1c00
      [   95.520512]  ? io_worker_handle_work+0x1c00/0x1c00
      [   95.521211]  kthread+0x396/0x470
      [   95.521727]  ? _raw_spin_unlock_irq+0x24/0x30
      [   95.522380]  ? kthread_mod_delayed_work+0x180/0x180
      [   95.523108]  ret_from_fork+0x22/0x30
      [   95.523684]
      [   95.523985] Allocated by task 4035:
      [   95.524543]  kasan_save_stack+0x1b/0x40
      [   95.525136]  __kasan_kmalloc.constprop.0+0xc2/0xd0
      [   95.525882]  kmem_cache_alloc_trace+0x17b/0x310
      [   95.533930]  io_queue_sqe+0x225/0xcb0
      [   95.534505]  io_submit_sqes+0x1768/0x25f0
      [   95.535164]  __x64_sys_io_uring_enter+0x89e/0xd10
      [   95.535900]  do_syscall_64+0x33/0x40
      [   95.536465]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   95.537199]
      [   95.537505] Freed by task 4035:
      [   95.538003]  kasan_save_stack+0x1b/0x40
      [   95.538599]  kasan_set_track+0x1c/0x30
      [   95.539177]  kasan_set_free_info+0x1b/0x30
      [   95.539798]  __kasan_slab_free+0x112/0x160
      [   95.540427]  kfree+0xd1/0x390
      [   95.540910]  io_commit_cqring+0x3ec/0x8e0
      [   95.541516]  io_iopoll_complete+0x914/0x1390
      [   95.542150]  io_do_iopoll+0x580/0x700
      [   95.542724]  io_iopoll_try_reap_events.part.0+0x108/0x200
      [   95.543512]  io_ring_ctx_wait_and_kill+0x118/0x340
      [   95.544206]  io_uring_release+0x43/0x50
      [   95.544791]  __fput+0x28d/0x940
      [   95.545291]  task_work_run+0xea/0x1b0
      [   95.545873]  do_exit+0xb6a/0x2c60
      [   95.546400]  do_group_exit+0x12a/0x320
      [   95.546967]  __x64_sys_exit_group+0x3f/0x50
      [   95.547605]  do_syscall_64+0x33/0x40
      [   95.548155]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is that once we got a non EAGAIN error in io_wq_submit_work(),
      we'll complete req by calling io_req_complete(), which will hold completion_lock
      to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't
      hold completion_lock to call io_commit_cqring(), then there maybe concurrent
      access to ctx->defer_list, double free may happen.
      
      To fix this bug, we always let io_iopoll_complete() complete polled io.
      
      Cc: <stable@vger.kernel.org> # 5.5+
      Reported-by: default avatarAbaci Fuzz <abaci@linux.alibaba.com>
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dad1b124
    • Pavel Begunkov's avatar
      io_uring: add timeout update · 9c8e11b3
      Pavel Begunkov authored
      
      
      Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in
      IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode.
      Oirignal timeout.off will be ignored as well.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      [axboe: remove now unused 'ret' variable]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c8e11b3
    • Pavel Begunkov's avatar
      io_uring: restructure io_timeout_cancel() · fbd15848
      Pavel Begunkov authored
      
      
      Add io_timeout_extract() helper, which searches and disarms timeouts,
      but doesn't complete them. No functional changes.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fbd15848
    • Pavel Begunkov's avatar
      io_uring: fix files cancellation · bee749b1
      Pavel Begunkov authored
      
      
      io_uring_cancel_files()'s task check condition mistakenly got flipped.
      
      1. There can't be a request in the inflight list without
      IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler.
      2. Also, don't call the function for files==NULL to not do such a check,
      all that staff is already handled well by its counter part,
      __io_uring_cancel_task_requests().
      
      With that just flip the task check.
      
      Also, it iowq-cancels all request of current task there, don't forget to
      set right ->files into struct io_task_cancel.
      
      Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task")
      Reported-by: default avatar <syzbot+c0d52d0b3c0c3ffb9525@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bee749b1
    • Jens Axboe's avatar
      io_uring: use bottom half safe lock for fixed file data · ac0648a5
      Jens Axboe authored
      
      
      io_file_data_ref_zero() can be invoked from soft-irq from the RCU core,
      hence we need to ensure that the file_data lock is bottom half safe. Use
      the _bh() variants when grabbing this lock.
      
      Reported-by: default avatar <syzbot+1f4ba1e5520762c523c6@syzkaller.appspotmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac0648a5
    • Pavel Begunkov's avatar
      io_uring: fix miscounting ios_left · bd5bbda7
      Pavel Begunkov authored
      
      
      io_req_init() doesn't decrement state->ios_left if a request doesn't
      need ->file, it just returns before that on if(!needs_file). That's
      not really a problem but may cause overhead for an additional fput().
      Also inline and kill io_req_set_file() as it's of no use anymore.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bd5bbda7
    • Pavel Begunkov's avatar
      io_uring: change submit file state invariant · 6e1271e6
      Pavel Begunkov authored
      
      
      Keep submit state invariant of whether there are file refs left based on
      state->nr_refs instead of (state->file==NULL), and always check against
      the first one. It's easier to track and allows to remove 1 if. It also
      automatically leaves struct submit_state in a consistent state after
      io_submit_state_end(), that's not used yet but nice.
      
      btw rename has_refs to file_refs for more clarity.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e1271e6
    • Xiaoguang Wang's avatar
      io_uring: check kthread stopped flag when sq thread is unparked · 65b2b213
      Xiaoguang Wang authored
      
      
      syzbot reports following issue:
      INFO: task syz-executor.2:12399 can't die for more than 143 seconds.
      task:syz-executor.2  state:D stack:28744 pid:12399 ppid:  8504 flags:0x00004004
      Call Trace:
       context_switch kernel/sched/core.c:3773 [inline]
       __schedule+0x893/0x2170 kernel/sched/core.c:4522
       schedule+0xcf/0x270 kernel/sched/core.c:4600
       schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847
       do_wait_for_common kernel/sched/completion.c:85 [inline]
       __wait_for_common kernel/sched/completion.c:106 [inline]
       wait_for_common kernel/sched/completion.c:117 [inline]
       wait_for_completion+0x163/0x260 kernel/sched/completion.c:138
       kthread_stop+0x17a/0x720 kernel/kthread.c:596
       io_put_sq_data fs/io_uring.c:7193 [inline]
       io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290
       io_finish_async fs/io_uring.c:7297 [inline]
       io_sq_offload_create fs/io_uring.c:8015 [inline]
       io_uring_create fs/io_uring.c:9433 [inline]
       io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45deb9
      Code: Unable to access opcode bytes at RIP 0x45de8f.
      RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9
      RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5
      RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
      R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c
      INFO: task syz-executor.2:12399 blocked for more than 143 seconds.
            Not tainted 5.10.0-rc3-next-20201110-syzkaller #0
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      
      Currently we don't have a reproducer yet, but seems that there is a
      race in current codes:
      => io_put_sq_data
            ctx_list is empty now.       |
      ==> kthread_park(sqd->thread);     |
                                         | T1: sq thread is parked now.
      ==> kthread_stop(sqd->thread);     |
          KTHREAD_SHOULD_STOP is set now.|
      ===> kthread_unpark(k);            |
                                         | T2: sq thread is now unparkd, run again.
                                         |
                                         | T3: sq thread is now preempted out.
                                         |
      ===> wake_up_process(k);           |
                                         |
                                         | T4: Since sqd ctx_list is empty, needs_sched will be true,
                                         | then sq thread sets task state to TASK_INTERRUPTIBLE,
                                         | and schedule, now sq thread will never be waken up.
      ===> wait_for_completion           |
      
      I have artificially used mdelay() to simulate above race, will get same
      stack like this syzbot report, but to be honest, I'm not sure this code
      race triggers syzbot report.
      
      To fix this possible code race, when sq thread is unparked, need to check
      whether sq thread has been stopped.
      
      Reported-by: default avatar <syzbot+03beeb595f074db9cfd1@syzkaller.appspotmail.com>
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      65b2b213
    • Pavel Begunkov's avatar
      io_uring: share fixed_file_refs b/w multiple rsrcs · 36f72fe2
      Pavel Begunkov authored
      
      
      Double fixed files for splice/tee are done in a nasty way, it takes 2
      ref_node refs, and during the second time it blindly overrides
      req->fixed_file_refs hoping that it haven't changed. That works because
      all that is done under iouring_lock in a single go but is error-prone.
      
      Bind everything explicitly to a single ref_node and take only one ref,
      with current ref_node ordering it's guaranteed to keep all files valid
      awhile the request is inflight.
      
      That's mainly a cleanup + preparation for generic resource handling,
      but also saves pcpu_ref get/put for splice/tee with 2 fixed files.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      36f72fe2
    • Pavel Begunkov's avatar
      io_uring: replace inflight_wait with tctx->wait · c98de08c
      Pavel Begunkov authored
      
      
      As tasks now cancel only theirs requests, and inflight_wait is awaited
      only in io_uring_cancel_files(), which should be called with ->in_idle
      set, instead of keeping a separate inflight_wait use tctx->wait.
      
      That will add some spurious wakeups but actually is safer from point of
      not hanging the task.
      
      e.g.
      task1                   | IRQ
                              | *start* io_complete_rw_common(link)
                              |        link: req1 -> req2 -> req3(with files)
      *cancel_files()         |
      io_wq_cancel(), etc.    |
                              | put_req(link), adds to io-wq req2
      schedule()              |
      
      So, task1 will never try to cancel req2 or req3. If req2 is
      long-standing (e.g. read(empty_pipe)), this may hang.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c98de08c
    • Pavel Begunkov's avatar
      io_uring: don't take fs for recvmsg/sendmsg · 10cad2c4
      Pavel Begunkov authored
      
      
      We don't even allow not plain data msg_control, which is disallowed in
      __sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
      IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
      there are cases that can be, e.g. IOSQE_ASYNC.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      10cad2c4
    • Xiaoguang Wang's avatar
      io_uring: only wake up sq thread while current task is in io worker context · 2e9dbe90
      Xiaoguang Wang authored
      
      
      If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread
      task context or in io worker task context. If current task context is sq
      thread, we don't need to check whether should wake up sq thread.
      
      io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory
      barrier, before this patch, perf shows obvious overhead:
        Samples: 481K of event 'cycles', Event count (approx.): 299807382878
        Overhead  Comma  Shared Object     Symbol
           3.69%  :9630  [kernel.vmlinux]  [k] io_issue_sqe
      
      With this patch, perf shows:
        Samples: 482K of event 'cycles', Event count (approx.): 299929547283
        Overhead  Comma  Shared Object     Symbol
           0.70%  :4015  [kernel.vmlinux]  [k] io_issue_sqe
      
      It shows some obvious improvements.
      
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2e9dbe90
    • Xiaoguang Wang's avatar
      io_uring: don't acquire uring_lock twice · 906a3c6f
      Xiaoguang Wang authored
      
      
      Both IOPOLL and sqes handling need to acquire uring_lock, combine
      them together, then we just need to acquire uring_lock once.
      
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      906a3c6f
    • Xiaoguang Wang's avatar
      io_uring: initialize 'timeout' properly in io_sq_thread() · a0d9205f
      Xiaoguang Wang authored
      
      
      Some static checker reports below warning:
          fs/io_uring.c:6939 io_sq_thread()
          error: uninitialized symbol 'timeout'.
      
      This is a false positive, but let's just initialize 'timeout' to make
      sure we don't trip over this.
      
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a0d9205f
    • Xiaoguang Wang's avatar
      io_uring: refactor io_sq_thread() handling · 08369246
      Xiaoguang Wang authored
      
      
      There are some issues about current io_sq_thread() implementation:
        1. The prepare_to_wait() usage in __io_sq_thread() is weird. If
      multiple ctxs share one same poll thread, one ctx will put poll thread
      in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't
      need to change task's stat at all. I think only if all ctxs don't have
      work to do, we can do it.
        2. We use round-robin strategy to make multiple ctxs share one same
      poll thread, but there are various condition in __io_sq_thread(), which
      seems complicated and may affect round-robin strategy.
      
      To improve above issues, I take below actions:
        1. If multiple ctxs share one same poll thread, only if all all ctxs
      don't have work to do, we can call prepare_to_wait() and schedule() to
      make poll thread enter sleep state.
        2. To make round-robin strategy more straight, I simplify
      __io_sq_thread() a bit, it just does io poll and sqes submit work once,
      does not check various condition.
        3. For multiple ctxs share one same poll thread, we choose the biggest
      sq_thread_idle among these ctxs as timeout condition, and will update
      it when ctx is in or out.
        4. Not need to check EBUSY especially, if io_submit_sqes() returns
      EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should
      be aware of cq overflow and enters kernel to flush work.
      
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08369246
    • Pavel Begunkov's avatar
      io_uring: always batch cancel in *cancel_files() · f6edbabb
      Pavel Begunkov authored
      
      
      Instead of iterating over each request and cancelling it individually in
      io_uring_cancel_files(), try to cancel all matching requests and use
      ->inflight_list only to check if there anything left.
      
      In many cases it should be faster, and we can reuse a lot of code from
      task cancellation.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f6edbabb
    • Pavel Begunkov's avatar
      io_uring: pass files into kill timeouts/poll · 6b81928d
      Pavel Begunkov authored
      
      
      Make io_poll_remove_all() and io_kill_timeouts() to match against files
      as well. A preparation patch, effectively not used by now.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6b81928d
    • Pavel Begunkov's avatar
      io_uring: don't iterate io_uring_cancel_files() · b52fda00
      Pavel Begunkov authored
      
      
      io_uring_cancel_files() guarantees to cancel all matching requests,
      that's not necessary to do that in a loop. Move it up in the callchain
      into io_uring_cancel_task_requests().
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b52fda00
    • Pavel Begunkov's avatar
      io_uring: cancel only requests of current task · df9923f9
      Pavel Begunkov authored
      
      
      io_uring_cancel_files() cancels all request that match files regardless
      of task. There is no real need in that, cancel only requests of the
      specified task. That also handles SQPOLL case as it already changes task
      to it.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      df9923f9
    • Pavel Begunkov's avatar
      io_uring: add a {task,files} pair matching helper · 08d23634
      Pavel Begunkov authored
      
      
      Add io_match_task() that matches both task and files.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08d23634
    • Pavel Begunkov's avatar
      io_uring: simplify io_task_match() · 06de5f59
      Pavel Begunkov authored
      
      
      If IORING_SETUP_SQPOLL is set all requests belong to the corresponding
      SQPOLL task, so skip task checking in that case and always match.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      06de5f59
    • Pavel Begunkov's avatar
      io_uring: inline io_import_iovec() · 2846c481
      Pavel Begunkov authored
      
      
      Inline io_import_iovec() and leave only its former __io_import_iovec()
      renamed to the original name. That makes it more obious what is reused in
      io_read/write().
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2846c481
    • Pavel Begunkov's avatar
      io_uring: remove duplicated io_size from rw · 632546c4
      Pavel Begunkov authored
      
      
      io_size and iov_count in io_read() and io_write() hold the same value,
      kill the last one.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      632546c4
    • David Laight's avatar
      fs/io_uring Don't use the return value from import_iovec(). · 10fc72e4
      David Laight authored
      
      
      This is the only code that relies on import_iovec() returning
      iter.count on success.
      This allows a better interface to import_iovec().
      
      Signed-off-by: default avatarDavid Laight <david.laight@aculab.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      10fc72e4
    • Pavel Begunkov's avatar
      io_uring: NULL files dereference by SQPOLL · 1a38ffc9
      Pavel Begunkov authored
      
      
      SQPOLL task may find sqo_task->files == NULL and
      __io_sq_thread_acquire_files() would leave it unset, so following
      fget_many() and others try to dereference NULL and fault. Propagate
      an error files are missing.
      
      [  118.962785] BUG: kernel NULL pointer dereference, address:
      	0000000000000020
      [  118.963812] #PF: supervisor read access in kernel mode
      [  118.964534] #PF: error_code(0x0000) - not-present page
      [  118.969029] RIP: 0010:__fget_files+0xb/0x80
      [  119.005409] Call Trace:
      [  119.005651]  fget_many+0x2b/0x30
      [  119.005964]  io_file_get+0xcf/0x180
      [  119.006315]  io_submit_sqes+0x3a4/0x950
      [  119.007481]  io_sq_thread+0x1de/0x6a0
      [  119.007828]  kthread+0x114/0x150
      [  119.008963]  ret_from_fork+0x22/0x30
      
      Reported-by: default avatarJosef Grieb <josef.grieb@gmail.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1a38ffc9
    • Hao Xu's avatar
      io_uring: add timeout support for io_uring_enter() · c73ebb68
      Hao Xu authored
      
      
      Now users who want to get woken when waiting for events should submit a
      timeout command first. It is not safe for applications that split SQ and
      CQ handling between two threads, such as mysql. Users should synchronize
      the two threads explicitly to protect SQ and that will impact the
      performance.
      
      This patch adds support for timeout to existing io_uring_enter(). To
      avoid overloading arguments, it introduces a new parameter structure
      which contains sigmask and timeout.
      
      I have tested the workloads with one thread submiting nop requests
      while the other reaping the cqe with timeout. It shows 1.8~2x faster
      when the iodepth is 16.
      
      Signed-off-by: default avatarJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      [axboe: various cleanups/fixes, and name change to SIG_IS_DATA]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c73ebb68
    • Jens Axboe's avatar
      io_uring: only plug when appropriate · 27926b68
      Jens Axboe authored
      
      
      We unconditionally call blk_start_plug() when starting the IO
      submission, but we only really should do that if we have more than 1
      request to submit AND we're potentially dealing with block based storage
      underneath. For any other type of request, it's just a waste of time to
      do so.
      
      Add a ->plug bit to io_op_def and set it for read/write requests. We
      could make this more precise and check the file itself as well, but it
      doesn't matter that much and would quickly become more expensive.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      27926b68
    • Pavel Begunkov's avatar
      io_uring: rearrange io_kiocb fields for better caching · 0415767e
      Pavel Begunkov authored
      
      
      We've got extra 8 bytes in the 2nd cacheline, put ->fixed_file_refs
      there, so inline execution path mostly doesn't touch the 3rd cacheline
      for fixed_file requests as well.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0415767e
    • Pavel Begunkov's avatar
      io_uring: link requests with singly linked list · f2f87370
      Pavel Begunkov authored
      
      
      Singly linked list for keeping linked requests is enough, because we
      almost always operate on the head and traverse forward with the
      exception of linked timeouts going 1 hop backwards.
      
      Replace ->link_list with a handmade singly linked list. Also kill
      REQ_F_LINK_HEAD in favour of checking a newly added ->list for NULL
      directly.
      
      That saves 8B in io_kiocb, is not as heavy as list fixup, makes better
      use of cache by not touching a previous request (i.e. last request of
      the link) each time on list modification and optimises cache use further
      in the following patch, and actually makes travesal easier removing in
      the end some lines. Also, keeping invariant in ->list instead of having
      REQ_F_LINK_HEAD is less error-prone.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f2f87370
    • Pavel Begunkov's avatar
      io_uring: track link timeout's master explicitly · 90cd7e42
      Pavel Begunkov authored
      
      
      In preparation for converting singly linked lists for chaining requests,
      make linked timeouts save requests that they're responsible for and not
      count on doubly linked list for back referencing.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90cd7e42
    • Pavel Begunkov's avatar
      io_uring: track link's head and tail during submit · 863e0560
      Pavel Begunkov authored
      
      
      Explicitly save not only a link's head in io_submit_sqe[s]() but the
      tail as well. That's in preparation for keeping linked requests in a
      singly linked list.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      863e0560
    • Pavel Begunkov's avatar
      io_uring: split poll and poll_remove structs · 018043be
      Pavel Begunkov authored
      
      
      Don't use a single struct for polls and poll remove requests, they have
      totally different layouts.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      018043be
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_UNLINKAT · 14a1143b
      Jens Axboe authored
      
      
      IORING_OP_UNLINKAT behaves like unlinkat(2) and takes the same flags
      and arguments.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14a1143b
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_RENAMEAT · 80a261fd
      Jens Axboe authored
      
      
      IORING_OP_RENAMEAT behaves like renameat2(), and takes the same flags
      etc.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80a261fd
    • Jens Axboe's avatar
      fs: make do_renameat2() take struct filename · e886663c
      Jens Axboe authored
      
      
      Pass in the struct filename pointers instead of the user string, and
      update the three callers to do the same.
      
      This behaves like do_unlinkat(), which also takes a filename struct and
      puts it when it is done. Converting callers is then trivial.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e886663c
    • Jens Axboe's avatar
      io_uring: enable file table usage for SQPOLL rings · 14587a46
      Jens Axboe authored
      
      
      Now that SQPOLL supports non-registered files and grabs the file table,
      we can relax the restriction on open/close/accept/connect and allow
      them on a ring that is setup with IORING_SETUP_SQPOLL.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14587a46
    • Jens Axboe's avatar
      io_uring: allow non-fixed files with SQPOLL · 28cea78a
      Jens Axboe authored
      
      
      The restriction of needing fixed files for SQPOLL is problematic, and
      prevents/inhibits several valid uses cases. With the referenced
      files_struct that we have now, it's trivially supportable.
      
      Treat ->files like we do the mm for the SQPOLL thread - grab a reference
      to it (and assign it), and drop it when we're done.
      
      This feature is exposed as IORING_FEAT_SQPOLL_NONFIXED.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      28cea78a
  2. Nov 24, 2020