Skip to content
  1. May 09, 2022
  2. May 01, 2022
    • Greg Kroah-Hartman's avatar
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add test for reg2btf_ids out of bounds access · f59e6886
      Kumar Kartikeya Dwivedi authored
      
      
      commit 13c6a37d upstream.
      
      This test tries to pass a PTR_TO_BTF_ID_OR_NULL to the release function,
      which would trigger a out of bounds access without the fix in commit
      45ce4b4f ("bpf: Fix crash due to out of bounds access into reg2btf_ids.")
      but after the fix, it should only index using base_type(reg->type),
      which should be less than __BPF_REG_TYPE_MAX, and also not permit any
      type flags to be set for the reg->type.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220220023138.2224652-1-memxor@gmail.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f59e6886
    • Linus Torvalds's avatar
      mm: gup: make fault_in_safe_writeable() use fixup_user_fault() · dcecd95a
      Linus Torvalds authored
      
      
      commit fe673d3f upstream
      
      Instead of using GUP, make fault_in_safe_writeable() actually force a
      'handle_mm_fault()' using the same fixup_user_fault() machinery that
      futexes already use.
      
      Using the GUP machinery meant that fault_in_safe_writeable() did not do
      everything that a real fault would do, ranging from not auto-expanding
      the stack segment, to not updating accessed or dirty flags in the page
      tables (GUP sets those flags on the pages themselves).
      
      The latter causes problems on architectures (like s390) that do accessed
      bit handling in software, which meant that fault_in_safe_writeable()
      didn't actually do all the fault handling it needed to, and trying to
      access the user address afterwards would still cause faults.
      
      Reported-and-tested-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Fixes: cdd591fc ("iov_iter: Introduce fault_in_iov_iter_writeable")
      Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/
      
      
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dcecd95a
    • Filipe Manana's avatar
      btrfs: fallback to blocking mode when doing async dio over multiple extents · 4a0123bd
      Filipe Manana authored
      commit ca93e44b upstream
      
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      
      
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a0123bd
    • Filipe Manana's avatar
      btrfs: fix deadlock due to page faults during direct IO reads and writes · c81c4f56
      Filipe Manana authored
      commit 51bd9563 upstream
      
      If we do a direct IO read or write when the buffer given by the user is
      memory mapped to the file range we are going to do IO, we end up ending
      in a deadlock. This is triggered by the new test case generic/647 from
      fstests.
      
      For a direct IO read we get a trace like this:
      
        [967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
        [967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
        [967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
        [967.875992] Call Trace:
        [967.875999]  __schedule+0x3ca/0xe10
        [967.876015]  schedule+0x43/0xe0
        [967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
        [967.876109]  ? do_wait_intr_irq+0xb0/0xb0
        [967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
        [967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
        [967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
        [967.876214]  extent_readahead+0x32d/0x530 [btrfs]
        [967.876253]  ? lru_cache_add+0x104/0x220
        [967.876255]  ? kvm_sched_clock_read+0x14/0x40
        [967.876258]  ? sched_clock_cpu+0xd/0x110
        [967.876263]  ? lock_release+0x155/0x4a0
        [967.876271]  read_pages+0x86/0x270
        [967.876274]  ? lru_cache_add+0x125/0x220
        [967.876281]  page_cache_ra_unbounded+0x1a3/0x220
        [967.876291]  filemap_fault+0x626/0xa20
        [967.876303]  __do_fault+0x36/0xf0
        [967.876308]  __handle_mm_fault+0x83f/0x15f0
        [967.876322]  handle_mm_fault+0x9e/0x260
        [967.876327]  __get_user_pages+0x204/0x620
        [967.876332]  ? get_user_pages_unlocked+0x69/0x340
        [967.876340]  get_user_pages_unlocked+0xd3/0x340
        [967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
        [967.876366]  iov_iter_get_pages+0x8d/0x3a0
        [967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
        [967.876379]  ? lock_release+0x155/0x4a0
        [967.876387]  iomap_dio_bio_actor+0x232/0x410
        [967.876396]  iomap_apply+0x12a/0x4a0
        [967.876398]  ? iomap_dio_rw+0x30/0x30
        [967.876414]  __iomap_dio_rw+0x29f/0x5e0
        [967.876415]  ? iomap_dio_rw+0x30/0x30
        [967.876420]  ? lock_acquired+0xf3/0x420
        [967.876429]  iomap_dio_rw+0xa/0x30
        [967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
        [967.876460]  new_sync_read+0x118/0x1a0
        [967.876472]  vfs_read+0x128/0x1b0
        [967.876477]  __x64_sys_pread64+0x90/0xc0
        [967.876483]  do_syscall_64+0x3b/0xc0
        [967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [967.876490] RIP: 0033:0x7fb6f2c038d6
        [967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
        [967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
        [967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
        [967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
        [967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
        [967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
      
      This happens because at btrfs_dio_iomap_begin() we lock the extent range
      and return with it locked - we only unlock in the endio callback, at
      end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
      iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
      faults that resulting in reading the pages, through the readahead callback
      btrfs_readahead(), and through there we end to attempt to lock again the
      same extent range (or a subrange of what we locked before), resulting in
      the deadlock.
      
      For a direct IO write, the scenario is a bit different, and it results in
      trace like this:
      
        [1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
        [1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
        [1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
        [1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
        [1330.351906] Call Trace:
        [1330.351913]  __schedule+0x3ca/0xe10
        [1330.351930]  schedule+0x43/0xe0
        [1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
        [1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
        [1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
        [1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
        [1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
        [1330.352133]  ? lru_cache_add+0x104/0x220
        [1330.352135]  ? kvm_sched_clock_read+0x14/0x40
        [1330.352138]  ? sched_clock_cpu+0xd/0x110
        [1330.352143]  ? lock_release+0x155/0x4a0
        [1330.352151]  read_pages+0x86/0x270
        [1330.352155]  ? lru_cache_add+0x125/0x220
        [1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
        [1330.352172]  filemap_fault+0x626/0xa20
        [1330.352176]  ? filemap_map_pages+0x18b/0x660
        [1330.352184]  __do_fault+0x36/0xf0
        [1330.352189]  __handle_mm_fault+0x1253/0x15f0
        [1330.352203]  handle_mm_fault+0x9e/0x260
        [1330.352208]  __get_user_pages+0x204/0x620
        [1330.352212]  ? get_user_pages_unlocked+0x69/0x340
        [1330.352220]  get_user_pages_unlocked+0xd3/0x340
        [1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
        [1330.352246]  iov_iter_get_pages+0x8d/0x3a0
        [1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
        [1330.352259]  ? lock_release+0x155/0x4a0
        [1330.352266]  iomap_dio_bio_actor+0x232/0x410
        [1330.352275]  iomap_apply+0x12a/0x4a0
        [1330.352278]  ? iomap_dio_rw+0x30/0x30
        [1330.352292]  __iomap_dio_rw+0x29f/0x5e0
        [1330.352294]  ? iomap_dio_rw+0x30/0x30
        [1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
        [1330.352339]  new_sync_write+0x11f/0x1b0
        [1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
        [1330.352354]  vfs_write+0x292/0x3c0
        [1330.352359]  __x64_sys_pwrite64+0x90/0xc0
        [1330.352365]  do_syscall_64+0x3b/0xc0
        [1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [1330.352372] RIP: 0033:0x7f4b0a580986
        [1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
        [1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
        [1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
        [1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
        [1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
        [1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      
      Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
      range unlocked, but later when the page faults are triggered and we try
      to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
      we find the ordered extent for our write, created by the iomap callback
      btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
      deadlock since we can't complete the ordered extent without reading the
      pages (the iomap code only submits the bio after the pages are faulted
      in).
      
      Fix this by setting the nofault attribute of the given iov_iter and retry
      the direct IO read/write if we get an -EFAULT error returned from iomap.
      For reads, also disable page faults completely, this is because when we
      read from a hole or a prealloc extent, we can still trigger page faults
      due to the call to iov_iter_zero() done by iomap - at the moment, it is
      oblivious to the value of the ->nofault attribute of an iov_iter.
      We also need to keep track of the number of bytes written or read, and
      pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
      
      This depends on the iov_iter and iomap changes introduced in commit
      c03098d4 ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
      git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2"
      
      ).
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c81c4f56
    • Andreas Gruenbacher's avatar
      gfs2: Fix mmap + page fault deadlocks for direct I/O · 640a6be8
      Andreas Gruenbacher authored
      
      
      commit b01b2d72 upstream
      
      Also disable page faults during direct I/O requests and implement a
      similar kind of retry logic as in the buffered I/O case.
      
      The retry logic in the direct I/O case differs from the buffered I/O
      case in the following way: direct I/O doesn't provide the kinds of
      consistency guarantees between concurrent reads and writes that buffered
      I/O provides, so once we lose the inode glock while faulting in user
      pages, we always resume the operation.  We never need to return a
      partial read or write.
      
      This locking problem was originally reported by Jan Kara.  Linus came up
      with the idea of disabling page faults.  Many thanks to Al Viro and
      Matthew Wilcox for their feedback.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      640a6be8
    • Andreas Gruenbacher's avatar
      iov_iter: Introduce nofault flag to disable page faults · f86f8d27
      Andreas Gruenbacher authored
      
      
      commit 3337ab08 upstream
      
      Introduce a new nofault flag to indicate to iov_iter_get_pages not to
      fault in user pages.
      
      This is implemented by passing the FOLL_NOFAULT flag to get_user_pages,
      which causes get_user_pages to fail when it would otherwise fault in a
      page. We'll use the ->nofault flag to prevent iomap_dio_rw from faulting
      in pages when page faults are not allowed.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f86f8d27
    • Andreas Gruenbacher's avatar
      gup: Introduce FOLL_NOFAULT flag to disable page faults · 6e213bc6
      Andreas Gruenbacher authored
      
      
      commit 55b8fe70 upstream
      
      Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
      -EFAULT when it would otherwise trigger a page fault.  This is roughly
      similar to FOLL_FAST_ONLY but available on all architectures, and less
      fragile.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e213bc6
    • Andreas Gruenbacher's avatar
      iomap: Add done_before argument to iomap_dio_rw · d3b74479
      Andreas Gruenbacher authored
      
      
      commit 4fdccaa0 upstream
      
      Add a done_before argument to iomap_dio_rw that indicates how much of
      the request has already been transferred.  When the request succeeds, we
      report that done_before additional bytes were tranferred.  This is
      useful for finishing a request asynchronously when part of the request
      has already been completed synchronously.
      
      We'll use that to allow iomap_dio_rw to be used with page faults
      disabled: when a page fault occurs while submitting a request, we
      synchronously complete the part of the request that has already been
      submitted.  The caller can then take care of the page fault and call
      iomap_dio_rw again for the rest of the request, passing in the number of
      bytes already tranferred.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3b74479
    • Andreas Gruenbacher's avatar
      iomap: Support partial direct I/O on user copy failures · ea7a5785
      Andreas Gruenbacher authored
      
      
      commit 97308f8b upstream
      
      In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
      IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
      return a partial result.  This allows the caller to deal with the page
      fault and retry the remainder of the request.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea7a5785
    • Andreas Gruenbacher's avatar
      iomap: Fix iomap_dio_rw return value for user copies · a00cc46f
      Andreas Gruenbacher authored
      
      
      commit 42c498c1 upstream
      
      When a user copy fails in one of the helpers of iomap_dio_rw, fail with
      -EFAULT instead of returning 0.  This matches what iomap_dio_bio_actor
      returns when it gets an -EFAULT from bio_iov_iter_get_pages.  With these
      changes, iomap_dio_actor now consistently fails with -EFAULT when a user
      page cannot be faulted in.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a00cc46f
    • Andreas Gruenbacher's avatar
      gfs2: Fix mmap + page fault deadlocks for buffered I/O · 81a7fc39
      Andreas Gruenbacher authored
      
      
      commit 00bfe02f upstream
      
      In the .read_iter and .write_iter file operations, we're accessing
      user-space memory while holding the inode glock.  There is a possibility
      that the memory is mapped to the same file, in which case we'd recurse
      on the same glock.
      
      We could detect and work around this simple case of recursive locking,
      but more complex scenarios exist that involve multiple glocks,
      processes, and cluster nodes, and working around all of those cases
      isn't practical or even possible.
      
      Avoid these kinds of problems by disabling page faults while holding the
      inode glock.  If a page fault would occur, we either end up with a
      partial read or write or with -EFAULT if nothing could be read or
      written.  In either case, we know that we're not done with the
      operation, so we indicate that we're willing to give up the inode glock
      and then we fault in the missing pages.  If that made us lose the inode
      glock, we return a partial read or write.  Otherwise, we resume the
      operation.
      
      This locking problem was originally reported by Jan Kara.  Linus came up
      with the idea of disabling page faults.  Many thanks to Al Viro and
      Matthew Wilcox for their feedback.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81a7fc39