Skip to content
  1. May 01, 2022
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add test for reg2btf_ids out of bounds access · f59e6886
      Kumar Kartikeya Dwivedi authored
      commit 13c6a37d upstream.
      
      This test tries to pass a PTR_TO_BTF_ID_OR_NULL to the release function,
      which would trigger a out of bounds access without the fix in commit
      45ce4b4f
      
       ("bpf: Fix crash due to out of bounds access into reg2btf_ids.")
      but after the fix, it should only index using base_type(reg->type),
      which should be less than __BPF_REG_TYPE_MAX, and also not permit any
      type flags to be set for the reg->type.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220220023138.2224652-1-memxor@gmail.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f59e6886
    • Linus Torvalds's avatar
      mm: gup: make fault_in_safe_writeable() use fixup_user_fault() · dcecd95a
      Linus Torvalds authored
      commit fe673d3f
      
       upstream
      
      Instead of using GUP, make fault_in_safe_writeable() actually force a
      'handle_mm_fault()' using the same fixup_user_fault() machinery that
      futexes already use.
      
      Using the GUP machinery meant that fault_in_safe_writeable() did not do
      everything that a real fault would do, ranging from not auto-expanding
      the stack segment, to not updating accessed or dirty flags in the page
      tables (GUP sets those flags on the pages themselves).
      
      The latter causes problems on architectures (like s390) that do accessed
      bit handling in software, which meant that fault_in_safe_writeable()
      didn't actually do all the fault handling it needed to, and trying to
      access the user address afterwards would still cause faults.
      
      Reported-and-tested-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Fixes: cdd591fc ("iov_iter: Introduce fault_in_iov_iter_writeable")
      Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/
      
      
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dcecd95a
    • Filipe Manana's avatar
      btrfs: fallback to blocking mode when doing async dio over multiple extents · 4a0123bd
      Filipe Manana authored
      commit ca93e44b upstream
      
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      
      
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a0123bd
    • Filipe Manana's avatar
      btrfs: fix deadlock due to page faults during direct IO reads and writes · c81c4f56
      Filipe Manana authored
      commit 51bd9563 upstream
      
      If we do a direct IO read or write when the buffer given by the user is
      memory mapped to the file range we are going to do IO, we end up ending
      in a deadlock. This is triggered by the new test case generic/647 from
      fstests.
      
      For a direct IO read we get a trace like this:
      
        [967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
        [967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
        [967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
        [967.875992] Call Trace:
        [967.875999]  __schedule+0x3ca/0xe10
        [967.876015]  schedule+0x43/0xe0
        [967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
        [967.876109]  ? do_wait_intr_irq+0xb0/0xb0
        [967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
        [967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
        [967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
        [967.876214]  extent_readahead+0x32d/0x530 [btrfs]
        [967.876253]  ? lru_cache_add+0x104/0x220
        [967.876255]  ? kvm_sched_clock_read+0x14/0x40
        [967.876258]  ? sched_clock_cpu+0xd/0x110
        [967.876263]  ? lock_release+0x155/0x4a0
        [967.876271]  read_pages+0x86/0x270
        [967.876274]  ? lru_cache_add+0x125/0x220
        [967.876281]  page_cache_ra_unbounded+0x1a3/0x220
        [967.876291]  filemap_fault+0x626/0xa20
        [967.876303]  __do_fault+0x36/0xf0
        [967.876308]  __handle_mm_fault+0x83f/0x15f0
        [967.876322]  handle_mm_fault+0x9e/0x260
        [967.876327]  __get_user_pages+0x204/0x620
        [967.876332]  ? get_user_pages_unlocked+0x69/0x340
        [967.876340]  get_user_pages_unlocked+0xd3/0x340
        [967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
        [967.876366]  iov_iter_get_pages+0x8d/0x3a0
        [967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
        [967.876379]  ? lock_release+0x155/0x4a0
        [967.876387]  iomap_dio_bio_actor+0x232/0x410
        [967.876396]  iomap_apply+0x12a/0x4a0
        [967.876398]  ? iomap_dio_rw+0x30/0x30
        [967.876414]  __iomap_dio_rw+0x29f/0x5e0
        [967.876415]  ? iomap_dio_rw+0x30/0x30
        [967.876420]  ? lock_acquired+0xf3/0x420
        [967.876429]  iomap_dio_rw+0xa/0x30
        [967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
        [967.876460]  new_sync_read+0x118/0x1a0
        [967.876472]  vfs_read+0x128/0x1b0
        [967.876477]  __x64_sys_pread64+0x90/0xc0
        [967.876483]  do_syscall_64+0x3b/0xc0
        [967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [967.876490] RIP: 0033:0x7fb6f2c038d6
        [967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
        [967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
        [967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
        [967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
        [967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
        [967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
      
      This happens because at btrfs_dio_iomap_begin() we lock the extent range
      and return with it locked - we only unlock in the endio callback, at
      end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
      iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
      faults that resulting in reading the pages, through the readahead callback
      btrfs_readahead(), and through there we end to attempt to lock again the
      same extent range (or a subrange of what we locked before), resulting in
      the deadlock.
      
      For a direct IO write, the scenario is a bit different, and it results in
      trace like this:
      
        [1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
        [1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
        [1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
        [1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
        [1330.351906] Call Trace:
        [1330.351913]  __schedule+0x3ca/0xe10
        [1330.351930]  schedule+0x43/0xe0
        [1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
        [1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
        [1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
        [1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
        [1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
        [1330.352133]  ? lru_cache_add+0x104/0x220
        [1330.352135]  ? kvm_sched_clock_read+0x14/0x40
        [1330.352138]  ? sched_clock_cpu+0xd/0x110
        [1330.352143]  ? lock_release+0x155/0x4a0
        [1330.352151]  read_pages+0x86/0x270
        [1330.352155]  ? lru_cache_add+0x125/0x220
        [1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
        [1330.352172]  filemap_fault+0x626/0xa20
        [1330.352176]  ? filemap_map_pages+0x18b/0x660
        [1330.352184]  __do_fault+0x36/0xf0
        [1330.352189]  __handle_mm_fault+0x1253/0x15f0
        [1330.352203]  handle_mm_fault+0x9e/0x260
        [1330.352208]  __get_user_pages+0x204/0x620
        [1330.352212]  ? get_user_pages_unlocked+0x69/0x340
        [1330.352220]  get_user_pages_unlocked+0xd3/0x340
        [1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
        [1330.352246]  iov_iter_get_pages+0x8d/0x3a0
        [1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
        [1330.352259]  ? lock_release+0x155/0x4a0
        [1330.352266]  iomap_dio_bio_actor+0x232/0x410
        [1330.352275]  iomap_apply+0x12a/0x4a0
        [1330.352278]  ? iomap_dio_rw+0x30/0x30
        [1330.352292]  __iomap_dio_rw+0x29f/0x5e0
        [1330.352294]  ? iomap_dio_rw+0x30/0x30
        [1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
        [1330.352339]  new_sync_write+0x11f/0x1b0
        [1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
        [1330.352354]  vfs_write+0x292/0x3c0
        [1330.352359]  __x64_sys_pwrite64+0x90/0xc0
        [1330.352365]  do_syscall_64+0x3b/0xc0
        [1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [1330.352372] RIP: 0033:0x7f4b0a580986
        [1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
        [1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
        [1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
        [1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
        [1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
        [1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      
      Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
      range unlocked, but later when the page faults are triggered and we try
      to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
      we find the ordered extent for our write, created by the iomap callback
      btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
      deadlock since we can't complete the ordered extent without reading the
      pages (the iomap code only submits the bio after the pages are faulted
      in).
      
      Fix this by setting the nofault attribute of the given iov_iter and retry
      the direct IO read/write if we get an -EFAULT error returned from iomap.
      For reads, also disable page faults completely, this is because when we
      read from a hole or a prealloc extent, we can still trigger page faults
      due to the call to iov_iter_zero() done by iomap - at the moment, it is
      oblivious to the value of the ->nofault attribute of an iov_iter.
      We also need to keep track of the number of bytes written or read, and
      pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
      
      This depends on the iov_iter and iomap changes introduced in commit
      c03098d4 ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
      git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2"
      
      ).
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c81c4f56
    • Andreas Gruenbacher's avatar
      gfs2: Fix mmap + page fault deadlocks for direct I/O · 640a6be8
      Andreas Gruenbacher authored
      commit b01b2d72
      
       upstream
      
      Also disable page faults during direct I/O requests and implement a
      similar kind of retry logic as in the buffered I/O case.
      
      The retry logic in the direct I/O case differs from the buffered I/O
      case in the following way: direct I/O doesn't provide the kinds of
      consistency guarantees between concurrent reads and writes that buffered
      I/O provides, so once we lose the inode glock while faulting in user
      pages, we always resume the operation.  We never need to return a
      partial read or write.
      
      This locking problem was originally reported by Jan Kara.  Linus came up
      with the idea of disabling page faults.  Many thanks to Al Viro and
      Matthew Wilcox for their feedback.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      640a6be8
    • Andreas Gruenbacher's avatar
      iov_iter: Introduce nofault flag to disable page faults · f86f8d27
      Andreas Gruenbacher authored
      commit 3337ab08
      
       upstream
      
      Introduce a new nofault flag to indicate to iov_iter_get_pages not to
      fault in user pages.
      
      This is implemented by passing the FOLL_NOFAULT flag to get_user_pages,
      which causes get_user_pages to fail when it would otherwise fault in a
      page. We'll use the ->nofault flag to prevent iomap_dio_rw from faulting
      in pages when page faults are not allowed.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f86f8d27
    • Andreas Gruenbacher's avatar
      gup: Introduce FOLL_NOFAULT flag to disable page faults · 6e213bc6
      Andreas Gruenbacher authored
      commit 55b8fe70
      
       upstream
      
      Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
      -EFAULT when it would otherwise trigger a page fault.  This is roughly
      similar to FOLL_FAST_ONLY but available on all architectures, and less
      fragile.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e213bc6
    • Andreas Gruenbacher's avatar
      iomap: Add done_before argument to iomap_dio_rw · d3b74479
      Andreas Gruenbacher authored
      commit 4fdccaa0
      
       upstream
      
      Add a done_before argument to iomap_dio_rw that indicates how much of
      the request has already been transferred.  When the request succeeds, we
      report that done_before additional bytes were tranferred.  This is
      useful for finishing a request asynchronously when part of the request
      has already been completed synchronously.
      
      We'll use that to allow iomap_dio_rw to be used with page faults
      disabled: when a page fault occurs while submitting a request, we
      synchronously complete the part of the request that has already been
      submitted.  The caller can then take care of the page fault and call
      iomap_dio_rw again for the rest of the request, passing in the number of
      bytes already tranferred.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3b74479
    • Andreas Gruenbacher's avatar
      iomap: Support partial direct I/O on user copy failures · ea7a5785
      Andreas Gruenbacher authored
      commit 97308f8b
      
       upstream
      
      In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
      IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
      return a partial result.  This allows the caller to deal with the page
      fault and retry the remainder of the request.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea7a5785
    • Andreas Gruenbacher's avatar
      iomap: Fix iomap_dio_rw return value for user copies · a00cc46f
      Andreas Gruenbacher authored
      commit 42c498c1
      
       upstream
      
      When a user copy fails in one of the helpers of iomap_dio_rw, fail with
      -EFAULT instead of returning 0.  This matches what iomap_dio_bio_actor
      returns when it gets an -EFAULT from bio_iov_iter_get_pages.  With these
      changes, iomap_dio_actor now consistently fails with -EFAULT when a user
      page cannot be faulted in.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a00cc46f
    • Andreas Gruenbacher's avatar
      gfs2: Fix mmap + page fault deadlocks for buffered I/O · 81a7fc39
      Andreas Gruenbacher authored
      commit 00bfe02f
      
       upstream
      
      In the .read_iter and .write_iter file operations, we're accessing
      user-space memory while holding the inode glock.  There is a possibility
      that the memory is mapped to the same file, in which case we'd recurse
      on the same glock.
      
      We could detect and work around this simple case of recursive locking,
      but more complex scenarios exist that involve multiple glocks,
      processes, and cluster nodes, and working around all of those cases
      isn't practical or even possible.
      
      Avoid these kinds of problems by disabling page faults while holding the
      inode glock.  If a page fault would occur, we either end up with a
      partial read or write or with -EFAULT if nothing could be read or
      written.  In either case, we know that we're not done with the
      operation, so we indicate that we're willing to give up the inode glock
      and then we fault in the missing pages.  If that made us lose the inode
      glock, we return a partial read or write.  Otherwise, we resume the
      operation.
      
      This locking problem was originally reported by Jan Kara.  Linus came up
      with the idea of disabling page faults.  Many thanks to Al Viro and
      Matthew Wilcox for their feedback.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81a7fc39
    • Andreas Gruenbacher's avatar
      gfs2: Eliminate ip->i_gh · 38b58498
      Andreas Gruenbacher authored
      commit 1b223f70
      
       upstream
      
      Now that gfs2_file_buffered_write is the only remaining user of
      ip->i_gh, we can move the glock holder to the stack (or rather, use the
      one we already have on the stack); there is no need for keeping the
      holder in the inode anymore.
      
      This is slightly complicated by the fact that we're using ip->i_gh for
      the statfs inode in gfs2_file_buffered_write as well.  Writing to the
      statfs inode isn't very common, so allocate the statfs holder
      dynamically when needed.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38b58498
    • Andreas Gruenbacher's avatar
      gfs2: Move the inode glock locking to gfs2_file_buffered_write · 8d363d81
      Andreas Gruenbacher authored
      commit b924bdab
      
       upstream
      
      So far, for buffered writes, we were taking the inode glock in
      gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
      not holding the inode glock while iomap_write_actor faults in user
      pages.  It turns out that iomap_write_actor is called inside iomap_begin
      ... iomap_end, so the user pages were still faulted in while holding the
      inode glock and the locking code in iomap_begin / iomap_end was
      completely pointless.
      
      Move the locking into gfs2_file_buffered_write instead.  We'll take care
      of the potential deadlocks due to faulting in user pages while holding a
      glock in a subsequent patch.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d363d81
    • Bob Peterson's avatar
      gfs2: Introduce flag for glock holder auto-demotion · 416a7053
      Bob Peterson authored
      commit dc732906
      
       upstream
      
      This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
      will allow glocks to be demoted automatically on locking conflicts.
      When a locking request comes in that isn't compatible with the locking
      state of an active holder and that holder has the HIF_MAY_DEMOTE flag
      set, the holder will be demoted before the incoming locking request is
      granted.
      
      Note that this mechanism demotes active holders (with the HIF_HOLDER
      flag set), while before we were only demoting glocks without any active
      holders.  This allows processes to keep hold of locks that may form a
      cyclic locking dependency; the core glock logic will then break those
      dependencies in case a conflicting locking request occurs.  We'll use
      this to avoid giving up the inode glock proactively before faulting in
      pages.
      
      Processes that allow a glock holder to be taken away indicate this by
      calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
      Later, they call gfs2_holder_disallow_demote() to clear the flag again,
      and then they check if their holder is still queued: if it is, they are
      still holding the glock; if it isn't, they can re-acquire the glock (or
      abort).
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      416a7053
    • Andreas Gruenbacher's avatar
      gfs2: Clean up function may_grant · b25cfbc0
      Andreas Gruenbacher authored
      commit 61444649
      
       upstream
      
      Pass the first current glock holder into function may_grant and
      deobfuscate the logic there.
      
      While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant.  To make
      that build cleanly, de-constify the may_grant arguments.
      
      We're now using function find_first_holder in do_promote, so move the
      function's definition above do_promote.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b25cfbc0
    • Andreas Gruenbacher's avatar
      gfs2: Add wrapper for iomap_file_buffered_write · b88b9985
      Andreas Gruenbacher authored
      commit 2eb7509a
      
       upstream
      
      Add a wrapper around iomap_file_buffered_write.  We'll add code for when
      the operation needs to be retried here later.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b88b9985
    • Andreas Gruenbacher's avatar
      iov_iter: Introduce fault_in_iov_iter_writeable · 1d91c912
      Andreas Gruenbacher authored
      commit cdd591fc
      
       upstream
      
      Introduce a new fault_in_iov_iter_writeable helper for safely faulting
      in an iterator for writing.  Uses get_user_pages() to fault in the pages
      without actually writing to them, which would be destructive.
      
      We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
      the iterator passed to .read_iter isn't in memory.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d91c912
    • Andreas Gruenbacher's avatar
      iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable · 30e66b1d
      Andreas Gruenbacher authored
      commit a6294593
      
       upstream
      
      Turn iov_iter_fault_in_readable into a function that returns the number
      of bytes not faulted in, similar to copy_to_user, instead of returning a
      non-zero value when any of the requested pages couldn't be faulted in.
      This supports the existing users that require all pages to be faulted in
      as well as new users that are happy if any pages can be faulted in.
      
      Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
      sure this change doesn't silently break things.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30e66b1d
    • Andreas Gruenbacher's avatar
      gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} · 923f05a6
      Andreas Gruenbacher authored
      commit bb523b40
      
       upstream
      
      Turn fault_in_pages_{readable,writeable} into versions that return the
      number of bytes not faulted in, similar to copy_to_user, instead of
      returning a non-zero value when any of the requested pages couldn't be
      faulted in.  This supports the existing users that require all pages to
      be faulted in as well as new users that are happy if any pages can be
      faulted in.
      
      Rename the functions to fault_in_{readable,writeable} to make sure
      this change doesn't silently break things.
      
      Neither of these functions is entirely trivial and it doesn't seem
      useful to inline them, so move them to mm/gup.c.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      923f05a6
    • Muchun Song's avatar
      mm: kfence: fix objcgs vector allocation · 19cbd78f
      Muchun Song authored
      commit 8f0b3649 upstream.
      
      If the kfence object is allocated to be used for objects vector, then
      this slot of the pool eventually being occupied permanently since the
      vector is never freed.  The solutions could be (1) freeing vector when
      the kfence object is freed or (2) allocating all vectors statically.
      
      Since the memory consumption of object vectors is low, it is better to
      chose (2) to fix the issue and it is also can reduce overhead of vectors
      allocating in the future.
      
      Link: https://lkml.kernel.org/r/20220328132843.16624-1-songmuchun@bytedance.com
      Fixes: d3fb45f3
      
       ("mm, kfence: insert KFENCE hooks for SLAB")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19cbd78f
    • Dinh Nguyen's avatar
      ARM: dts: socfpga: change qspi to "intel,socfpga-qspi" · 10033fa7
      Dinh Nguyen authored
      commit 36de991e upstream.
      
      Because of commit 9cb2ff11 ("spi: cadence-quadspi: Disable Auto-HW polling"),
      which does a write to the CQSPI_REG_WR_COMPLETION_CTRL register
      regardless of any condition. Well, the Cadence QuadSPI controller on
      Intel's SoCFPGA platforms does not implement the
      CQSPI_REG_WR_COMPLETION_CTRL register, thus a write to this register
      results in a crash!
      
      So starting with v5.16, I introduced the patch
      98d948eb
      
       ("spi: cadence-quadspi: fix write completion support"),
      which adds the dts compatible "intel,socfpga-qspi" that is specific for
      versions that doesn't have the CQSPI_REG_WR_COMPLETION_CTRL register implemented.
      
      Signed-off-by: default avatarDinh Nguyen <dinguyen@kernel.org>
      [IA: submitted for linux-5.15.y]
      Signed-off-by: default avatarIan Abbott <abbotti@mev.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10033fa7
    • Dinh Nguyen's avatar
      spi: cadence-quadspi: fix write completion support · e8749d60
      Dinh Nguyen authored
      commit 98d948eb upstream.
      
      Some versions of the Cadence QSPI controller does not have the write
      completion register implemented(CQSPI_REG_WR_COMPLETION_CTRL). On the
      Intel SoCFPGA platform the CQSPI_REG_WR_COMPLETION_CTRL register is
      not configured.
      
      Add a quirk to not write to the CQSPI_REG_WR_COMPLETION_CTRL register.
      
      Fixes: 9cb2ff11
      
       ("spi: cadence-quadspi: Disable Auto-HW polling)
      Signed-off-by: default avatarDinh Nguyen <dinguyen@kernel.org>
      Reviewed-by: default avatarPratyush Yadav <p.yadav@ti.com>
      Link: https://lore.kernel.org/r/20211108200854.3616121-1-dinguyen@kernel.org
      
      
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      [IA: backported for linux=5.15.y]
      Signed-off-by: default avatarIan Abbott <abbotti@mev.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8749d60
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Fix crash due to out of bounds access into reg2btf_ids. · 8c39925e
      Kumar Kartikeya Dwivedi authored
      commit 45ce4b4f upstream
      
      When commit e6ac2450 ("bpf: Support bpf program calling kernel function") added
      kfunc support, it defined reg2btf_ids as a cheap way to translate the verifier
      reg type to the appropriate btf_vmlinux BTF ID, however
      commit c25b2ae1 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
      moved the __BPF_REG_TYPE_MAX from the last member of bpf_reg_type enum to after
      the base register types, and defined other variants using type flag
      composition. However, now, the direct usage of reg->type to index into
      reg2btf_ids may no longer fall into __BPF_REG_TYPE_MAX range, and hence lead to
      out of bounds access and kernel crash on dereference of bad pointer.
      
      [backport note: commit 3363bd0c ("bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM
       argument support") was introduced after 5.15 and contains an out of bound
       reg2btf_ids access. Since that commit hasn't been backported, this patch
       doesn't include fix to that access. If we backport that commit in future,
       we need to fix its faulting access as well.]
      
      Fixes: c25b2ae1
      
       ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220216201943.624869-1-memxor@gmail.com
      
      
      Cc: stable@vger.kernel.org # v5.15+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c39925e
    • Hao Luo's avatar
      bpf/selftests: Test PTR_TO_RDONLY_MEM · 379382b3
      Hao Luo authored
      commit 9497c458
      
       upstream.
      
      This test verifies that a ksym of non-struct can not be directly
      updated.
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-10-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      379382b3
    • Hao Luo's avatar
      bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem. · 2a77c587
      Hao Luo authored
      commit 216e3cd2
      
       upstream.
      
      Some helper functions may modify its arguments, for example,
      bpf_d_path, bpf_get_stack etc. Previously, their argument types
      were marked as ARG_PTR_TO_MEM, which is compatible with read-only
      mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
      but technically incorrect, to modify a read-only memory by passing
      it into one of such helper functions.
      
      This patch tags the bpf_args compatible with immutable memory with
      MEM_RDONLY flag. The arguments that don't have this flag will be
      only compatible with mutable memory types, preventing the helper
      from modifying a read-only memory. The bpf_args that have
      MEM_RDONLY are compatible with both mutable memory and immutable
      memory.
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a77c587
    • Hao Luo's avatar
      bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM. · 15166bb3
      Hao Luo authored
      commit 34d3a78c upstream.
      
      Tag the return type of {per, this}_cpu_ptr with RDONLY_MEM. The
      returned value of this pair of helpers is kernel object, which
      can not be updated by bpf programs. Previously these two helpers
      return PTR_OT_MEM for kernel objects of scalar type, which allows
      one to directly modify the memory. Now with RDONLY_MEM tagging,
      the verifier will reject programs that write into RDONLY_MEM.
      
      Fixes: 63d9b80d ("bpf: Introducte bpf_this_cpu_ptr()")
      Fixes: eaa6bcb7 ("bpf: Introduce bpf_per_cpu_ptr()")
      Fixes: 4976b718
      
       ("bpf: Introduce pseudo_btf_id")
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-8-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15166bb3
    • Hao Luo's avatar
      bpf: Convert PTR_TO_MEM_OR_NULL to composable types. · b710f737
      Hao Luo authored
      commit cf9f2f8d
      
       upstream.
      
      Remove PTR_TO_MEM_OR_NULL and replace it with PTR_TO_MEM combined with
      flag PTR_MAYBE_NULL.
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-7-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b710f737
    • Hao Luo's avatar
      bpf: Introduce MEM_RDONLY flag · b4533613
      Hao Luo authored
      commit 20b2aff4
      
       upstream.
      
      This patch introduce a flag MEM_RDONLY to tag a reg value
      pointing to read-only memory. It makes the following changes:
      
      1. PTR_TO_RDWR_BUF -> PTR_TO_BUF
      2. PTR_TO_RDONLY_BUF -> PTR_TO_BUF | MEM_RDONLY
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-6-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4533613
    • Hao Luo's avatar
      bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL · 8d38cde4
      Hao Luo authored
      commit c25b2ae1
      
       upstream.
      
      We have introduced a new type to make bpf_reg composable, by
      allocating bits in the type to represent flags.
      
      One of the flags is PTR_MAYBE_NULL which indicates a pointer
      may be NULL. This patch switches the qualified reg_types to
      use this flag. The reg_types changed in this patch include:
      
      1. PTR_TO_MAP_VALUE_OR_NULL
      2. PTR_TO_SOCKET_OR_NULL
      3. PTR_TO_SOCK_COMMON_OR_NULL
      4. PTR_TO_TCP_SOCK_OR_NULL
      5. PTR_TO_BTF_ID_OR_NULL
      6. PTR_TO_MEM_OR_NULL
      7. PTR_TO_RDONLY_BUF_OR_NULL
      8. PTR_TO_RDWR_BUF_OR_NULL
      
      [haoluo: backport notes
       There was a reg_type_may_be_null() in adjust_ptr_min_max_vals() in
       5.15.x, but didn't exist in the upstream commit. This backport
       converted that reg_type_may_be_null() to type_may_be_null() as well.]
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/r/20211217003152.48334-5-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d38cde4
    • Hao Luo's avatar
      bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULL · 3c141c82
      Hao Luo authored
      commit 3c480732
      
       upstream.
      
      We have introduced a new type to make bpf_ret composable, by
      reserving high bits to represent flags.
      
      One of the flag is PTR_MAYBE_NULL, which indicates a pointer
      may be NULL. When applying this flag to ret_types, it means
      the returned value could be a NULL pointer. This patch
      switches the qualified arg_types to use this flag.
      The ret_types changed in this patch include:
      
      1. RET_PTR_TO_MAP_VALUE_OR_NULL
      2. RET_PTR_TO_SOCKET_OR_NULL
      3. RET_PTR_TO_TCP_SOCK_OR_NULL
      4. RET_PTR_TO_SOCK_COMMON_OR_NULL
      5. RET_PTR_TO_ALLOC_MEM_OR_NULL
      6. RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL
      7. RET_PTR_TO_BTF_ID_OR_NULL
      
      This patch doesn't eliminate the use of these names, instead
      it makes them aliases to 'RET_PTR_TO_XXX | PTR_MAYBE_NULL'.
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-4-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c141c82
    • Hao Luo's avatar
      bpf: Replace ARG_XXX_OR_NULL with ARG_XXX | PTR_MAYBE_NULL · d58a396f
      Hao Luo authored
      commit 48946bd6
      
       upstream.
      
      We have introduced a new type to make bpf_arg composable, by
      reserving high bits of bpf_arg to represent flags of a type.
      
      One of the flags is PTR_MAYBE_NULL which indicates a pointer
      may be NULL. When applying this flag to an arg_type, it means
      the arg can take NULL pointer. This patch switches the
      qualified arg_types to use this flag. The arg_types changed
      in this patch include:
      
      1. ARG_PTR_TO_MAP_VALUE_OR_NULL
      2. ARG_PTR_TO_MEM_OR_NULL
      3. ARG_PTR_TO_CTX_OR_NULL
      4. ARG_PTR_TO_SOCKET_OR_NULL
      5. ARG_PTR_TO_ALLOC_MEM_OR_NULL
      6. ARG_PTR_TO_STACK_OR_NULL
      
      This patch does not eliminate the use of these arg_types, instead
      it makes them an alias to the 'ARG_XXX | PTR_MAYBE_NULL'.
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-3-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d58a396f
    • Hao Luo's avatar
      bpf: Introduce composable reg, ret and arg types. · a7602098
      Hao Luo authored
      commit d639b9d1
      
       upstream.
      
      There are some common properties shared between bpf reg, ret and arg
      values. For instance, a value may be a NULL pointer, or a pointer to
      a read-only memory. Previously, to express these properties, enumeration
      was used. For example, in order to test whether a reg value can be NULL,
      reg_type_may_be_null() simply enumerates all types that are possibly
      NULL. The problem of this approach is that it's not scalable and causes
      a lot of duplication. These properties can be combined, for example, a
      type could be either MAYBE_NULL or RDONLY, or both.
      
      This patch series rewrites the layout of reg_type, arg_type and
      ret_type, so that common properties can be extracted and represented as
      composable flag. For example, one can write
      
       ARG_PTR_TO_MEM | PTR_MAYBE_NULL
      
      which is equivalent to the previous
      
       ARG_PTR_TO_MEM_OR_NULL
      
      The type ARG_PTR_TO_MEM are called "base type" in this patch. Base
      types can be extended with flags. A flag occupies the higher bits while
      base types sits in the lower bits.
      
      This patch in particular sets up a set of macro for this purpose. The
      following patches will rewrite arg_types, ret_types and reg_types
      respectively.
      
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211217003152.48334-2-haoluo@google.com
      
      
      Cc: stable@vger.kernel.org # 5.15.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7602098
    • Willy Tarreau's avatar
      floppy: disable FDRAWCMD by default · e52da8e4
      Willy Tarreau authored
      commit 233087ca upstream.
      
      Minh Yuan reported a concurrency use-after-free issue in the floppy code
      between raw_cmd_ioctl and seek_interrupt.
      
      [ It turns out this has been around, and that others have reported the
        KASAN splats over the years, but Minh Yuan had a reproducer for it and
        so gets primary credit for reporting it for this fix   - Linus ]
      
      The problem is, this driver tends to break very easily and nowadays,
      nobody is expected to use FDRAWCMD anyway since it was used to
      manipulate non-standard formats.  The risk of breaking the driver is
      higher than the risk presented by this race, and accessing the device
      requires privileges anyway.
      
      Let's just add a config option to completely disable this ioctl and
      leave it disabled by default.  Distros shouldn't use it, and only those
      running on antique hardware might need to enable it.
      
      Link: https://lore.kernel.org/all/000000000000b71cdd05d703f6bf@google.com/
      Link: https://lore.kernel.org/lkml/CAKcFiNC=MfYVW-Jt9A3=FPJpTwCD2PL_ULNCpsCVE5s8ZeBQgQ@mail.gmail.com
      Link: https://lore.kernel.org/all/CAEAjamu1FRhz6StCe_55XY5s389ZP_xmCF69k987En+1z53=eg@mail.gmail.com
      
      
      Reported-by: default avatarMinh Yuan <yuanmingbuaa@gmail.com>
      Reported-by: default avatar <syzbot+8e8958586909d62b6840@syzkaller.appspotmail.com>
      Reported-by: default avatarcruise k <cruise4k@gmail.com>
      Reported-by: default avatarKyungtae Kim <kt0755@gmail.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      Tested-by: default avatarDenis Efremov <efremov@linux.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e52da8e4
  2. Apr 27, 2022