Skip to content
  1. Mar 04, 2022
    • Filipe Manana's avatar
      btrfs: fallback to blocking mode when doing async dio over multiple extents · ca93e44b
      Filipe Manana authored
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      
      
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca93e44b
  2. Mar 02, 2022
    • Filipe Manana's avatar
      btrfs: add missing run of delayed items after unlink during log replay · 4751dc99
      Filipe Manana authored
      
      
      During log replay, whenever we need to check if a name (dentry) exists in
      a directory we do searches on the subvolume tree for inode references or
      or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
      keys as well, before kernel 5.17). However when during log replay we
      unlink a name, through btrfs_unlink_inode(), we may not delete inode
      references and dir index keys from a subvolume tree and instead just add
      the deletions to the delayed inode's delayed items, which will only be
      run when we commit the transaction used for log replay. This means that
      after an unlink operation during log replay, if we attempt to search for
      the same name during log replay, we will not see that the name was already
      deleted, since the deletion is recorded only on the delayed items.
      
      We run delayed items after every unlink operation during log replay,
      except at unlink_old_inode_refs() and at add_inode_ref(). This was due
      to an overlook, as delayed items should be run after evert unlink, for
      the reasons stated above.
      
      So fix those two cases.
      
      Fixes: 0d836392 ("Btrfs: fix mount failure after fsync due to hard link recreation")
      Fixes: 1f250e92 ("Btrfs: fix log replay failure after unlink and link combination")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4751dc99
    • Sidong Yang's avatar
      btrfs: qgroup: fix deadlock between rescan worker and remove qgroup · d4aef1e1
      Sidong Yang authored
      
      
      The commit e804861b ("btrfs: fix deadlock between quota disable and
      qgroup rescan worker") by Kawasaki resolves deadlock between quota
      disable and qgroup rescan worker. But also there is a deadlock case like
      it. It's about enabling or disabling quota and creating or removing
      qgroup. It can be reproduced in simple script below.
      
      for i in {1..100}
      do
          btrfs quota enable /mnt &
          btrfs qgroup create 1/0 /mnt &
          btrfs qgroup destroy 1/0 /mnt &
          btrfs quota disable /mnt &
      done
      
      Here's why the deadlock happens:
      
      1) The quota rescan task is running.
      
      2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock
         mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for
         the quota rescan task to complete.
      
      3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock
         the qgroup_ioctl_lock mutex, because it's being held by task A. At that
         point task B is holding a transaction handle for the current transaction.
      
      4) The quota rescan task calls btrfs_commit_transaction(). This results
         in it waiting for all other tasks to release their handles on the
         transaction, but task B is blocked on the qgroup_ioctl_lock mutex
         while holding a handle on the transaction, and that mutex is being held
         by task A, which is waiting for the quota rescan task to complete,
         resulting in a deadlock between these 3 tasks.
      
      To resolve this issue, the thread disabling quota should unlock
      qgroup_ioctl_lock before waiting rescan completion. Move
      btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock.
      
      Fixes: e804861b ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarSidong Yang <realwakka@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4aef1e1
    • Omar Sandoval's avatar
      btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() · 5fd76bf3
      Omar Sandoval authored
      
      
      We are seeing crashes similar to the following trace:
      
      [38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
      [38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54
      [38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      [38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
      [38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206
      [38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14
      [38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360
      [38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000
      [38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800
      [38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360
      [38.987146] FS:  00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000
      [38.988662] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0
      [38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [38.992528] Call Trace:
      [38.992854]  <TASK>
      [38.993148]  btrfs_relocate_chunk+0x27/0xe0 [btrfs]
      [38.993941]  btrfs_balance+0x78e/0xea0 [btrfs]
      [38.994801]  ? vsnprintf+0x33c/0x520
      [38.995368]  ? __kmalloc_track_caller+0x351/0x440
      [38.996198]  btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs]
      [38.997084]  btrfs_ioctl+0x11b0/0x2da0 [btrfs]
      [38.997867]  ? mod_objcg_state+0xee/0x340
      [38.998552]  ? seq_release+0x24/0x30
      [38.999184]  ? proc_nr_files+0x30/0x30
      [38.999654]  ? call_rcu+0xc8/0x2f0
      [39.000228]  ? __x64_sys_ioctl+0x84/0xc0
      [39.000872]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [39.001973]  __x64_sys_ioctl+0x84/0xc0
      [39.002566]  do_syscall_64+0x3a/0x80
      [39.003011]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [39.003735] RIP: 0033:0x7f11c166959b
      [39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b
      [39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003
      [39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0
      [39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3
      [39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001
      [39.015040]  </TASK>
      [39.015418] ---[ end trace 0000000000000000 ]---
      [43.131559] ------------[ cut here ]------------
      [43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717!
      [43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G        W         5.17.0-rc4 #54
      [43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      [43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
      [43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
      [43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
      [43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
      [43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
      [43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
      [43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
      [43.145686] FS:  00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000
      [43.146808] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0
      [43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [43.150559] Call Trace:
      [43.150904]  <TASK>
      [43.151253]  btrfs_finish_extent_commit+0x88/0x290 [btrfs]
      [43.152127]  btrfs_commit_transaction+0x74f/0xaa0 [btrfs]
      [43.152932]  ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
      [43.153786]  btrfs_ioctl+0x1edc/0x2da0 [btrfs]
      [43.154475]  ? __check_object_size+0x150/0x170
      [43.155170]  ? preempt_count_add+0x49/0xa0
      [43.155753]  ? __x64_sys_ioctl+0x84/0xc0
      [43.156437]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [43.157456]  __x64_sys_ioctl+0x84/0xc0
      [43.157980]  do_syscall_64+0x3a/0x80
      [43.158543]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [43.159231] RIP: 0033:0x7f7657f1e59b
      [43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b
      [43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003
      [43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
      [43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800
      [43.166863]  </TASK>
      [43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common
      [43.169552] ---[ end trace 0000000000000000 ]---
      [43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
      [43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
      [43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
      [43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
      [43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
      [43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
      [43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
      [43.180071] FS:  00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000
      [43.181073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0
      [43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      We first hit the WARN_ON(rc->block_group->pinned > 0) in
      btrfs_relocate_block_group() and then the BUG_ON(!cache) in
      unpin_extent_range(). This tells us that we are exiting relocation and
      removing the block group with bytes still pinned for that block group.
      This is supposed to be impossible: the last thing relocate_block_group()
      does is commit the transaction to get rid of pinned extents.
      
      Commit d0c2f4fa ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") introduced an optimization so that
      commits from fsync don't have to wait for the previous commit to unpin
      extents. This was only intended to affect fsync, but it inadvertently
      made it possible for any commit to skip waiting for the previous commit
      to unpin. This is because if a call to btrfs_commit_transaction() finds
      that another thread is already committing the transaction, it waits for
      the other thread to complete the commit and then returns. If that other
      thread was in fsync, then it completes the commit without completing the
      previous commit. This makes the following sequence of events possible:
      
      Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________
      btrfs_commit_transaction(N) |                                     |
        btrfs_run_delayed_refs    |                                     |
          pin extents             |                                     |
        ...                       |                                     |
        state = UNBLOCKED         |btrfs_sync_file                      |
                                  |  btrfs_start_transaction(N + 1)     |relocate_block_group
                                  |                                     |  btrfs_join_transaction(N + 1)
                                  |  btrfs_commit_transaction(N + 1)    |
        ...                       |  trans->state = COMMIT_START        |
                                  |                                     |  btrfs_commit_transaction(N + 1)
                                  |                                     |    wait_for_commit(N + 1, COMPLETED)
                                  |  wait_for_commit(N, SUPER_COMMITTED)|
        state = SUPER_COMMITTED   |  ...                                |
        btrfs_finish_extent_commit|                                     |
          unpin_extent_range()    |  trans->state = COMPLETED           |
                                  |                                     |    return
                                  |                                     |
          ...                     |                                     |Thread 1 isn't done, so pinned > 0
                                  |                                     |and we WARN
                                  |                                     |
                                  |                                     |btrfs_remove_block_group
          unpin_extent_range()    |                                     |
            Thread 3 removed the  |                                     |
            block group, so we BUG|                                     |
      
      There are other sequences involving SUPER_COMMITTED transactions that
      can cause a similar outcome.
      
      We could fix this by making relocation explicitly wait for unpinning,
      but there may be other cases that need it. Josef mentioned ENOSPC
      flushing and the free space cache inode as other potential victims.
      Rather than playing whack-a-mole, this fix is conservative and makes all
      commits not in fsync wait for all previous transactions, which is what
      the optimization intended.
      
      Fixes: d0c2f4fa ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5fd76bf3
    • Josef Bacik's avatar
      btrfs: do not start relocation until in progress drops are done · b4be6aef
      Josef Bacik authored
      
      
      We hit a bug with a recovering relocation on mount for one of our file
      systems in production.  I reproduced this locally by injecting errors
      into snapshot delete with balance running at the same time.  This
      presented as an error while looking up an extent item
      
        WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
        CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
        RIP: 0010:lookup_inline_extent_backref+0x647/0x680
        RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
        RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
        R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
        R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
        FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
        Call Trace:
         <TASK>
         insert_inline_extent_backref+0x46/0xd0
         __btrfs_inc_extent_ref.isra.0+0x5f/0x200
         ? btrfs_merge_delayed_refs+0x164/0x190
         __btrfs_run_delayed_refs+0x561/0xfa0
         ? btrfs_search_slot+0x7b4/0xb30
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_run_delayed_refs+0x73/0x1f0
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_commit_transaction+0x50/0xa50
         ? btrfs_update_reloc_root+0x122/0x220
         prepare_to_merge+0x29f/0x320
         relocate_block_group+0x2b8/0x550
         btrfs_relocate_block_group+0x1a6/0x350
         btrfs_relocate_chunk+0x27/0xe0
         btrfs_balance+0x777/0xe60
         balance_kthread+0x35/0x50
         ? btrfs_balance+0xe60/0xe60
         kthread+0x16b/0x190
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x22/0x30
         </TASK>
      
      Normally snapshot deletion and relocation are excluded from running at
      the same time by the fs_info->cleaner_mutex.  However if we had a
      pending balance waiting to get the ->cleaner_mutex, and a snapshot
      deletion was running, and then the box crashed, we would come up in a
      state where we have a half deleted snapshot.
      
      Again, in the normal case the snapshot deletion needs to complete before
      relocation can start, but in this case relocation could very well start
      before the snapshot deletion completes, as we simply add the root to the
      dead roots list and wait for the next time the cleaner runs to clean up
      the snapshot.
      
      Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
      had a pending drop_progress key.  If they do then we know we were in the
      middle of the drop operation and set a flag on the fs_info.  Then
      balance can wait until this flag is cleared to start up again.
      
      If there are DEAD_ROOT's that don't have a drop_progress set then we're
      safe to start balance right away as we'll be properly protected by the
      cleaner_mutex.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4be6aef
    • Su Yue's avatar
      btrfs: tree-checker: use u64 for item data end to avoid overflow · a6ab66eb
      Su Yue authored
      User reported there is an array-index-out-of-bounds access while
      mounting the crafted image:
      
        [350.411942 ] loop0: detected capacity change from 0 to 262144
        [350.427058 ] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (1044)
        [350.428564 ] BTRFS info (device loop0): disk space caching is enabled
        [350.428568 ] BTRFS info (device loop0): has skinny extents
        [350.429589 ]
        [350.429619 ] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
        [350.429636 ] index 1048096 is out of range for type 'page *[16]'
        [350.429650 ] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4
        [350.429652 ] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
        [350.429653 ] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
        [350.429772 ] Call Trace:
        [350.429774 ]  <TASK>
        [350.429776 ]  dump_stack_lvl+0x47/0x5c
        [350.429780 ]  ubsan_epilogue+0x5/0x50
        [350.429786 ]  __ubsan_handle_out_of_bounds+0x66/0x70
        [350.429791 ]  btrfs_get_16+0xfd/0x120 [btrfs]
        [350.429832 ]  check_leaf+0x754/0x1a40 [btrfs]
        [350.429874 ]  ? filemap_read+0x34a/0x390
        [350.429878 ]  ? load_balance+0x175/0xfc0
        [350.429881 ]  validate_extent_buffer+0x244/0x310 [btrfs]
        [350.429911 ]  btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
        [350.429935 ]  end_bio_extent_readpage+0x3af/0x850 [btrfs]
        [350.429969 ]  ? newidle_balance+0x259/0x480
        [350.429972 ]  end_workqueue_fn+0x29/0x40 [btrfs]
        [350.429995 ]  btrfs_work_helper+0x71/0x330 [btrfs]
        [350.430030 ]  ? __schedule+0x2fb/0xa40
        [350.430033 ]  process_one_work+0x1f6/0x400
        [350.430035 ]  ? process_one_work+0x400/0x400
        [350.430036 ]  worker_thread+0x2d/0x3d0
        [350.430037 ]  ? process_one_work+0x400/0x400
        [350.430038 ]  kthread+0x165/0x190
        [350.430041 ]  ? set_kthread_struct+0x40/0x40
        [350.430043 ]  ret_from_fork+0x1f/0x30
        [350.430047 ]  </TASK>
        [350.430047 ]
        [350.430077 ] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
      
      btrfs check reports:
        corrupt leaf: root=3 block=20975616 physical=20975616 slot=1, unexpected
        item end, have 4294971193 expect 3897
      
      The first slot item offset is 4293005033 and the size is 1966160.
      In check_leaf, we use btrfs_item_end() to check item boundary versus
      extent_buffer data size. However, return type of btrfs_item_end() is u32.
      (u32)(4293005033 + 1966160) == 3897, overflow happens and the result 3897
      equals to leaf data size reasonably.
      
      Fix it by use u64 variable to store item data end in check_leaf() to
      avoid u32 overflow.
      
      This commit does solve the invalid memory access showed by the stack
      trace.  However, its metadata profile is DUP and another copy of the
      leaf is fine.  So the image can be mounted successfully. But when umount
      is called, the ASSERT btrfs_mark_buffer_dirty() will be triggered
      because the only node in extent tree has 0 item and invalid owner. It's
      solved by another commit
      "btrfs: check extent buffer owner against the owner rootid".
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
      
      
      Reported-by: default avatarWenqing Liu <wenqingliu0120@gmail.com>
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6ab66eb
    • Josef Bacik's avatar
      btrfs: do not WARN_ON() if we have PageError set · a50e1fcb
      Josef Bacik authored
      
      
      Whenever we do any extent buffer operations we call
      assert_eb_page_uptodate() to complain loudly if we're operating on an
      non-uptodate page.  Our overnight tests caught this warning earlier this
      week
      
        WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
        CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G        W         5.17.0-rc3+ #564
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Workqueue: btrfs-cache btrfs_work_helper
        RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
        RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
        RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
        RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
        RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
        R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
        R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
        FS:  0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
        Call Trace:
      
         extent_buffer_test_bit+0x3f/0x70
         free_space_test_bit+0xa6/0xc0
         load_free_space_tree+0x1f6/0x470
         caching_thread+0x454/0x630
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? lock_release+0x1f0/0x2d0
         btrfs_work_helper+0xf2/0x3e0
         ? lock_release+0x1f0/0x2d0
         ? finish_task_switch.isra.0+0xf9/0x3a0
         process_one_work+0x26d/0x580
         ? process_one_work+0x580/0x580
         worker_thread+0x55/0x3b0
         ? process_one_work+0x580/0x580
         kthread+0xf0/0x120
         ? kthread_complete_and_exit+0x20/0x20
         ret_from_fork+0x1f/0x30
      
      This was partially fixed by c2e39305 ("btrfs: clear extent buffer
      uptodate when we fail to write it"), however all that fix did was keep
      us from finding extent buffers after a failed writeout.  It didn't keep
      us from continuing to use a buffer that we already had found.
      
      In this case we're searching the commit root to cache the block group,
      so we can start committing the transaction and switch the commit root
      and then start writing.  After the switch we can look up an extent
      buffer that hasn't been written yet and start processing that block
      group.  Then we fail to write that block out and clear Uptodate on the
      page, and then we start spewing these errors.
      
      Normally we're protected by the tree lock to a certain degree here.  If
      we read a block we have that block read locked, and we block the writer
      from locking the block before we submit it for the write.  However this
      isn't necessarily fool proof because the read could happen before we do
      the submit_bio and after we locked and unlocked the extent buffer.
      
      Also in this particular case we have path->skip_locking set, so that
      won't save us here.  We'll simply get a block that was valid when we
      read it, but became invalid while we were using it.
      
      What we really want is to catch the case where we've "read" a block but
      it's not marked Uptodate.  On read we ClearPageError(), so if we're
      !Uptodate and !Error we know we didn't do the right thing for reading
      the page.
      
      Fix this by checking !Uptodate && !Error, this way we will not complain
      if our buffer gets invalidated while we're using it, and we'll maintain
      the spirit of the check which is to make sure we have a fully in-cache
      block while we're messing with it.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a50e1fcb
    • Filipe Manana's avatar
      btrfs: fix lost prealloc extents beyond eof after full fsync · d9947887
      Filipe Manana authored
      
      
      When doing a full fsync, if we have prealloc extents beyond (or at) eof,
      and the leaves that contain them were not modified in the current
      transaction, we end up not logging them. This results in losing those
      extents when we replay the log after a power failure, since the inode is
      truncated to the current value of the logged i_size.
      
      Just like for the fast fsync path, we need to always log all prealloc
      extents starting at or beyond i_size. The fast fsync case was fixed in
      commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay") but it missed the full fsync path. The problem
      exists since the very early days, when the log tree was added by
      commit e02119d5 ("Btrfs: Add a write ahead tree log to optimize
      synchronous operations").
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        # Create our test file with many file extent items, so that they span
        # several leaves of metadata, even if the node/page size is 64K. Use
        # direct IO and not fsync/O_SYNC because it's both faster and it avoids
        # clearing the full sync flag from the inode - we want the fsync below
        # to trigger the slow full sync code path.
        $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
      
        # Now add two preallocated extents to our file without extending the
        # file's size. One right at i_size, and another further beyond, leaving
        # a gap between the two prealloc extents.
        $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
        $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
      
        # Make sure everything is durably persisted and the transaction is
        # committed. This makes all created extents to have a generation lower
        # than the generation of the transaction used by the next write and
        # fsync.
        sync
      
        # Now overwrite only the first extent, which will result in modifying
        # only the first leaf of metadata for our inode. Then fsync it. This
        # fsync will use the slow code path (inode full sync bit is set) because
        # it's the first fsync since the inode was created/loaded.
        $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
      
        # Extent list before power failure.
        $ xfs_io -c "fiemap -v" /mnt/foo
        /mnt/foo:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..7]:          2178048..2178055     8   0x0
           1: [8..16383]:      26632..43007     16376   0x0
           2: [16384..32767]:  2156544..2172927 16384   0x0
           3: [32768..34815]:  2172928..2174975  2048 0x800
           4: [34816..40959]:  hole              6144
           5: [40960..43007]:  2174976..2177023  2048 0x801
      
        <power fail>
      
        # Mount fs again, trigger log replay.
        $ mount /dev/sdc /mnt
      
        # Extent list after power failure and log replay.
        $ xfs_io -c "fiemap -v" /mnt/foo
        /mnt/foo:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..7]:          2178048..2178055     8   0x0
           1: [8..16383]:      26632..43007     16376   0x0
           2: [16384..32767]:  2156544..2172927 16384   0x1
      
        # The prealloc extents at file offsets 16M and 20M are missing.
      
      So fix this by calling btrfs_log_prealloc_extents() when we are doing a
      full fsync, so that we always log all prealloc extents beyond eof.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9947887
    • Qu Wenruo's avatar
      btrfs: subpage: fix a wrong check on subpage->writers · c992fa1f
      Qu Wenruo authored
      
      
      [BUG]
      When looping btrfs/074 with 64K page size and 4K sectorsize, there is a
      low chance (1/50~1/100) to crash with the following ASSERT() triggered
      in btrfs_subpage_start_writer():
      
      	ret = atomic_add_return(nbits, &subpage->writers);
      	ASSERT(ret == nbits); <<< This one <<<
      
      [CAUSE]
      With more debugging output on the parameters of
      btrfs_subpage_start_writer(), it shows a very concerning error:
      
        ret=29 nbits=13 start=393216 len=53248
      
      For @nbits it's correct, but @ret which is the returned value from
      atomic_add_return(), it's not only larger than nbits, but also larger
      than max sectors per page value (for 64K page size and 4K sector size,
      it's 16).
      
      This indicates that some call sites are not properly decreasing the value.
      
      And that's exactly the case, in btrfs_page_unlock_writer(), due to the
      fact that we can have page locked either by lock_page() or
      process_one_page(), we have to check if the subpage has any writer.
      
      If no writers, it's locked by lock_page() and we only need to unlock it.
      
      But unfortunately the check for the writers are completely opposite:
      
      	if (atomic_read(&subpage->writers))
      		/* No writers, locked by plain lock_page() */
      		return unlock_page(page);
      
      We directly unlock the page if it has writers, which is the completely
      opposite what we want.
      
      Thankfully the affected call site is only limited to
      extent_write_locked_range(), so it's mostly affecting compressed write.
      
      [FIX]
      Just fix the wrong check condition to fix the bug.
      
      Fixes: e55a0de1 ("btrfs: rework page locking in __extent_writepage()")
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c992fa1f
  3. Feb 24, 2022
    • Qu Wenruo's avatar
      btrfs: reduce extent threshold for autodefrag · 558732df
      Qu Wenruo authored
      
      
      There is a big gap between inode_should_defrag() and autodefrag extent
      size threshold.  For inode_should_defrag() it has a flexible
      @small_write value. For compressed extent is 16K, and for non-compressed
      extent it's 64K.
      
      However for autodefrag extent size threshold, it's always fixed to the
      default value (256K).
      
      This means, the following write sequence will trigger autodefrag to
      defrag ranges which didn't trigger autodefrag:
      
        pwrite 0 8k
        sync
        pwrite 8k 128K
        sync
      
      The latter 128K write will also be considered as a defrag target (if
      other conditions are met). While only that 8K write is really
      triggering autodefrag.
      
      Such behavior can cause extra IO for autodefrag.
      
      Close the gap, by copying the @small_write value into inode_defrag, so
      that later autodefrag can use the same @small_write value which
      triggered autodefrag.
      
      With the existing transid value, this allows autodefrag really to scan
      the ranges which triggered autodefrag.
      
      Although this behavior change is mostly reducing the extent_thresh value
      for autodefrag, I believe in the future we should allow users to specify
      the autodefrag extent threshold through mount options, but that's an
      other problem to consider in the future.
      
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      558732df
    • Qu Wenruo's avatar
      btrfs: autodefrag: only scan one inode once · 26fbac25
      Qu Wenruo authored
      
      
      Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
      still just exhausting all inode_defrag items in the tree.
      
      This means, it doesn't make much difference to requeue an inode_defrag,
      other than scan the inode from the beginning till its end.
      
      Change the behaviour to always scan from offset 0 of an inode, and till
      the end.
      
      By this we get the following benefit:
      
      - Straight-forward code
      
      - No more re-queue related check
      
      - Fewer members in inode_defrag
      
      We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
      each loop, and added extra should_auto_defrag() check per-loop.
      
      Note: the patch needs to be backported and is intentionally written
      to minimize the diff size, code will be cleaned up later.
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26fbac25
    • Qu Wenruo's avatar
      btrfs: defrag: don't use merged extent map for their generation check · 199257a7
      Qu Wenruo authored
      
      
      For extent maps, if they are not compressed extents and are adjacent by
      logical addresses and file offsets, they can be merged into one larger
      extent map.
      
      Such merged extent map will have the higher generation of all the
      original ones.
      
      But this brings a problem for autodefrag, as it relies on accurate
      extent_map::generation to determine if one extent should be defragged.
      
      For merged extent maps, their higher generation can mark some older
      extents to be defragged while the original extent map doesn't meet the
      minimal generation threshold.
      
      Thus this will cause extra IO.
      
      So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED,
      to indicate if the extent map is merged from one or more ems.
      
      And for autodefrag, if we find a merged extent map, and its generation
      meets the generation requirement, we just don't use this one, and go
      back to defrag_get_extent() to read extent maps from subvolume trees.
      
      This could cause more read IO, but should result less defrag data write,
      so in the long run it should be a win for autodefrag.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      199257a7
    • Qu Wenruo's avatar
      btrfs: defrag: bring back the old file extent search behavior · d5633b0d
      Qu Wenruo authored
      
      
      For defrag, we don't really want to use btrfs_get_extent() to iterate
      all extent maps of an inode.
      
      The reasons are:
      
      - btrfs_get_extent() can merge extent maps
        And the result em has the higher generation of the two, causing defrag
        to mark unnecessary part of such merged large extent map.
      
        This in fact can result extra IO for autodefrag in v5.16+ kernels.
      
        However this patch is not going to completely solve the problem, as
        one can still using read() to trigger extent map reading, and got
        them merged.
      
        The completely solution for the extent map merging generation problem
        will come as an standalone fix.
      
      - btrfs_get_extent() caches the extent map result
        Normally it's fine, but for defrag the target range may not get
        another read/write for a long long time.
        Such cache would only increase the memory usage.
      
      - btrfs_get_extent() doesn't skip older extent map
        Unlike the old find_new_extent() which uses btrfs_search_forward() to
        skip the older subtree, thus it will pick up unnecessary extent maps.
      
      This patch will fix the regression by introducing defrag_get_extent() to
      replace the btrfs_get_extent() call.
      
      This helper will:
      
      - Not cache the file extent we found
        It will search the file extent and manually convert it to em.
      
      - Use btrfs_search_forward() to skip entire ranges which is modified in
        the past
      
      This should reduce the IO for autodefrag.
      
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d5633b0d
    • Qu Wenruo's avatar
      btrfs: defrag: remove an ambiguous condition for rejection · 550f133f
      Qu Wenruo authored
      
      
      From the very beginning of btrfs defrag, there is a check to reject
      extents which meet both conditions:
      
      - Physically adjacent
      
        We may want to defrag physically adjacent extents to reduce the number
        of extents or the size of subvolume tree.
      
      - Larger than 128K
      
        This may be there for compressed extents, but unfortunately 128K is
        exactly the max capacity for compressed extents.
        And the check is > 128K, thus it never rejects compressed extents.
      
        Furthermore, the compressed extent capacity bug is fixed by previous
        patch, there is no reason for that check anymore.
      
      The original check has a very small ranges to reject (the target extent
      size is > 128K, and default extent threshold is 256K), and for
      compressed extent it doesn't work at all.
      
      So it's better just to remove the rejection, and allow us to defrag
      physically adjacent extents.
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      550f133f
    • Qu Wenruo's avatar
      btrfs: defrag: don't defrag extents which are already at max capacity · 979b25c3
      Qu Wenruo authored
      
      
      [BUG]
      For compressed extents, defrag ioctl will always try to defrag any
      compressed extents, wasting not only IO but also CPU time to
      compress/decompress:
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
         xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
         sync
         xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
         sync
         echo "=== before ==="
         xfs_io -c "fiemap -v" $MNT/foobar
         btrfs filesystem defrag $MNT/foobar
         sync
         echo "=== after ==="
         xfs_io -c "fiemap -v" $MNT/foobar
      
      Then it shows the 2 128K extents just get COW for no extra benefit, with
      extra IO/CPU spent:
      
          === before ===
          /mnt/btrfs/file1:
           EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
             0: [0..255]:        26624..26879       256   0x8
             1: [256..511]:      26632..26887       256   0x9
          === after ===
          /mnt/btrfs/file1:
           EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
             0: [0..255]:        26640..26895       256   0x8
             1: [256..511]:      26648..26903       256   0x9
      
      This affects not only v5.16 (after the defrag rework), but also v5.15
      (before the defrag rework).
      
      [CAUSE]
      From the very beginning, btrfs defrag never checks if one extent is
      already at its max capacity (128K for compressed extents, 128M
      otherwise).
      
      And the default extent size threshold is 256K, which is already beyond
      the compressed extent max size.
      
      This means, by default btrfs defrag ioctl will mark all compressed
      extent which is not adjacent to a hole/preallocated range for defrag.
      
      [FIX]
      Introduce a helper to grab the maximum extent size, and then in
      defrag_collect_targets() and defrag_check_next_extent(), reject extents
      which are already at their max capacity.
      
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      979b25c3
    • Qu Wenruo's avatar
      btrfs: defrag: don't try to merge regular extents with preallocated extents · 7093f152
      Qu Wenruo authored
      
      
      [BUG]
      With older kernels (before v5.16), btrfs will defrag preallocated extents.
      While with newer kernels (v5.16 and newer) btrfs will not defrag
      preallocated extents, but it will defrag the extent just before the
      preallocated extent, even it's just a single sector.
      
      This can be exposed by the following small script:
      
      	mkfs.btrfs -f $dev > /dev/null
      
      	mount $dev $mnt
      	xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
      	xfs_io -c "fiemap -v" $mnt/file
      	btrfs fi defrag $mnt/file
      	sync
      	xfs_io -c "fiemap -v" $mnt/file
      
      The output looks like this on older kernels:
      
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26624..26631         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..39]:         26664..26703        40   0x1
      
      Which defrags the single sector along with the preallocated extent, and
      replace them with an regular extent into a new location (caused by data
      COW).
      This wastes most of the data IO just for the preallocated range.
      
      On the other hand, v5.16 is slightly better:
      
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26624..26631         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26664..26671         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      
      The preallocated range is not defragged, but the sector before it still
      gets defragged, which has no need for it.
      
      [CAUSE]
      One of the function reused by the old and new behavior is
      defrag_check_next_extent(), it will determine if we should defrag
      current extent by checking the next one.
      
      It only checks if the next extent is a hole or inlined, but it doesn't
      check if it's preallocated.
      
      On the other hand, out of the function, both old and new kernel will
      reject preallocated extents.
      
      Such inconsistent behavior causes above behavior.
      
      [FIX]
      - Also check if next extent is preallocated
        If so, don't defrag current extent.
      
      - Add comments for each branch why we reject the extent
      
      This will reduce the IO caused by defrag ioctl and autodefrag.
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7093f152
  4. Feb 16, 2022
    • Qu Wenruo's avatar
      btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target · 966d879b
      Qu Wenruo authored
      
      
      In the rework of btrfs_defrag_file(), we always call
      defrag_one_cluster() and increase the offset by cluster size, which is
      only 256K.
      
      But there are cases where we have a large extent (e.g. 128M) which
      doesn't need to be defragged at all.
      
      Before the refactor, we can directly skip the range, but now we have to
      scan that extent map again and again until the cluster moves after the
      non-target extent.
      
      Fix the problem by allow defrag_one_cluster() to increase
      btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
      the last extent of the cluster is not a target.
      
      The test script looks like this:
      
      	mkfs.btrfs -f $dev > /dev/null
      
      	mount $dev $mnt
      
      	# As btrfs ioctl uses 32M as extent_threshold
      	xfs_io -f -c "pwrite 0 64M" $mnt/file1
      	sync
      	# Some fragemented range to defrag
      	xfs_io -s -c "pwrite 65548k 4k" \
      		  -c "pwrite 65544k 4k" \
      		  -c "pwrite 65540k 4k" \
      		  -c "pwrite 65536k 4k" \
      		  $mnt/file1
      	sync
      
      	echo "=== before ==="
      	xfs_io -c "fiemap -v" $mnt/file1
      	echo "=== after ==="
      	btrfs fi defrag $mnt/file1
      	sync
      	xfs_io -c "fiemap -v" $mnt/file1
      	umount $mnt
      
      With extra ftrace put into defrag_one_cluster(), before the patch it
      would result tons of loops:
      
      (As defrag_one_cluster() is inlined, the function name is its caller)
      
        btrfs-126062  [005] .....  4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144
        btrfs-126062  [005] .....  4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144
        ...
        btrfs-126062  [005] .....  4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144
      
      But with this patch there will be just one loop, then directly to the
      end of the extent:
      
        btrfs-130471  [014] .....  5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144
        btrfs-130471  [014] .....  5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      966d879b
    • Dāvis Mosāns's avatar
      btrfs: prevent copying too big compressed lzo segment · 741b23a9
      Dāvis Mosāns authored
      
      
      Compressed length can be corrupted to be a lot larger than memory
      we have allocated for buffer.
      This will cause memcpy in copy_compressed_segment to write outside
      of allocated memory.
      
      This mostly results in stuck read syscall but sometimes when using
      btrfs send can get #GP
      
        kernel: general protection fault, probably for non-canonical address 0x841551d5c1000: 0000 [#1] PREEMPT SMP NOPTI
        kernel: CPU: 17 PID: 264 Comm: kworker/u256:7 Tainted: P           OE     5.17.0-rc2-1 #12
        kernel: Workqueue: btrfs-endio btrfs_work_helper [btrfs]
        kernel: RIP: 0010:lzo_decompress_bio (./include/linux/fortify-string.h:225 fs/btrfs/lzo.c:322 fs/btrfs/lzo.c:394) btrfs
        Code starting with the faulting instruction
        ===========================================
           0:*  48 8b 06                mov    (%rsi),%rax              <-- trapping instruction
           3:   48 8d 79 08             lea    0x8(%rcx),%rdi
           7:   48 83 e7 f8             and    $0xfffffffffffffff8,%rdi
           b:   48 89 01                mov    %rax,(%rcx)
           e:   44 89 f0                mov    %r14d,%eax
          11:   48 8b 54 06 f8          mov    -0x8(%rsi,%rax,1),%rdx
        kernel: RSP: 0018:ffffb110812efd50 EFLAGS: 00010212
        kernel: RAX: 0000000000001000 RBX: 000000009ca264c8 RCX: ffff98996e6d8ff8
        kernel: RDX: 0000000000000064 RSI: 000841551d5c1000 RDI: ffffffff9500435d
        kernel: RBP: ffff989a3be856c0 R08: 0000000000000000 R09: 0000000000000000
        kernel: R10: 0000000000000000 R11: 0000000000001000 R12: ffff98996e6d8000
        kernel: R13: 0000000000000008 R14: 0000000000001000 R15: 000841551d5c1000
        kernel: FS:  0000000000000000(0000) GS:ffff98a09d640000(0000) knlGS:0000000000000000
        kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        kernel: CR2: 00001e9f984d9ea8 CR3: 000000014971a000 CR4: 00000000003506e0
        kernel: Call Trace:
        kernel:  <TASK>
        kernel: end_compressed_bio_read (fs/btrfs/compression.c:104 fs/btrfs/compression.c:1363 fs/btrfs/compression.c:323) btrfs
        kernel: end_workqueue_fn (fs/btrfs/disk-io.c:1923) btrfs
        kernel: btrfs_work_helper (fs/btrfs/async-thread.c:326) btrfs
        kernel: process_one_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:212 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2312)
        kernel: worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2455)
        kernel: ? process_one_work (kernel/workqueue.c:2397)
        kernel: kthread (kernel/kthread.c:377)
        kernel: ? kthread_complete_and_exit (kernel/kthread.c:332)
        kernel: ret_from_fork (arch/x86/entry/entry_64.S:301)
        kernel:  </TASK>
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarDāvis Mosāns <davispuh@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      741b23a9
  5. Feb 10, 2022
    • Dāvis Mosāns's avatar
      btrfs: send: in case of IO error log it · 2e7be9db
      Dāvis Mosāns authored
      
      
      Currently if we get IO error while doing send then we abort without
      logging information about which file caused issue.  So log it to help
      with debugging.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarDāvis Mosāns <davispuh@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e7be9db
    • Filipe Manana's avatar
      btrfs: get rid of warning on transaction commit when using flushoncommit · a0f0cf83
      Filipe Manana authored
      When using the flushoncommit mount option, during almost every transaction
      commit we trigger a warning from __writeback_inodes_sb_nr():
      
        $ cat fs/fs-writeback.c:
        (...)
        static void __writeback_inodes_sb_nr(struct super_block *sb, ...
        {
              (...)
              WARN_ON(!rwsem_is_locked(&sb->s_umount));
              (...)
        }
        (...)
      
      The trace produced in dmesg looks like the following:
      
        [947.473890] WARNING: CPU: 5 PID: 930 at fs/fs-writeback.c:2610 __writeback_inodes_sb_nr+0x7e/0xb3
        [947.481623] Modules linked in: nfsd nls_cp437 cifs asn1_decoder cifs_arc4 fscache cifs_md4 ipmi_ssif
        [947.489571] CPU: 5 PID: 930 Comm: btrfs-transacti Not tainted 95.16.3-srb-asrock-00001-g36437ad63879 #186
        [947.497969] RIP: 0010:__writeback_inodes_sb_nr+0x7e/0xb3
        [947.502097] Code: 24 10 4c 89 44 24 18 c6 (...)
        [947.519760] RSP: 0018:ffffc90000777e10 EFLAGS: 00010246
        [947.523818] RAX: 0000000000000000 RBX: 0000000000963300 RCX: 0000000000000000
        [947.529765] RDX: 0000000000000000 RSI: 000000000000fa51 RDI: ffffc90000777e50
        [947.535740] RBP: ffff888101628a90 R08: ffff888100955800 R09: ffff888100956000
        [947.541701] R10: 0000000000000002 R11: 0000000000000001 R12: ffff888100963488
        [947.547645] R13: ffff888100963000 R14: ffff888112fb7200 R15: ffff888100963460
        [947.553621] FS:  0000000000000000(0000) GS:ffff88841fd40000(0000) knlGS:0000000000000000
        [947.560537] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [947.565122] CR2: 0000000008be50c4 CR3: 000000000220c000 CR4: 00000000001006e0
        [947.571072] Call Trace:
        [947.572354]  <TASK>
        [947.573266]  btrfs_commit_transaction+0x1f1/0x998
        [947.576785]  ? start_transaction+0x3ab/0x44e
        [947.579867]  ? schedule_timeout+0x8a/0xdd
        [947.582716]  transaction_kthread+0xe9/0x156
        [947.585721]  ? btrfs_cleanup_transaction.isra.0+0x407/0x407
        [947.590104]  kthread+0x131/0x139
        [947.592168]  ? set_kthread_struct+0x32/0x32
        [947.595174]  ret_from_fork+0x22/0x30
        [947.597561]  </TASK>
        [947.598553] ---[ end trace 644721052755541c ]---
      
      This is because we started using writeback_inodes_sb() to flush delalloc
      when committing a transaction (when using -o flushoncommit), in order to
      avoid deadlocks with filesystem freeze operations. This change was made
      by commit ce8ea7cc ("btrfs: don't call btrfs_start_delalloc_roots
      in flushoncommit"). After that change we started producing that warning,
      and every now and then a user reports this since the warning happens too
      often, it spams dmesg/syslog, and a user is unsure if this reflects any
      problem that might compromise the filesystem's reliability.
      
      We can not just lock the sb->s_umount semaphore before calling
      writeback_inodes_sb(), because that would at least deadlock with
      filesystem freezing, since at fs/super.c:freeze_super() sync_filesystem()
      is called while we are holding that semaphore in write mode, and that can
      trigger a transaction commit, resulting in a deadlock. It would also
      trigger the same type of deadlock in the unmount path. Possibly, it could
      also introduce some other locking dependencies that lockdep would report.
      
      To fix this call try_to_writeback_inodes_sb() instead of
      writeback_inodes_sb(), because that will try to read lock sb->s_umount
      and then will only call writeback_inodes_sb() if it was able to lock it.
      This is fine because the cases where it can't read lock sb->s_umount
      are during a filesystem unmount or during a filesystem freeze - in those
      cases sb->s_umount is write locked and sync_filesystem() is called, which
      calls writeback_inodes_sb(). In other words, in all cases where we can't
      take a read lock on sb->s_umount, writeback is already being triggered
      elsewhere.
      
      An alternative would be to call btrfs_start_delalloc_roots() with a
      number of pages different from LONG_MAX, for example matching the number
      of delalloc bytes we currently have, in which case we would end up
      starting all delalloc with filemap_fdatawrite_wbc() and not with an
      async flush via filemap_flush() - that is only possible after the rather
      recent commit e076ab2a ("btrfs: shrink delalloc pages instead of
      full inodes"). However that creates a whole new can of worms due to new
      lock dependencies, which lockdep complains, like for example:
      
      [ 8948.247280] ======================================================
      [ 8948.247823] WARNING: possible circular locking dependency detected
      [ 8948.248353] 5.17.0-rc1-btrfs-next-111 #1 Not tainted
      [ 8948.248786] ------------------------------------------------------
      [ 8948.249320] kworker/u16:18/933570 is trying to acquire lock:
      [ 8948.249812] ffff9b3de1591690 (sb_internal#2){.+.+}-{0:0}, at: find_free_extent+0x141e/0x1590 [btrfs]
      [ 8948.250638]
                     but task is already holding lock:
      [ 8948.251140] ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
      [ 8948.252018]
                     which lock already depends on the new lock.
      
      [ 8948.252710]
                     the existing dependency chain (in reverse order) is:
      [ 8948.253343]
                     -> #2 (&root->delalloc_mutex){+.+.}-{3:3}:
      [ 8948.253950]        __mutex_lock+0x90/0x900
      [ 8948.254354]        start_delalloc_inodes+0x78/0x400 [btrfs]
      [ 8948.254859]        btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
      [ 8948.255408]        btrfs_commit_transaction+0x32f/0xc00 [btrfs]
      [ 8948.255942]        btrfs_mksubvol+0x380/0x570 [btrfs]
      [ 8948.256406]        btrfs_mksnapshot+0x81/0xb0 [btrfs]
      [ 8948.256870]        __btrfs_ioctl_snap_create+0x17f/0x190 [btrfs]
      [ 8948.257413]        btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [ 8948.257961]        btrfs_ioctl+0x1196/0x3630 [btrfs]
      [ 8948.258418]        __x64_sys_ioctl+0x83/0xb0
      [ 8948.258793]        do_syscall_64+0x3b/0xc0
      [ 8948.259146]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 8948.259709]
                     -> #1 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}:
      [ 8948.260330]        __mutex_lock+0x90/0x900
      [ 8948.260692]        btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
      [ 8948.261234]        btrfs_commit_transaction+0x32f/0xc00 [btrfs]
      [ 8948.261766]        btrfs_set_free_space_cache_v1_active+0x38/0x60 [btrfs]
      [ 8948.262379]        btrfs_start_pre_rw_mount+0x119/0x180 [btrfs]
      [ 8948.262909]        open_ctree+0x1511/0x171e [btrfs]
      [ 8948.263359]        btrfs_mount_root.cold+0x12/0xde [btrfs]
      [ 8948.263863]        legacy_get_tree+0x30/0x50
      [ 8948.264242]        vfs_get_tree+0x28/0xc0
      [ 8948.264594]        vfs_kern_mount.part.0+0x71/0xb0
      [ 8948.265017]        btrfs_mount+0x11d/0x3a0 [btrfs]
      [ 8948.265462]        legacy_get_tree+0x30/0x50
      [ 8948.265851]        vfs_get_tree+0x28/0xc0
      [ 8948.266203]        path_mount+0x2d4/0xbe0
      [ 8948.266554]        __x64_sys_mount+0x103/0x140
      [ 8948.266940]        do_syscall_64+0x3b/0xc0
      [ 8948.267300]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 8948.267790]
                     -> #0 (sb_internal#2){.+.+}-{0:0}:
      [ 8948.268322]        __lock_acquire+0x12e8/0x2260
      [ 8948.268733]        lock_acquire+0xd7/0x310
      [ 8948.269092]        start_transaction+0x44c/0x6e0 [btrfs]
      [ 8948.269591]        find_free_extent+0x141e/0x1590 [btrfs]
      [ 8948.270087]        btrfs_reserve_extent+0x14b/0x280 [btrfs]
      [ 8948.270588]        cow_file_range+0x17e/0x490 [btrfs]
      [ 8948.271051]        btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
      [ 8948.271586]        writepage_delalloc+0xb5/0x170 [btrfs]
      [ 8948.272071]        __extent_writepage+0x156/0x3c0 [btrfs]
      [ 8948.272579]        extent_write_cache_pages+0x263/0x460 [btrfs]
      [ 8948.273113]        extent_writepages+0x76/0x130 [btrfs]
      [ 8948.273573]        do_writepages+0xd2/0x1c0
      [ 8948.273942]        filemap_fdatawrite_wbc+0x68/0x90
      [ 8948.274371]        start_delalloc_inodes+0x17f/0x400 [btrfs]
      [ 8948.274876]        btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
      [ 8948.275417]        flush_space+0x1f2/0x630 [btrfs]
      [ 8948.275863]        btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
      [ 8948.276438]        process_one_work+0x252/0x5a0
      [ 8948.276829]        worker_thread+0x55/0x3b0
      [ 8948.277189]        kthread+0xf2/0x120
      [ 8948.277506]        ret_from_fork+0x22/0x30
      [ 8948.277868]
                     other info that might help us debug this:
      
      [ 8948.278548] Chain exists of:
                       sb_internal#2 --> &fs_info->delalloc_root_mutex --> &root->delalloc_mutex
      
      [ 8948.279601]  Possible unsafe locking scenario:
      
      [ 8948.280102]        CPU0                    CPU1
      [ 8948.280508]        ----                    ----
      [ 8948.280915]   lock(&root->delalloc_mutex);
      [ 8948.281271]                                lock(&fs_info->delalloc_root_mutex);
      [ 8948.281915]                                lock(&root->delalloc_mutex);
      [ 8948.282487]   lock(sb_internal#2);
      [ 8948.282800]
                      *** DEADLOCK ***
      
      [ 8948.283333] 4 locks held by kworker/u16:18/933570:
      [ 8948.283750]  #0: ffff9b3dc00a9d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
      [ 8948.284609]  #1: ffffa90349dafe70 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
      [ 8948.285637]  #2: ffff9b3e14db5040 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}, at: btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
      [ 8948.286674]  #3: ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
      [ 8948.287596]
                    stack backtrace:
      [ 8948.287975] CPU: 3 PID: 933570 Comm: kworker/u16:18 Not tainted 5.17.0-rc1-btrfs-next-111 #1
      [ 8948.288677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 8948.289649] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
      [ 8948.290298] Call Trace:
      [ 8948.290517]  <TASK>
      [ 8948.290700]  dump_stack_lvl+0x59/0x73
      [ 8948.291026]  check_noncircular+0xf3/0x110
      [ 8948.291375]  ? start_transaction+0x228/0x6e0 [btrfs]
      [ 8948.291826]  __lock_acquire+0x12e8/0x2260
      [ 8948.292241]  lock_acquire+0xd7/0x310
      [ 8948.292714]  ? find_free_extent+0x141e/0x1590 [btrfs]
      [ 8948.293241]  ? lock_is_held_type+0xea/0x140
      [ 8948.293601]  start_transaction+0x44c/0x6e0 [btrfs]
      [ 8948.294055]  ? find_free_extent+0x141e/0x1590 [btrfs]
      [ 8948.294518]  find_free_extent+0x141e/0x1590 [btrfs]
      [ 8948.294957]  ? _raw_spin_unlock+0x29/0x40
      [ 8948.295312]  ? btrfs_get_alloc_profile+0x124/0x290 [btrfs]
      [ 8948.295813]  btrfs_reserve_extent+0x14b/0x280 [btrfs]
      [ 8948.296270]  cow_file_range+0x17e/0x490 [btrfs]
      [ 8948.296691]  btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
      [ 8948.297175]  ? find_lock_delalloc_range+0x247/0x270 [btrfs]
      [ 8948.297678]  writepage_delalloc+0xb5/0x170 [btrfs]
      [ 8948.298123]  __extent_writepage+0x156/0x3c0 [btrfs]
      [ 8948.298570]  extent_write_cache_pages+0x263/0x460 [btrfs]
      [ 8948.299061]  extent_writepages+0x76/0x130 [btrfs]
      [ 8948.299495]  do_writepages+0xd2/0x1c0
      [ 8948.299817]  ? sched_clock_cpu+0xd/0x110
      [ 8948.300160]  ? lock_release+0x155/0x4a0
      [ 8948.300494]  filemap_fdatawrite_wbc+0x68/0x90
      [ 8948.300874]  ? do_raw_spin_unlock+0x4b/0xa0
      [ 8948.301243]  start_delalloc_inodes+0x17f/0x400 [btrfs]
      [ 8948.301706]  ? lock_release+0x155/0x4a0
      [ 8948.302055]  btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
      [ 8948.302564]  flush_space+0x1f2/0x630 [btrfs]
      [ 8948.302970]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
      [ 8948.303510]  process_one_work+0x252/0x5a0
      [ 8948.303860]  ? process_one_work+0x5a0/0x5a0
      [ 8948.304221]  worker_thread+0x55/0x3b0
      [ 8948.304543]  ? process_one_work+0x5a0/0x5a0
      [ 8948.304904]  kthread+0xf2/0x120
      [ 8948.305184]  ? kthread_complete_and_exit+0x20/0x20
      [ 8948.305598]  ret_from_fork+0x22/0x30
      [ 8948.305921]  </TASK>
      
      It all comes from the fact that btrfs_start_delalloc_roots() takes the
      delalloc_root_mutex, in the transaction commit path we are holding a
      read lock on one of the superblock's freeze semaphores (via
      sb_start_intwrite()), the async reclaim task can also do a call to
      btrfs_start_delalloc_roots(), which ends up triggering writeback with
      calls to filemap_fdatawrite_wbc(), resulting in extent allocation which
      in turn can call btrfs_start_transaction(), which will result in taking
      the freeze semaphore via sb_start_intwrite(), forming a nasty dependency
      on all those locks which can be taken in different orders by different
      code paths.
      
      So just adopt the simple approach of calling try_to_writeback_inodes_sb()
      at btrfs_start_delalloc_flush().
      
      Link: https://lore.kernel.org/linux-btrfs/20220130005258.GA7465@cuci.nl/
      Link: https://lore.kernel.org/linux-btrfs/43acc426-d683-d1b6-729d-c6bc4a2fff4d@gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/6833930a-08d7-6fbc-0141-eb9cdfd6bb4d@gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/20190322041731.GF16651@hungrycats.org/
      
      
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [ add more link reports ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0f0cf83
    • Qu Wenruo's avatar
      btrfs: defrag: don't try to defrag extents which are under writeback · 0d1ffa22
      Qu Wenruo authored
      
      
      Once we start writeback (have called btrfs_run_delalloc_range()), we
      allocate an extent, create an extent map point to that extent, with a
      generation of (u64)-1, created the ordered extent and then clear the
      DELALLOC bit from the range in the inode's io tree.
      
      Such extent map can pass the first call of defrag_collect_targets(), as
      its generation is (u64)-1, meets any possible minimal generation check.
      And the range will not have DELALLOC bit, also passing the DELALLOC bit
      check.
      
      It will only be re-checked in the second call of
      defrag_collect_targets(), which will wait for writeback.
      
      But at that stage we have already spent our time waiting for some IO we
      may or may not want to defrag.
      
      Let's reject such extents early so we won't waste our time.
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d1ffa22
    • Qu Wenruo's avatar
      btrfs: don't hold CPU for too long when defragging a file · ea0eba69
      Qu Wenruo authored
      There is a user report about "btrfs filesystem defrag" causing 120s
      timeout problem.
      
      For btrfs_defrag_file() it will iterate all file extents if called from
      defrag ioctl, thus it can take a long time.
      
      There is no reason not to release the CPU during such a long operation.
      
      Add cond_resched() after defragged one cluster.
      
      CC: stable@vger.kernel.org # 5.16
      Link: https://lore.kernel.org/linux-btrfs/10e51417-2203-f0a4-2021-86c8511cc367@gmx.com
      
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea0eba69
  6. Jan 31, 2022
    • Filipe Manana's avatar
      btrfs: skip reserved bytes warning on unmount after log cleanup failure · 40cdc509
      Filipe Manana authored
      After the recent changes made by commit c2e39305 ("btrfs: clear
      extent buffer uptodate when we fail to write it") and its followup fix,
      commit 651740a5 ("btrfs: check WRITE_ERR when trying to read an
      extent buffer"), we can now end up not cleaning up space reservations of
      log tree extent buffers after a transaction abort happens, as well as not
      cleaning up still dirty extent buffers.
      
      This happens because if writeback for a log tree extent buffer failed,
      then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
      and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
      when trying to free the log tree with free_log_tree(), which iterates
      over the tree, we can end up getting an -EIO error when trying to read
      a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
      extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
      EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
      immediately as we can not iterate over the entire tree.
      
      In that case we never update the reserved space for an extent buffer in
      the respective block group and space_info object.
      
      When this happens we get the following traces when unmounting the fs:
      
      [174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
      [174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
      [174957.399379] ------------[ cut here ]------------
      [174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.407523] Modules linked in: btrfs overlay dm_zero (...)
      [174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.429717] Code: 21 48 8b bd (...)
      [174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
      [174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
      [174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
      [174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
      [174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
      [174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
      [174957.439317] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.440402] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.443948] Call Trace:
      [174957.444264]  <TASK>
      [174957.444538]  btrfs_free_block_groups+0x255/0x3c0 [btrfs]
      [174957.445238]  close_ctree+0x301/0x357 [btrfs]
      [174957.445803]  ? call_rcu+0x16c/0x290
      [174957.446250]  generic_shutdown_super+0x74/0x120
      [174957.446832]  kill_anon_super+0x14/0x30
      [174957.447305]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.447890]  deactivate_locked_super+0x31/0xa0
      [174957.448440]  cleanup_mnt+0x147/0x1c0
      [174957.448888]  task_work_run+0x5c/0xa0
      [174957.449336]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.449934]  syscall_exit_to_user_mode+0x16/0x40
      [174957.450512]  do_syscall_64+0x48/0xc0
      [174957.450980]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.451605] RIP: 0033:0x7f328fdc4a97
      [174957.452059] Code: 03 0c 00 f7 (...)
      [174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.460193]  </TASK>
      [174957.460534] irq event stamp: 0
      [174957.461003] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.463147] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.466323] ---[ end trace bc7ee0c490bce3af ]---
      [174957.467282] ------------[ cut here ]------------
      [174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.470066] Modules linked in: btrfs overlay dm_zero (...)
      [174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.488050] Code: 00 00 00 ad de (...)
      [174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
      [174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
      [174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
      [174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
      [174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
      [174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
      [174957.499438] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.500990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.506167] Call Trace:
      [174957.506654]  <TASK>
      [174957.507047]  close_ctree+0x301/0x357 [btrfs]
      [174957.507867]  ? call_rcu+0x16c/0x290
      [174957.508567]  generic_shutdown_super+0x74/0x120
      [174957.509447]  kill_anon_super+0x14/0x30
      [174957.510194]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.511123]  deactivate_locked_super+0x31/0xa0
      [174957.511976]  cleanup_mnt+0x147/0x1c0
      [174957.512610]  task_work_run+0x5c/0xa0
      [174957.513309]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.514231]  syscall_exit_to_user_mode+0x16/0x40
      [174957.515069]  do_syscall_64+0x48/0xc0
      [174957.515718]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.516688] RIP: 0033:0x7f328fdc4a97
      [174957.517413] Code: 03 0c 00 f7 d8 (...)
      [174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.529404]  </TASK>
      [174957.529843] irq event stamp: 0
      [174957.530256] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.532075] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
      [174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
      [174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
      [174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
      [174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
      [174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
      [174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
      [174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
      
      This also means that in case we have log tree extent buffers that are
      still dirty, we can end up not cleaning them up in case we find an
      extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
      we have no way for iterating over the rest of the tree.
      
      This issue is very often triggered with test cases generic/475 and
      generic/648 from fstests.
      
      The issue could almost be fixed by iterating over the io tree attached to
      each log root which keeps tracks of the range of allocated extent buffers,
      log_root->dirty_log_pages, however that does not work and has some
      inconveniences:
      
      1) After we sync the log, we clear the range of the extent buffers from
         the io tree, so we can't find them after writeback. We could keep the
         ranges in the io tree, with a separate bit to signal they represent
         extent buffers already written, but that means we need to hold into
         more memory until the transaction commits.
      
         How much more memory is used depends a lot on whether we are able to
         allocate contiguous extent buffers on disk (and how often) for a log
         tree - if we are able to, then a single extent state record can
         represent multiple extent buffers, otherwise we need multiple extent
         state record structures to track each extent buffer.
         In fact, my earlier approach did that:
      
         https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
      
         However that can cause a very significant negative impact on
         performance, not only due to the extra memory usage but also because
         we get a larger and deeper dirty_log_pages io tree.
         We got a report that, on beefy machines at least, we can get such
         performance drop with fsmark for example:
      
         https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
      
      
      
      2) We would be doing it only to deal with an unexpected and exceptional
         case, which is basically failure to read an extent buffer from disk
         due to IO failures. On a healthy system we don't expect transaction
         aborts to happen after all;
      
      3) Instead of relying on iterating the log tree or tracking the ranges
         of extent buffers in the dirty_log_pages io tree, using the radix
         tree that tracks extent buffers (fs_info->buffer_radix) to find all
         log tree extent buffers is not reliable either, because after writeback
         of an extent buffer it can be evicted from memory by the release page
         callback of the btree inode (btree_releasepage()).
      
      Since there's no way to be able to properly cleanup a log tree without
      being able to read its extent buffers from disk and without using more
      memory to track the logical ranges of the allocated extent buffers do
      the following:
      
      1) When we fail to cleanup a log tree, setup a flag that indicates that
         failure;
      
      2) Trigger writeback of all log tree extent buffers that are still dirty,
         and wait for the writeback to complete. This is just to cleanup their
         state, page states, page leaks, etc;
      
      3) When unmounting the fs, ignore if the number of bytes reserved in a
         block group and in a space_info is not 0 if, and only if, we failed to
         cleanup a log tree. Also ignore only for metadata block groups and the
         metadata space_info object.
      
      This is far from a perfect solution, but it serves to silence test
      failures such as those from generic/475 and generic/648. However having
      a non-zero value for the reserved bytes counters on unmount after a
      transaction abort, is not such a terrible thing and it's completely
      harmless, it does not affect the filesystem integrity in any way.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40cdc509
    • Tom Rix's avatar
      btrfs: fix use of uninitialized variable at rm device ioctl · 37b45995
      Tom Rix authored
      
      
      Clang static analysis reports this problem
      ioctl.c:3333:8: warning: 3rd function call argument is an
        uninitialized value
          ret = exclop_start_or_cancel_reloc(fs_info,
      
      cancel is only set in one branch of an if-check and is always used.  So
      initialize to false.
      
      Fixes: 1a15eb72 ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      37b45995
    • Filipe Manana's avatar
      btrfs: fix use-after-free after failure to create a snapshot · 28b21c55
      Filipe Manana authored
      
      
      At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
      then attach it to the transaction's list of pending snapshots. After that
      we call btrfs_commit_transaction(), and if that returns an error we jump
      to 'fail' label, where we kfree() the pending snapshot structure. This can
      result in a later use-after-free of the pending snapshot:
      
      1) We allocated the pending snapshot and added it to the transaction's
         list of pending snapshots;
      
      2) We call btrfs_commit_transaction(), and it fails either at the first
         call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
         In both cases, we don't abort the transaction and we release our
         transaction handle. We jump to the 'fail' label and free the pending
         snapshot structure. We return with the pending snapshot still in the
         transaction's list;
      
      3) Another task commits the transaction. This time there's no error at
         all, and then during the transaction commit it accesses a pointer
         to the pending snapshot structure that the snapshot creation task
         has already freed, resulting in a user-after-free.
      
      This issue could actually be detected by smatch, which produced the
      following warning:
      
        fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list
      
      So fix this by not having the snapshot creation ioctl directly add the
      pending snapshot to the transaction's list. Instead add the pending
      snapshot to the transaction handle, and then at btrfs_commit_transaction()
      we add the snapshot to the list only when we can guarantee that any error
      returned after that point will result in a transaction abort, in which
      case the ioctl code can safely free the pending snapshot and no one can
      access it anymore.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28b21c55
    • Su Yue's avatar
      btrfs: tree-checker: check item_size for dev_item · ea1d1ca4
      Su Yue authored
      
      
      Check item size before accessing the device item to avoid out of bound
      access, similar to inode_item check.
      
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea1d1ca4
    • Su Yue's avatar
      btrfs: tree-checker: check item_size for inode_item · 0c982944
      Su Yue authored
      while mounting the crafted image, out-of-bounds access happens:
      
        [350.429619] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
        [350.429636] index 1048096 is out of range for type 'page *[16]'
        [350.429650] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4 #1
        [350.429652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
        [350.429653] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
        [350.429772] Call Trace:
        [350.429774]  <TASK>
        [350.429776]  dump_stack_lvl+0x47/0x5c
        [350.429780]  ubsan_epilogue+0x5/0x50
        [350.429786]  __ubsan_handle_out_of_bounds+0x66/0x70
        [350.429791]  btrfs_get_16+0xfd/0x120 [btrfs]
        [350.429832]  check_leaf+0x754/0x1a40 [btrfs]
        [350.429874]  ? filemap_read+0x34a/0x390
        [350.429878]  ? load_balance+0x175/0xfc0
        [350.429881]  validate_extent_buffer+0x244/0x310 [btrfs]
        [350.429911]  btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
        [350.429935]  end_bio_extent_readpage+0x3af/0x850 [btrfs]
        [350.429969]  ? newidle_balance+0x259/0x480
        [350.429972]  end_workqueue_fn+0x29/0x40 [btrfs]
        [350.429995]  btrfs_work_helper+0x71/0x330 [btrfs]
        [350.430030]  ? __schedule+0x2fb/0xa40
        [350.430033]  process_one_work+0x1f6/0x400
        [350.430035]  ? process_one_work+0x400/0x400
        [350.430036]  worker_thread+0x2d/0x3d0
        [350.430037]  ? process_one_work+0x400/0x400
        [350.430038]  kthread+0x165/0x190
        [350.430041]  ? set_kthread_struct+0x40/0x40
        [350.430043]  ret_from_fork+0x1f/0x30
        [350.430047]  </TASK>
        [350.430077] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
      
      check_leaf() is checking the leaf:
      
        corrupt leaf: root=4 block=29396992 slot=1, bad key order, prev (16140901064495857664 1 0) current (1 204 12582912)
        leaf 29396992 items 6 free space 3565 generation 6 owner DEV_TREE
        leaf 29396992 flags 0x1(WRITTEN) backref revision 1
        fs uuid a62e00e8-e94e-4200-8217-12444de93c2e
        chunk uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
      	  item 0 key (16140901064495857664 INODE_ITEM 0) itemoff 3955 itemsize 40
      		  generation 0 transid 0 size 0 nbytes 17592186044416
      		  block group 0 mode 52667 links 33 uid 0 gid 2104132511 rdev 94223634821136
      		  sequence 100305 flags 0x2409000(none)
      		  atime 0.0 (1970-01-01 08:00:00)
      		  ctime 2973280098083405823.4294967295 (-269783007-01-01 21:37:03)
      		  mtime 18446744071572723616.4026825121 (1902-04-16 12:40:00)
      		  otime 9249929404488876031.4294967295 (622322949-04-16 04:25:58)
      	  item 1 key (1 DEV_EXTENT 12582912) itemoff 3907 itemsize 48
      		  dev extent chunk_tree 3
      		  chunk_objectid 256 chunk_offset 12582912 length 8388608
      		  chunk_tree_uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
      
      The corrupted leaf of device tree has an inode item. The leaf passed
      checksum and others checks in validate_extent_buffer until check_leaf_item().
      Because of the key type BTRFS_INODE_ITEM, check_inode_item() is called even we
      are in the device tree. Since the
      item offset + sizeof(struct btrfs_inode_item) > eb->len, out-of-bounds access
      is triggered.
      
      The item end vs leaf boundary check has been done before
      check_leaf_item(), so fix it by checking item size in check_inode_item()
      before access of the inode item in extent buffer.
      
      Other check functions except check_dev_item() in check_leaf_item()
      have their item size checks.
      The commit for check_dev_item() is followed.
      
      No regression observed during running fstests.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
      
      
      CC: stable@vger.kernel.org # 5.10+
      CC: Wenqing Liu <wenqingliu0120@gmail.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c982944
    • Shin'ichiro Kawasaki's avatar
      btrfs: fix deadlock between quota disable and qgroup rescan worker · e804861b
      Shin'ichiro Kawasaki authored
      
      
      Quota disable ioctl starts a transaction before waiting for the qgroup
      rescan worker completes. However, this wait can be infinite and results
      in deadlock because of circular dependency among the quota disable
      ioctl, the qgroup rescan worker and the other task with transaction such
      as block group relocation task.
      
      The deadlock happens with the steps following:
      
      1) Task A calls ioctl to disable quota. It starts a transaction and
         waits for qgroup rescan worker completes.
      2) Task B such as block group relocation task starts a transaction and
         joins to the transaction that task A started. Then task B commits to
         the transaction. In this commit, task B waits for a commit by task A.
      3) Task C as the qgroup rescan worker starts its job and starts a
         transaction. In this transaction start, task C waits for completion
         of the transaction that task A started and task B committed.
      
      This deadlock was found with fstests test case btrfs/115 and a zoned
      null_blk device. The test case enables and disables quota, and the
      block group reclaim was triggered during the quota disable by chance.
      The deadlock was also observed by running quota enable and disable in
      parallel with 'btrfs balance' command on regular null_blk devices.
      
      An example report of the deadlock:
      
        [372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds.
        [372.479944]       Not tainted 5.16.0-rc8 #7
        [372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [372.493898] task:kworker/u16:6   state:D stack:    0 pid:  103 ppid:     2 flags:0x00004000
        [372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs]
        [372.510782] Call Trace:
        [372.514092]  <TASK>
        [372.521684]  __schedule+0xb56/0x4850
        [372.530104]  ? io_schedule_timeout+0x190/0x190
        [372.538842]  ? lockdep_hardirqs_on+0x7e/0x100
        [372.547092]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [372.555591]  schedule+0xe0/0x270
        [372.561894]  btrfs_commit_transaction+0x18bb/0x2610 [btrfs]
        [372.570506]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
        [372.578875]  ? free_unref_page+0x3f2/0x650
        [372.585484]  ? finish_wait+0x270/0x270
        [372.591594]  ? release_extent_buffer+0x224/0x420 [btrfs]
        [372.599264]  btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs]
        [372.607157]  ? lock_release+0x3a9/0x6d0
        [372.613054]  ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs]
        [372.620960]  ? do_raw_spin_lock+0x11e/0x250
        [372.627137]  ? rwlock_bug.part.0+0x90/0x90
        [372.633215]  ? lock_is_held_type+0xe4/0x140
        [372.639404]  btrfs_work_helper+0x1ae/0xa90 [btrfs]
        [372.646268]  process_one_work+0x7e9/0x1320
        [372.652321]  ? lock_release+0x6d0/0x6d0
        [372.658081]  ? pwq_dec_nr_in_flight+0x230/0x230
        [372.664513]  ? rwlock_bug.part.0+0x90/0x90
        [372.670529]  worker_thread+0x59e/0xf90
        [372.676172]  ? process_one_work+0x1320/0x1320
        [372.682440]  kthread+0x3b9/0x490
        [372.687550]  ? _raw_spin_unlock_irq+0x24/0x50
        [372.693811]  ? set_kthread_struct+0x100/0x100
        [372.700052]  ret_from_fork+0x22/0x30
        [372.705517]  </TASK>
        [372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds.
        [372.729827]       Not tainted 5.16.0-rc8 #7
        [372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [372.767106] task:btrfs-transacti state:D stack:    0 pid: 2347 ppid:     2 flags:0x00004000
        [372.787776] Call Trace:
        [372.801652]  <TASK>
        [372.812961]  __schedule+0xb56/0x4850
        [372.830011]  ? io_schedule_timeout+0x190/0x190
        [372.852547]  ? lockdep_hardirqs_on+0x7e/0x100
        [372.871761]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [372.886792]  schedule+0xe0/0x270
        [372.901685]  wait_current_trans+0x22c/0x310 [btrfs]
        [372.919743]  ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs]
        [372.938923]  ? finish_wait+0x270/0x270
        [372.959085]  ? join_transaction+0xc75/0xe30 [btrfs]
        [372.977706]  start_transaction+0x938/0x10a0 [btrfs]
        [372.997168]  transaction_kthread+0x19d/0x3c0 [btrfs]
        [373.013021]  ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs]
        [373.031678]  kthread+0x3b9/0x490
        [373.047420]  ? _raw_spin_unlock_irq+0x24/0x50
        [373.064645]  ? set_kthread_struct+0x100/0x100
        [373.078571]  ret_from_fork+0x22/0x30
        [373.091197]  </TASK>
        [373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds.
        [373.114147]       Not tainted 5.16.0-rc8 #7
        [373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [373.130393] task:btrfs           state:D stack:    0 pid: 3145 ppid:  3141 flags:0x00004000
        [373.140998] Call Trace:
        [373.145501]  <TASK>
        [373.149654]  __schedule+0xb56/0x4850
        [373.155306]  ? io_schedule_timeout+0x190/0x190
        [373.161965]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.168469]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [373.175468]  schedule+0xe0/0x270
        [373.180814]  wait_for_commit+0x104/0x150 [btrfs]
        [373.187643]  ? test_and_set_bit+0x20/0x20 [btrfs]
        [373.194772]  ? kmem_cache_free+0x124/0x550
        [373.201191]  ? btrfs_put_transaction+0x69/0x3d0 [btrfs]
        [373.208738]  ? finish_wait+0x270/0x270
        [373.214704]  ? __btrfs_end_transaction+0x347/0x7b0 [btrfs]
        [373.222342]  btrfs_commit_transaction+0x44d/0x2610 [btrfs]
        [373.230233]  ? join_transaction+0x255/0xe30 [btrfs]
        [373.237334]  ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs]
        [373.245251]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
        [373.253296]  relocate_block_group+0x105/0xc20 [btrfs]
        [373.260533]  ? mutex_lock_io_nested+0x1270/0x1270
        [373.267516]  ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs]
        [373.275155]  ? merge_reloc_roots+0x710/0x710 [btrfs]
        [373.283602]  ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs]
        [373.291934]  ? kmem_cache_free+0x124/0x550
        [373.298180]  btrfs_relocate_block_group+0x35c/0x930 [btrfs]
        [373.306047]  btrfs_relocate_chunk+0x85/0x210 [btrfs]
        [373.313229]  btrfs_balance+0x12f4/0x2d20 [btrfs]
        [373.320227]  ? lock_release+0x3a9/0x6d0
        [373.326206]  ? btrfs_relocate_chunk+0x210/0x210 [btrfs]
        [373.333591]  ? lock_is_held_type+0xe4/0x140
        [373.340031]  ? rcu_read_lock_sched_held+0x3f/0x70
        [373.346910]  btrfs_ioctl_balance+0x548/0x700 [btrfs]
        [373.354207]  btrfs_ioctl+0x7f2/0x71b0 [btrfs]
        [373.360774]  ? lockdep_hardirqs_on_prepare+0x410/0x410
        [373.367957]  ? lockdep_hardirqs_on_prepare+0x410/0x410
        [373.375327]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
        [373.383841]  ? find_held_lock+0x2c/0x110
        [373.389993]  ? lock_release+0x3a9/0x6d0
        [373.395828]  ? mntput_no_expire+0xf7/0xad0
        [373.402083]  ? lock_is_held_type+0xe4/0x140
        [373.408249]  ? vfs_fileattr_set+0x9f0/0x9f0
        [373.414486]  ? selinux_file_ioctl+0x349/0x4e0
        [373.420938]  ? trace_raw_output_lock+0xb4/0xe0
        [373.427442]  ? selinux_inode_getsecctx+0x80/0x80
        [373.434224]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.440660]  ? force_qs_rnp+0x2a0/0x6b0
        [373.446534]  ? lock_is_held_type+0x9b/0x140
        [373.452763]  ? __blkcg_punt_bio_submit+0x1b0/0x1b0
        [373.459732]  ? security_file_ioctl+0x50/0x90
        [373.466089]  __x64_sys_ioctl+0x127/0x190
        [373.472022]  do_syscall_64+0x3b/0x90
        [373.477513]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [373.484823] RIP: 0033:0x7f8f4af7e2bb
        [373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb
        [373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003
        [373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0
        [373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001
        [373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002
        [373.546506]  </TASK>
        [373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds.
        [373.559383]       Not tainted 5.16.0-rc8 #7
        [373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [373.575748] task:btrfs           state:D stack:    0 pid: 3146 ppid:  2168 flags:0x00000000
        [373.586314] Call Trace:
        [373.590846]  <TASK>
        [373.595121]  __schedule+0xb56/0x4850
        [373.600901]  ? __lock_acquire+0x23db/0x5030
        [373.607176]  ? io_schedule_timeout+0x190/0x190
        [373.613954]  schedule+0xe0/0x270
        [373.619157]  schedule_timeout+0x168/0x220
        [373.625170]  ? usleep_range_state+0x150/0x150
        [373.631653]  ? mark_held_locks+0x9e/0xe0
        [373.637767]  ? do_raw_spin_lock+0x11e/0x250
        [373.643993]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
        [373.651267]  ? _raw_spin_unlock_irq+0x24/0x50
        [373.657677]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.664103]  wait_for_completion+0x163/0x250
        [373.670437]  ? bit_wait_timeout+0x160/0x160
        [373.676585]  btrfs_quota_disable+0x176/0x9a0 [btrfs]
        [373.683979]  ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs]
        [373.691340]  ? down_write+0xd0/0x130
        [373.696880]  ? down_write_killable+0x150/0x150
        [373.703352]  btrfs_ioctl+0x3945/0x71b0 [btrfs]
        [373.710061]  ? find_held_lock+0x2c/0x110
        [373.716192]  ? lock_release+0x3a9/0x6d0
        [373.722047]  ? __handle_mm_fault+0x23cd/0x3050
        [373.728486]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
        [373.737032]  ? set_pte+0x6a/0x90
        [373.742271]  ? do_raw_spin_unlock+0x55/0x1f0
        [373.748506]  ? lock_is_held_type+0xe4/0x140
        [373.754792]  ? vfs_fileattr_set+0x9f0/0x9f0
        [373.761083]  ? selinux_file_ioctl+0x349/0x4e0
        [373.767521]  ? selinux_inode_getsecctx+0x80/0x80
        [373.774247]  ? __up_read+0x182/0x6e0
        [373.780026]  ? count_memcg_events.constprop.0+0x46/0x60
        [373.787281]  ? up_write+0x460/0x460
        [373.792932]  ? security_file_ioctl+0x50/0x90
        [373.799232]  __x64_sys_ioctl+0x127/0x190
        [373.805237]  do_syscall_64+0x3b/0x90
        [373.810947]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [373.818102] RIP: 0033:0x7f1383ea02bb
        [373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb
        [373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003
        [373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
        [373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a
        [373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000
        [373.879838]  </TASK>
        [373.884018]
                     Showing all locks held in the system:
        [373.894250] 3 locks held by kworker/4:1/58:
        [373.900356] 1 lock held by khungtaskd/63:
        [373.906333]  #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260
        [373.917307] 3 locks held by kworker/u16:6/103:
        [373.923938]  #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320
        [373.936555]  #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320
        [373.951109]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs]
        [373.964027] 2 locks held by less/1803:
        [373.969982]  #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80
        [373.981295]  #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060
        [373.992969] 1 lock held by btrfs-transacti/2347:
        [373.999893]  #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs]
        [374.015872] 3 locks held by btrfs/3145:
        [374.022298]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs]
        [374.034456]  #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs]
        [374.047646]  #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs]
        [374.063295] 4 locks held by btrfs/3146:
        [374.069647]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs]
        [374.081601]  #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs]
        [374.094283]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs]
        [374.106885]  #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs]
      
        [374.126780] =============================================
      
      To avoid the deadlock, wait for the qgroup rescan worker to complete
      before starting the transaction for the quota disable ioctl. Clear
      BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to
      request the worker to complete. On transaction start failure, set the
      BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag
      changes can be done safely since the function btrfs_quota_disable is not
      called concurrently because of fs_info->subvol_sem.
      
      Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid
      another qgroup rescan worker to start after the previous qgroup worker
      completed.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e804861b
    • Qu Wenruo's avatar
      btrfs: don't start transaction for scrub if the fs is mounted read-only · 2d192fc4
      Qu Wenruo authored
      
      
      [BUG]
      The following super simple script would crash btrfs at unmount time, if
      CONFIG_BTRFS_ASSERT() is set.
      
       mkfs.btrfs -f $dev
       mount $dev $mnt
       xfs_io -f -c "pwrite 0 4k" $mnt/file
       umount $mnt
       mount -r ro $dev $mnt
       btrfs scrub start -Br $mnt
       umount $mnt
      
      This will trigger the following ASSERT() introduced by commit
      0a31daa4 ("btrfs: add assertion for empty list of transactions at
      late stage of umount").
      
      That patch is definitely not the cause, it just makes enough noise for
      developers.
      
      [CAUSE]
      We will start transaction for the following call chain during scrub:
      
        scrub_enumerate_chunks()
        |- btrfs_inc_block_group_ro()
           |- btrfs_join_transaction()
      
      However for RO mount, there is no running transaction at all, thus
      btrfs_join_transaction() will start a new transaction.
      
      Furthermore, since it's read-only mount, btrfs_sync_fs() will not call
      btrfs_commit_super() to commit the new but empty transaction.
      
      And leads to the ASSERT().
      
      The bug has been there for a long time. Only the new ASSERT() makes it
      noisy enough to be noticed.
      
      [FIX]
      For read-only scrub on read-only mount, there is no need to start a
      transaction nor to allocate new chunks in btrfs_inc_block_group_ro().
      
      Just do extra read-only mount check in btrfs_inc_block_group_ro(), and
      if it's read-only, skip all chunk allocation and go inc_block_group_ro()
      directly.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d192fc4
  7. Jan 25, 2022
    • Filipe Manana's avatar
      btrfs: update writeback index when starting defrag · 27cdfde1
      Filipe Manana authored
      
      
      When starting a defrag, we should update the writeback index of the
      inode's mapping in case it currently has a value beyond the start of the
      range we are defragging. This can help performance and often result in
      getting less extents after writeback - for e.g., if the current value
      of the writeback index sits somewhere in the middle of a range that
      gets dirty by the defrag, then after writeback we can get two smaller
      extents instead of a single, larger extent.
      
      We used to have this before the refactoring in 5.16, but it was removed
      without any reason to do so. Originally it was added in kernel 3.1, by
      commit 2a0f7f57 ("Btrfs: fix recursive auto-defrag"), in order to
      fix a loop with autodefrag resulting in dirtying and writing pages over
      and over, but some testing on current code did not show that happening,
      at least with the test described in that commit.
      
      So add back the behaviour, as at the very least it is a nice to have
      optimization.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27cdfde1
    • Filipe Manana's avatar
      btrfs: add back missing dirty page rate limiting to defrag · 3c9d31c7
      Filipe Manana authored
      
      
      A defrag operation can dirty a lot of pages, specially if operating on
      the entire file or a large file range. Any task dirtying pages should
      periodically call balance_dirty_pages_ratelimited(), as stated in that
      function's comments, otherwise they can leave too many dirty pages in
      the system. This is what we did before the refactoring in 5.16, and
      it should have remained, just like in the buffered write path and
      relocation. So restore that behaviour.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c9d31c7
    • Filipe Manana's avatar
      btrfs: fix deadlock when reserving space during defrag · 0cb5950f
      Filipe Manana authored
      
      
      When defragging we can end up collecting a range for defrag that has
      already pages under delalloc (dirty), as long as the respective extent
      map for their range is not mapped to a hole, a prealloc extent or
      the extent map is from an old generation.
      
      Most of the time that is harmless from a functional perspective at
      least, however it can result in a deadlock:
      
      1) At defrag_collect_targets() we find an extent map that meets all
         requirements but there's delalloc for the range it covers, and we add
         its range to list of ranges to defrag;
      
      2) The defrag_collect_targets() function is called at defrag_one_range(),
         after it locked a range that overlaps the range of the extent map;
      
      3) At defrag_one_range(), while the range is still locked, we call
         defrag_one_locked_target() for the range associated to the extent
         map we collected at step 1);
      
      4) Then finally at defrag_one_locked_target() we do a call to
         btrfs_delalloc_reserve_space(), which will reserve data and metadata
         space. If the space reservations can not be satisfied right away, the
         flusher might be kicked in and start flushing delalloc and wait for
         the respective ordered extents to complete. If this happens we will
         deadlock, because both flushing delalloc and finishing an ordered
         extent, requires locking the range in the inode's io tree, which was
         already locked at defrag_collect_targets().
      
      So fix this by skipping extent maps for which there's already delalloc.
      
      Fixes: eb793cf8 ("btrfs: defrag: introduce helper to collect target file extents")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cb5950f
  8. Jan 20, 2022
    • Qu Wenruo's avatar
      btrfs: defrag: properly update range->start for autodefrag · c080b414
      Qu Wenruo authored
      
      
      [BUG]
      After commit 7b508037 ("btrfs: defrag: use defrag_one_cluster() to
      implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
      the file from previously finished location.
      
      [CAUSE]
      The recent refactoring of defrag only focuses on defrag ioctl subpage
      support, doesn't take autodefrag into consideration.
      
      There are two problems involved which prevents autodefrag to restart its
      scan:
      
      - No range.start update
        Previously when one defrag target is found, range->start will be
        updated to indicate where next search should start from.
      
        But now btrfs_defrag_file() doesn't update it anymore, making all
        autodefrag to rescan from file offset 0.
      
        This would also make autodefrag to mark the same range dirty again and
        again, causing extra IO.
      
      - No proper quick exit for defrag_one_cluster()
        Currently if we reached or exceed @max_sectors limit, we just exit
        defrag_one_cluster(), and let next defrag_one_cluster() call to do a
        quick exit.
        This makes @cur increase, thus no way to properly know which range is
        defragged and which range is skipped.
      
      [FIX]
      The fix involves two modifications:
      
      - Update range->start to next cluster start
        This is a little different from the old behavior.
        Previously range->start is updated to the next defrag target.
      
        But in the end, the behavior should still be pretty much the same,
        as now we skip to next defrag target inside btrfs_defrag_file().
      
        Thus if auto-defrag determines to re-scan, then we still do the skip,
        just at a different timing.
      
      - Make defrag_one_cluster() to return >0 to indicate a quick exit
        So that btrfs_defrag_file() can also do a quick exit, without
        increasing @cur to the range end, and re-use @cur to update
        @range->start.
      
      - Add comment for btrfs_defrag_file() to mention the range->start update
        Currently only autodefrag utilize this behavior, as defrag ioctl won't
        set @max_to_defrag parameter, thus unless interrupted it will always
        try to defrag the whole range.
      
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c080b414
    • Qu Wenruo's avatar
      btrfs: defrag: fix wrong number of defragged sectors · 484167da
      Qu Wenruo authored
      
      
      [BUG]
      There are users using autodefrag mount option reporting obvious increase
      in IO:
      
      > If I compare the write average (in total, I don't have it per process)
      > when taking idle periods on the same machine:
      >     Linux 5.16:
      >         without autodefrag: ~ 10KiB/s
      >         with autodefrag: between 1 and 2MiB/s.
      >
      >     Linux 5.15:
      >         with autodefrag:~ 10KiB/s (around the same as without
      > autodefrag on 5.16)
      
      [CAUSE]
      When autodefrag mount option is enabled, btrfs_defrag_file() will be
      called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many
      sectors we can defrag in one try.
      
      And then use the number of sectors defragged to determine if we need to
      re-defrag.
      
      But commit b18c3ab2 ("btrfs: defrag: introduce helper to defrag one
      cluster") uses wrong unit to increase @sectors_defragged, which should
      be in unit of sector, not byte.
      
      This means, if we have defragged any sector, then @sectors_defragged
      will be >= sectorsize (normally 4096), which is larger than
      BTRFS_DEFRAG_BATCH.
      
      This makes the @max_sectors check in defrag_one_cluster() to underflow,
      rendering the whole @max_sectors check useless.
      
      Thus causing way more IO for autodefrag mount options, as now there is
      no limit on how many sectors can really be defragged.
      
      [FIX]
      Fix the problems by:
      
      - Use sector as unit when increasing @sectors_defragged
      
      - Include @sectors_defragged > @max_sectors case to break the loop
      
      - Add extra comment on the return value of btrfs_defrag_file()
      
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: b18c3ab2 ("btrfs: defrag: introduce helper to defrag one cluster")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      484167da
    • Filipe Manana's avatar
      btrfs: allow defrag to be interruptible · b767c2fc
      Filipe Manana authored
      
      
      During defrag, at btrfs_defrag_file(), we have this loop that iterates
      over a file range in steps no larger than 256K subranges. If the range
      is too long, there's no way to interrupt it. So make the loop check in
      each iteration if there's signal pending, and if there is, break and
      return -AGAIN to userspace.
      
      Before kernel 5.16, we used to allow defrag to be cancelled through a
      signal, but that was lost with commit 7b508037 ("btrfs: defrag:
      use defrag_one_cluster() to implement btrfs_defrag_file()").
      
      This change adds back the possibility to cancel a defrag with a signal
      and keeps the same semantics, returning -EAGAIN to user space (and not
      the usually more expected -EINTR).
      
      This is also motivated by a recent bug on 5.16 where defragging a 1 byte
      file resulted in iterating from file range 0 to (u64)-1, as hitting the
      bug triggered a too long loop, basically requiring one to reboot the
      machine, as it was not possible to cancel defrag.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b767c2fc
    • Filipe Manana's avatar
      btrfs: fix too long loop when defragging a 1 byte file · 6b34cd8e
      Filipe Manana authored
      
      
      When attempting to defrag a file with a single byte, we can end up in a
      too long loop, which is nearly infinite because at btrfs_defrag_file()
      we end up with the variable last_byte assigned with a value of
      18446744073709551615 (which is (u64)-1). The problem comes from the fact
      we end up doing:
      
          last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
      
      So if last_byte was assigned 0, which is i_size - 1, we underflow and
      end up with the value 18446744073709551615.
      
      This is trivial to reproduce and the following script triggers it:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        echo -n "X" > $MNT/foobar
      
        btrfs filesystem defragment $MNT/foobar
      
        umount $MNT
      
      So fix this by not decrementing last_byte by 1 before doing the sector
      size round up. Also, to make it easier to follow, make the round up right
      after computing last_byte.
      
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b34cd8e
  9. Jan 07, 2022