Skip to content
  1. Oct 29, 2021
    • David Sterba's avatar
      btrfs: send: prepare for v2 protocol · e77fbf99
      David Sterba authored
      
      
      This is preparatory work for send protocol update to version 2 and
      higher.
      
      We have many pending protocol update requests but still don't have the
      basic protocol rev in place, the first thing that must happen is to do
      the actual versioning support.
      
      The protocol version is u32 and is a new member in the send ioctl
      struct. Validity of the version field is backed by a new flag bit. Old
      kernels would fail when a higher version is requested. Version protocol
      0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION,
        that's also exported in sysfs.
      
      The version is still unchanged and will be increased once we have new
      incompatible commands or stream updates.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e77fbf99
  2. Oct 27, 2021
    • Anand Jain's avatar
      btrfs: fix comment about sector sizes supported in 64K systems · 50780d9b
      Anand Jain authored
      Commit 95ea0486
      
       ("btrfs: allow read-write for 4K sectorsize on 64K
      page size systems") added write support for 4K sectorsize on a 64K
      systems. Fix the now stale comments.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50780d9b
    • Josef Bacik's avatar
      btrfs: update device path inode time instead of bd_inode · 54fde91f
      Josef Bacik authored
      Christoph pointed out that I'm updating bdev->bd_inode for the device
      time when we remove block devices from a btrfs file system, however this
      isn't actually exposed to anything.  The inode we want to update is the
      one that's associated with the path to the device, usually on devtmpfs,
      so that blkid notices the difference.
      
      We still don't want to do the blkdev_open, so use kern_path() to get the
      path to the given device and do the update time on that inode.
      
      Fixes: 8f96a5bf
      
       ("btrfs: update the bdev time directly when closing")
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54fde91f
    • Josef Bacik's avatar
      fs: export an inode_update_time helper · e60feb44
      Josef Bacik authored
      
      
      If you already have an inode and need to update the time on the inode
      there is no way to do this properly.  Export this helper to allow file
      systems to update time on the inode so the appropriate handler is
      called, either ->update_time or generic_update_time.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e60feb44
    • Omar Sandoval's avatar
      btrfs: fix deadlock when defragging transparent huge pages · 24bcb454
      Omar Sandoval authored
      
      
      Attempting to defragment a Btrfs file containing a transparent huge page
      immediately deadlocks with the following stack trace:
      
        #0  context_switch (kernel/sched/core.c:4940:2)
        #1  __schedule (kernel/sched/core.c:6287:8)
        #2  schedule (kernel/sched/core.c:6366:3)
        #3  io_schedule (kernel/sched/core.c:8389:2)
        #4  wait_on_page_bit_common (mm/filemap.c:1356:4)
        #5  __lock_page (mm/filemap.c:1648:2)
        #6  lock_page (./include/linux/pagemap.h:625:3)
        #7  pagecache_get_page (mm/filemap.c:1910:4)
        #8  find_or_create_page (./include/linux/pagemap.h:420:9)
        #9  defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
        #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
        #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
        #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
        #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
        #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
        #15 vfs_ioctl (fs/ioctl.c:51:10)
        #16 __do_sys_ioctl (fs/ioctl.c:874:11)
        #17 __se_sys_ioctl (fs/ioctl.c:860:1)
        #18 __x64_sys_ioctl (fs/ioctl.c:860:1)
        #19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
        #20 do_syscall_64 (arch/x86/entry/common.c:80:7)
        #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)
      
      A huge page is represented by a compound page, which consists of a
      struct page for each PAGE_SIZE page within the huge page. The first
      struct page is the "head page", and the remaining are "tail pages".
      
      Defragmentation attempts to lock each page in the range. However,
      lock_page() on a tail page actually locks the corresponding head page.
      So, if defragmentation tries to lock more than one struct page in a
      compound page, it tries to lock the same head page twice and deadlocks
      with itself.
      
      Ideally, we should be able to defragment transparent huge pages.
      However, THP for filesystems is currently read-only, so a lot of code is
      not ready to use huge pages for I/O. For now, let's just return
      ETXTBUSY.
      
      This can be reproduced with the following on a kernel with
      CONFIG_READ_ONLY_THP_FOR_FS=y:
      
        $ cat create_thp_file.c
        #include <fcntl.h>
        #include <stdbool.h>
        #include <stdio.h>
        #include <stdint.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <sys/mman.h>
      
        static const char zeroes[1024 * 1024];
        static const size_t FILE_SIZE = 2 * 1024 * 1024;
      
        int main(int argc, char **argv)
        {
                if (argc != 2) {
                        fprintf(stderr, "usage: %s PATH\n", argv[0]);
                        return EXIT_FAILURE;
                }
                int fd = creat(argv[1], 0777);
                if (fd == -1) {
                        perror("creat");
                        return EXIT_FAILURE;
                }
                size_t written = 0;
                while (written < FILE_SIZE) {
                        ssize_t ret = write(fd, zeroes,
                                            sizeof(zeroes) < FILE_SIZE - written ?
                                            sizeof(zeroes) : FILE_SIZE - written);
                        if (ret < 0) {
                                perror("write");
                                return EXIT_FAILURE;
                        }
                        written += ret;
                }
                close(fd);
                fd = open(argv[1], O_RDONLY);
                if (fd == -1) {
                        perror("open");
                        return EXIT_FAILURE;
                }
      
                /*
                 * Reserve some address space so that we can align the file mapping to
                 * the huge page size.
                 */
                void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
                                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
                if (placeholder_map == MAP_FAILED) {
                        perror("mmap (placeholder)");
                        return EXIT_FAILURE;
                }
      
                void *aligned_address =
                        (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));
      
                void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
                                 MAP_SHARED | MAP_FIXED, fd, 0);
                if (map == MAP_FAILED) {
                        perror("mmap");
                        return EXIT_FAILURE;
                }
                if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
                        perror("madvise");
                        return EXIT_FAILURE;
                }
      
                char *line = NULL;
                size_t line_capacity = 0;
                FILE *smaps_file = fopen("/proc/self/smaps", "r");
                if (!smaps_file) {
                        perror("fopen");
                        return EXIT_FAILURE;
                }
                for (;;) {
                        for (size_t off = 0; off < FILE_SIZE; off += 4096)
                                ((volatile char *)map)[off];
      
                        ssize_t ret;
                        bool this_mapping = false;
                        while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
                                unsigned long start, end, huge;
                                if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
                                        this_mapping = (start <= (uintptr_t)map &&
                                                        (uintptr_t)map < end);
                                } else if (this_mapping &&
                                           sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
                                           huge > 0) {
                                        return EXIT_SUCCESS;
                                }
                        }
      
                        sleep(6);
                        rewind(smaps_file);
                        fflush(smaps_file);
                }
        }
        $ ./create_thp_file huge
        $ btrfs fi defrag -czstd ./huge
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24bcb454
    • Anand Jain's avatar
      btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit · 020e5277
      Anand Jain authored
      Commit 2efc459d
      
       ("sysfs: Add sysfs_emit and sysfs_emit_at to format
      sysfs out") merged in 5.10 introduced two new functions sysfs_emit() and
      sysfs_emit_at() which are aware of the PAGE_SIZE limit of the output
      buffer.
      
      Use the above two new functions instead of scnprintf() and snprintf()
      in various sysfs show().
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      020e5277
    • Qu Wenruo's avatar
      btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE · 38732474
      Qu Wenruo authored
      
      
      It's a common practice to avoid use sizeof(struct btrfs_super_block)
      (3531), but to use BTRFS_SUPER_INFO_SIZE (4096).
      
      The problem is that, sizeof(struct btrfs_super_block) doesn't match
      BTRFS_SUPER_INFO_SIZE from the very beginning.
      
      Furthermore, for all call sites except selftests, we always allocate
      BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason
      to use the smaller value, and it doesn't really save any space.
      
      So let's get rid of such confusing behavior, and unify those two values.
      
      This modification also adds a new static_assert() to verify the size,
      and moves the BTRFS_SUPER_INFO_* macros to the definition of
      btrfs_super_block for the static_assert().
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38732474
    • Filipe Manana's avatar
      btrfs: update comments for chunk allocation -ENOSPC cases · ecd84d54
      Filipe Manana authored
      
      
      Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
      describe which cases can lead to a failure to allocate metadata and system
      space despite having previously reserved space. This adds one more reason
      that I previously forgot to mention.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ecd84d54
    • Filipe Manana's avatar
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana authored
      
      
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712
      
       ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • Johannes Thumshirn's avatar
      btrfs: zoned: use greedy gc for auto reclaim · 2ca0ec77
      Johannes Thumshirn authored
      
      
      Currently auto reclaim of unusable zones reclaims the block-groups in
      the order they have been added to the reclaim list.
      
      Change this to a greedy algorithm by sorting the list so we have the
      block-groups with the least amount of valid bytes reclaimed first.
      
      Note: we can't splice the block groups from reclaim_bgs to let the sort
      happen outside of the lock. The block groups can be still in use by
      other parts eg. via bg_list and we must hold unused_bgs_lock while
      processing them.
      
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ write note and comment why we can't splice the list ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ca0ec77
    • Christoph Hellwig's avatar
      btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state · 813ebc16
      Christoph Hellwig authored
      
      
      Just use the %pg format specifier in all the debug printks previously
      using it.  Note that both bdevname and the %pg specifier never print
      a pathname, so the kbasename call wasn't needed to start with.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ adjust messages and indentation ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      813ebc16
    • Josef Bacik's avatar
      btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls · 1a15eb72
      Josef Bacik authored
      
      
      For device removal and replace we call btrfs_find_device_by_devspec,
      which if we give it a device path and nothing else will call
      btrfs_get_dev_args_from_path, which opens the block device and reads the
      super block and then looks up our device based on that.
      
      However at this point we're holding the sb write "lock", so reading the
      block device pulls in the dependency of ->open_mutex, which produces the
      following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #405 Not tainted
      ------------------------------------------------------
      losetup/11576 is trying to acquire lock:
      ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_get_by_path+0x98/0xa0
             btrfs_get_bdev_and_sb+0x1b/0xb0
             btrfs_find_device_by_devspec+0x12b/0x1c0
             btrfs_rm_device+0x127/0x610
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11576:
       #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f31b02404cb
      
      Instead what we want to do is populate our device lookup args before we
      grab any locks, and then pass these args into btrfs_rm_device().  From
      there we can find the device and do the appropriate removal.
      
      Suggested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a15eb72
    • Josef Bacik's avatar
      btrfs: add a btrfs_get_dev_args_from_path helper · faa775c4
      Josef Bacik authored
      
      
      We are going to want to populate our device lookup args outside of any
      locks and then do the actual device lookup later, so add a helper to do
      this work and make btrfs_find_device_by_devspec() use this helper for
      now.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      faa775c4
    • Josef Bacik's avatar
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik authored
      
      
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      562d7b15
    • Josef Bacik's avatar
      btrfs: do not call close_fs_devices in btrfs_rm_device · 8b41393f
      Josef Bacik authored
      
      
      There's a subtle case where if we're removing the seed device from a
      file system we need to free its private copy of the fs_devices.  However
      we do not need to call close_fs_devices(), because at this point there
      are no devices left to close as we've closed the last one.  The only
      thing that close_fs_devices() does is decrement ->opened, which should
      be 1.  We want to avoid calling close_fs_devices() here because it has a
      lockdep_assert_held(&uuid_mutex), and we are going to stop holding the
      uuid_mutex in this path.
      
      So simply decrement the  ->opened counter like we should, and then clean
      up like normal.  Also add a comment explaining what we're doing here as
      I initially removed this code erroneously.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b41393f
    • Anand Jain's avatar
      btrfs: add comments for device counts in struct btrfs_fs_devices · add9745a
      Anand Jain authored
      
      
      A bug was was checking a wrong device count before we delete the struct
      btrfs_fs_devices in btrfs_rm_device(). To avoid future confusion and
      easy reference add a comment about the various device counts that we have
      in the struct btrfs_fs_devices.
      
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      add9745a
    • Anand Jain's avatar
      btrfs: use num_device to check for the last surviving seed device · 8e906945
      Anand Jain authored
      
      
      For both sprout and seed fsids,
       btrfs_fs_devices::num_devices provides device count including missing
       btrfs_fs_devices::open_devices provides device count excluding missing
      
      We create a dummy struct btrfs_device for the missing device, so
      num_devices != open_devices when there is a missing device.
      
      In btrfs_rm_devices() we wrongly check for %cur_devices->open_devices
      before freeing the seed fs_devices. Instead we should check for
      %cur_devices->num_devices.
      
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e906945
    • Filipe Manana's avatar
      btrfs: fix lost error handling when replaying directory deletes · 10adb115
      Filipe Manana authored
      
      
      At replay_dir_deletes(), if find_dir_range() returns an error we break out
      of the main while loop and then assign a value of 0 (success) to the 'ret'
      variable, resulting in completely ignoring that an error happened. Fix
      that by jumping to the 'out' label when find_dir_range() returns an error
      (negative value).
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10adb115
    • Qu Wenruo's avatar
      btrfs: remove btrfs_bio::logical member · f4f39fc5
      Qu Wenruo authored
      
      
      The member btrfs_bio::logical is only initialized by two call sites:
      
      - btrfs_repair_one_sector()
        No corresponding site to utilize it.
      
      - btrfs_submit_direct()
        The corresponding site to utilize it is btrfs_check_read_dio_bio().
      
      However for btrfs_check_read_dio_bio(), we can grab the file_offset from
      btrfs_dio_private::file_offset directly.
      
      Thus it turns out we don't really need that btrfs_bio::logical member at
      all.
      
      For btrfs_bio, the logical bytenr can be fetched from its
      bio->bi_iter.bi_sector directly.
      
      So let's just remove the member to save 8 bytes for structure btrfs_bio.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4f39fc5
    • Qu Wenruo's avatar
      btrfs: rename btrfs_dio_private::logical_offset to file_offset · 47926ab5
      Qu Wenruo authored
      
      
      The naming of "logical_offset" can be confused with logical bytenr of
      the dio range.
      
      In fact it's file offset, and the naming "file_offset" is already widely
      used in all other sites.
      
      Just do the rename to avoid confusion.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47926ab5
    • Christoph Hellwig's avatar
      btrfs: use bvec_kmap_local in btrfs_csum_one_bio · 3dcfbcce
      Christoph Hellwig authored
      
      
      Using local kmaps slightly reduces the chances to stray writes, and
      the bvec interface cleans up the code a little bit.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3dcfbcce
    • Anand Jain's avatar
      btrfs: reduce btrfs_update_block_group alloc argument to bool · 11b66fa6
      Anand Jain authored
      
      
      btrfs_update_block_group() accounts for the number of bytes allocated or
      freed. Argument @alloc specifies whether the call is for alloc or free.
      Convert the argument @alloc type from int to bool.
      
      Reviewed-by: default avatarSu Yue <l@damenly.su>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11b66fa6
    • Nikolay Borisov's avatar
      btrfs: make btrfs_ref::real_root optional · eed2037f
      Nikolay Borisov authored
      
      
      Now that real_root is only used in ref-verify core gate it behind
      CONFIG_BTRFS_FS_REF_VERIFY ifdef. This shrinks the size of pending
      delayed refs by 8 bytes per ref, of which we can have many at any one
      time depending on intensity of the workload. Also change the comment
      about the member as it no longer deals with qgroups.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eed2037f
    • Nikolay Borisov's avatar
      btrfs: pull up qgroup checks from delayed-ref core to init time · 681145d4
      Nikolay Borisov authored
      
      
      Instead of checking whether qgroup processing for a dealyed ref has to
      happen in the core of delayed ref, simply pull the check at init time of
      respective delayed ref structures. This eliminates the final use of
      real_root in delayed-ref core paving the way to making this member
      optional.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      681145d4
    • Nikolay Borisov's avatar
      btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref · f42c5da6
      Nikolay Borisov authored
      
      
      In order to make 'real_root' used only in ref-verify it's required to
      have the necessary context to perform the same checks that this member
      is used for. So add 'mod_root' which will contain the root on behalf of
      which a delayed ref was created and a 'skip_group' parameter which
      will contain callsite-specific override of skip_qgroup.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f42c5da6
    • Nikolay Borisov's avatar
      btrfs: rely on owning_root field in btrfs_add_delayed_tree_ref to detect CHUNK_ROOT · d55b9e68
      Nikolay Borisov authored
      
      
      The real_root field is going to be used only by ref-verify tool so limit
      its use outside of it. Blocks belonging to the chunk root will always
      have it as an owner so the check is equivalent.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d55b9e68
    • Nikolay Borisov's avatar
      btrfs: rename root fields in delayed refs structs · 113479d5
      Nikolay Borisov authored
      
      
      Both data and metadata delayed ref structures have fields named
      root/ref_root respectively. Those are somewhat cryptic and don't really
      convey the real meaning. In fact those roots are really the original
      owners of the respective block (i.e in case of a snapshot a data delayed
      ref will contain the original root that owns the given block). Rename
      those fields accordingly and adjust comments.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      113479d5
    • Josef Bacik's avatar
      btrfs: do not infinite loop in data reclaim if we aborted · 0e24f6d8
      Josef Bacik authored
      
      
      Error injection stressing uncovered a busy loop in our data reclaim
      loop.  There are two cases here, one where we loop creating block groups
      until space_info->full is set, or in the main loop we will skip erroring
      out any tickets if space_info->full == 0.  Unfortunately if we aborted
      the transaction then we will never allocate chunks or reclaim any space
      and thus never get ->full, and you'll see stack traces like this:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139]
        CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G        W         5.13.0-rc1+ #328
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_data_space
        RIP: 0010:btrfs_join_transaction+0x12/0x20
        RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246
        RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000
        RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0
        R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff
        R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008
        FS:  0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0
        Call Trace:
         flush_space+0x4a8/0x660
         btrfs_async_reclaim_data_space+0x55/0x130
         process_one_work+0x1e9/0x380
         worker_thread+0x53/0x3e0
         ? process_one_work+0x380/0x380
         kthread+0x118/0x140
         ? __kthread_bind_mask+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      Fix this by checking to see if we have a btrfs fs error in either of the
      reclaim loops, and if so fail the tickets and bail.  In addition to
      this, fix maybe_fail_all_tickets() to not try to grant tickets if we've
      aborted, simply fail everything.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e24f6d8
    • Josef Bacik's avatar
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik authored
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733
      
       ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84961539
    • Josef Bacik's avatar
      btrfs: change error handling for btrfs_delete_*_in_log · 9a35fc95
      Josef Bacik authored
      
      
      Currently we will abort the transaction if we get a random error (like
      -EIO) while trying to remove the directory entries from the root log
      during rename.
      
      However since these are simply log tree related errors, we can mark the
      trans as needing a full commit.  Then if the error was truly
      catastrophic we'll hit it during the normal commit and abort as
      appropriate.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a35fc95
    • Josef Bacik's avatar
      btrfs: change handle_fs_error in recover_log_trees to aborts · ba51e2a1
      Josef Bacik authored
      
      
      During inspection of the return path for replay I noticed that we don't
      actually abort the transaction if we get a failure during replay.  This
      isn't a problem necessarily, as we properly return the error and will
      fail to mount.  However we still leave this dangling transaction that
      could conceivably be committed without thinking there was an error.
      
      We were using btrfs_handle_fs_error() here, but that pre-dates the
      transaction abort code.  Simply replace the btrfs_handle_fs_error()
      calls with transaction aborts, so we still know where exactly things
      went wrong, and add a few in some other un-handled error cases.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba51e2a1
    • Kai Song's avatar
      btrfs: zoned: use kmemdup() to replace kmalloc + memcpy · 64259baa
      Kai Song authored
      
      
      Fix memdup.cocci warning:
      fs/btrfs/zoned.c:1198:23-30: WARNING opportunity for kmemdup
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarKai Song <songkai01@inspur.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64259baa
    • Qu Wenruo's avatar
      btrfs: subpage: only allow compression if the range is fully page aligned · 0cf9b244
      Qu Wenruo authored
      
      
      For compressed write, we use a mechanism called async COW, which unlike
      regular run_delalloc_cow() or cow_file_range() will also unlock the
      first page.
      
      This mechanism allows us to continue handling next ranges, without
      waiting for the time consuming compression.
      
      But this has a problem for subpage case, as we could have the following
      delalloc range for a page:
      
      0		32K		64K
      |	|///////|	|///////|
      		\- A		\- B
      
      In the above case, if we pass both ranges to cow_file_range_async(),
      both range A and range B will try to unlock the full page [0, 64K).
      
      And which one finishes later than the other one will try to do other
      page operations like end_page_writeback() on a unlocked page, triggering
      VM layer BUG_ON().
      
      To make subpage compression work at least partially, here we add another
      restriction for it, only allow compression if the delalloc range is
      fully page aligned.
      
      By that, async extent is always ensured to unlock the first page
      exclusively, just like it used to be for regular sectorsize.
      
      In theory, we only need to make sure the delalloc range fully covers its
      first page, but the tail page will be locked anyway, blocking later
      writeback until the compression finishes.
      
      Thus here we choose to make sure the range is fully page aligned before
      doing the compression.
      
      In the future, we could optimize the situation by properly increasing
      subpage::writers number for the locked page, but that also means we need
      to change how we run delalloc range of page.
      (Instead of running each delalloc range we hit, we need to find and lock
      all delalloc ranges covering the page, then run each of them).
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cf9b244
    • Qu Wenruo's avatar
      btrfs: subpage: avoid potential deadlock with compression and delalloc · 2749f7ef
      Qu Wenruo authored
      
      
      [BUG]
      With experimental subpage compression enabled, a simple fsstress can
      lead to self deadlock on page 720896:
      
              mkfs.btrfs -f -s 4k $dev > /dev/null
              mount $dev -o compress $mnt
              $fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
      
      [CAUSE]
      If we have a file layout looks like below:
      
      	0	32K	64K	96K	128K
      	|//|		|///////////////|
      	   4K
      
      Then we run delalloc range for the inode, it will:
      
      - Call find_lock_delalloc_range() with @delalloc_start = 0
        Then we got a delalloc range [0, 4K).
      
        This range will be COWed.
      
      - Call find_lock_delalloc_range() again with @delalloc_start = 4K
        Since find_lock_delalloc_range() never cares whether the range
        is still inside page range [0, 64K), it will return range [64K, 128K).
      
        This range meets the condition for subpage compression, will go
        through async COW path.
      
        And async COW path will return @page_started.
      
        But that @page_started is now for range [64K, 128K), not for range
        [0, 64K).
      
      - writepage_dellloc() returned 1 for page [0, 64K)
        Thus page [0, 64K) will not be unlocked, nor its page dirty status
        will be cleared.
      
      Next time when we try to lock page [0, 64K) we will deadlock, as there
      is no one to release page [0, 64K).
      
      This problem will never happen for regular page size as one page only
      contains one sector.  After the first find_lock_delalloc_range() call,
      the @delalloc_end will go beyond @page_end no matter if we found a
      delalloc range or not
      
      Thus this bug only happens for subpage, as now we need multiple runs to
      exhaust the delalloc range of a page.
      
      [FIX]
      Fix the problem by ensuring the delalloc range we ran at least started
      inside @locked_page.
      
      So that we will never get incorrect @page_started.
      
      And to prevent such problem from happening again:
      
      - Make find_lock_delalloc_range() return false if the found range is
        beyond @end value passed in.
      
        Since @end will be utilized now, add an ASSERT() to ensure we pass
        correct @end into find_lock_delalloc_range().
      
        This also means, for selftests we needs to populate @end before calling
        find_lock_delalloc_range().
      
      - New ASSERT() in find_lock_delalloc_range()
        Now we will make sure the @start/@end passed in at least covers part
        of the page.
      
      - New ASSERT() in run_delalloc_range()
        To make sure the range at least starts inside @locked page.
      
      - Use @delalloc_start as proper cursor, while @delalloc_end is always
        reset to @page_end.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2749f7ef
    • Qu Wenruo's avatar
      btrfs: handle page locking in btrfs_page_end_writer_lock with no writers · 164674a7
      Qu Wenruo authored
      
      
      There are several call sites of extent_clear_unlock_delalloc() which get
      @locked_page = NULL.
      So that extent_clear_unlock_delalloc() will try to call
      process_one_page() to unlock every page even the first page is not
      locked by btrfs_page_start_writer_lock().
      
      This will trigger an ASSERT() in btrfs_subpage_end_and_test_writer() as
      previously we require every page passed to
      btrfs_subpage_end_and_test_writer() to be locked by
      btrfs_page_start_writer_lock().
      
      But compression path doesn't go that way.
      
      Thankfully it's not hard to distinguish page locked by lock_page() and
      btrfs_page_start_writer_lock().
      
      So do the check in btrfs_subpage_end_and_test_writer() so now it can
      handle both cases well.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      164674a7
    • Qu Wenruo's avatar
      btrfs: rework page locking in __extent_writepage() · e55a0de1
      Qu Wenruo authored
      
      
      Pages passed to __extent_writepage() are always locked, but they may be
      locked by different functions.
      
      There are two types of locked page for __extent_writepage():
      
      - Page locked by plain lock_page()
        It should not have any subpage::writers count.
        Can be unlocked by unlock_page().
        This is the most common locked page for __extent_writepage() called
        inside extent_write_cache_pages() or extent_write_full_page().
        Rarer cases include the @locked_page from extent_write_locked_range().
      
      - Page locked by lock_delalloc_pages()
        There is only one caller, all pages except @locked_page for
        extent_write_locked_range().
        In this case, we have to call subpage helper to handle the case.
      
      So here we introduce a helper, btrfs_page_unlock_writer(), to allow
      __extent_writepage() to unlock different locked pages.
      
      And since for all other callers of __extent_writepage() their pages are
      ensured to be locked by lock_page(), also add an extra check for
      epd::extent_locked to unlock such pages directly.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e55a0de1
    • Qu Wenruo's avatar
      btrfs: subpage: make lzo_compress_pages() compatible · d4088803
      Qu Wenruo authored
      
      
      There are several problems in lzo_compress_pages() preventing it from
      being subpage compatible:
      
      - No page offset is calculated when reading from inode pages
        For subpage case, we could have @start which is not aligned to
        PAGE_SIZE.
      
        Thus the destination where we read data from must take offset in page
        into consideration.
      
      - The padding for segment header is bound to PAGE_SIZE
        This means, for subpage case we can skip several corners where on x86
        machines we need to add padding zeros.
      
      The rework will:
      
      - Update the comment to replace "page" with "sector"
      
      - Introduce a new helper, copy_compressed_data_to_page(), to do the copy
        So that we don't need to bother page switching for both input and
        output.
      
        Now in lzo_compress_pages() we only care about page switching for
        input, while in copy_compressed_data_to_page() we only care about the
        page switching for output.
      
      - Only one main cursor
        For lzo_compress_pages() we use @cur_in as main cursor.
        It will be the file offset we are currently at.
      
        All other helper variables will be only declared inside the loop.
      
        For copy_compressed_data_to_page() it's similar, we will have
        @cur_out at the main cursor, which records how many bytes are in the
        output.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4088803
    • Qu Wenruo's avatar
      btrfs: factor uncompressed async extent submission code into a new helper · 2b83a0ee
      Qu Wenruo authored
      
      
      Introduce a new helper, submit_uncompressed_range(), for async cow cases
      where we fallback to COW.
      
      There are some new updates introduced to the helper:
      
      - Proper locked_page detection
        It's possible that the async_extent range doesn't cover the locked
        page.  In that case we shouldn't unlock the locked page.
      
        In the new helper, we will ensure that we only unlock the locked page
        when:
      
        * The locked page covers part of the async_extent range
        * The locked page is not unlocked by cow_file_range() nor
          extent_write_locked_range()
      
        This also means extra comments are added focusing on the page locking.
      
      - Add extra comment on some rare parameter used.
        We use @unlock_page = 0 for cow_file_range(), where only two call
        sites doing the same thing, including the new helper.
      
        It's definitely worth some comments.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b83a0ee
    • Qu Wenruo's avatar
      btrfs: subpage: make extent_write_locked_range() compatible · 66448b9d
      Qu Wenruo authored
      
      
      There are two sites are not subpage compatible yet for
      extent_write_locked_range():
      
      - How @nr_pages are calculated
        For subpage we can have the following range with 64K page size:
      
        0   32K  64K   96K 128K
        |   |////|/////|   |
      
        In that case, although 96K - 32K == 64K, thus it looks like one page
        is enough, but the range spans two pages, not one.
      
        Fix it by doing proper round_up() and round_down() to calculate
        @nr_pages.
      
        Also add some extra ASSERT()s to ensure the range passed in is already
        aligned.
      
      - How the page end is calculated
        Currently we just use cur + PAGE_SIZE - 1 to calculate the page end.
      
        Which can't handle the above range layout, and will trigger ASSERT()
        in btrfs_writepage_endio_finish_ordered(), as the range is no longer
        covered by the page range.
      
        Fix it by taking page end into consideration.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66448b9d
    • Qu Wenruo's avatar
      btrfs: subpage: make end_compressed_bio_writeback() compatible · 741ec653
      Qu Wenruo authored
      
      
      In end_compressed_writeback() we just clear the full page writeback.
      For subpage case, if there are two delalloc ranges in the same page, the
      2nd range will trigger a BUG_ON() as the page writeback is already
      cleared by previous range.
      
      Fix it by using btrfs_page_clamp_clear_writeback() helper.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      741ec653