Skip to content
  1. Mar 17, 2021
    • Filipe Manana's avatar
      btrfs: fix race when cloning extent buffer during rewind of an old root · dbcc7d57
      Filipe Manana authored
      
      
      While resolving backreferences, as part of a logical ino ioctl call or
      fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
      operations of a root, triggering a stack trace like the following:
      
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:1210!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G        W         5.11.0-2d11c0084b02-misc-next+ #89
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
        Code: 05 48 8d 74 10 (...)
        RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
        RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
        RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
        RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
        R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
        R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
        FS:  00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
        Call Trace:
         btrfs_search_old_slot+0x265/0x10d0
         ? lock_acquired+0xbb/0x600
         ? btrfs_search_slot+0x1090/0x1090
         ? free_extent_buffer.part.61+0xd7/0x140
         ? free_extent_buffer+0x13/0x20
         resolve_indirect_refs+0x3e9/0xfc0
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? add_prelim_ref.part.11+0x150/0x150
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? do_raw_spin_unlock+0xa8/0x140
         ? rb_insert_color+0x30/0x360
         ? prelim_ref_insert+0x12d/0x430
         find_parent_nodes+0x5c3/0x1830
         ? resolve_indirect_refs+0xfc0/0xfc0
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x160/0x210
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? poison_range+0x38/0x40
         ? unpoison_range+0x14/0x40
         ? trace_hardirqs_on+0x55/0x120
         btrfs_find_all_roots_safe+0x142/0x1e0
         ? find_parent_nodes+0x1830/0x1830
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         iterate_extent_inodes+0x20e/0x580
         ? tree_backref_for_extent+0x230/0x230
         ? lock_downgrade+0x3d0/0x3d0
         ? read_extent_buffer+0xdd/0x110
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? _raw_spin_unlock+0x22/0x30
         ? __kasan_check_write+0x14/0x20
         iterate_inodes_from_logical+0x129/0x170
         ? iterate_inodes_from_logical+0x129/0x170
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         ? iterate_extent_inodes+0x580/0x580
         ? __vmalloc_node+0x92/0xb0
         ? init_data_container+0x34/0xb0
         ? init_data_container+0x34/0xb0
         ? kvmalloc_node+0x60/0x80
         btrfs_ioctl_logical_to_ino+0x158/0x230
         btrfs_ioctl+0x205e/0x4040
         ? __might_sleep+0x71/0xe0
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? getrusage+0x4b6/0x9c0
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __might_fault+0x64/0xd0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __task_pid_nr_ns+0xd3/0x250
         ? lock_acquire+0xc7/0x510
         ? __fget_files+0x160/0x230
         ? __fget_light+0xf2/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fd1976e2427
        Code: 00 00 90 48 8b 05 (...)
        RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
        RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
        RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
        R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
        R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
        Modules linked in:
        ---[ end trace ec8931a1c36e57be ]---
      
        (gdb) l *(__tree_mod_log_rewind+0x3b1)
        0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
        1205                     * the modification. as we're going backwards, we do the
        1206                     * opposite of each operation here.
        1207                     */
        1208                    switch (tm->op) {
        1209                    case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        1210                            BUG_ON(tm->slot < n);
        1211                            fallthrough;
        1212                    case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        1213                    case MOD_LOG_KEY_REMOVE:
        1214                            btrfs_set_node_key(eb, &tm->key, tm->slot);
      
      Here's what happens to hit that BUG_ON():
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            tree_mod_log_insert_root(eb X, eb Y, 1);
      
      3) At tree_mod_log_insert_root() we create tree mod log elements for each
         slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
         with a ->logical pointing to ebX->start. These are placed in an array
         named tm_list.
         Lets assume there are N elements (N pointers in eb X);
      
      4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
         element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
         ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
         to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         tm_list as argument. After that, tree_mod_log_free_eb() calls
         __tree_mod_log_insert() for each member of tm_list in reverse order,
         from highest slot in eb X, slot N - 1, to slot 0 of eb X;
      
      6) __tree_mod_log_insert() sets the sequence number of each given tree mod
         log operation - it increments fs_info->tree_mod_seq and sets
         fs_info->tree_mod_seq as the sequence number of the given tree mod log
         operation.
      
         This means that for the tm_list created at tree_mod_log_insert_root(),
         the element corresponding to slot 0 of eb X has the highest sequence
         number (1 + N), and the element corresponding to the last slot has the
         lowest sequence number (2);
      
      7) Then, after inserting tm_list's elements into the tree mod log rbtree,
         the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
         sequence number, which is N + 2;
      
      8) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      9) Later some other task T allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      10) The tree mod log user task calls btrfs_search_old_slot(), which calls
          get_old_root(), and finally that calls __tree_mod_log_oldest_root()
          with time_seq == 1 and eb_root == eb Y;
      
      11) First iteration of the while loop finds the tree mod log element with
          sequence number N + 2, for the logical address of eb Y and of type
          MOD_LOG_ROOT_REPLACE;
      
      12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
          of the loop, and set root_logical to point to tm->old_root.logical
          which corresponds to the logical address of eb X;
      
      13) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
          the old slot N - 1 of eb X (eb X had N items in it before being freed);
      
      14) We then break out of the while loop and return the tree mod log operation
          of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
          eb X, to get_old_root();
      
      15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
          and set "logical" to the logical address of eb X, which was the old
          root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
          which results in adding a tree mod log operation of type
          MOD_LOG_KEY_ADD to the tree mod log - this is done at
          ctree.c:insert_ptr() - but after adding the tree mod log operation
          and before updating the number of items in eb X from 0 to 1...
      
      17) The task at get_old_root() calls tree_mod_log_search() and gets the
          tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
          Then it enters the following if branch:
      
          if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
             (...)
          } (...)
      
          Calls read_tree_block() for eb X, which gets a reference on eb X but
          does not lock it - task T has it locked.
          Then it clones eb X while it has nritems set to 0 in its header, before
          task T sets nritems to 1 in eb X's header. From hereupon we use the
          clone of eb X which no other task has access to;
      
      18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
          mod log operation we just got from tree_mod_log_search() in the
          previous step and the cloned version of eb X;
      
      19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
          of items set in eb X's clone, which is 0. Then we enter the while loop,
          and in its first iteration we process the MOD_LOG_KEY_ADD operation,
          which just decrements "n" from 0 to (u32)-1, since "n" is declared with
          a type of u32. At the end of this iteration we call rb_next() to find the
          next tree mod log operation for eb X, that gives us the mod log operation
          of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
          number of N + 1 (steps 3 to 6);
      
      20) Then we go back to the top of the while loop and trigger the following
          BUG_ON():
      
              (...)
              switch (tm->op) {
              case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                       BUG_ON(tm->slot < n);
                       fallthrough;
              (...)
      
          Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
      
      Fix this by taking a read lock on the extent buffer before cloning it at
      ctree.c:get_old_root(). This should be done regardless of the extent
      buffer having been freed and reused, as a concurrent task might be
      modifying it (while holding a write lock on it).
      
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
      Fixes: 834328a8
      
       ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbcc7d57
    • David Sterba's avatar
      btrfs: fix slab cache flags for free space tree bitmap · 34e49994
      David Sterba authored
      
      
      The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
      that's a debugging flag and not always enabled. Also the other slabs are
      created with at least SLAB_MEM_SPREAD that we want as well to average
      the memory placement cost.
      
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Fixes: 3acd4850
      
       ("btrfs: fix allocation of free space cache v1 bitmap pages")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      34e49994
  2. Mar 16, 2021
    • Qu Wenruo's avatar
      btrfs: subpage: make readahead work properly · 60484cd9
      Qu Wenruo authored
      
      
      In readahead infrastructure, we are using a lot of hard coded PAGE_SHIFT
      while we're not doing anything specific to PAGE_SIZE.
      
      One of the most affected part is the radix tree operation of
      btrfs_fs_info::reada_tree.
      
      If using PAGE_SHIFT, subpage metadata readahead is broken and does no
      help reading metadata ahead.
      
      Fix the problem by using btrfs_fs_info::sectorsize_bits so that
      readahead could work for subpage.
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60484cd9
    • Qu Wenruo's avatar
      btrfs: subpage: fix wild pointer access during metadata read failure · d9bb77d5
      Qu Wenruo authored
      
      
      [BUG]
      When running fstests for btrfs subpage read-write test, it has a very
      high chance to crash at generic/475 with the following stack:
      
       BTRFS warning (device dm-8): direct IO failed ino 510 rw 1,34817 sector 0xcdf0 len 94208 err no 10
       Unable to handle kernel paging request at virtual address ffff80001157e7c0
       CPU: 2 PID: 687125 Comm: kworker/u12:4 Tainted: G        WC        5.12.0-rc2-custom+ #5
       Hardware name: Khadas VIM3 (DT)
       Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
       pc : queued_spin_lock_slowpath+0x1a0/0x390
       lr : do_raw_spin_lock+0xc4/0x11c
       Call trace:
        queued_spin_lock_slowpath+0x1a0/0x390
        _raw_spin_lock+0x68/0x84
        btree_readahead_hook+0x38/0xc0 [btrfs]
        end_bio_extent_readpage+0x504/0x5f4 [btrfs]
        bio_endio+0x170/0x1a4
        end_workqueue_fn+0x3c/0x60 [btrfs]
        btrfs_work_helper+0x1b0/0x1b4 [btrfs]
        process_one_work+0x22c/0x430
        worker_thread+0x70/0x3a0
        kthread+0x13c/0x140
        ret_from_fork+0x10/0x30
       Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827)
      
      [CAUSE]
      In end_bio_extent_readpage(), if we hit an error during read, we will
      handle the error differently for data and metadata.
      For data we queue a repair, while for metadata, we record the error and
      let the caller choose what to do.
      
      But the code is still using page->private to grab extent buffer, which
      no longer points to extent buffer for subpage metadata pages.
      
      Thus this wild pointer access leads to above crash.
      
      [FIX]
      Introduce a helper, find_extent_buffer_readpage(), to grab extent
      buffer.
      
      The difference against find_extent_buffer_nospinlock() is:
      
      - Also handles regular sectorsize == PAGE_SIZE case
      - No extent buffer refs increase/decrease
        As extent buffer under IO must have non-zero refs, so this is safe
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9bb77d5
  3. Mar 15, 2021
    • Filipe Manana's avatar
      btrfs: zoned: fix linked list corruption after log root tree allocation failure · e3d3b415
      Filipe Manana authored
      When using a zoned filesystem, while syncing the log, if we fail to
      allocate the root node for the log root tree, we are not removing the
      log context we allocated on stack from the list of log contexts of the
      log root tree. This means after the return from btrfs_sync_log() we get
      a corrupted linked list.
      
      Fix this by allocating the node before adding our stack allocated context
      to the list of log contexts of the log root tree.
      
      Fixes: 3ddebf27
      
       ("btrfs: zoned: reorder log node allocation on zoned filesystem")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3d3b415
    • Qu Wenruo's avatar
      btrfs: fix qgroup data rsv leak caused by falloc failure · a3ee79bd
      Qu Wenruo authored
      
      
      [BUG]
      When running fsstress with only falloc workload, and a very low qgroup
      limit set, we can get qgroup data rsv leak at unmount time.
      
       BTRFS warning (device dm-0): qgroup 0/5 has unreleased space, type 0 rsv 20480
       BTRFS error (device dm-0): qgroup reserved space leaked
      
      The minimal reproducer looks like:
      
        #!/bin/bash
        dev=/dev/test/test
        mnt="/mnt/btrfs"
        fsstress=~/xfstests-dev/ltp/fsstress
        runtime=8
      
        workload()
        {
                umount $dev &> /dev/null
                umount $mnt &> /dev/null
                mkfs.btrfs -f $dev > /dev/null
                mount $dev $mnt
      
                btrfs quota en $mnt
                btrfs quota rescan -w $mnt
                btrfs qgroup limit 16m 0/5 $mnt
      
                $fsstress -w -z -f creat=10 -f fallocate=10 -p 2 -n 100 \
        		-d $mnt -v > /tmp/fsstress
      
                umount $mnt
                if dmesg | grep leak ; then
      		echo "!!! FAILED !!!"
        		exit 1
                fi
        }
      
        for (( i=0; i < $runtime; i++)); do
                echo "=== $i/$runtime==="
                workload
        done
      
      Normally it would fail before round 4.
      
      [CAUSE]
      In function insert_prealloc_file_extent(), we first call
      btrfs_qgroup_release_data() to know how many bytes are reserved for
      qgroup data rsv.
      
      Then use that @qgroup_released number to continue our work.
      
      But after we call btrfs_qgroup_release_data(), we should either queue
      @qgroup_released to delayed ref or free them manually in error path.
      
      Unfortunately, we lack the error handling to free the released bytes,
      leaking qgroup data rsv.
      
      All the error handling function outside won't help at all, as we have
      released the range, meaning in inode io tree, the EXTENT_QGROUP_RESERVED
      bit is already cleared, thus all btrfs_qgroup_free_data() call won't
      free any data rsv.
      
      [FIX]
      Add free_qgroup tag to manually free the released qgroup data rsv.
      
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reported-by: default avatarDavid Sterba <dsterba@suse.cz>
      Fixes: 9729f10a
      
       ("btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent()")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3ee79bd
    • Qu Wenruo's avatar
      btrfs: track qgroup released data in own variable in insert_prealloc_file_extent · fbf48bb0
      Qu Wenruo authored
      
      
      There is a piece of weird code in insert_prealloc_file_extent(), which
      looks like:
      
      	ret = btrfs_qgroup_release_data(inode, file_offset, len);
      	if (ret < 0)
      		return ERR_PTR(ret);
      	if (trans) {
      		ret = insert_reserved_file_extent(trans, inode,
      						  file_offset, &stack_fi,
      						  true, ret);
      	...
      	}
      	extent_info.is_new_extent = true;
      	extent_info.qgroup_reserved = ret;
      	...
      
      Note how the variable @ret is abused here, and if anyone is adding code
      just after btrfs_qgroup_release_data() call, it's super easy to
      overwrite the @ret and cause tons of qgroup related bugs.
      
      Fix such abuse by introducing new variable @qgroup_released, so that we
      won't reuse the existing variable @ret.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbf48bb0
    • Qu Wenruo's avatar
      btrfs: fix wrong offset to zero out range beyond i_size · d2dcc8ed
      Qu Wenruo authored
      [BUG]
      The test generic/091 fails , with the following output:
      
        fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
        mapped writes DISABLED
        Seed set to 1
        main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
        main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
        skipping zero size read
        truncating to largest ever: 0xe400
        copying to largest ever: 0x1f400
        cloning to largest ever: 0x70000
        cloning to largest ever: 0x77000
        fallocating to largest ever: 0x7a120
        Mapped Read: non-zero data past EOF (0x3a7ff) page offset 0x800 is 0xf2e1 <<<
        ...
      
      [CAUSE]
      In commit c28ea613
      
       ("btrfs: subpage: fix the false data csum mismatch error")
      end_bio_extent_readpage() changes to only zero the range inside the bvec
      for incoming subpage support.
      
      But that commit is using incorrect offset to calculate the start.
      
      For subpage, we can have a case that the whole bvec is beyond isize,
      thus we need to calculate the correct offset.
      
      But the offending commit is using @end (bvec end), other than @start
      (bvec start) to calculate the start offset.
      
      This means, we only zero the last byte of the bvec, not from the isize.
      This stupid bug makes the range beyond isize is not properly zeroed, and
      failed above test.
      
      [FIX]
      Use correct @start to calculate the range start.
      
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Fixes: c28ea613
      
       ("btrfs: subpage: fix the false data csum mismatch error")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2dcc8ed
  4. Mar 04, 2021
  5. Mar 03, 2021
    • Qu Wenruo's avatar
      btrfs: subpage: fix the false data csum mismatch error · c28ea613
      Qu Wenruo authored
      
      
      [BUG]
      When running fstresss, we can hit strange data csum mismatch where the
      on-disk data is in fact correct (passes scrub).
      
      With some extra debug info added, we have the following traces:
      
        0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
        0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
        0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
        0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
        0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
        0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
        0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
        0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
        1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
        1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
        1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
        7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0
      
      [CAUSE]
      Normally we expect all submitted bio reads to only touch the range we
      specified, and under subpage context, it means we should only touch the
      range specified in each bvec.
      
      But in data read path, inside end_bio_extent_readpage(), we have page
      zeroing which only takes regular page size into consideration.
      
      This means for subpage if we have an inode whose content looks like below:
      
        0       16K     32K     48K     64K
        |///////|       |///////|       |
      
        |//| = data needs to be read from disk
        |  | = hole
      
      And i_size is 64K initially.
      
      Then the following race can happen:
      
      		T1		|		T2
      --------------------------------+--------------------------------
      btrfs_do_readpage()		|
      |- isize = 64K;			|
      |  At this time, the isize is 	|
      |  64K				|
      |				|
      |- submit_extent_page()		|
      |  submit previous assembled bio|
      |  assemble bio for [0, 16K)	|
      |				|
      |- submit_extent_page()		|
         submit read bio for [0, 16K) |
         assemble read bio for	|
         [32K, 48K)			|
       				|
      				| btrfs_setsize()
      				| |- i_size_write(, 16K);
      				|    Now i_size is only 16K
      end_io() for [0K, 16K)		|
      |- end_bio_extent_readpage()	|
         |- btrfs_verify_data_csum()  |
         |  No csum error		|
         |- i_size = 16K;		|
         |- zero_user_segment(16K,	|
            PAGE_SIZE);		|
            !!! We zeroed range	|
            !!! [32K, 48K)		|
      				| end_io for [32K, 48K)
      				| |- end_bio_extent_readpage()
      				|    |- btrfs_verify_data_csum()
      				|       ! CSUM MISMATCH !
      				|       ! As the range is zeroed now !
      
      [FIX]
      To fix the problem, make end_bio_extent_readpage() to only zero the
      range of bvec.
      
      The bug only affects subpage read-write support, as for full read-only
      mount we can't change i_size thus won't hit the race condition.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c28ea613
    • Filipe Manana's avatar
      btrfs: fix warning when creating a directory with smack enabled · fd57a98d
      Filipe Manana authored
      
      
      When we have smack enabled, during the creation of a directory smack may
      attempt to add a "smack transmute" xattr on the inode, which results in
      the following warning and trace:
      
        WARNING: CPU: 3 PID: 2548 at fs/btrfs/transaction.c:537 start_transaction+0x489/0x4f0
        Modules linked in: nft_objref nf_conntrack_netbios_ns (...)
        CPU: 3 PID: 2548 Comm: mkdir Not tainted 5.9.0-rc2smack+ #81
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:start_transaction+0x489/0x4f0
        Code: e9 be fc ff ff (...)
        RSP: 0018:ffffc90001887d10 EFLAGS: 00010202
        RAX: ffff88816f1e0000 RBX: 0000000000000201 RCX: 0000000000000003
        RDX: 0000000000000201 RSI: 0000000000000002 RDI: ffff888177849000
        RBP: ffff888177849000 R08: 0000000000000001 R09: 0000000000000004
        R10: ffffffff825e8f7a R11: 0000000000000003 R12: ffffffffffffffe2
        R13: 0000000000000000 R14: ffff88803d884270 R15: ffff8881680d8000
        FS:  00007f67317b8440(0000) GS:ffff88817bcc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f67247a22a8 CR3: 000000004bfbc002 CR4: 0000000000370ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? slab_free_freelist_hook+0xea/0x1b0
         ? trace_hardirqs_on+0x1c/0xe0
         btrfs_setxattr_trans+0x3c/0xf0
         __vfs_setxattr+0x63/0x80
         smack_d_instantiate+0x2d3/0x360
         security_d_instantiate+0x29/0x40
         d_instantiate_new+0x38/0x90
         btrfs_mkdir+0x1cf/0x1e0
         vfs_mkdir+0x14f/0x200
         do_mkdirat+0x6d/0x110
         do_syscall_64+0x2d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f673196ae6b
        Code: 8b 05 11 (...)
        RSP: 002b:00007ffc3c679b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
        RAX: ffffffffffffffda RBX: 00000000000001ff RCX: 00007f673196ae6b
        RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc3c67a30d
        RBP: 00007ffc3c67a30d R08: 00000000000001ff R09: 0000000000000000
        R10: 000055d3e39fe930 R11: 0000000000000246 R12: 0000000000000000
        R13: 00007ffc3c679cd8 R14: 00007ffc3c67a30d R15: 00007ffc3c679ce0
        irq event stamp: 11029
        hardirqs last  enabled at (11037): [<ffffffff81153fe6>] console_unlock+0x486/0x670
        hardirqs last disabled at (11044): [<ffffffff81153c01>] console_unlock+0xa1/0x670
        softirqs last  enabled at (8864): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20
        softirqs last disabled at (8851): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20
      
      This happens because at btrfs_mkdir() we call d_instantiate_new() while
      holding a transaction handle, which results in the following call chain:
      
        btrfs_mkdir()
           trans = btrfs_start_transaction(root, 5);
      
           d_instantiate_new()
              smack_d_instantiate()
                  __vfs_setxattr()
                      btrfs_setxattr_trans()
                         btrfs_start_transaction()
                            start_transaction()
                               WARN_ON()
                                 --> a tansaction start has TRANS_EXTWRITERS
                                     set in its type
                               h->orig_rsv = h->block_rsv
                               h->block_rsv = NULL
      
           btrfs_end_transaction(trans)
      
      Besides the warning triggered at start_transaction, we set the handle's
      block_rsv to NULL which may cause some surprises later on.
      
      So fix this by making btrfs_setxattr_trans() not start a transaction when
      we already have a handle on one, stored in current->journal_info, and use
      that handle. We are good to use the handle because at btrfs_mkdir() we did
      reserve space for the xattr and the inode item.
      
      Reported-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      CC: stable@vger.kernel.org # 5.4+
      Acked-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Tested-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Link: https://lore.kernel.org/linux-btrfs/434d856f-bd7b-4889-a6ec-e81aaebfa735@schaufler-ca.com/
      
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd57a98d
    • Nikolay Borisov's avatar
      btrfs: don't flush from btrfs_delayed_inode_reserve_metadata · 4d14c5cd
      Nikolay Borisov authored
      Calling btrfs_qgroup_reserve_meta_prealloc from
      btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
      while holding a transaction and delayed node locks. This is deadlock
      prone. In the past multiple commits:
      
       * ae5e070e ("btrfs: qgroup: don't try to wait flushing if we're
      already holding a transaction")
      
       * 6f23277a ("btrfs: qgroup: don't commit transaction when we already
       hold the handle")
      
      Tried to solve various aspects of this but this was always a
      whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
      scenario involving btrfs_delayed_node::mutex. Namely, one thread
      can call btrfs_dirty_inode as a result of reading a file and modifying
      its atime:
      
        PID: 6963   TASK: ffff8c7f3f94c000  CPU: 2   COMMAND: "test"
        #0  __schedule at ffffffffa529e07d
        #1  schedule at ffffffffa529e4ff
        #2  schedule_timeout at ffffffffa52a1bdd
        #3  wait_for_completion at ffffffffa529eeea             <-- sleeps with delayed node mutex held
        #4  start_delalloc_inodes at ffffffffc0380db5
        #5  btrfs_start_delalloc_snapshot at ffffffffc0393836
        #6  try_flush_qgroup at ffffffffc03f04b2
        #7  __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6     <-- tries to reserve space and starts delalloc inodes.
        #8  btrfs_delayed_update_inode at ffffffffc03e31aa      <-- acquires delayed node mutex
        #9  btrfs_update_inode at ffffffffc0385ba8
       #10  btrfs_dirty_inode at ffffffffc038627b               <-- TRANSACTIION OPENED
       #11  touch_atime at ffffffffa4cf0000
       #12  generic_file_read_iter at ffffffffa4c1f123
       #13  new_sync_read at ffffffffa4ccdc8a
       #14  vfs_read at ffffffffa4cd0849
       #15  ksys_read at ffffffffa4cd0bd1
       #16  do_syscall_64 at ffffffffa4a052eb
       #17  entry_SYSCALL_64_after_hwframe at ffffffffa540008c
      
      This will cause an asynchronous work to flush the delalloc inodes to
      happen which can try to acquire the same delayed_node mutex:
      
        PID: 455    TASK: ffff8c8085fa4000  CPU: 5   COMMAND: "kworker/u16:30"
        #0  __schedule at ffffffffa529e07d
        #1  schedule at ffffffffa529e4ff
        #2  schedule_preempt_disabled at ffffffffa529e80a
        #3  __mutex_lock at ffffffffa529fdcb                    <-- goes to sleep, never wakes up.
        #4  btrfs_delayed_update_inode at ffffffffc03e3143      <-- tries to acquire the mutex
        #5  btrfs_update_inode at ffffffffc0385ba8              <-- this is the same inode that pid 6963 is holding
        #6  cow_file_range_inline.constprop.78 at ffffffffc0386be7
        #7  cow_file_range at ffffffffc03879c1
        #8  btrfs_run_delalloc_range at ffffffffc038894c
        #9  writepage_delalloc at ffffffffc03a3c8f
       #10  __extent_writepage at ffffffffc03a4c01
       #11  extent_write_cache_pages at ffffffffc03a500b
       #12  extent_writepages at ffffffffc03a6de2
       #13  do_writepages at ffffffffa4c277eb
       #14  __filemap_fdatawrite_range at ffffffffa4c1e5bb
       #15  btrfs_run_delalloc_work at ffffffffc0380987         <-- starts running delayed nodes
       #16  normal_work_helper at ffffffffc03b706c
       #17  process_one_work at ffffffffa4aba4e4
       #18  worker_thread at ffffffffa4aba6fd
       #19  kthread at ffffffffa4ac0a3d
       #20  ret_from_fork at ffffffffa54001ff
      
      To fully address those cases the complete fix is to never issue any
      flushing while holding the transaction or the delayed node lock. This
      patch achieves it by calling qgroup_reserve_meta directly which will
      either succeed without flushing or will fail and return -EDQUOT. In the
      latter case that return value is going to be propagated to
      btrfs_dirty_inode which will fallback to start a new transaction. That's
      fine as the majority of time we expect the inode will have
      BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
      copying the in-memory state.
      
      Fixes: c53e9653
      
       ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d14c5cd
  6. Mar 02, 2021
  7. Feb 23, 2021
    • Johannes Thumshirn's avatar
      btrfs: zoned: fix deadlock on log sync · 6e37d245
      Johannes Thumshirn authored
      Lockdep with fstests test case btrfs/041 detected a unsafe locking
      scenario when we allocate the log node on a zoned filesystem.
      
      btrfs/041
       ============================================
       WARNING: possible recursive locking detected
       5.11.0-rc7+ #939 Not tainted
       --------------------------------------------
       xfs_io/698 is trying to acquire lock:
       ffff88810cd673a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x3d1/0xee0 [btrfs]
      
       but task is already holding lock:
       ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&root->log_mutex);
         lock(&root->log_mutex);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       2 locks held by xfs_io/698:
        #0: ffff88810cd66620 (sb_internal){.+.+}-{0:0}, at: btrfs_sync_file+0x2c3/0x570 [btrfs]
        #1: ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]
      
       stack backtrace:
       CPU: 0 PID: 698 Comm: xfs_io Not tainted 5.11.0-rc7+ #939
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
       Call Trace:
        dump_stack+0x77/0x97
        __lock_acquire.cold+0xb9/0x32a
        lock_acquire+0xb5/0x400
        ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
        __mutex_lock+0x7b/0x8d0
        ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
        ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
        ? find_first_extent_bit+0x9f/0x100 [btrfs]
        ? __mutex_unlock_slowpath+0x35/0x270
        btrfs_sync_log+0x3d1/0xee0 [btrfs]
        btrfs_sync_file+0x3a8/0x570 [btrfs]
        __x64_sys_fsync+0x34/0x60
        do_syscall_64+0x33/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens, because we are taking the ->log_mutex albeit it has already
      been locked.
      
      Also while at it, fix the bogus unlock of the tree_log_mutex in the error
      handling.
      
      Fixes: 3ddebf27
      
       ("btrfs: zoned: reorder log node allocation on zoned filesystem")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e37d245
    • Josef Bacik's avatar
      btrfs: avoid double put of block group when emptying cluster · 95c85fba
      Josef Bacik authored
      It's wrong calling btrfs_put_block_group in
      __btrfs_return_cluster_to_free_space if the block group passed is
      different than the block group the cluster represents. As this means the
      cluster doesn't have a reference to the passed block group. This results
      in double put and a use-after-free bug.
      
      Fix this by simply bailing if the block group we passed in does not
      match the block group on the cluster.
      
      Fixes: fa9c0d79
      
       ("Btrfs: rework allocation clustering")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95c85fba
    • Filipe Manana's avatar
      btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled · 3660d0bc
      Filipe Manana authored
      
      
      When using the NO_HOLES feature, if we clone a file range that spans only
      a hole into a range that is at or beyond the current i_size of the
      destination file, we end up not setting the full sync runtime flag on the
      inode. As a result, if we then fsync the destination file and have a power
      failure, after log replay we can end up exposing stale data instead of
      having a hole for that range.
      
      The conditions for this to happen are the following:
      
      1) We have a file with a size of, for example, 1280K;
      
      2) There is a written (non-prealloc) extent for the file range from 1024K
         to 1280K with a length of 256K;
      
      3) This particular file extent layout is durably persisted, so that the
         existing superblock persisted on disk points to a subvolume root where
         the file has that exact file extent layout and state;
      
      4) The file is truncated to a smaller size, to an offset lower than the
         start offset of its last extent, for example to 800K. The truncate sets
         the full sync runtime flag on the inode;
      
      6) Fsync the file to log it and clear the full sync runtime flag;
      
      7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
         into the file with a destination offset that starts at or beyond the
         256K file extent item we had - for example to offset 1024K;
      
      8) Since the clone operation does not find extents in the source range,
         we end up in the if branch at the bottom of btrfs_clone() where we
         punch a hole for the file range starting at offset 1024K by calling
         btrfs_replace_file_extents(). There we end up not setting the full
         sync flag on the inode, because we don't know we are being called in
         a clone context (and not fallocate's punch hole operation), and
         neither do we create an extent map to represent a hole because the
         requested range is beyond eof;
      
      9) A further fsync to the file will be a fast fsync, since the clone
         operation did not set the full sync flag, and therefore it relies on
         modified extent maps to correctly log the file layout. But since
         it does not find any extent map marking the range from 1024K (the
         previous eof) to the new eof, it does not log a file extent item
         for that range representing the hole;
      
      10) After a power failure no hole for the range starting at 1024K is
         punched and we end up exposing stale data from the old 256K extent.
      
      Turning this into exact steps:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdi
        $ mount /dev/sdi /mnt
      
        # Create our test file with 3 extents of 256K and a 256K hole at offset
        # 256K. The file has a size of 1280K.
        $ xfs_io -f -s \
                    -c "pwrite -S 0xab -b 256K 0 256K" \
                    -c "pwrite -S 0xcd -b 256K 512K 256K" \
                    -c "pwrite -S 0xef -b 256K 768K 256K" \
                    -c "pwrite -S 0x73 -b 256K 1024K 256K" \
                    /mnt/sdi/foobar
      
        # Make sure it's durably persisted. We want the last committed super
        # block to point to this particular file extent layout.
        sync
      
        # Now truncate our file to a smaller size, falling within a position of
        # the second extent. This sets the full sync runtime flag on the inode.
        # Then fsync the file to log it and clear the full sync flag from the
        # inode. The third extent is no longer part of the file and therefore
        # it is not logged.
        $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
      
        # Now do a clone operation that only clones the hole and sets back the
        # file size to match the size it had before the truncate operation
        # (1280K).
        $ xfs_io \
              -c "reflink /mnt/foobar 256K 1024K 256K" \
              -c "fsync" \
              /mnt/foobar
      
        # File data before power failure:
        $ od -A d -t x1 /mnt/foobar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
        *
        0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
        *
        0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        1310720
      
        <power fail>
      
        # Mount the fs again to replay the log tree.
        $ mount /dev/sdi /mnt
      
        # File data after power failure:
        $ od -A d -t x1 /mnt/foobar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
        *
        0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
        *
        0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
        *
        1310720
      
      The range from 1024K to 1280K should correspond to a hole but instead it
      points to stale data, to the 256K extent that should not exist after the
      truncate operation.
      
      The issue does not exists when not using NO_HOLES, because for that case
      we use file extent items to represent holes, these are found and copied
      during the loop that iterates over extents at btrfs_clone(), and that
      causes btrfs_replace_file_extents() to be called with a non-NULL
      extent_info argument and therefore set the full sync runtime flag on the
      inode.
      
      So fix this by making the code that deals with a trailing hole during
      cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
      range starts at or beyond the current i_size.
      
      A test case for fstests will follow soon.
      
      Backporting notes: for kernel 5.4 the change goes to ioctl.c into
      btrfs_clone before the last call to btrfs_punch_hole_range.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3660d0bc
    • Josef Bacik's avatar
      btrfs: tree-checker: do not error out if extent ref hash doesn't match · 1119a72e
      Josef Bacik authored
      
      
      The tree checker checks the extent ref hash at read and write time to
      make sure we do not corrupt the file system.  Generally extent
      references go inline, but if we have enough of them we need to make an
      item, which looks like
      
      key.objectid	= <bytenr>
      key.type	= <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
      key.offset	= hash(tree, owner, offset)
      
      However if key.offset collide with an unrelated extent reference we'll
      simply key.offset++ until we get something that doesn't collide.
      Obviously this doesn't match at tree checker time, and thus we error
      while writing out the transaction.  This is relatively easy to
      reproduce, simply do something like the following
      
        xfs_io -f -c "pwrite 0 1M" file
        offset=2
      
        for i in {0..10000}
        do
      	  xfs_io -c "reflink file 0 ${offset}M 1M" file
      	  offset=$(( offset + 2 ))
        done
      
        xfs_io -c "reflink file 0 17999258914816 1M" file
        xfs_io -c "reflink file 0 35998517829632 1M" file
        xfs_io -c "reflink file 0 53752752058368 1M" file
      
        btrfs filesystem sync
      
      And the sync will error out because we'll abort the transaction.  The
      magic values above are used because they generate hash collisions with
      the first file in the main subvol.
      
      The fix for this is to remove the hash value check from tree checker, as
      we have no idea which offset ours should belong to.
      
      Reported-by: default avatarTuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
      Fixes: 0785a9aa
      
       ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1119a72e
    • Filipe Manana's avatar
      btrfs: fix race between swap file activation and snapshot creation · dd0734f2
      Filipe Manana authored
      When creating a snapshot we check if the current number of swap files, in
      the root, is non-zero, and if it is, we error out and warn that we can not
      create the snapshot because there are active swap files.
      
      However this is racy because when a task started activation of a swap
      file, another task might have started already snapshot creation and might
      have seen the counter for the number of swap files as zero. This means
      that after the swap file is activated we may end up with a snapshot of the
      same root successfully created, and therefore when the first write to the
      swap file happens it has to fall back into COW mode, which should never
      happen for active swap files.
      
      Basically what can happen is:
      
      1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
         There it sees that root->nr_swapfiles has a value of 0 so it continues;
      
      2) Task B enters btrfs_swap_activate(). It is not aware that another task
         started snapshot creation but it did not finish yet. It increments
         root->nr_swapfiles from 0 to 1;
      
      3) Task B checks that the file meets all requirements to be an active
         swap file - it has NOCOW set, there are no snapshots for the inode's
         root at the moment, no file holes, no reflinked extents, etc;
      
      4) Task B returns success and now the file is an active swap file;
      
      5) Task A commits the transaction to create the snapshot and finishes.
         The swap file's extents are now shared between the original root and
         the snapshot;
      
      6) A write into an extent of the swap file is attempted - there is a
         snapshot of the file's root, so we fall back to COW mode and therefore
         the physical location of the extent changes on disk.
      
      So fix this by taking the snapshot lock during swap file activation before
      locking the extent range, as that is the order in which we lock these
      during buffered writes.
      
      Fixes: ed46ff3d
      
       ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd0734f2
    • Filipe Manana's avatar
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana authored
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d
      
       ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      195a49ea
    • Filipe Manana's avatar
      btrfs: avoid checking for RO block group twice during nocow writeback · 20903032
      Filipe Manana authored
      
      
      During the nocow writeback path, we currently iterate the rbtree of block
      groups twice: once for checking if the target block group is RO with the
      call to btrfs_extent_readonly()), and once again for getting a nocow
      reference on the block group with a call to btrfs_inc_nocow_writers().
      
      Since btrfs_inc_nocow_writers() already returns false when the target
      block group is RO, remove the call to btrfs_extent_readonly(). Not only
      we avoid searching the blocks group rbtree twice, it also helps reduce
      contention on the lock that protects it (specially since it is a spin
      lock and not a read-write lock). That may make a noticeable difference
      on very large filesystems, with thousands of allocated block groups.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20903032
    • Nikolay Borisov's avatar
      btrfs: fix race between extent freeing/allocation when using bitmaps · 3c179165
      Nikolay Borisov authored
      
      
      During allocation the allocator will try to allocate an extent using
      cluster policy. Once the current cluster is exhausted it will remove the
      entry under btrfs_free_cluster::lock and subsequently acquire
      btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
      and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
      because there exists a race condition between removing the entry under
      one lock and doing the necessary accounting holding a different lock
      since extent freeing only uses the 2nd lock. This can result in the
      following situation:
      
      T1:                                    T2:
      btrfs_alloc_from_cluster               insert_into_bitmap <holds tree_lock>
       if (entry->bytes == 0)                   if (block_group && !list_empty(&block_group->cluster_list)) {
          rb_erase(entry)
      
       spin_unlock(&cluster->lock);
         (total_bitmaps is still 4)           spin_lock(&cluster->lock);
                                               <doesn't find entry in cluster->root>
       spin_lock(&ctl->tree_lock);             <goes to new_bitmap label, adds
      <blocked since T2 holds tree_lock>       <a new entry and calls add_new_bitmap>
      					    recalculate_thresholds  <crashes,
                                                    due to total_bitmaps
      					      becoming 5 and triggering
      					      an ASSERT>
      
      To fix this ensure that once depleted, the cluster entry is deleted when
      both cluster lock and tree locks are held in the allocator (T1), this
      ensures that even if there is a race with a concurrent
      insert_into_bitmap call it will correctly find the entry in the cluster
      and add the new space to it.
      
      CC: <stable@vger.kernel.org> # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c179165
    • Qu Wenruo's avatar
      btrfs: make check_compressed_csum() to be subpage compatible · 04d4ba4c
      Qu Wenruo authored
      
      
      Currently check_compressed_csum() completely relies on sectorsize ==
      PAGE_SIZE to do checksum verification for compressed extents.
      
      To make it subpage compatible, this patch will:
      - Do extra calculation for the csum range
        Since we have multiple sectors inside a page, we need to only hash
        the range we want, not the full page anymore.
      
      - Do sector-by-sector hash inside the page
      
      With this patch and previous conversion on
      btrfs_submit_compressed_read(), now we can read subpage compressed
      extents properly, and do proper csum verification.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04d4ba4c
    • Qu Wenruo's avatar
      btrfs: make btrfs_submit_compressed_read() subpage compatible · be6a1361
      Qu Wenruo authored
      
      
      For compressed read, we always submit page read using page size.  This
      doesn't work well with subpage, as for subpage one page can contain
      several sectors.  Such submission will read range out of what we want,
      and cause problems.
      
      Thankfully to make it subpage compatible, we only need to change how the
      last page of the compressed extent is read.
      
      Instead of always adding a full page to the compressed read bio, if we're
      at the last page, calculate the size using compressed length, so that we
      only add part of the range into the compressed read bio.
      
      Since we are here, also change the PAGE_SIZE used in
      lookup_extent_mapping() to sectorsize.
      This modification won't cause any functional change, as
      lookup_extent_mapping() can handle the case where the search range is
      larger than found extent range.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be6a1361
    • Ira Weiny's avatar
      btrfs: fix raid6 qstripe kmap · d70cef0d
      Ira Weiny authored
      When a qstripe is required an extra page is allocated and mapped.  There
      were 3 problems:
      
      1) There is no corresponding call of kunmap() for the qstripe page.
      2) There is no reason to map the qstripe page more than once if the
         number of bits set in rbio->dbitmap is greater than one.
      3) There is no reason to map the parity page and unmap it each time
         through the loop.
      
      The page memory can continue to be reused with a single mapping on each
      iteration by raid6_call.gen_syndrome() without remapping.  So map the
      page for the duration of the loop.
      
      Similarly, improve the algorithm by mapping the parity page just 1 time.
      
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4.x: c17af965
      
      : btrfs: raid56: simplify tracking of Q stripe presence
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d70cef0d
  8. Feb 09, 2021
    • Naohiro Aota's avatar
      btrfs: zoned: enable to mount ZONED incompat flag · 9d294a68
      Naohiro Aota authored
      
      
      This final patch adds the ZONED incompat flag to the supported flags
      and enables to mount ZONED flagged file system.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d294a68
    • Naohiro Aota's avatar
      btrfs: zoned: deal with holes writing out tree-log pages · b528f467
      Naohiro Aota authored
      
      
      Since the zoned filesystem requires sequential write out of metadata, we
      cannot proceed with a hole in tree-log pages. When such a hole exists,
      btree_write_cache_pages() will return -EAGAIN. This happens when someone,
      e.g., a concurrent transaction commit, writes a dirty extent in this
      tree-log commit.
      
      If we are not going to wait for the extents, we can hope the concurrent
      writing fills the hole for us. So, we can ignore the error in this case and
      hope the next write will succeed.
      
      If we want to wait for them and got the error, we cannot wait for them
      because it will cause a deadlock. So, let's bail out to a full commit in
      this case.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b528f467
    • Naohiro Aota's avatar
      btrfs: zoned: reorder log node allocation on zoned filesystem · 3ddebf27
      Naohiro Aota authored
      
      
      This is the 3/3 patch to enable tree-log on zoned filesystems.
      
      The allocation order of nodes of "fs_info->log_root_tree" and nodes of
      "root->log_root" is not the same as the writing order of them. So, the
      writing causes unaligned write errors.
      
      Reorder the allocation of them by delaying allocation of the root node of
      "fs_info->log_root_tree," so that the node buffers can go out sequentially
      to devices.
      
      Cc: Filipe Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ddebf27
    • Naohiro Aota's avatar
      btrfs: zoned: serialize log transaction on zoned filesystems · fa1a0f42
      Naohiro Aota authored
      
      
      This is the 2/3 patch to enable tree-log on zoned filesystems.
      
      Since we can start more than one log transactions per subvolume
      simultaneously, nodes from multiple transactions can be allocated
      interleaved. Such mixed allocation results in non-sequential writes at
      the time of a log transaction commit. The nodes of the global log root
      tree (fs_info->log_root_tree), also have the same problem with mixed
      allocation.
      
      Serializes log transactions by waiting for a committing transaction when
      someone tries to start a new transaction, to avoid the mixed allocation
      problem. We must also wait for running log transactions from another
      subvolume, but there is no easy way to detect which subvolume root is
      running a log transaction. So, this patch forbids starting a new log
      transaction when other subvolumes already allocated the global log root
      tree.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa1a0f42
    • Naohiro Aota's avatar
      btrfs: zoned: extend zoned allocator to use dedicated tree-log block group · 40ab3be1
      Naohiro Aota authored
      
      
      This is the 1/3 patch to enable tree log on zoned filesystems.
      
      The tree-log feature does not work on a zoned filesystem as is. Blocks for
      a tree-log tree are allocated mixed with other metadata blocks and btrfs
      writes and syncs the tree-log blocks to devices at the time of fsync(),
      which has a different timing than a global transaction commit. As a
      result, both writing tree-log blocks and writing other metadata blocks
      become non-sequential writes that zoned filesystems must avoid.
      
      Introduce a dedicated block group for tree-log blocks, so that tree-log
      blocks and other metadata blocks can be separate write streams.  As a
      result, each write stream can now be written to devices separately.
      "fs_info->treelog_bg" tracks the dedicated block group and assigns
      "treelog_bg" on-demand on tree-log block allocation time.
      
      This commit extends the zoned block allocator to use the block group.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40ab3be1
    • Naohiro Aota's avatar
      btrfs: split alloc_log_tree() · 6ab6ebb7
      Naohiro Aota authored
      
      
      This is a preparation patch for the next patch. Split alloc_log_tree()
      into two parts. The first one allocating the tree structure, remains in
      alloc_log_tree() and the second part allocating the tree node, which is
      moved into btrfs_alloc_log_tree_node().
      
      Also export the latter part is to be used in the next patch.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ab6ebb7
    • Naohiro Aota's avatar
      btrfs: zoned: relocate block group to repair IO failure in zoned filesystems · f7ef5287
      Naohiro Aota authored
      
      
      When a bad checksum is found and if the filesystem has a mirror of the
      damaged data, we read the correct data from the mirror and writes it to
      damaged blocks. This however, violates the sequential write constraints
      of a zoned block device.
      
      We can consider three methods to repair an IO failure in zoned filesystems:
      
      (1) Reset and rewrite the damaged zone
      (2) Allocate new device extent and replace the damaged device extent to
          the new extent
      (3) Relocate the corresponding block group
      
      Method (1) is most similar to a behavior done with regular devices.
      However, it also wipes non-damaged data in the same device extent, and
      so it unnecessary degrades non-damaged data.
      
      Method (2) is much like device replacing but done in the same device. It
      is safe because it keeps the device extent until the replacing finish.
      However, extending device replacing is non-trivial. It assumes
      "src_dev->physical == dst_dev->physical". Also, the extent mapping
      replacing function should be extended to support replacing device extent
      position in one device.
      
      Method (3) invokes relocation of the damaged block group and is
      straightforward to implement. It relocates all the mirrored device
      extents, so it potentially is a more costly operation than method (1) or
      (2). But it relocates only used extents which reduce the total IO size.
      
      Let's apply method (3) for now. In the future, we can extend device-replace
      and apply method (2).
      
      For protecting a block group gets relocated multiple time with multiple
      IO errors, this commit introduces "relocating_repair" bit to show it's
      now relocating to repair IO failures. Also it uses a new kthread
      "btrfs-relocating-repair", not to block IO path with relocating process.
      
      This commit also supports repairing in the scrub process.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7ef5287
    • Naohiro Aota's avatar
      btrfs: zoned: enable relocation on a zoned filesystem · 32430c61
      Naohiro Aota authored
      
      
      Currently fallocate() is disabled on a zoned filesystem. Since current
      relocation process relies on preallocation to move file data extents, it
      must be handled differently.
      
      On a zoned filesystem, we just truncate the inode to the size that we
      wanted to pre-allocate. Then, we flush dirty pages on the file before
      finishing the relocation process. run_delalloc_zoned() will handle all
      the allocations and submit IOs to the underlying layers.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32430c61
    • Naohiro Aota's avatar
      btrfs: zoned: support dev-replace in zoned filesystems · 7db1c5d1
      Naohiro Aota authored
      
      
      This is 4/4 patch to implement device-replace on zoned filesystems.
      
      Even after the copying is done, the write pointers of the source device
      and the destination device may not be synchronized. For example, when
      the last allocated extent is freed before device-replace process, the
      extent is not copied, leaving a hole there.
      
      Synchronize the write pointers by writing zeroes to the destination
      device.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7db1c5d1
    • Naohiro Aota's avatar
      btrfs: zoned: implement copying for zoned device-replace · de17addc
      Naohiro Aota authored
      This is 3/4 patch to implement device-replace on zoned filesystems.
      
      This commit implements copying. To do this, it tracks the write pointer
      during the device replace process. As device-replace's copy process is
      smart enough to only copy used extents on the source device, we have to
      fill the gap to honor the sequential write requirement in the target
      device.
      
      The device-replace process on zoned filesystems must copy or clone all
      the extents in the source device exactly once. So, we need to ensure
      allocations started just before the dev-replace process to have their
      corresponding extent information in the B-trees.
      finish_extent_writes_for_zoned() implements that functionality, which
      basically is the removed code in the commit 042528f8
      
       ("Btrfs: fix
      block group remaining RO forever after error during device replace").
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de17addc