Skip to content
  1. Jun 03, 2021
  2. May 28, 2021
    • Filipe Manana's avatar
      btrfs: fix deadlock when cloning inline extents and low on available space · 76a6d5cd
      Filipe Manana authored
      There are a few cases where cloning an inline extent requires copying data
      into a page of the destination inode. For these cases we are allocating
      the required data and metadata space while holding a leaf locked. This can
      result in a deadlock when we are low on available space because allocating
      the space may flush delalloc and two deadlock scenarios can happen:
      
      1) When starting writeback for an inode with a very small dirty range that
         fits in an inline extent, we deadlock during the writeback when trying
         to insert the inline extent, at cow_file_range_inline(), if the extent
         is going to be located in the leaf for which we are already holding a
         read lock;
      
      2) After successfully starting writeback, for non-inline extent cases,
         the async reclaim thread will hang waiting for an ordered extent to
         complete if the ordered extent completion needs to modify the leaf
         for which the clone task is holding a read lock (for adding or
         replacing file extent items). So the cloning task will wait forever
         on the async reclaim thread to make progress, which in turn is
         waiting for the ordered extent completion which in turn is waiting
         to acquire a write lock on the same leaf.
      
      So fix this by making sure we release the path (and therefore the leaf)
      every time we need to copy the inline extent's data into a page of the
      destination inode, as by that time we do not need to have the leaf locked.
      
      Fixes: 05a5a762
      
       ("Btrfs: implement full reflink support for inline extents")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76a6d5cd
    • Filipe Manana's avatar
      btrfs: fix fsync failure and transaction abort after writes to prealloc extents · ea7036de
      Filipe Manana authored
      When doing a series of partial writes to different ranges of preallocated
      extents with transaction commits and fsyncs in between, we can end up with
      a checksum items in a log tree. This causes an fsync to fail with -EIO and
      abort the transaction, turning the filesystem to RO mode, when syncing the
      log.
      
      For this to happen, we need to have a full fsync of a file following one
      or more fast fsyncs.
      
      The following example reproduces the problem and explains how it happens:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        # Create our test file with 2 preallocated extents. Leave a 1M hole
        # between them to ensure that we get two file extent items that will
        # never be merged into a single one. The extents are contiguous on disk,
        # which will later result in the checksums for their data to be merged
        # into a single checksum item in the csums btree.
        #
        $ xfs_io -f \
                 -c "falloc 0 1M" \
                 -c "falloc 3M 3M" \
                 /mnt/foobar
      
        # Now write to the second extent and leave only 1M of it as unwritten,
        # which corresponds to the file range [4M, 5M[.
        #
        # Then fsync the file to flush delalloc and to clear full sync flag from
        # the inode, so that a future fsync will use the fast code path.
        #
        # After the writeback triggered by the fsync we have 3 file extent items
        # that point to the second extent we previously allocated:
        #
        # 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
        #    file range [3M, 4M[
        #
        # 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
        #    the file range [4M, 5M[
        #
        # 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
        #    file range [5M, 6M[
        #
        # All these file extent items have a generation of 6, which is the ID of
        # the transaction where they were created. The split of the original file
        # extent item is done at btrfs_mark_extent_written() when ordered extents
        # complete for the file ranges [3M, 4M[ and [5M, 6M[.
        #
        $ xfs_io -c "pwrite -S 0xab 3M 1M" \
                 -c "pwrite -S 0xef 5M 1M" \
                 -c "fsync" \
                 /mnt/foobar
      
        # Commit the current transaction. This wipes out the log tree created by
        # the previous fsync.
        sync
      
        # Now write to the unwritten range of the second extent we allocated,
        # corresponding to the file range [4M, 5M[, and fsync the file, which
        # triggers the fast fsync code path.
        #
        # The fast fsync code path sees that there is a new extent map covering
        # the file range [4M, 5M[ and therefore it will log a checksum item
        # covering the range [1M, 2M[ of the second extent we allocated.
        #
        # Also, after the fsync finishes we no longer have the 3 file extent
        # items that pointed to 3 sections of the second extent we allocated.
        # Instead we end up with a single file extent item pointing to the whole
        # extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
        # current transaction ID). This is due to the file extent item merging we
        # do when completing ordered extents into ranges that point to unwritten
        # (preallocated) extents. This merging is done at
        # btrfs_mark_extent_written().
        #
        $ xfs_io -c "pwrite -S 0xcd 4M 1M" \
                 -c "fsync" \
                 /mnt/foobar
      
        # Now do some write to our file outside the range of the second extent
        # that we allocated with fallocate() and truncate the file size from 6M
        # down to 5M.
        #
        # The truncate operation sets the full sync runtime flag on the inode,
        # forcing the next fsync to use the slow code path. It also changes the
        # length of the second file extent item so that it represents the file
        # range [3M, 5M[ and not the range [3M, 6M[ anymore.
        #
        # Finally fsync the file. Since this is a fsync that triggers the slow
        # code path, it will remove all items associated to the inode from the
        # log tree and then it will scan for file extent items in the
        # fs/subvolume tree that have a generation matching the current
        # transaction ID, which is 7. This means it will log 2 file extent
        # items:
        #
        # 1) One for the first extent we allocated, covering the file range
        #    [0, 1M[
        #
        # 2) Another for the first 2M of the second extent we allocated,
        #    covering the file range [3M, 5M[
        #
        # When logging the first file extent item we log a single checksum item
        # that has all the checksums for the entire extent.
        #
        # When logging the second file extent item, we also lookup for the
        # checksums that are associated with the range [0, 2M[ of the second
        # extent we allocated (file range [3M, 5M[), and then we log them with
        # btrfs_csum_file_blocks(). However that results in ending up with a log
        # that has two checksum items with ranges that overlap:
        #
        # 1) One for the range [1M, 2M[ of the second extent we allocated,
        #    corresponding to the file range [4M, 5M[, which we logged in the
        #    previous fsync that used the fast code path;
        #
        # 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
        #    extents, respectively, corresponding to the files ranges [0, 1M[
        #    and [3M, 5M[. This one was added during this last fsync that uses
        #    the slow code path and overlaps with the previous one logged by
        #    the previous fast fsync.
        #
        # This happens because when logging the checksums for the second
        # extent, we notice they start at an offset that matches the end of the
        # checksums item that we logged for the first extent, and because both
        # extents are contiguous on disk, btrfs_csum_file_blocks() decides to
        # extend that existing checksums item and append the checksums for the
        # second extent to this item. The end result is we end up with two
        # checksum items in the log tree that have overlapping ranges, as
        # listed before, resulting in the fsync to fail with -EIO and aborting
        # the transaction, turning the filesystem into RO mode.
        #
        $ xfs_io -c "pwrite -S 0xff 0 1M" \
                 -c "truncate 5M" \
                 -c "fsync" \
                 /mnt/foobar
        fsync: Input/output error
      
      After running the example, dmesg/syslog shows the tree checker complained
      about the checksum items with overlapping ranges and we aborted the
      transaction:
      
        $ dmesg
        (...)
        [756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
        [756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
        [756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
        [756289.563654] 	item 0 key (257 1 0) itemoff 16123 itemsize 160
        [756289.564649] 		inode generation 6 size 5242880 mode 100600
        [756289.565636] 	item 1 key (257 12 256) itemoff 16107 itemsize 16
        [756289.566694] 	item 2 key (257 108 0) itemoff 16054 itemsize 53
        [756289.567725] 		extent data disk bytenr 13631488 nr 1048576
        [756289.568697] 		extent data offset 0 nr 1048576 ram 1048576
        [756289.569689] 	item 3 key (257 108 1048576) itemoff 16001 itemsize 53
        [756289.570682] 		extent data disk bytenr 0 nr 0
        [756289.571363] 		extent data offset 0 nr 2097152 ram 2097152
        [756289.572213] 	item 4 key (257 108 3145728) itemoff 15948 itemsize 53
        [756289.573246] 		extent data disk bytenr 14680064 nr 3145728
        [756289.574121] 		extent data offset 0 nr 2097152 ram 3145728
        [756289.574993] 	item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
        [756289.576113] 	item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
        [756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
        [756289.578644] ------------[ cut here ]------------
        [756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
        [756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G        W         5.12.0-rc8-btrfs-next-87 #1
        [756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
        [756289.595122] Code: 5d c3 e8 76 60 (...)
        [756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
        [756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
        [756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
        [756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
        [756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
        [756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
        [756289.602278] FS:  00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
        [756289.603217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
        [756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [756289.606400] Call Trace:
        [756289.606704]  btree_csum_one_bio+0x244/0x2b0 [btrfs]
        [756289.607313]  btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
        [756289.608040]  submit_one_bio+0x61/0x70 [btrfs]
        [756289.608587]  btree_write_cache_pages+0x587/0x610 [btrfs]
        [756289.609258]  ? free_debug_processing+0x1d5/0x240
        [756289.609812]  ? __module_address+0x28/0xf0
        [756289.610298]  ? lock_acquire+0x1a0/0x3e0
        [756289.610754]  ? lock_acquired+0x19f/0x430
        [756289.611220]  ? lock_acquire+0x1a0/0x3e0
        [756289.611675]  do_writepages+0x43/0xf0
        [756289.612101]  ? __filemap_fdatawrite_range+0xa4/0x100
        [756289.612800]  __filemap_fdatawrite_range+0xc5/0x100
        [756289.613393]  btrfs_write_marked_extents+0x68/0x160 [btrfs]
        [756289.614085]  btrfs_sync_log+0x21c/0xf20 [btrfs]
        [756289.614661]  ? finish_wait+0x90/0x90
        [756289.615096]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [756289.615661]  ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
        [756289.616338]  ? lock_acquire+0x1a0/0x3e0
        [756289.616801]  ? lock_acquired+0x19f/0x430
        [756289.617284]  ? lock_acquire+0x1a0/0x3e0
        [756289.617750]  ? lock_release+0x214/0x470
        [756289.618221]  ? lock_acquired+0x19f/0x430
        [756289.618704]  ? dput+0x20/0x4a0
        [756289.619079]  ? dput+0x20/0x4a0
        [756289.619452]  ? lockref_put_or_lock+0x9/0x30
        [756289.619969]  ? lock_release+0x214/0x470
        [756289.620445]  ? lock_release+0x214/0x470
        [756289.620924]  ? lock_release+0x214/0x470
        [756289.621415]  btrfs_sync_file+0x46a/0x5b0 [btrfs]
        [756289.621982]  do_fsync+0x38/0x70
        [756289.622395]  __x64_sys_fsync+0x10/0x20
        [756289.622907]  do_syscall_64+0x33/0x80
        [756289.623438]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [756289.624063] RIP: 0033:0x7f08b27fbb7b
        [756289.624588] Code: 0f 05 48 3d 00 (...)
        [756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
        [756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
        [756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
        [756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
        [756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
        [756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
        [756289.631819] irq event stamp: 0
        [756289.632188] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
        [756289.633893] softirqs last  enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
        [756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
        [756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
        [756289.637082] BTRFS info (device sdc): forced readonly
      
      Having checksum items covering ranges that overlap is dangerous as in some
      cases it can lead to having extent ranges for which we miss checksums
      after log replay or getting the wrong checksum item. There were some fixes
      in the past for bugs that resulted in this problem, and were explained and
      fixed by the following commits:
      
        27b9a812 ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
        b84b8390 ("Btrfs: fix file read corruption after extent cloning and fsync")
        40e046ac ("Btrfs: fix missing data checksums after replaying a log tree")
        e289f03e
      
       ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
      
      Fix the issue by making btrfs_csum_file_blocks() taking into account the
      start offset of the next checksum item when it decides to extend an
      existing checksum item, so that it never extends the checksum to end at a
      range that goes beyond the start range of the next checksum item.
      
      When we can not access the next checksum item without releasing the path,
      simply drop the optimization of extending the previous checksum item and
      fallback to inserting a new checksum item - this happens rarely and the
      optimization is not significant enough for a log tree in order to justify
      the extra complexity, as it would only save a few bytes (the size of a
      struct btrfs_item) of leaf space.
      
      This behaviour is only needed when inserting into a log tree because
      for the regular checksums tree we never have a case where we try to
      insert a range of checksums that overlap with a range that was previously
      inserted.
      
      A test case for fstests will follow soon.
      
      Reported-by: default avatarPhilipp Fent <fent@in.tum.de>
      Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
      
      
      CC: stable@vger.kernel.org # 5.4+
      Tested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea7036de
    • Josef Bacik's avatar
      btrfs: abort in rename_exchange if we fail to insert the second ref · dc09ef35
      Josef Bacik authored
      
      
      Error injection stress uncovered a problem where we'd leave a dangling
      inode ref if we failed during a rename_exchange.  This happens because
      we insert the inode ref for one side of the rename, and then for the
      other side.  If this second inode ref insert fails we'll leave the first
      one dangling and leave a corrupt file system behind.  Fix this by
      aborting if we did the insert for the first inode ref.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc09ef35
    • Josef Bacik's avatar
      btrfs: check error value from btrfs_update_inode in tree log · f96d4474
      Josef Bacik authored
      
      
      Error injection testing uncovered a case where we ended up with invalid
      link counts on an inode.  This happened because we failed to notice an
      error when updating the inode while replaying the tree log, and
      committed the transaction with an invalid file system.
      
      Fix this by checking the return value of btrfs_update_inode.  This
      resolved the link count errors I was seeing, and we already properly
      handle passing up the error values in these paths.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f96d4474
    • Josef Bacik's avatar
      btrfs: fixup error handling in fixup_inode_link_counts · 011b28ac
      Josef Bacik authored
      
      
      This function has the following pattern
      
      	while (1) {
      		ret = whatever();
      		if (ret)
      			goto out;
      	}
      	ret = 0
      out:
      	return ret;
      
      However several places in this while loop we simply break; when there's
      a problem, thus clearing the return value, and in one case we do a
      return -EIO, and leak the memory for the path.
      
      Fix this by re-arranging the loop to deal with ret == 1 coming from
      btrfs_search_slot, and then simply delete the
      
      	ret = 0;
      out:
      
      bit so everybody can break if there is an error, which will allow for
      proper error handling to occur.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      011b28ac
    • Josef Bacik's avatar
      btrfs: mark ordered extent and inode with error if we fail to finish · d61bec08
      Josef Bacik authored
      
      
      While doing error injection testing I saw that sometimes we'd get an
      abort that wouldn't stop the current transaction commit from completing.
      This abort was coming from finish ordered IO, but at this point in the
      transaction commit we should have gotten an error and stopped.
      
      It turns out the abort came from finish ordered io while trying to write
      out the free space cache.  It occurred to me that any failure inside of
      finish_ordered_io isn't actually raised to the person doing the writing,
      so we could have any number of failures in this path and think the
      ordered extent completed successfully and the inode was fine.
      
      Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
      marking the mapping of the inode with mapping_set_error, so any callers
      that simply call fdatawait will also get the error.
      
      With this we're seeing the IO error on the free space inode when we fail
      to do the finish_ordered_io.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d61bec08
    • Josef Bacik's avatar
      btrfs: return errors from btrfs_del_csums in cleanup_ref_head · 856bd270
      Josef Bacik authored
      
      
      We are unconditionally returning 0 in cleanup_ref_head, despite the fact
      that btrfs_del_csums could fail.  We need to return the error so the
      transaction gets aborted properly, fix this by returning ret from
      btrfs_del_csums in cleanup_ref_head.
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      856bd270
    • Josef Bacik's avatar
      btrfs: fix error handling in btrfs_del_csums · b86652be
      Josef Bacik authored
      
      
      Error injection stress would sometimes fail with checksums on disk that
      did not have a corresponding extent.  This occurred because the pattern
      in btrfs_del_csums was
      
      	while (1) {
      		ret = btrfs_search_slot();
      		if (ret < 0)
      			break;
      	}
      	ret = 0;
      out:
      	btrfs_free_path(path);
      	return ret;
      
      If we got an error from btrfs_search_slot we'd clear the error because
      we were breaking instead of goto out.  Instead of using goto out, simply
      handle the cases where we may leave a random value in ret, and get rid
      of the
      
      	ret = 0;
      out:
      
      pattern and simply allow break to have the proper error reporting.  With
      this fix we properly abort the transaction and do not commit thinking we
      successfully deleted the csum.
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b86652be
    • Qu Wenruo's avatar
      btrfs: fix compressed writes that cross stripe boundary · 4c80a97d
      Qu Wenruo authored
      [BUG]
      When running btrfs/027 with "-o compress" mount option, it always
      crashes with the following call trace:
      
        BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/volumes.c:6651!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G           OE     5.13.0-rc2-custom+ #26
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
        RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs]
        Call Trace:
         btrfs_submit_compressed_write+0x2d7/0x470 [btrfs]
         submit_compressed_extents+0x3b0/0x470 [btrfs]
         ? mark_held_locks+0x49/0x70
         btrfs_work_helper+0x131/0x3e0 [btrfs]
         process_one_work+0x28f/0x5d0
         worker_thread+0x55/0x3c0
         ? process_one_work+0x5d0/0x5d0
         kthread+0x141/0x160
         ? __kthread_bind_mask+0x60/0x60
         ret_from_fork+0x22/0x30
        ---[ end trace 63113a3a91f34e68 ]---
      
      [CAUSE]
      The critical message before the crash means we have a bio at logical
      bytenr 298901504 length 12288, but only 8192 bytes can fit into one
      stripe, the remaining 4096 bytes go to another stripe.
      
      In btrfs, all bios are properly split to avoid cross stripe boundary,
      but commit 764c7c9a
      
       ("btrfs: zoned: fix parallel compressed writes")
      changed the behavior for compressed writes.
      
      Previously if we find our new page can't be fitted into current stripe,
      ie. "submit == 1" case, we submit current bio without adding current
      page.
      
             submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);
      
         page->mapping = NULL;
         if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
             PAGE_SIZE) {
      
      But after the modification, we will add the page no matter if it crosses
      stripe boundary, leading to the above crash.
      
             submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);
      
         if (pg_index == 0 && use_append)
                 len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
         else
                 len = bio_add_page(bio, page, PAGE_SIZE, 0);
      
         page->mapping = NULL;
         if (submit || len < PAGE_SIZE) {
      
      [FIX]
      It's no longer possible to revert to the original code style as we have
      two different bio_add_*_page() calls now.
      
      The new fix is to skip the bio_add_*_page() call if @submit is true.
      
      Also to avoid @len to be uninitialized, always initialize it to zero.
      
      If @submit is true, @len will not be checked.
      If @submit is not true, @len will be the return value of
      bio_add_*_page() call.
      Either way, the behavior is still the same as the old code.
      
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Fixes: 764c7c9a
      
       ("btrfs: zoned: fix parallel compressed writes")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c80a97d
  3. May 20, 2021
  4. May 17, 2021
    • Josef Bacik's avatar
      btrfs: do not BUG_ON in link_to_fixup_dir · 91df99a6
      Josef Bacik authored
      
      
      While doing error injection testing I got the following panic
      
        kernel BUG at fs/btrfs/tree-log.c:1862!
        invalid opcode: 0000 [#1] SMP NOPTI
        CPU: 1 PID: 7836 Comm: mount Not tainted 5.13.0-rc1+ #305
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:link_to_fixup_dir+0xd5/0xe0
        RSP: 0018:ffffb5800180fa30 EFLAGS: 00010216
        RAX: fffffffffffffffb RBX: 00000000fffffffb RCX: ffff8f595287faf0
        RDX: ffffb5800180fa37 RSI: ffff8f5954978800 RDI: 0000000000000000
        RBP: ffff8f5953af9450 R08: 0000000000000019 R09: 0000000000000001
        R10: 000151f408682970 R11: 0000000120021001 R12: ffff8f5954978800
        R13: ffff8f595287faf0 R14: ffff8f5953c77dd0 R15: 0000000000000065
        FS:  00007fc5284c8c40(0000) GS:ffff8f59bbd00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc5287f47c0 CR3: 000000011275e002 CR4: 0000000000370ee0
        Call Trace:
         replay_one_buffer+0x409/0x470
         ? btree_read_extent_buffer_pages+0xd0/0x110
         walk_up_log_tree+0x157/0x1e0
         walk_log_tree+0xa6/0x1d0
         btrfs_recover_log_trees+0x1da/0x360
         ? replay_one_extent+0x7b0/0x7b0
         open_ctree+0x1486/0x1720
         btrfs_mount_root.cold+0x12/0xea
         ? __kmalloc_track_caller+0x12f/0x240
         legacy_get_tree+0x24/0x40
         vfs_get_tree+0x22/0xb0
         vfs_kern_mount.part.0+0x71/0xb0
         btrfs_mount+0x10d/0x380
         ? vfs_parse_fs_string+0x4d/0x90
         legacy_get_tree+0x24/0x40
         vfs_get_tree+0x22/0xb0
         path_mount+0x433/0xa10
         __x64_sys_mount+0xe3/0x120
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      We can get -EIO or any number of legitimate errors from
      btrfs_search_slot(), panicing here is not the appropriate response.  The
      error path for this code handles errors properly, simply return the
      error.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91df99a6
    • Filipe Manana's avatar
      btrfs: release path before starting transaction when cloning inline extent · 6416954c
      Filipe Manana authored
      
      
      When cloning an inline extent there are a few cases, such as when we have
      an implicit hole at file offset 0, where we start a transaction while
      holding a read lock on a leaf. Starting the transaction results in a call
      to sb_start_intwrite(), which results in doing a read lock on a percpu
      semaphore. Lockdep doesn't like this and complains about it:
      
        [46.580704] ======================================================
        [46.580752] WARNING: possible circular locking dependency detected
        [46.580799] 5.13.0-rc1 #28 Not tainted
        [46.580832] ------------------------------------------------------
        [46.580877] cloner/3835 is trying to acquire lock:
        [46.580918] c00000001301d638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
        [46.581167]
        [46.581167] but task is already holding lock:
        [46.581217] c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
        [46.581293]
        [46.581293] which lock already depends on the new lock.
        [46.581293]
        [46.581351]
        [46.581351] the existing dependency chain (in reverse order) is:
        [46.581410]
        [46.581410] -> #1 (btrfs-tree-00){++++}-{3:3}:
        [46.581464]        down_read_nested+0x68/0x200
        [46.581536]        __btrfs_tree_read_lock+0x70/0x1d0
        [46.581577]        btrfs_read_lock_root_node+0x88/0x200
        [46.581623]        btrfs_search_slot+0x298/0xb70
        [46.581665]        btrfs_set_inode_index+0xfc/0x260
        [46.581708]        btrfs_new_inode+0x26c/0x950
        [46.581749]        btrfs_create+0xf4/0x2b0
        [46.581782]        lookup_open.isra.57+0x55c/0x6a0
        [46.581855]        path_openat+0x418/0xd20
        [46.581888]        do_filp_open+0x9c/0x130
        [46.581920]        do_sys_openat2+0x2ec/0x430
        [46.581961]        do_sys_open+0x90/0xc0
        [46.581993]        system_call_exception+0x3d4/0x410
        [46.582037]        system_call_common+0xec/0x278
        [46.582078]
        [46.582078] -> #0 (sb_internal#2){.+.+}-{0:0}:
        [46.582135]        __lock_acquire+0x1e90/0x2c50
        [46.582176]        lock_acquire+0x2b4/0x5b0
        [46.582263]        start_transaction+0x3cc/0x950
        [46.582308]        clone_copy_inline_extent+0xe4/0x5a0
        [46.582353]        btrfs_clone+0x5fc/0x880
        [46.582388]        btrfs_clone_files+0xd8/0x1c0
        [46.582434]        btrfs_remap_file_range+0x3d8/0x590
        [46.582481]        do_clone_file_range+0x10c/0x270
        [46.582558]        vfs_clone_file_range+0x1b0/0x310
        [46.582605]        ioctl_file_clone+0x90/0x130
        [46.582651]        do_vfs_ioctl+0x874/0x1ac0
        [46.582697]        sys_ioctl+0x6c/0x120
        [46.582733]        system_call_exception+0x3d4/0x410
        [46.582777]        system_call_common+0xec/0x278
        [46.582822]
        [46.582822] other info that might help us debug this:
        [46.582822]
        [46.582888]  Possible unsafe locking scenario:
        [46.582888]
        [46.582942]        CPU0                    CPU1
        [46.582984]        ----                    ----
        [46.583028]   lock(btrfs-tree-00);
        [46.583062]                                lock(sb_internal#2);
        [46.583119]                                lock(btrfs-tree-00);
        [46.583174]   lock(sb_internal#2);
        [46.583212]
        [46.583212]  *** DEADLOCK ***
        [46.583212]
        [46.583266] 6 locks held by cloner/3835:
        [46.583299]  #0: c00000001301d448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
        [46.583382]  #1: c00000000f6d3768 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
        [46.583477]  #2: c00000000f6d72a8 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
        [46.583574]  #3: c00000000f6d7138 (&ei->i_mmap_lock){+.+.}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
        [46.583657]  #4: c00000000f6d35f8 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
        [46.583743]  #5: c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
        [46.583828]
        [46.583828] stack backtrace:
        [46.583872] CPU: 1 PID: 3835 Comm: cloner Not tainted 5.13.0-rc1 #28
        [46.583931] Call Trace:
        [46.583955] [c0000000167c7200] [c000000000c1ee78] dump_stack+0xec/0x144 (unreliable)
        [46.584052] [c0000000167c7240] [c000000000274058] print_circular_bug.isra.32+0x3a8/0x400
        [46.584123] [c0000000167c72e0] [c0000000002741f4] check_noncircular+0x144/0x190
        [46.584191] [c0000000167c73b0] [c000000000278fc0] __lock_acquire+0x1e90/0x2c50
        [46.584259] [c0000000167c74f0] [c00000000027aa94] lock_acquire+0x2b4/0x5b0
        [46.584317] [c0000000167c75e0] [c000000000a0d6cc] start_transaction+0x3cc/0x950
        [46.584388] [c0000000167c7690] [c000000000af47a4] clone_copy_inline_extent+0xe4/0x5a0
        [46.584457] [c0000000167c77c0] [c000000000af525c] btrfs_clone+0x5fc/0x880
        [46.584514] [c0000000167c7990] [c000000000af5698] btrfs_clone_files+0xd8/0x1c0
        [46.584583] [c0000000167c7a00] [c000000000af5b58] btrfs_remap_file_range+0x3d8/0x590
        [46.584652] [c0000000167c7ae0] [c0000000005d81dc] do_clone_file_range+0x10c/0x270
        [46.584722] [c0000000167c7b40] [c0000000005d84f0] vfs_clone_file_range+0x1b0/0x310
        [46.584793] [c0000000167c7bb0] [c00000000058bf80] ioctl_file_clone+0x90/0x130
        [46.584861] [c0000000167c7c10] [c00000000058c894] do_vfs_ioctl+0x874/0x1ac0
        [46.584922] [c0000000167c7d10] [c00000000058db4c] sys_ioctl+0x6c/0x120
        [46.584978] [c0000000167c7d60] [c0000000000364a4] system_call_exception+0x3d4/0x410
        [46.585046] [c0000000167c7e10] [c00000000000d45c] system_call_common+0xec/0x278
        [46.585114] --- interrupt: c00 at 0x7ffff7e22990
        [46.585160] NIP:  00007ffff7e22990 LR: 00000001000010ec CTR: 0000000000000000
        [46.585224] REGS: c0000000167c7e80 TRAP: 0c00   Not tainted  (5.13.0-rc1)
        [46.585280] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
        [46.585374] IRQMASK: 0
        [46.585374] GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f17100 0000000000000004
        [46.585374] GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
        [46.585374] GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
        [46.585374] GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
        [46.585374] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        [46.585374] GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
        [46.585374] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
        [46.585374] GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
        [46.585919] NIP [00007ffff7e22990] 0x7ffff7e22990
        [46.585964] LR [00000001000010ec] 0x1000010ec
        [46.586010] --- interrupt: c00
      
      This should be a false positive, as both locks are acquired in read mode.
      Nevertheless, we don't need to hold a leaf locked when we start the
      transaction, so just release the leaf (path) before starting it.
      
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/linux-btrfs/20210513214404.xks77p566fglzgum@riteshh-domain/
      
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6416954c
  5. May 14, 2021
    • Filipe Manana's avatar
      btrfs: fix removed dentries still existing after log is synced · 54a40fc3
      Filipe Manana authored
      When we move one inode from one directory to another and both the inode
      and its previous parent directory were logged before, we are not supposed
      to have the dentry for the old parent if we have a power failure after the
      log is synced. Only the new dentry is supposed to exist.
      
      Generally this works correctly, however there is a scenario where this is
      not currently working, because the old parent of the file/directory that
      was moved is not authoritative for a range that includes the dir index and
      dir item keys of the old dentry. This case is better explained with the
      following example and reproducer:
      
        # The test requires a very specific layout of keys and items in the
        # fs/subvolume btree to trigger the bug. So we want to make sure that
        # on whatever platform we are, we have the same leaf/node size.
        #
        # Currently in btrfs the node/leaf size can not be smaller than the page
        # size (but it can be greater than the page size). So use the largest
        # supported node/leaf size (64K).
      
        $ mkfs.btrfs -f -n 65536 /dev/sdc
        $ mount /dev/sdc /mnt
      
        # "testdir" is inode 257.
        $ mkdir /mnt/testdir
        $ chmod 755 /mnt/testdir
      
        # Create several empty files to have the directory "testdir" with its
        # items spread over several leaves (7 in this case).
        $ for ((i = 1; i <= 1200; i++)); do
             echo -n > /mnt/testdir/file$i
          done
      
        # Create our test directory "dira", inode number 1458, which gets all
        # its items in leaf 7.
        #
        # The BTRFS_DIR_ITEM_KEY item for inode 257 ("testdir") that points to
        # the entry named "dira" is in leaf 2, while the BTRFS_DIR_INDEX_KEY
        # item that points to that entry is in leaf 3.
        #
        # For this particular filesystem node size (64K), file count and file
        # names, we endup with the directory entry items from inode 257 in
        # leaves 2 and 3, as previously mentioned - what matters for triggering
        # the bug exercised by this test case is that those items are not placed
        # in leaf 1, they must be placed in a leaf different from the one
        # containing the inode item for inode 257.
        #
        # The corresponding BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items for
        # the parent inode (257) are the following:
        #
        #    item 460 key (257 DIR_ITEM 3724298081) itemoff 48344 itemsize 34
        #         location key (1458 INODE_ITEM 0) type DIR
        #         transid 6 data_len 0 name_len 4
        #         name: dira
        #
        # and:
        #
        #    item 771 key (257 DIR_INDEX 1202) itemoff 36673 itemsize 34
        #         location key (1458 INODE_ITEM 0) type DIR
        #         transid 6 data_len 0 name_len 4
        #         name: dira
      
        $ mkdir /mnt/testdir/dira
      
        # Make sure everything done so far is durably persisted.
        $ sync
      
        # Now do a change to inode 257 ("testdir") that does not result in
        # COWing leaves 2 and 3 - the leaves that contain the directory items
        # pointing to inode 1458 (directory "dira").
        #
        # Changing permissions, the owner/group, updating or adding a xattr,
        # etc, will not change (COW) leaves 2 and 3. So for the sake of
        # simplicity change the permissions of inode 257, which results in
        # updating its inode item and therefore change (COW) only leaf 1.
      
        $ chmod 700 /mnt/testdir
      
        # Now fsync directory inode 257.
        #
        # Since only the first leaf was changed/COWed, we log the inode item of
        # inode 257 and only the dentries found in the first leaf, all have a
        # key type of BTRFS_DIR_ITEM_KEY, and no keys of type
        # BTRFS_DIR_INDEX_KEY, because they sort after the former type and none
        # exist in the first leaf.
        #
        # We also log 3 items that represent ranges for dir items and dir
        # indexes for which the log is authoritative:
        #
        # 1) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
        #    authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
        #    in the range [0, 2285968570] (the offset here is the crc32c of the
        #    dentry's name). The value 2285968570 corresponds to the offset of
        #    the first key of leaf 2 (which is of type BTRFS_DIR_ITEM_KEY);
        #
        # 2) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
        #    authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
        #    in the range [4293818216, (u64)-1] (the offset here is the crc32c
        #    of the dentry's name). The value 4293818216 corresponds to the
        #    offset of the highest key of type BTRFS_DIR_ITEM_KEY plus 1
        #    (4293818215 + 1), which is located in leaf 2;
        #
        # 3) a key of type BTRFS_DIR_LOG_INDEX_KEY, with an offset of 1203,
        #    which indicates the log is authoritative for all keys of type
        #    BTRFS_DIR_INDEX_KEY that have an offset in the range
        #    [1203, (u64)-1]. The value 1203 corresponds to the offset of the
        #    last key of type BTRFS_DIR_INDEX_KEY plus 1 (1202 + 1), which is
        #    located in leaf 3;
        #
        # Also, because "testdir" is a directory and inode 1458 ("dira") is a
        # child directory, we log inode 1458 too.
      
        $ xfs_io -c "fsync" /mnt/testdir
      
        # Now move "dira", inode 1458, to be a child of the root directory
        # (inode 256).
        #
        # Because this inode was previously logged, when "testdir" was fsynced,
        # the log is updated so that the old inode reference, referring to inode
        # 257 as the parent, is deleted and the new inode reference, referring
        # to inode 256 as the parent, is added to the log.
      
        $ mv /mnt/testdir/dira /mnt
      
        # Now change some file and fsync it. This guarantees the log changes
        # made by the previous move/rename operation are persisted. We do not
        # need to do any special modification to the file, just any change to
        # any file and sync the log.
      
        $ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/testdir/file1
      
        # Simulate a power failure and then mount again the filesystem to
        # replay the log tree. We want to verify that we are able to mount the
        # filesystem, meaning log replay was successful, and that directory
        # inode 1458 ("dira") only has inode 256 (the filesystem's root) as
        # its parent (and no longer a child of inode 257).
        #
        # It used to happen that during log replay we would end up having
        # inode 1458 (directory "dira") with 2 hard links, being a child of
        # inode 257 ("testdir") and inode 256 (the filesystem's root). This
        # resulted in the tree checker detecting the issue and causing the
        # mount operation to fail (with -EIO).
        #
        # This happened because in the log we have the new name/parent for
        # inode 1458, which results in adding the new dentry with inode 256
        # as the parent, but the previous dentry, under inode 257 was never
        # removed - this is because the ranges for dir items and dir indexes
        # of inode 257 for which the log is authoritative do not include the
        # old dir item and dir index for the dentry of inode 257 referring to
        # inode 1458:
        #
        # - for dir items, the log is authoritative for the ranges
        #   [0, 2285968570] and [4293818216, (u64)-1]. The dir item at inode 257
        #   pointing to inode 1458 has a key of (257 DIR_ITEM 3724298081), as
        #   previously mentioned, so the dir item is not deleted when the log
        #   replay procedure processes the authoritative ranges, as 3724298081
        #   is outside both ranges;
        #
        # - for dir indexes, the log is authoritative for the range
        #   [1203, (u64)-1], and the dir index item of inode 257 pointing to
        #   inode 1458 has a key of (257 DIR_INDEX 1202), as previously
        #   mentioned, so the dir index item is not deleted when the log
        #   replay procedure processes the authoritative range.
      
        <power failure>
      
        $ mount /dev/sdc /mnt
        mount: /mnt: can't read superblock on /dev/sdc.
      
        $ dmesg
        (...)
        [87849.840509] BTRFS info (device sdc): start tree-log replay
        [87849.875719] BTRFS critical (device sdc): corrupt leaf: root=5 block=30539776 slot=554 ino=1458, invalid nlink: has 2 expect no more than 1 for dir
        [87849.878084] BTRFS info (device sdc): leaf 30539776 gen 7 total ptrs 557 free space 2092 owner 5
        [87849.879516] BTRFS info (device sdc): refs 1 lock_owner 0 current 2099108
        [87849.880613] 	item 0 key (1181 1 0) itemoff 65275 itemsize 160
        [87849.881544] 		inode generation 6 size 0 mode 100644
        [87849.882692] 	item 1 key (1181 12 257) itemoff 65258 itemsize 17
        (...)
        [87850.562549] 	item 556 key (1458 12 257) itemoff 16017 itemsize 14
        [87850.563349] BTRFS error (device dm-0): block=30539776 write time tree block corruption detected
        [87850.564386] ------------[ cut here ]------------
        [87850.564920] WARNING: CPU: 3 PID: 2099108 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [87850.566129] Modules linked in: btrfs dm_zero dm_snapshot (...)
        [87850.573789] CPU: 3 PID: 2099108 Comm: mount Not tainted 5.12.0-rc8-btrfs-next-86 #1
        (...)
        [87850.587481] Call Trace:
        [87850.587768]  btree_csum_one_bio+0x244/0x2b0 [btrfs]
        [87850.588354]  ? btrfs_bio_fits_in_stripe+0xd8/0x110 [btrfs]
        [87850.589003]  btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
        [87850.589654]  submit_one_bio+0x61/0x70 [btrfs]
        [87850.590248]  submit_extent_page+0x91/0x2f0 [btrfs]
        [87850.590842]  write_one_eb+0x175/0x440 [btrfs]
        [87850.591370]  ? find_extent_buffer_nolock+0x1c0/0x1c0 [btrfs]
        [87850.592036]  btree_write_cache_pages+0x1e6/0x610 [btrfs]
        [87850.592665]  ? free_debug_processing+0x1d5/0x240
        [87850.593209]  do_writepages+0x43/0xf0
        [87850.593798]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87850.594391]  __filemap_fdatawrite_range+0xc5/0x100
        [87850.595196]  btrfs_write_marked_extents+0x68/0x160 [btrfs]
        [87850.596202]  btrfs_write_and_wait_transaction.isra.0+0x4d/0xd0 [btrfs]
        [87850.597377]  btrfs_commit_transaction+0x794/0xca0 [btrfs]
        [87850.598455]  ? _raw_spin_unlock_irqrestore+0x32/0x60
        [87850.599305]  ? kmem_cache_free+0x15a/0x3d0
        [87850.600029]  btrfs_recover_log_trees+0x346/0x380 [btrfs]
        [87850.601021]  ? replay_one_extent+0x7d0/0x7d0 [btrfs]
        [87850.601988]  open_ctree+0x13c9/0x1698 [btrfs]
        [87850.602846]  btrfs_mount_root.cold+0x13/0xed [btrfs]
        [87850.603771]  ? kmem_cache_alloc_trace+0x7c9/0x930
        [87850.604576]  ? vfs_parse_fs_string+0x5d/0xb0
        [87850.605293]  ? kfree+0x276/0x3f0
        [87850.605857]  legacy_get_tree+0x30/0x50
        [87850.606540]  vfs_get_tree+0x28/0xc0
        [87850.607163]  fc_mount+0xe/0x40
        [87850.607695]  vfs_kern_mount.part.0+0x71/0x90
        [87850.608440]  btrfs_mount+0x13b/0x3e0 [btrfs]
        (...)
        [87850.629477] ---[ end trace 68802022b99a1ea0 ]---
        [87850.630849] BTRFS: error (device sdc) in btrfs_commit_transaction:2381: errno=-5 IO failure (Error while writing out transaction)
        [87850.632422] BTRFS warning (device sdc): Skipping commit of aborted transaction.
        [87850.633416] BTRFS: error (device sdc) in cleanup_transaction:1978: errno=-5 IO failure
        [87850.634553] BTRFS: error (device sdc) in btrfs_replay_log:2431: errno=-5 IO failure (Failed to recover log tree)
        [87850.637529] BTRFS error (device sdc): open_ctree failed
      
      In this example the inode we moved was a directory, so it was easy to
      detect the problem because directories can only have one hard link and
      the tree checker immediately detects that. If the moved inode was a file,
      then the log replay would succeed and we would end up having both the
      new hard link (/mnt/foo) and the old hard link (/mnt/testdir/foo) present,
      but only the new one should be present.
      
      Fix this by forcing re-logging of the old parent directory when logging
      the new name during a rename operation. This ensures we end up with a log
      that is authoritative for a range covering the keys for the old dentry,
      therefore causing the old dentry do be deleted when replaying the log.
      
      A test case for fstests will follow up soon.
      
      Fixes: 64d6b281
      
       ("btrfs: remove unnecessary check_parent_dirs_for_sync()")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54a40fc3
    • Boris Burkov's avatar
      btrfs: return whole extents in fiemap · 15c7745c
      Boris Burkov authored
      
      
        `xfs_io -c 'fiemap <off> <len>' <file>`
      
      can give surprising results on btrfs that differ from xfs.
      
      btrfs prints out extents trimmed to fit the user input. If the user's
      fiemap request has an offset, then rather than returning each whole
      extent which intersects that range, we also trim the start extent to not
      have start < off.
      
      Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
      that returning the whole extent is expected.
      
      Some cases which all yield the same fiemap in xfs, but not btrfs:
        dd if=/dev/zero of=$f bs=4k count=1
        sudo xfs_io -c 'fiemap 0 1024' $f
          0: [0..7]: 26624..26631
        sudo xfs_io -c 'fiemap 2048 1024' $f
          0: [4..7]: 26628..26631
        sudo xfs_io -c 'fiemap 2048 4096' $f
          0: [4..7]: 26628..26631
        sudo xfs_io -c 'fiemap 3584 512' $f
          0: [7..7]: 26631..26631
        sudo xfs_io -c 'fiemap 4091 5' $f
          0: [7..6]: 26631..26630
      
      I believe this is a consequence of the logic for merging contiguous
      extents represented by separate extent items. That logic needs to track
      the last offset as it loops through the extent items, which happens to
      pick up the start offset on the first iteration, and trim off the
      beginning of the full extent. To fix it, start `off` at 0 rather than
      `start` so that we keep the iteration/merging intact without cutting off
      the start of the extent.
      
      after the fix, all the above commands give:
      
        0: [0..7]: 26624..26631
      
      The merging logic is exercised by fstest generic/483, and I have written
      a new fstest for checking we don't have backwards or zero-length fiemaps
      for cases like those above.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15c7745c
    • Josef Bacik's avatar
      btrfs: avoid RCU stalls while running delayed iputs · 71795ee5
      Josef Bacik authored
      
      
      Generally a delayed iput is added when we might do the final iput, so
      usually we'll end up sleeping while processing the delayed iputs
      naturally.  However there's no guarantee of this, especially for small
      files.  In production we noticed 5 instances of RCU stalls while testing
      a kernel release overnight across 1000 machines, so this is relatively
      common:
      
        host count: 5
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: ....: (20998 ticks this GP) idle=59e/1/0x4000000000000002 softirq=12333372/12333372 fqs=3208
         	(t=21031 jiffies g=27810193 q=41075) NMI backtrace for cpu 1
        CPU: 1 PID: 1713 Comm: btrfs-cleaner Kdump: loaded Not tainted 5.6.13-0_fbk12_rc1_5520_gec92bffc1ec9 #1
        Call Trace:
          <IRQ> dump_stack+0x50/0x70
          nmi_cpu_backtrace.cold.6+0x30/0x65
          ? lapic_can_unplug_cpu.cold.30+0x40/0x40
          nmi_trigger_cpumask_backtrace+0xba/0xca
          rcu_dump_cpu_stacks+0x99/0xc7
          rcu_sched_clock_irq.cold.90+0x1b2/0x3a3
          ? trigger_load_balance+0x5c/0x200
          ? tick_sched_do_timer+0x60/0x60
          ? tick_sched_do_timer+0x60/0x60
          update_process_times+0x24/0x50
          tick_sched_timer+0x37/0x70
          __hrtimer_run_queues+0xfe/0x270
          hrtimer_interrupt+0xf4/0x210
          smp_apic_timer_interrupt+0x5e/0x120
          apic_timer_interrupt+0xf/0x20 </IRQ>
         RIP: 0010:queued_spin_lock_slowpath+0x17d/0x1b0
         RSP: 0018:ffffc9000da5fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
         RAX: 0000000000000000 RBX: ffff889fa81d0cd8 RCX: 0000000000000029
         RDX: ffff889fff86c0c0 RSI: 0000000000080000 RDI: ffff88bfc2da7200
         RBP: ffff888f2dcdd768 R08: 0000000001040000 R09: 0000000000000000
         R10: 0000000000000001 R11: ffffffff82a55560 R12: ffff88bfc2da7200
         R13: 0000000000000000 R14: ffff88bff6c2a360 R15: ffffffff814bd870
         ? kzalloc.constprop.57+0x30/0x30
         list_lru_add+0x5a/0x100
         inode_lru_list_add+0x20/0x40
         iput+0x1c1/0x1f0
         run_delayed_iput_locked+0x46/0x90
         btrfs_run_delayed_iputs+0x3f/0x60
         cleaner_kthread+0xf2/0x120
         kthread+0x10b/0x130
      
      Fix this by adding a cond_resched_lock() to the loop processing delayed
      iputs so we can avoid these sort of stalls.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71795ee5
    • Johannes Thumshirn's avatar
      btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error · d6f67afb
      Johannes Thumshirn authored
      Commit 7000babd ("btrfs: assign proper values to a bool variable in
      dev_extent_hole_check_zoned") assigned false to the hole_start parameter
      of dev_extent_hole_check_zoned().
      
      The hole_start parameter is not boolean and returns the start location of
      the found hole.
      
      Fixes: 7000babd
      
       ("btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned")
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6f67afb
  6. May 05, 2021
  7. May 04, 2021
    • Naohiro Aota's avatar
      btrfs: zoned: sanity check zone type · 784daf2b
      Naohiro Aota authored
      
      
      The fstests test case generic/475 creates a dm-linear device that gets
      changed to a dm-error device. This leads to errors in loading the block
      group's zone information when running on a zoned file system, ultimately
      resulting in a list corruption. When running on a kernel with list
      debugging enabled this leads to the following crash.
      
       BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure
       kernel BUG at lib/list_debug.c:54!
       invalid opcode: 0000 [#1] SMP PTI
       CPU: 1 PID: 2433 Comm: umount Tainted: G        W         5.12.0+ #1018
       RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
       RSP: 0018:ffffc90001473df0 EFLAGS: 00010296
       RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90
       RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003
       RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001
       R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000
       R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100
       FS:  00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0xc9/0x310 [btrfs]
        close_ctree+0x2ee/0x31a [btrfs]
        ? call_rcu+0x8f/0x270
        ? mutex_lock+0x1c/0x40
        generic_shutdown_super+0x67/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x90
        cleanup_mnt+0x13e/0x1b0
        task_work_run+0x63/0xb0
        exit_to_user_mode_loop+0xd9/0xe0
        exit_to_user_mode_prepare+0x3e/0x60
        syscall_exit_to_user_mode+0x1d/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      As dm-error has no support for zones, btrfs will run it's zone emulation
      mode on this device. The zone emulation mode emulates conventional zones,
      so bail out if the zone bitmap that gets populated on mount sees the zone
      as sequential while we're thinking it's a conventional zone when creating
      a block group.
      
      Note: this scenario is unlikely in a real wold application and can only
      happen by this (ab)use of device-mapper targets.
      
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      784daf2b
    • Anand Jain's avatar
      btrfs: fix unmountable seed device after fstrim · 5e753a81
      Anand Jain authored
      
      
      The following test case reproduces an issue of wrongly freeing in-use
      blocks on the readonly seed device when fstrim is called on the rw sprout
      device. As shown below.
      
      Create a seed device and add a sprout device to it:
      
        $ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
        $ btrfstune -S 1 /dev/loop0
        $ mount /dev/loop0 /btrfs
        $ btrfs dev add -f /dev/loop1 /btrfs
        BTRFS info (device loop0): relocating block group 290455552 flags system
        BTRFS info (device loop0): relocating block group 1048576 flags system
        BTRFS info (device loop0): disk added /dev/loop1
        $ umount /btrfs
      
      Mount the sprout device and run fstrim:
      
        $ mount /dev/loop1 /btrfs
        $ fstrim /btrfs
        $ umount /btrfs
      
      Now try to mount the seed device, and it fails:
      
        $ mount /dev/loop0 /btrfs
        mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
      
      Block 5292032 is missing on the readonly seed device:
      
       $ dmesg -kt | tail
       <snip>
       BTRFS error (device loop0): bad tree block start, want 5292032 have 0
       BTRFS warning (device loop0): couldn't read-tree root
       BTRFS error (device loop0): open_ctree failed
      
      From the dump-tree of the seed device (taken before the fstrim). Block
      5292032 belonged to the block group starting at 5242880:
      
        $ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
        <snip>
        item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
        	block group used 114688 chunk_objectid 256 flags METADATA
        <snip>
      
      From the dump-tree of the sprout device (taken before the fstrim).
      fstrim used block-group 5242880 to find the related free space to free:
      
        $ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
        <snip>
        item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
        	block group used 32768 chunk_objectid 256 flags METADATA
        <snip>
      
      BPF kernel tracing the fstrim command finds the missing block 5292032
      within the range of the discarded blocks as below:
      
        kprobe:btrfs_discard_extent {
        	printf("freeing start %llu end %llu num_bytes %llu:\n",
        		arg1, arg1+arg2, arg2);
        }
      
        freeing start 5259264 end 5406720 num_bytes 147456
        <snip>
      
      Fix this by avoiding the discard command to the readonly seed device.
      
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e753a81
  8. Apr 29, 2021
    • Filipe Manana's avatar
      btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501
      Filipe Manana authored
      There are a few exceptional cases where cloning an inline extent needs to
      copy the inline extent data into a page of the destination inode.
      
      When this happens, we end up starting a transaction while having a dirty
      page for the destination inode and while having the range locked in the
      destination's inode iotree too. Because when reserving metadata space
      for a transaction we may need to flush existing delalloc in case there is
      not enough free space, we have a mechanism in place to prevent a deadlock,
      which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
      cloning inline extent and low on free metadata space").
      
      However when using qgroups, a transaction also reserves metadata qgroup
      space, which can also result in flushing delalloc in case there is not
      enough available space at the moment. When this happens we deadlock, since
      flushing delalloc requires locking the file range in the inode's iotree
      and the range was already locked at the very beginning of the clone
      operation, before attempting to start the transaction.
      
      When this issue happens, stack traces like the following are reported:
      
        [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
        [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
        [72747.556271] Call Trace:
        [72747.556273]  __schedule+0x296/0x760
        [72747.556277]  schedule+0x3c/0xa0
        [72747.556279]  io_schedule+0x12/0x40
        [72747.556284]  __lock_page+0x13c/0x280
        [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
        [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
        [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
        [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
        [72747.556362]  ? update_group_capacity+0x25/0x210
        [72747.556366]  ? cpumask_next_and+0x1a/0x20
        [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.556394]  do_writepages+0x41/0xd0
        [72747.556398]  __writeback_single_inode+0x39/0x2a0
        [72747.556403]  writeback_sb_inodes+0x1ea/0x440
        [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
        [72747.556410]  wb_writeback+0x235/0x2b0
        [72747.556414]  ? get_nr_inodes+0x35/0x50
        [72747.556417]  wb_workfn+0x354/0x490
        [72747.556420]  ? newidle_balance+0x2c5/0x3e0
        [72747.556424]  process_one_work+0x1aa/0x340
        [72747.556426]  worker_thread+0x30/0x390
        [72747.556429]  ? create_worker+0x1a0/0x1a0
        [72747.556432]  kthread+0x116/0x130
        [72747.556435]  ? kthread_park+0x80/0x80
        [72747.556438]  ret_from_fork+0x1f/0x30
      
        [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [72747.566961] Call Trace:
        [72747.566964]  __schedule+0x296/0x760
        [72747.566968]  ? finish_wait+0x80/0x80
        [72747.566970]  schedule+0x3c/0xa0
        [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
        [72747.566999]  ? finish_wait+0x80/0x80
        [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
        [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
        [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
        [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
        [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
        [72747.567106]  ? update_load_avg+0x7e/0x5f0
        [72747.567109]  ? enqueue_entity+0xf4/0x6f0
        [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
        [72747.567140]  do_writepages+0x41/0xd0
        [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
        [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
        [72747.567200]  process_one_work+0x1aa/0x340
        [72747.567202]  worker_thread+0x30/0x390
        [72747.567205]  ? create_worker+0x1a0/0x1a0
        [72747.567208]  kthread+0x116/0x130
        [72747.567211]  ? kthread_park+0x80/0x80
        [72747.567214]  ret_from_fork+0x1f/0x30
      
        [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
        [72747.569689] Call Trace:
        [72747.569691]  __schedule+0x296/0x760
        [72747.569694]  schedule+0x3c/0xa0
        [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569725]  ? finish_wait+0x80/0x80
        [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
        [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
        [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
        [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
        [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
        [72747.569836]  ? __kmalloc+0x16a/0x2c0
        [72747.569839]  do_iter_readv_writev+0x160/0x1c0
        [72747.569843]  do_iter_write+0x80/0x1b0
        [72747.569847]  vfs_writev+0x84/0x140
        [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
        [72747.569873]  do_writev+0x65/0x100
        [72747.569876]  do_syscall_64+0x33/0x40
        [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
        [72747.569903] Call Trace:
        [72747.569906]  __schedule+0x296/0x760
        [72747.569909]  schedule+0x3c/0xa0
        [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569940]  ? finish_wait+0x80/0x80
        [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
        [72747.569989]  start_transaction+0x279/0x580 [btrfs]
        [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
        [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
        [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
        [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
        [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
        [72747.570126]  do_clone_file_range+0xed/0x200
        [72747.570131]  vfs_clone_file_range+0x37/0x110
        [72747.570134]  ioctl_file_clone+0x7d/0xb0
        [72747.570137]  do_vfs_ioctl+0x138/0x630
        [72747.570140]  __x64_sys_ioctl+0x62/0xc0
        [72747.570143]  do_syscall_64+0x33/0x40
        [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      So fix this by skipping the flush of delalloc for an inode that is
      flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
      such a special case of cloning an inline extent, when flushing delalloc
      during qgroup metadata reservation.
      
      The special cases for cloning inline extents were added in kernel 5.7 by
      by commit 05a5a762 ("Btrfs: implement full reflink support for
      inline extents"), while having qgroup metadata space reservation flushing
      delalloc when low on space was added in kernel 5.9 by commit
      c53e9653
      
       ("btrfs: qgroup: try to flush qgroup space when we get
      -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
      kernel backports.
      
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
      Fixes: c53e9653
      
       ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9baa501
    • Filipe Manana's avatar
      btrfs: fix race leading to unpersisted data and metadata on fsync · 626e9f41
      Filipe Manana authored
      When doing a fast fsync on a file, there is a race which can result in the
      fsync returning success to user space without logging the inode and without
      durably persisting new data.
      
      The following example shows one possible scenario for this:
      
         $ mkfs.btrfs -f /dev/sdc
         $ mount /dev/sdc /mnt
      
         $ touch /mnt/bar
         $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz
      
         # Now we have:
         # file bar == inode 257
         # file baz == inode 258
      
         $ mv /mnt/baz /mnt/foo
      
         # Now we have:
         # file bar == inode 257
         # file foo == inode 258
      
         $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
         # fsync bar before foo, it is important to trigger the race.
         $ xfs_io -c "fsync" /mnt/bar
         $ xfs_io -c "fsync" /mnt/foo
      
         # After this:
         # inode 257, file bar, is empty
         # inode 258, file foo, has 1M filled with 0xcd
      
         <power failure>
      
         # Replay the log:
         $ mount /dev/sdc /mnt
      
         # After this point file foo should have 1M filled with 0xcd and not 0xab
      
      The following steps explain how the race happens:
      
      1) Before the first fsync of inode 258, when it has the "baz" name, its
         ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
         The inode also has the full sync flag set;
      
      2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
         the generation of the current transaction, and set ->last_log_commit
         to 0, which is the current value of ->last_sub_trans (done at
         btrfs_log_inode()).
      
         The full sync flag is cleared from the inode during the fsync.
      
         The log sub transaction that was committed had an ID of 0 and when we
         synced the log, at btrfs_sync_log(), we incremented root->log_transid
         from 0 to 1;
      
      3) During the rename:
      
         We update inode 258, through btrfs_update_inode(), and that causes its
         ->last_sub_trans to be set to 1 (the current log transaction ID), and
         ->last_log_commit remains with a value of 0.
      
         After updating inode 258, because we have previously logged the inode
         in the previous fsync, we log again the inode through the call to
         btrfs_log_new_name(). This results in updating the inode's
         ->last_log_commit from 0 to 1 (the current value of its
         ->last_sub_trans).
      
         The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
         the next log transaction;
      
      4) Then a buffered write against inode 258 is made. This leaves the value
         of ->last_sub_trans as 1 (the ID of the current log transaction, stored
         at root->log_transid);
      
      5) Then an fsync against inode 257 (or any other inode other than 258),
         happens. This results in committing the log transaction with ID 1,
         which results in updating root->last_log_commit to 1 and bumping
         root->log_transid from 1 to 2;
      
      6) Then an fsync against inode 258 starts. We flush delalloc and wait only
         for writeback to complete, since the full sync flag is not set in the
         inode's runtime flags - we do not wait for ordered extents to complete.
      
         Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
         ordered extent completes. The call returns true:
      
           static inline bool btrfs_inode_in_log(...)
           {
               bool ret = false;
      
               spin_lock(&inode->lock);
               if (inode->logged_trans == generation &&
                   inode->last_sub_trans <= inode->last_log_commit &&
                   inode->last_sub_trans <= inode->root->last_log_commit)
                       ret = true;
               spin_unlock(&inode->lock);
               return ret;
           }
      
         generation has a value of 6 (fs_info->generation), ->logged_trans also
         has a value of 6 (set when we logged the inode during the first fsync
         and when logging it during the rename), ->last_sub_trans has a value
         of 1, set during the rename (step 3), ->last_log_commit also has a
         value of 1 (set in step 3) and root->last_log_commit has a value of 1,
         which was set in step 5 when fsyncing inode 257.
      
         As a consequence we don't log the inode, any new extents and do not
         sync the log, resulting in a data loss if a power failure happens
         after the fsync and before the current transaction commits.
         Also, because we do not log the inode, after a power failure the mtime
         and ctime of the inode do not match those we had before.
      
         When the ordered extent completes before we call btrfs_inode_in_log(),
         then the call returns false and we log the inode and sync the log,
         since at the end of ordered extent completion we update the inode and
         set ->last_sub_trans to 2 (the value of root->log_transid) and
         ->last_log_commit to 1.
      
      This problem is found after removing the check for the emptiness of the
      inode's list of modified extents in the recent commit 209ecbb8
      ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
      added in the 5.13 merge window. However checking the emptiness of the
      list is not really the way to solve this problem, and was never intended
      to, because while that solves the problem for COW writes, the problem
      persists for NOCOW writes because in that case the list is always empty.
      
      In the case of NOCOW writes, even though we wait for the writeback to
      complete before returning from btrfs_sync_file(), we end up not logging
      the inode, which has a new mtime/ctime, and because we don't sync the log,
      we never issue disk barriers (send REQ_PREFLUSH to the device) since that
      only happens when we sync the log (when we write super blocks at
      btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
      btrfs_sync_file() to user space, we are not guaranteeing that the data is
      durably persisted on disk.
      
      Also, while the example above uses a rename exchange to show how the
      problem happens, it is not the only way to trigger it. An alternative
      could be adding a new hard link to inode 258, since that also results
      in calling btrfs_log_new_name() and updating the inode in the log.
      An example reproducer using the addition of a hard link instead of a
      rename operation:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ touch /mnt/bar
        $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo
      
        $ ln /mnt/foo /mnt/foo_link
        $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
        $ xfs_io -c "fsync" /mnt/bar
        $ xfs_io -c "fsync" /mnt/foo
      
        <power failure>
      
        # Replay the log:
        $ mount /dev/sdc /mnt
      
        # After this point file foo often has 1M filled with 0xab and not 0xcd
      
      The reasons leading to the final fsync of file foo, inode 258, not
      persisting the new data are the same as for the previous example with
      a rename operation.
      
      So fix by never skipping logging and log syncing when there are still any
      ordered extents in flight. To avoid making the conditional if statement
      that checks if logging an inode is needed harder to read, place all the
      logic into an helper function with separate if statements to make it more
      manageable and easier to read.
      
      A test case for fstests will follow soon.
      
      For NOCOW writes, the problem existed before commit b5e6c3e1
      ("btrfs: always wait on ordered extents at fsync time"), introduced in
      kernel 4.19, then it went away with that commit since we started to always
      wait for ordered extent completion before logging.
      
      The problem came back again once the fast fsync path was changed again to
      avoid waiting for ordered extent completion, in commit 48778179
      ("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.
      
      However, for COW writes, the race only happens after the recent
      commit 209ecbb8
      
       ("btrfs: remove stale comment and logic from
      btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
      writes, the bug existed before that commit. So tag 5.10+ as the release
      for stable backports.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      626e9f41
    • Filipe Manana's avatar
      btrfs: do not consider send context as valid when trying to flush qgroups · ffb7c2e9
      Filipe Manana authored
      
      
      At qgroup.c:try_flush_qgroup() we are asserting that current->journal_info
      is either NULL or has the value BTRFS_SEND_TRANS_STUB.
      
      However allowing for BTRFS_SEND_TRANS_STUB makes no sense because:
      
      1) It is misleading, because send operations are read-only and do not
         ever need to reserve qgroup space;
      
      2) We already assert that current->journal_info != BTRFS_SEND_TRANS_STUB
         at transaction.c:start_transaction();
      
      3) On a kernel without CONFIG_BTRFS_ASSERT=y set, it would result in
         a crash if try_flush_qgroup() is ever called in a send context, because
         at transaction.c:start_transaction we cast current->journal_info into
         a struct btrfs_trans_handle pointer and then dereference it.
      
      So just do allow a send context at try_flush_qgroup() and update the
      comment about it.
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffb7c2e9
    • Filipe Manana's avatar
      btrfs: zoned: fix silent data loss after failure splitting ordered extent · adbd914d
      Filipe Manana authored
      On a zoned filesystem, sometimes we need to split an ordered extent into 3
      different ordered extents. The original ordered extent is shortened, at
      the front and at the rear, and we create two other new ordered extents to
      represent the trimmed parts of the original ordered extent.
      
      After adjusting the original ordered extent, we create an ordered extent
      to represent the pre-range, and that may fail with ENOMEM for example.
      After that we always try to create the ordered extent for the post-range,
      and if that happens to succeed we end up returning success to the caller
      as we overwrite the 'ret' variable which contained the previous error.
      
      This means we end up with a file range for which there is no ordered
      extent, which results in the range never getting a new file extent item
      pointing to the new data location. And since the split operation did
      not return an error, writeback does not fail and the inode's mapping is
      not flagged with an error, resulting in a subsequent fsync not reporting
      an error either.
      
      It's possibly very unlikely to have the creation of the post-range ordered
      extent succeed after the creation of the pre-range ordered extent failed,
      but it's not impossible.
      
      So fix this by making sure we only create the post-range ordered extent
      if there was no error creating the ordered extent for the pre-range.
      
      Fixes: d22002fd
      
       ("btrfs: zoned: split ordered extent when bio is sent")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adbd914d
  9. Apr 21, 2021
    • Johannes Thumshirn's avatar
      btrfs: zoned: automatically reclaim zones · 18bb8bbf
      Johannes Thumshirn authored
      
      
      When a file gets deleted on a zoned file system, the space freed is not
      returned back into the block group's free space, but is migrated to
      zone_unusable.
      
      As this zone_unusable space is behind the current write pointer it is not
      possible to use it for new allocations. In the current implementation a
      zone is reset once all of the block group's space is accounted as zone
      unusable.
      
      This behaviour can lead to premature ENOSPC errors on a busy file system.
      
      Instead of only reclaiming the zone once it is completely unusable,
      kick off a reclaim job once the amount of unusable bytes exceeds a user
      configurable threshold between 51% and 100%. It can be set per mounted
      filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
      by default.
      
      Similar to reclaiming unused block groups, these dirty block groups are
      added to a to_reclaim list and then on a transaction commit, the reclaim
      process is triggered but after we deleted unused block groups, which will
      free space for the relocation process.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18bb8bbf
    • Johannes Thumshirn's avatar
      btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock · f3372065
      Johannes Thumshirn authored
      
      
      As a preparation for extending the block group deletion use case, rename
      the unused_bgs_mutex to reclaim_bgs_lock.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3372065
    • Johannes Thumshirn's avatar
      btrfs: zoned: reset zones of relocated block groups · 01e86008
      Johannes Thumshirn authored
      When relocating a block group the freed up space is not discarded in one
      big block, but each extent is discarded on its own with -odisard=sync.
      
      For a zoned filesystem we need to discard the whole block group at once,
      so btrfs_discard_extent() will translate the discard into a
      REQ_OP_ZONE_RESET operation, which then resets the device's zone.
      Failure to reset the zone is not fatal error.
      
      Discussion about the approach and regarding transaction blocking:
      https://lore.kernel.org/linux-btrfs/CAL3q7H4SjS_d5rBepfTMhU8Th3bJzdmyYd0g4Z60yUgC_rC_ZA@mail.gmail.com/
      
      
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      01e86008
    • Qu Wenruo's avatar
      btrfs: more graceful errors/warnings on 32bit systems when reaching limits · e9306ad4
      Qu Wenruo authored
      Btrfs uses internally mapped u64 address space for all its metadata.
      Due to the page cache limit on 32bit systems, btrfs can't access
      metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See
      how MAX_LFS_FILESIZE and page::index are defined.  This is 16T for 4K
      page size while 256T for 64K page size.
      
      Users can have a filesystem which doesn't have metadata beyond the
      boundary at mount time, but later balance can cause it to create
      metadata beyond the boundary.
      
      And modification to MM layer is unrealistic just for such minor use
      case. We can't do more than to prevent mounting such filesystem or warn
      early when the numbers are still within the limits.
      
      To address such problem, this patch will introduce the following checks:
      
      - Mount time rejection
        This will reject any fs which has metadata chunk at or beyond the
        boundary.
      
      - Mount time early warning
        If there is any metadata chunk beyond 5/8th of the boundary, we do an
        early warning and hope the end user will see it.
      
      - Runtime extent buffer rejection
        If we're going to allocate an extent buffer at or beyond the boundary,
        reject such request with EOVERFLOW.
        This is definitely going to cause problems like transaction abort, but
        we have no better ways.
      
      - Runtime extent buffer early warning
        If an extent buffer beyond 5/8th of the max file size is allocated, do
        an early warning.
      
      Above error/warning message will only be printed once for each fs to
      reduce dmesg flood.
      
      If the mount is rejected, the filesystem will be mountable only on a
      64bit host.
      
      Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/
      
      
      Reported-by: default avatarErik Jensen <erikjensen@rkjnsn.net>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9306ad4
    • Filipe Manana's avatar
      btrfs: zoned: fix unpaired block group unfreeze during device replace · 0dc16ef4
      Filipe Manana authored
      When doing a device replace on a zoned filesystem, if we find a block
      group with ->to_copy == 0, we jump to the label 'done', which will result
      in later calling btrfs_unfreeze_block_group(), even though at this point
      we never called btrfs_freeze_block_group().
      
      Since at this point we have neither turned the block group to RO mode nor
      made any progress, we don't need to jump to the label 'done'. So fix this
      by jumping instead to the label 'skip' and dropping our reference on the
      block group before the jump.
      
      Fixes: 78ce9fc2
      
       ("btrfs: zoned: mark block groups to copy for device-replace")
      CC: stable@vger.kernel.org # 5.12
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0dc16ef4
    • Filipe Manana's avatar
      btrfs: fix race when picking most recent mod log operation for an old root · f9690f42
      Filipe Manana authored
      Commit dbcc7d57
      
       ("btrfs: fix race when cloning extent buffer during
      rewind of an old root"), fixed a race when we need to rewind the extent
      buffer of an old root. It was caused by picking a new mod log operation
      for the extent buffer while getting a cloned extent buffer with an outdated
      number of items (off by -1), because we cloned the extent buffer without
      locking it first.
      
      However there is still another similar race, but in the opposite direction.
      The cloned extent buffer has a number of items that does not match the
      number of tree mod log operations that are going to be replayed. This is
      because right after we got the last (most recent) tree mod log operation to
      replay and before locking and cloning the extent buffer, another task adds
      a new pointer to the extent buffer, which results in adding a new tree mod
      log operation and incrementing the number of items in the extent buffer.
      So after cloning we have mismatch between the number of items in the extent
      buffer and the number of mod log operations we are going to apply to it.
      This results in hitting a BUG_ON() that produces the following stack trace:
      
         ------------[ cut here ]------------
         kernel BUG at fs/btrfs/tree-mod-log.c:675!
         invalid opcode: 0000 [#1] SMP KASAN PTI
         CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G        W         5.12.0-7d1efdf501f8-misc-next+ #99
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
         RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
         Code: 05 48 8d 74 10 (...)
         RSP: 0018:ffffc90001027090 EFLAGS: 00010293
         RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
         RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
         RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
         R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
         R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
         FS:  00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
         Call Trace:
          btrfs_get_old_root+0x16a/0x5c0
          ? lock_downgrade+0x400/0x400
          btrfs_search_old_slot+0x192/0x520
          ? btrfs_search_slot+0x1090/0x1090
          ? free_extent_buffer.part.61+0xd7/0x140
          ? free_extent_buffer+0x13/0x20
          resolve_indirect_refs+0x3e9/0xfc0
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? add_prelim_ref.part.11+0x150/0x150
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? lock_acquired+0xbb/0x620
          ? __kasan_check_write+0x14/0x20
          ? do_raw_spin_unlock+0xa8/0x140
          ? rb_insert_color+0x340/0x360
          ? prelim_ref_insert+0x12d/0x430
          find_parent_nodes+0x5c3/0x1830
          ? stack_trace_save+0x87/0xb0
          ? resolve_indirect_refs+0xfc0/0xfc0
          ? fs_reclaim_acquire+0x67/0xf0
          ? __kasan_check_read+0x11/0x20
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? fs_reclaim_acquire+0x67/0xf0
          ? __kasan_check_read+0x11/0x20
          ? ___might_sleep+0x10f/0x1e0
          ? __kasan_kmalloc+0x9d/0xd0
          ? trace_hardirqs_on+0x55/0x120
          btrfs_find_all_roots_safe+0x142/0x1e0
          ? find_parent_nodes+0x1830/0x1830
          ? trace_hardirqs_on+0x55/0x120
          ? ulist_free+0x1f/0x30
          ? btrfs_inode_flags_to_xflags+0x50/0x50
          iterate_extent_inodes+0x20e/0x580
          ? tree_backref_for_extent+0x230/0x230
          ? release_extent_buffer+0x225/0x280
          ? read_extent_buffer+0xdd/0x110
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? lock_acquired+0xbb/0x620
          ? __kasan_check_write+0x14/0x20
          ? do_raw_spin_unlock+0xa8/0x140
          ? _raw_spin_unlock+0x22/0x30
          ? release_extent_buffer+0x225/0x280
          iterate_inodes_from_logical+0x129/0x170
          ? iterate_inodes_from_logical+0x129/0x170
          ? btrfs_inode_flags_to_xflags+0x50/0x50
          ? iterate_extent_inodes+0x580/0x580
          ? __vmalloc_node+0x92/0xb0
          ? init_data_container+0x34/0xb0
          ? init_data_container+0x34/0xb0
          ? kvmalloc_node+0x60/0x80
          btrfs_ioctl_logical_to_ino+0x158/0x230
          btrfs_ioctl+0x2038/0x4360
          ? __kasan_check_write+0x14/0x20
          ? mmput+0x3b/0x220
          ? btrfs_ioctl_get_supported_features+0x30/0x30
          ? __kasan_check_read+0x11/0x20
          ? __kasan_check_read+0x11/0x20
          ? lock_release+0xc8/0x650
          ? __might_fault+0x64/0xd0
          ? __kasan_check_read+0x11/0x20
          ? lock_downgrade+0x400/0x400
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? lockdep_hardirqs_on_prepare+0x13/0x210
          ? _raw_spin_unlock_irqrestore+0x51/0x63
          ? __kasan_check_read+0x11/0x20
          ? do_vfs_ioctl+0xfc/0x9d0
          ? ioctl_file_clone+0xe0/0xe0
          ? lock_downgrade+0x400/0x400
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? __kasan_check_read+0x11/0x20
          ? lock_release+0xc8/0x650
          ? __task_pid_nr_ns+0xd3/0x250
          ? __kasan_check_read+0x11/0x20
          ? __fget_files+0x160/0x230
          ? __fget_light+0xf2/0x110
          __x64_sys_ioctl+0xc3/0x100
          do_syscall_64+0x37/0x80
          entry_SYSCALL_64_after_hwframe+0x44/0xae
         RIP: 0033:0x7f29ae85b427
         Code: 00 00 90 48 8b (...)
         RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
         RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
         RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
         RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
         R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
         R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
         Modules linked in:
         ---[ end trace 85e5fce078dfbe04 ]---
      
        (gdb) l *(tree_mod_log_rewind+0x3b1)
        0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
        670                      * the modification. As we're going backwards, we do the
        671                      * opposite of each operation here.
        672                      */
        673                     switch (tm->op) {
        674                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        675                             BUG_ON(tm->slot < n);
        676                             fallthrough;
        677                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        678                     case BTRFS_MOD_LOG_KEY_REMOVE:
        679                             btrfs_set_node_key(eb, &tm->key, tm->slot);
        (gdb) quit
      
      The following steps explain in more detail how it happens:
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
         This is task A;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
      
      3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
         of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
         pointing to ebX->start. We only have one item in eb X, so we create
         only one tree mod log operation, and store in the "tm_list" array;
      
      4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
         log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
         to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
         set to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         "tm_list" as argument. After that, tree_mod_log_free_eb() calls
         tree_mod_log_insert(). This inserts the mod log operation of type
         BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
         with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
      
      6) Then, after inserting the "tm_list" single element into the tree mod
         log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
         gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
      
      7) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      8) Later some other task B allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      9) The tree mod log user task calls btrfs_search_old_slot(), which calls
         btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
         with time_seq == 1 and eb_root == eb Y;
      
      10) The first iteration of the while loop finds the tree mod log element
          with sequence number 3, for the logical address of eb Y and of type
          BTRFS_MOD_LOG_ROOT_REPLACE;
      
      11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
          break out of the loop, and set root_logical to point to
          tm->old_root.logical, which corresponds to the logical address of
          eb X;
      
      12) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
          corresponds to the old slot 0 of eb X (eb X had only 1 item in it
          before being freed at step 7);
      
      13) We then break out of the while loop and return the tree mod log
          operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
          for slot 0 of eb X, to btrfs_get_old_root();
      
      14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
          operation and set "logical" to the logical address of eb X, which was
          the old root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      15) But before calling tree_mod_log_search(), task B locks eb X, adds a
          key to eb X, which results in adding a tree mod log operation of type
          BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
          log, and increments the number of items in eb X from 0 to 1.
          Now fs_info->tree_mod_seq has a value of 4;
      
      16) Task A then calls tree_mod_log_search(), which returns the most recent
          tree mod log operation for eb X, which is the one just added by task B
          at the previous step, with a sequence number of 4, a type of
          BTRFS_MOD_LOG_KEY_ADD and for slot 0;
      
      17) Before task A locks and clones eb X, task A adds another key to eb X,
          which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
          with a sequence number of 5, for slot 1 of eb X, increments the
          number of items in eb X from 1 to 2, and unlocks eb X.
          Now fs_info->tree_mod_seq has a value of 5;
      
      18) Task A then locks eb X and clones it. The clone has a value of 2 for
          the number of items and the pointer "tm" points to the tree mod log
          operation with sequence number 4, not the most recent one with a
          sequence number of 5, so there is mismatch between the number of
          mod log operations that are going to be applied to the cloned version
          of eb X and the number of items in the clone;
      
      19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
          tree mod log operation with sequence number 4 and a type of
          BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
      
      20) At tree_mod_log_rewind(), we set the local variable "n" with a value
          of 2, which is the number of items in the clone of eb X.
      
          Then in the first iteration of the while loop, we process the mod log
          operation with sequence number 4, which is targeted at slot 0 and has
          a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
          2 to 1.
      
          Then we pick the next tree mod log operation for eb X, which is the
          tree mod log operation with a sequence number of 2, a type of
          BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
          added in step 5 to the tree mod log tree.
      
          We go back to the top of the loop to process this mod log operation,
          and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
      
              (...)
              switch (tm->op) {
              case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                      BUG_ON(tm->slot < n);
                      fallthrough;
      	(...)
      
      Fix this by checking for a more recent tree mod log operation after locking
      and cloning the extent buffer of the old root node, and use it as the first
      operation to apply to the cloned extent buffer when rewinding it.
      
      Stable backport notes: due to moved code and renames, in =< 5.11 the
      change should be applied to ctree.c:get_old_root.
      
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
      Fixes: 834328a8
      
       ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9690f42
    • Filipe Manana's avatar
      btrfs: fix metadata extent leak after failure to create subvolume · 67addf29
      Filipe Manana authored
      
      
      When creating a subvolume we allocate an extent buffer for its root node
      after starting a transaction. We setup a root item for the subvolume that
      points to that extent buffer and then attempt to insert the root item into
      the root tree - however if that fails, due to ENOMEM for example, we do
      not free the extent buffer previously allocated and we do not abort the
      transaction (as at that point we did nothing that can not be undone).
      
      This means that we effectively do not return the metadata extent back to
      the free space cache/tree and we leave a delayed reference for it which
      causes a metadata extent item to be added to the extent tree, in the next
      transaction commit, without having backreferences. When this happens
      'btrfs check' reports the following:
      
        $ btrfs check /dev/sdi
        Opening filesystem to check...
        Checking filesystem on /dev/sdi
        UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb
        [1/7] checking root items
        [2/7] checking extents
        ref mismatch on [30425088 16384] extent item 1, found 0
        backref 30425088 root 256 not referenced back 0x564a91c23d70
        incorrect global backref count on 30425088 found 1 wanted 0
        backpointer mismatch on [30425088 16384]
        owner ref check failed [30425088 16384]
        ERROR: errors found in extent allocation tree or chunk allocation
        [3/7] checking free space cache
        [4/7] checking fs roots
        [5/7] checking only csums items (without verifying data)
        [6/7] checking root refs
        [7/7] checking quota groups skipped (not enabled on this FS)
        found 212992 bytes used, error(s) found
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 124669
        file data blocks allocated: 65536
         referenced 65536
      
      So fix this by freeing the metadata extent if btrfs_insert_root() returns
      an error.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67addf29
  10. Apr 20, 2021
    • Qu Wenruo's avatar
      btrfs: handle remount to no compress during compression · 1d8ba9e7
      Qu Wenruo authored
      [BUG]
      When running btrfs/071 with inode_need_compress() removed from
      compress_file_range(), we got the following crash:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
        RIP: 0010:compress_file_range+0x476/0x7b0 [btrfs]
        Call Trace:
         ? submit_compressed_extents+0x450/0x450 [btrfs]
         async_cow_start+0x16/0x40 [btrfs]
         btrfs_work_helper+0xf2/0x3e0 [btrfs]
         process_one_work+0x278/0x5e0
         worker_thread+0x55/0x400
         ? process_one_work+0x5e0/0x5e0
         kthread+0x168/0x190
         ? kthread_create_worker_on_cpu+0x70/0x70
         ret_from_fork+0x22/0x30
        ---[ end trace 65faf4eae941fa7d ]---
      
      This is already after the patch "btrfs: inode: fix NULL pointer
      dereference if inode doesn't need compression."
      
      [CAUSE]
      @pages is firstly created by kcalloc() in compress_file_extent():
                      pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
      
      Then passed to btrfs_compress_pages() to be utilized there:
      
                      ret = btrfs_compress_pages(...
                                                 pages,
                                                 &nr_pages,
                                                 ...);
      
      btrfs_compress_pages() will initialize each page as output, in
      zlib_compress_pages() we have:
      
                              pages[nr_pages] = out_page;
                              nr_pages++;
      
      Normally this is completely fine, but there is a special case which
      is in btrfs_compress_pages() itself:
      
              switch (type) {
              default:
                      return -E2BIG;
              }
      
      In this case, we didn't modify @pages nor @out_pages, leaving them
      untouched, then when we cleanup pages, the we can hit NULL pointer
      dereference again:
      
              if (pages) {
                      for (i = 0; i < nr_pages; i++) {
                              WARN_ON(pages[i]->mapping);
                              put_page(pages[i]);
                      }
              ...
              }
      
      Since pages[i] are all initialized to zero, and btrfs_compress_pages()
      doesn't change them at all, accessing pages[i]->mapping would lead to
      NULL pointer dereference.
      
      This is not possible for current kernel, as we check
      inode_need_compress() before doing pages allocation.
      But if we're going to remove that inode_need_compress() in
      compress_file_extent(), then it's going to be a problem.
      
      [FIX]
      When btrfs_compress_pages() hits its default case, modify @out_pages to
      0 to prevent such problem from happening.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212331
      
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d8ba9e7
  11. Apr 19, 2021
    • Johannes Thumshirn's avatar
      btrfs: zoned: fail mount if the device does not support zone append · 1d68128c
      Johannes Thumshirn authored
      
      
      For zoned btrfs, zone append is mandatory to write to a sequential write
      only zone, otherwise parallel writes to the same zone could result in
      unaligned write errors.
      
      If a zoned block device does not support zone append (e.g. a dm-crypt
      zoned device using a non-NULL IV cypher), fail to mount.
      
      CC: stable@vger.kernel.org # 5.12
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d68128c
    • Filipe Manana's avatar
      btrfs: fix race between transaction aborts and fsyncs leading to use-after-free · 061dde82
      Filipe Manana authored
      There is a race between a task aborting a transaction during a commit,
      a task doing an fsync and the transaction kthread, which leads to an
      use-after-free of the log root tree. When this happens, it results in a
      stack trace like the following:
      
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure
        BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5)
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10
        BTRFS error (device dm-0): error writing primary super block to device 1
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10
        BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers)
        BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure
        general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        RIP: 0010:__mutex_lock+0x139/0xa40
        Code: c0 74 19 (...)
        RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202
        RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002
        RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040
        R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358
        FS:  00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? __btrfs_handle_fs_error+0xde/0x146 [btrfs]
         ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
         ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
         btrfs_sync_log+0x7c1/0xf20 [btrfs]
         btrfs_sync_file+0x40c/0x580 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fa9142a55c3
        Code: 8b 15 09 (...)
        RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3
        RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005
        RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340
        R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0
        Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...)
        ---[ end trace ee2f1b19327d791d ]---
      
      The steps that lead to this crash are the following:
      
      1) We are at transaction N;
      
      2) We have two tasks with a transaction handle attached to transaction N.
         Task A and Task B. Task B is doing an fsync;
      
      3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree
         into a local variable named 'log_root_tree' at the top of
         btrfs_sync_log(). Task B is about to call write_all_supers(), but
         before that...
      
      4) Task A calls btrfs_commit_transaction(), and after it sets the
         transaction state to TRANS_STATE_COMMIT_START, an error happens before
         it waits for the transaction's 'num_writers' counter to reach a value
         of 1 (no one else attached to the transaction), so it jumps to the
         label "cleanup_transaction";
      
      5) Task A then calls cleanup_transaction(), where it aborts the
         transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state,
         setting the ->aborted field of the transaction and the handle to an
         errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state.
      
         After that, at cleanup_transaction(), it deletes the transaction from
         the list of transactions (fs_info->trans_list), sets the transaction
         to the state TRANS_STATE_COMMIT_DOING and then waits for the number
         of writers to go down to 1, as it's currently 2 (1 for task A and 1
         for task B);
      
      6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR
         is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction().
      
         There it sees the list fs_info->trans_list is empty, and then proceeds
         into calling btrfs_drop_all_logs(), which frees the log root tree with
         a call to btrfs_free_log_root_tree();
      
      7) Task B calls write_all_supers() and, shortly after, under the label
         'out_wake_log_root', it deferences the pointer stored in
         'log_root_tree', which was already freed in the previous step by the
         transaction kthread. This results in a use-after-free leading to a
         crash.
      
      Fix this by deleting the transaction from the list of transactions at
      cleanup_transaction() only after setting the transaction state to
      TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are
      attached to the transaction to release their transaction handles.
      This makes the transaction kthread wait for all the tasks attached to
      the transaction to be done with the transaction before dropping the
      log roots and doing other cleanups.
      
      Fixes: ef67963d
      
       ("btrfs: drop logs when we've aborted a transaction")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      061dde82
    • Qu Wenruo's avatar
      btrfs: introduce submit_eb_subpage() to submit a subpage metadata page · c4aec299
      Qu Wenruo authored
      
      
      The new function, submit_eb_subpage(), will submit all the dirty extent
      buffers in the page.
      
      The major difference between submit_eb_page() and submit_eb_subpage()
      is:
      - How to grab extent buffer
        Now we use find_extent_buffer_nospinlock() other than using
        page::private.
      
      All other different handling is already done in functions like
      lock_extent_buffer_for_io() and write_one_eb().
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4aec299
    • Qu Wenruo's avatar
      btrfs: make lock_extent_buffer_for_io() to be subpage compatible · f3156df9
      Qu Wenruo authored
      
      
      For subpage metadata, we don't use page locking at all.  So just skip
      the page locking part for subpage.  The rest of the function can be
      reused.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3156df9
    • Qu Wenruo's avatar
      btrfs: introduce write_one_subpage_eb() function · 35b6ddfa
      Qu Wenruo authored
      
      
      The new function, write_one_subpage_eb(), as a subroutine for subpage
      metadata write, will handle the extent buffer bio submission.
      
      The major differences between the new write_one_subpage_eb() and
      write_one_eb() is:
      
      - No page locking
        When entering write_one_subpage_eb() the page is no longer locked.
        We only lock the page for its status update, and unlock immediately.
        Now we completely rely on extent io tree locking.
      
      - Extra bitmap update along with page status update
        Now page dirty and writeback is controlled by
        btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
        They both follow the schema that any sector is dirty/writeback, then
        the full page gets dirty/writeback.
      
      - When to update the nr_written number
        Now we take a shortcut, if we have cleared the last dirty bit of the
        page, we update nr_written.
        This is not completely perfect, but should emulate the old behavior
        well enough.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35b6ddfa
    • Qu Wenruo's avatar
      btrfs: introduce end_bio_subpage_eb_writepage() function · 2f3186d8
      Qu Wenruo authored
      
      
      The new function, end_bio_subpage_eb_writepage(), will handle the
      metadata writeback endio.
      
      The major differences involved are:
      
      - How to grab extent buffer
        Now page::private is a pointer to btrfs_subpage, we can no longer grab
        extent buffer directly.
        Thus we need to use the bv_offset to locate the extent buffer manually
        and iterate through the whole range.
      
      - Use btrfs_subpage_end_writeback() caller
        This helper will handle the subpage writeback for us.
      
      Since this function is executed under endio context, when grabbing
      extent buffers it can't grab eb->refs_lock as that lock is not designed
      to be grabbed under hardirq context.
      
      So here introduce a helper, find_extent_buffer_nolock(), for such
      situation, and convert find_extent_buffer() to use that helper.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f3186d8
    • Josef Bacik's avatar
      btrfs: check return value of btrfs_commit_transaction in relocation · fb686c68
      Josef Bacik authored
      
      
      There are a few places where we don't check the return value of
      btrfs_commit_transaction in relocation.c.  Thankfully all these places
      have straightforward error handling, so simply change all of the sites
      at once.
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb686c68