Skip to content
  1. Aug 17, 2023
    • xiaoshoukui's avatar
      btrfs: fix BUG_ON condition in btrfs_cancel_balance · 29eefa6d
      xiaoshoukui authored
      
      
      Pausing and canceling balance can race to interrupt balance lead to BUG_ON
      panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance
      does not take this race scenario into account.
      
      However, the race condition has no other side effects. We can fix that.
      
      Reproducing it with panic trace like this:
      
        kernel BUG at fs/btrfs/volumes.c:4618!
        RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0
        Call Trace:
         <TASK>
         ? do_nanosleep+0x60/0x120
         ? hrtimer_nanosleep+0xb7/0x1a0
         ? sched_core_clone_cookie+0x70/0x70
         btrfs_ioctl_balance_ctl+0x55/0x70
         btrfs_ioctl+0xa46/0xd20
         __x64_sys_ioctl+0x7d/0xa0
         do_syscall_64+0x38/0x80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        Race scenario as follows:
        > mutex_unlock(&fs_info->balance_mutex);
        > --------------------
        > .......issue pause and cancel req in another thread
        > --------------------
        > ret = __btrfs_balance(fs_info);
        >
        > mutex_lock(&fs_info->balance_mutex);
        > if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) {
        >         btrfs_info(fs_info, "balance: paused");
        >         btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED);
        > }
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarxiaoshoukui <xiaoshoukui@ruijie.com.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29eefa6d
    • Chris Mason's avatar
      btrfs: only subtract from len_to_oe_boundary when it is tracking an extent · 09c3717c
      Chris Mason authored
      
      
      bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
      as we submit bios for writes.  Every time we add a page to the bio, we
      decrement those bytes from len_to_oe_boundary, and then we submit the
      bio if we happen to hit zero.
      
      Most of the time, len_to_oe_boundary gets set to U32_MAX.
      submit_extent_page() adds pages into our bio, and the size of the bio
      ends up limited by:
      
      - Are we contiguous on disk?
      - Does bio_add_page() allow us to stuff more in?
      - is len_to_oe_boundary > 0?
      
      The len_to_oe_boundary math starts with U32_MAX, which isn't page or
      sector aligned, and subtracts from it until it hits zero.  In the
      non-zoned case, the last IO we submit before we hit zero is going to be
      unaligned, triggering BUGs.
      
      This is hard to trigger because bio_add_page() isn't going to make a bio
      of U32_MAX size unless you give it a perfect set of pages and fully
      contiguous extents on disk.  We can hit it pretty reliably while making
      large swapfiles during provisioning because the machine is freshly
      booted, mostly idle, and the disk is freshly formatted.  It's also
      possible to trigger with reads when read_ahead_kb is set to 4GB.
      
      The code has been clean up and shifted around a few times, but this flaw
      has been lurking since the counter was added.  I think the commit
      24e6c808 ("btrfs: simplify main loop in submit_extent_page") ended
      up exposing the bug.
      
      The fix used here is to skip doing math on len_to_oe_boundary unless
      we've changed it from the default U32_MAX value.  bio_add_page() is the
      real limit we want, and there's no reason to do extra math when block
      layer is doing it for us.
      
      Sample reproducer, note you'll need to change the path to the bdi and
      device:
      
        SUBVOL=/btrfs/swapvol
        SWAPFILE=$SUBVOL/swapfile
        SZMB=8192
      
        mkfs.btrfs -f /dev/vdb
        mount /dev/vdb /btrfs
      
        btrfs subvol create $SUBVOL
        chattr +C $SUBVOL
        dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
        sync
      
        echo 4 > /proc/sys/vm/drop_caches
      
        echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
      
        while true; do
      	  echo 1 > /proc/sys/vm/drop_caches
      	  echo 1 > /proc/sys/vm/drop_caches
      	  dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
        done
      
      Fixes: 24e6c808 ("btrfs: simplify main loop in submit_extent_page")
      CC: stable@vger.kernel.org # 6.4+
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09c3717c
    • Anand Jain's avatar
      btrfs: fix replace/scrub failure with metadata_uuid · b471965f
      Anand Jain authored
      
      
      Fstests with POST_MKFS_CMD="btrfstune -m" (as in the mailing list)
      reported a few of the test cases failing.
      
      The failure scenario can be summarized and simplified as follows:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb1 /dev/sdb2 :0
        $ btrfstune -m /dev/sdb1 :0
        $ wipefs -a /dev/sdb1 :0
        $ mount -o degraded /dev/sdb2 /btrfs :0
        $ btrfs replace start -B -f -r 1 /dev/sdb1 /btrfs :1
          STDERR:
          ERROR: ioctl(DEV_REPLACE_START) failed on "/btrfs": Input/output error
      
        [11290.583502] BTRFS warning (device sdb2): tree block 22036480 mirror 2 has bad fsid, has 99835c32-49f0-4668-9e66-dc277a96b4a6 want da40350c-33ac-4872-92a8-4948ed8c04d0
        [11290.586580] BTRFS error (device sdb2): unable to fix up (regular) error at logical 22020096 on dev /dev/sdb8 physical 1048576
      
      As above, the replace is failing because we are verifying the header with
      fs_devices::fsid instead of fs_devices::metadata_uuid, despite the
      metadata_uuid actually being present.
      
      To fix this, use fs_devices::metadata_uuid. We copy fsid into
      fs_devices::metadata_uuid if there is no metadata_uuid, so its fine.
      
      Fixes: a3ddbaeb ("btrfs: scrub: introduce a helper to verify one metadata block")
      CC: stable@vger.kernel.org # 6.4+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b471965f
  2. Aug 14, 2023
    • Filipe Manana's avatar
      btrfs: fix infinite directory reads · 9b378f6a
      Filipe Manana authored
      
      
      The readdir implementation currently processes always up to the last index
      it finds. This however can result in an infinite loop if the directory has
      a large number of entries such that they won't all fit in the given buffer
      passed to the readdir callback, that is, dir_emit() returns a non-zero
      value. Because in that case readdir() will be called again and if in the
      meanwhile new directory entries were added and we still can't put all the
      remaining entries in the buffer, we keep repeating this over and over.
      
      The following C program and test script reproduce the problem:
      
        $ cat /mnt/readdir_prog.c
        #include <sys/types.h>
        #include <dirent.h>
        #include <stdio.h>
      
        int main(int argc, char *argv[])
        {
          DIR *dir = opendir(".");
          struct dirent *dd;
      
          while ((dd = readdir(dir))) {
            printf("%s\n", dd->d_name);
            rename(dd->d_name, "TEMPFILE");
            rename("TEMPFILE", dd->d_name);
          }
          closedir(dir);
        }
      
        $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV &> /dev/null
        #mkfs.xfs -f $DEV &> /dev/null
        #mkfs.ext4 -F $DEV &> /dev/null
      
        mount $DEV $MNT
      
        mkdir $MNT/testdir
        for ((i = 1; i <= 2000; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        cd $MNT/testdir
        /mnt/readdir_prog
      
        cd /mnt
      
        umount $MNT
      
      This behaviour is surprising to applications and it's unlike ext4, xfs,
      tmpfs, vfat and other filesystems, which always finish. In this case where
      new entries were added due to renames, some file names may be reported
      more than once, but this varies according to each filesystem - for example
      ext4 never reported the same file more than once while xfs reports the
      first 13 file names twice.
      
      So change our readdir implementation to track the last index number when
      opendir() is called and then make readdir() never process beyond that
      index number. This gives the same behaviour as ext4.
      
      Reported-by: default avatarRob Landley <rob@landley.net>
      Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
      
      
      CC: stable@vger.kernel.org # 6.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b378f6a
  3. Aug 10, 2023
    • Josef Bacik's avatar
      btrfs: set cache_block_group_error if we find an error · 92fb94b6
      Josef Bacik authored
      
      
      We set cache_block_group_error if btrfs_cache_block_group() returns an
      error, this is because we could end up not finding space to allocate and
      mistakenly return -ENOSPC, and which could then abort the transaction
      with the incorrect errno, and in the case of ENOSPC result in a
      WARN_ON() that will trip up tests like generic/475.
      
      However there's the case where multiple threads can be racing, one
      thread gets the proper error, and the other thread doesn't actually call
      btrfs_cache_block_group(), it instead sees ->cached ==
      BTRFS_CACHE_ERROR.  Again the result is the same, we fail to allocate
      our space and return -ENOSPC.  Instead we need to set
      cache_block_group_error to -EIO in this case to make sure that if we do
      not make our allocation we get the appropriate error returned back to
      the caller.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92fb94b6
    • Qu Wenruo's avatar
      btrfs: reject invalid reloc tree root keys with stack dump · 6ebcd021
      Qu Wenruo authored
      
      
      [BUG]
      Syzbot reported a crash that an ASSERT() got triggered inside
      prepare_to_merge().
      
      That ASSERT() makes sure the reloc tree is properly pointed back by its
      subvolume tree.
      
      [CAUSE]
      After more debugging output, it turns out we had an invalid reloc tree:
      
        BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17
      
      Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM,
      QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree.
      
      But reloc trees can only exist for subvolumes, as for non-subvolume
      trees, we just COW the involved tree block, no need to create a reloc
      tree since those tree blocks won't be shared with other trees.
      
      Only subvolumes tree can share tree blocks with other trees (thus they
      have BTRFS_ROOT_SHAREABLE flag).
      
      Thus this new debug output proves my previous assumption that corrupted
      on-disk data can trigger that ASSERT().
      
      [FIX]
      Besides the dedicated fix and the graceful exit, also let tree-checker to
      check such root keys, to make sure reloc trees can only exist for subvolumes.
      
      CC: stable@vger.kernel.org # 5.15+
      Reported-by: default avatar <syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ebcd021
    • Qu Wenruo's avatar
      btrfs: exit gracefully if reloc roots don't match · 05d7ce50
      Qu Wenruo authored
      
      
      [BUG]
      Syzbot reported a crash that an ASSERT() got triggered inside
      prepare_to_merge().
      
      [CAUSE]
      The root cause of the triggered ASSERT() is we can have a race between
      quota tree creation and relocation.
      
      This leads us to create a duplicated quota tree in the
      btrfs_read_fs_root() path, and since it's treated as fs tree, it would
      have ROOT_SHAREABLE flag, causing us to create a reloc tree for it.
      
      The bug itself is fixed by a dedicated patch for it, but this already
      taught us the ASSERT() is not something straightforward for
      developers.
      
      [ENHANCEMENT]
      Instead of using an ASSERT(), let's handle it gracefully and output
      extra info about the mismatch reloc roots to help debug.
      
      Also with the above ASSERT() removed, we can trigger ASSERT(0)s inside
      merge_reloc_roots() later.
      Also replace those ASSERT(0)s with WARN_ON()s.
      
      CC: stable@vger.kernel.org # 5.15+
      Reported-by: default avatar <syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05d7ce50
    • Qu Wenruo's avatar
      btrfs: avoid race between qgroup tree creation and relocation · 773e722a
      Qu Wenruo authored
      
      
      [BUG]
      Syzbot reported a weird ASSERT() triggered inside prepare_to_merge().
      
        assertion failed: root->reloc_root == reloc_root, in fs/btrfs/relocation.c:1919
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/relocation.c:1919!
        invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        CPU: 0 PID: 9904 Comm: syz-executor.3 Not tainted
        6.4.0-syzkaller-08881-g533925cb7604 #0
        Hardware name: Google Google Compute Engine/Google Compute Engine,
        BIOS Google 05/27/2023
        RIP: 0010:prepare_to_merge+0xbb2/0xc40 fs/btrfs/relocation.c:1919
        Code: fe e9 f5 (...)
        RSP: 0018:ffffc9000325f760 EFLAGS: 00010246
        RAX: 000000000000004f RBX: ffff888075644030 RCX: 1481ccc522da5800
        RDX: ffffc90005c09000 RSI: 00000000000364ca RDI: 00000000000364cb
        RBP: ffffc9000325f870 R08: ffffffff816f33ac R09: 1ffff9200064bea0
        R10: dffffc0000000000 R11: fffff5200064bea1 R12: ffff888075644000
        R13: ffff88803b166000 R14: ffff88803b166560 R15: ffff88803b166558
        FS:  00007f4e305fd700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056080679c000 CR3: 00000000193ad000 CR4: 00000000003506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <TASK>
         relocate_block_group+0xa5d/0xcd0 fs/btrfs/relocation.c:3749
         btrfs_relocate_block_group+0x7ab/0xd70 fs/btrfs/relocation.c:4087
         btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3283
         __btrfs_balance+0x1b06/0x2690 fs/btrfs/volumes.c:4018
         btrfs_balance+0xbdb/0x1120 fs/btrfs/volumes.c:4402
         btrfs_ioctl_balance+0x496/0x7c0 fs/btrfs/ioctl.c:3604
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:870 [inline]
         __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
        RIP: 0033:0x7f4e2f88c389
      
      [CAUSE]
      With extra debugging, the offending reloc_root is for quota tree (rootid 8).
      
      Normally we should not use the reloc tree for quota root at all, as reloc
      trees are only for subvolume trees.
      
      But there is a race between quota enabling and relocation, this happens
      after commit 85724171 ("btrfs: fix the btrfs_get_global_root return value").
      
      Before that commit, for quota and free space tree, we exit immediately
      if we cannot grab it from fs_info.
      
      But now we would try to read it from disk, just as if they are fs trees,
      this sets ROOT_SHAREABLE flags in such race:
      
                   Thread A             |           Thread B
       ---------------------------------+------------------------------
       btrfs_quota_enable()             |
       |                                | btrfs_get_root_ref()
       |                                | |- btrfs_get_global_root()
       |                                | |  Returned NULL
       |                                | |- btrfs_lookup_fs_root()
       |                                | |  Returned NULL
       |- btrfs_create_tree()           | |
       |  Now quota root item is        | |
       |  inserted                      | |- btrfs_read_tree_root()
       |                                | |  Got the newly inserted quota root
       |                                | |- btrfs_init_fs_root()
       |                                | |  Set ROOT_SHAREABLE flag
      
      [FIX]
      Get back to the old behavior by returning PTR_ERR(-ENOENT) if the target
      objectid is not a subvolume tree or data reloc tree.
      
      Reported-and-tested-by: default avatar <syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com>
      Fixes: 85724171 ("btrfs: fix the btrfs_get_global_root return value")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      773e722a
    • Christoph Hellwig's avatar
      btrfs: properly clear end of the unreserved range in cow_file_range · 12b2d64e
      Christoph Hellwig authored
      
      
      When the call to btrfs_reloc_clone_csums in cow_file_range returns an
      error, we jump to the out_unlock label with the extent_reserved variable
      set to false.   The cleanup at the label will then call
      extent_clear_unlock_delalloc on the range from start to end.  But we've
      already added cur_alloc_size to start before the jump, so there might no
      range be left from the newly incremented start to end.  Move the check for
      'start < end' so that it is reached by also for the !extent_reserved case.
      
      CC: stable@vger.kernel.org # 6.1+
      Fixes: a315e68f ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12b2d64e
    • Christoph Hellwig's avatar
      btrfs: don't wait for writeback on clean pages in extent_write_cache_pages · 5c256998
      Christoph Hellwig authored
      
      
      __extent_writepage could have started on more pages than the one it was
      called for.  This happens regularly for zoned file systems, and in theory
      could happen for compressed I/O if the worker thread was executed very
      quickly. For such pages extent_write_cache_pages waits for writeback
      to complete before moving on to the next page, which is highly inefficient
      as it blocks the flusher thread.
      
      Port over the PageDirty check that was added to write_cache_pages in
      commit 515f4a03 ("mm: write_cache_pages optimise page cleaning") to
      fix this.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c256998
    • Christoph Hellwig's avatar
      btrfs: don't stop integrity writeback too early · effa24f6
      Christoph Hellwig authored
      
      
      extent_write_cache_pages stops writing pages as soon as nr_to_write hits
      zero.  That is the right thing for opportunistic writeback, but incorrect
      for data integrity writeback, which needs to ensure that no dirty pages
      are left in the range.  Thus only stop the writeback for WB_SYNC_NONE
      if nr_to_write hits 0.
      
      This is a port of write_cache_pages changes in commit 05fe478d
      ("mm: write_cache_pages integrity fix").
      
      Note that I've only trigger the problem with other changes to the btrfs
      writeback code, but this condition seems worthwhile fixing anyway.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ updated comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      effa24f6
    • Josef Bacik's avatar
      btrfs: wait for actual caching progress during allocation · fc1f91b9
      Josef Bacik authored
      
      
      Recently we've been having mysterious hangs while running generic/475 on
      the CI system.  This turned out to be something like this:
      
        Task 1
        dmsetup suspend --nolockfs
        -> __dm_suspend
         -> dm_wait_for_completion
          -> dm_wait_for_bios_completion
           -> Unable to complete because of IO's on a plug in Task 2
      
        Task 2
        wb_workfn
        -> wb_writeback
         -> blk_start_plug
          -> writeback_sb_inodes
           -> Infinite loop unable to make an allocation
      
        Task 3
        cache_block_group
        ->read_extent_buffer_pages
         ->Waiting for IO to complete that can't be submitted because Task 1
           suspended the DM device
      
      The problem here is that we need Task 2 to be scheduled completely for
      the blk plug to flush.  Normally this would happen, we normally wait for
      the block group caching to finish (Task 3), and this schedule would
      result in the block plug flushing.
      
      However if there's enough free space available from the current caching
      to satisfy the allocation we won't actually wait for the caching to
      complete.  This check however just checks that we have enough space, not
      that we can make the allocation.  In this particular case we were trying
      to allocate 9MiB, and we had 10MiB of free space, but we didn't have
      9MiB of contiguous space to allocate, and thus the allocation failed and
      we looped.
      
      We specifically don't cycle through the FFE loop until we stop finding
      cached block groups because we don't want to allocate new block groups
      just because we're caching, so we short circuit the normal loop once we
      hit LOOP_CACHING_WAIT and we found a caching block group.
      
      This is normally fine, except in this particular case where the caching
      thread can't make progress because the DM device has been suspended.
      
      Fix this by not only waiting for free space to >= the amount of space we
      want to allocate, but also that we make some progress in caching from
      the time we start waiting.  This will keep us from busy looping when the
      caching is taking a while but still theoretically has enough space for
      us to allocate from, and fixes this particular case by forcing us to
      actually sleep and wait for forward progress, which will flush the plug.
      
      With this fix we're no longer hanging with generic/475.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc1f91b9
  4. Jul 26, 2023
    • Filipe Manana's avatar
      btrfs: check for commit error at btrfs_attach_transaction_barrier() · b28ff3a7
      Filipe Manana authored
      
      
      btrfs_attach_transaction_barrier() is used to get a handle pointing to the
      current running transaction if the transaction has not started its commit
      yet (its state is < TRANS_STATE_COMMIT_START). If the transaction commit
      has started, then we wait for the transaction to commit and finish before
      returning - however we completely ignore if the transaction was aborted
      due to some error during its commit, we simply return ERR_PT(-ENOENT),
      which makes the caller assume everything is fine and no errors happened.
      
      This could make an fsync return success (0) to user space when in fact we
      had a transaction abort and the target inode changes were therefore not
      persisted.
      
      Fix this by checking for the return value from btrfs_wait_for_commit(),
      and if it returned an error, return it back to the caller.
      
      Fixes: d4edf39b ("Btrfs: fix uncompleted transaction")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b28ff3a7
  5. Jul 25, 2023
  6. Jul 21, 2023
  7. Jul 18, 2023
    • Filipe Manana's avatar
      btrfs: fix warning when putting transaction with qgroups enabled after abort · aa84ce8a
      Filipe Manana authored
      
      
      If we have a transaction abort with qgroups enabled we get a warning
      triggered when doing the final put on the transaction, like this:
      
        [552.6789] ------------[ cut here ]------------
        [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6817] Modules linked in: btrfs blake2b_generic xor (...)
        [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
        [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6821] Code: bd a0 01 00 (...)
        [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286
        [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000
        [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010
        [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20
        [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70
        [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028
        [552.6821] FS:  0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000
        [552.6821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0
        [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [552.6822] Call Trace:
        [552.6822]  <TASK>
        [552.6822]  ? __warn+0x80/0x130
        [552.6822]  ? btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6824]  ? report_bug+0x1f4/0x200
        [552.6824]  ? handle_bug+0x42/0x70
        [552.6824]  ? exc_invalid_op+0x14/0x70
        [552.6824]  ? asm_exc_invalid_op+0x16/0x20
        [552.6824]  ? btrfs_put_transaction+0x123/0x130 [btrfs]
        [552.6826]  btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs]
        [552.6828]  ? _raw_spin_unlock_irqrestore+0x23/0x40
        [552.6828]  ? try_to_wake_up+0x94/0x5e0
        [552.6828]  ? __pfx_process_timeout+0x10/0x10
        [552.6828]  transaction_kthread+0x103/0x1d0 [btrfs]
        [552.6830]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
        [552.6832]  kthread+0xee/0x120
        [552.6832]  ? __pfx_kthread+0x10/0x10
        [552.6832]  ret_from_fork+0x29/0x50
        [552.6832]  </TASK>
        [552.6832] ---[ end trace 0000000000000000 ]---
      
      This corresponds to this line of code:
      
        void btrfs_put_transaction(struct btrfs_transaction *transaction)
        {
            (...)
                WARN_ON(!RB_EMPTY_ROOT(
                                &transaction->delayed_refs.dirty_extent_root));
            (...)
        }
      
      The warning happens because btrfs_qgroup_destroy_extent_records(), called
      in the transaction abort path, we free all entries from the rbtree
      "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we
      don't actually empty the rbtree - it's still pointing to nodes that were
      freed.
      
      So set the rbtree's root node to NULL to avoid this warning (assign
      RB_ROOT).
      
      Fixes: 81f7eb00 ("btrfs: destroy qgroup extent records on transaction abort")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa84ce8a
    • Christoph Hellwig's avatar
      btrfs: fix ordered extent split error handling in btrfs_dio_submit_io · 7cad645e
      Christoph Hellwig authored
      
      
      When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io
      fails to allocate memory for a new ordered_extent, it calls into the
      btrfs_dio_end_io for error handling.  btrfs_dio_end_io then assumes that
      bbio->ordered is set because it is supposed to be at this point, except
      for this error handling corner case.  Try to not overload the
      btrfs_dio_end_io with error handling of a bio in a non-canonical state,
      and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io
      directly for this error case.
      
      Reported-by: default avatarsyzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
      Fixes: b41b6f69 ("btrfs: use btrfs_finish_ordered_extent to complete direct writes")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Tested-by: default avatarsyzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7cad645e
    • Josef Bacik's avatar
      btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand · 17b17fcd
      Josef Bacik authored
      
      
      While trying to get the subpage blocksize tests running, I hit the
      following panic on generic/476
      
        assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
        kernel BUG at fs/btrfs/subpage.c:229!
        Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
        CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12
        Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023
        pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
        pc : btrfs_subpage_assert+0xbc/0xf0
        lr : btrfs_subpage_assert+0xbc/0xf0
        Call trace:
         btrfs_subpage_assert+0xbc/0xf0
         btrfs_subpage_clear_checked+0x38/0xc0
         btrfs_page_clear_checked+0x48/0x98
         btrfs_truncate_block+0x5d0/0x6a8
         btrfs_cont_expand+0x5c/0x528
         btrfs_write_check.isra.0+0xf8/0x150
         btrfs_buffered_write+0xb4/0x760
         btrfs_do_write_iter+0x2f8/0x4b0
         btrfs_file_write_iter+0x1c/0x30
         do_iter_readv_writev+0xc8/0x158
         do_iter_write+0x9c/0x210
         vfs_iter_write+0x24/0x40
         iter_file_splice_write+0x224/0x390
         direct_splice_actor+0x38/0x68
         splice_direct_to_actor+0x12c/0x260
         do_splice_direct+0x90/0xe8
         generic_copy_file_range+0x50/0x90
         vfs_copy_file_range+0x29c/0x470
         __arm64_sys_copy_file_range+0xcc/0x498
         invoke_syscall.constprop.0+0x80/0xd8
         do_el0_svc+0x6c/0x168
         el0_svc+0x50/0x1b0
         el0t_64_sync_handler+0x114/0x120
         el0t_64_sync+0x194/0x198
      
      This happens because during btrfs_cont_expand we'll get a page, set it
      as mapped, and if it's not Uptodate we'll read it.  However between the
      read and re-locking the page we could have called release_folio() on the
      page, but left the page in the file mapping.  release_folio() can clear
      the page private, and thus further down we blow up when we go to modify
      the subpage bits.
      
      Fix this by putting the set_page_extent_mapped() after the read.  This
      is safe because read_folio() will call set_page_extent_mapped() before
      it does the read, and then if we clear page private but leave it on the
      mapping we're completely safe re-setting set_page_extent_mapped().  With
      this patch I can now run generic/476 without panicing.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17b17fcd
    • Qu Wenruo's avatar
      btrfs: raid56: always verify the P/Q contents for scrub · 486c737f
      Qu Wenruo authored
      
      
      [REGRESSION]
      Commit 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery
      path to use error_bitmap") changed the behavior of scrub_rbio().
      
      Initially if we have no error reading the raid bio, we will assign
      @need_check to true, then finish_parity_scrub() would later verify the
      content of P/Q stripes before writeback.
      
      But after that commit we never verify the content of P/Q stripes and
      just writeback them.
      
      This can lead to unrepaired P/Q stripes during scrub, or already
      corrupted P/Q copied to the dev-replace target.
      
      [FIX]
      The situation is more complex than the regression, in fact the initial
      behavior is not 100% correct either.
      
      If we have the following rare case, it can still lead to the same
      problem using the old behavior:
      
      		0	16K	32K	48K	64K
      	Data 1:	|IIIIIII|                       |
      	Data 2:	|				|
      	Parity:	|	|CCCCCCC|		|
      
      Where "I" means IO error, "C" means corruption.
      
      In the above case, we're scrubbing the parity stripe, then read out all
      the contents of Data 1, Data 2, Parity stripes.
      
      But found IO error in Data 1, which leads to rebuild using Data 2 and
      Parity and got the correct data.
      
      In that case, we would not verify if the Parity is correct for range
      [16K, 32K).
      
      So here we have to always verify the content of Parity no matter if we
      did recovery or not.
      
      This patch would remove the @need_check parameter of
      finish_parity_scrub() completely, and would always do the P/Q
      verification before writeback.
      
      Fixes: 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap")
      CC: stable@vger.kernel.org # 6.2+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      486c737f
    • Filipe Manana's avatar
      btrfs: use irq safe locking when running and adding delayed iputs · 866e98a4
      Filipe Manana authored
      
      
      Running delayed iputs, which never happens in an irq context, needs to
      lock the spinlock fs_info->delayed_iput_lock. When finishing bios for
      data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which
      needs to add a delayed iput and for that it needs to acquire the spinlock
      fs_info->delayed_iput_lock. Without disabling irqs when running delayed
      iputs we can therefore deadlock on that spinlock. The same deadlock can
      also happen when adding an inode to the delayed iputs list, since this
      can be done outside an irq context as well.
      
      Syzbot recently reported this, which results in the following trace:
      
        ================================
        WARNING: inconsistent lock state
        6.4.0-syzkaller-09904-ga507db1d8fdc #0 Not tainted
        --------------------------------
        inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes:
        ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline]
        ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
        {IN-SOFTIRQ-W} state was registered at:
          lock_acquire kernel/locking/lockdep.c:5761 [inline]
          lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
          __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
          _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
          spin_lock include/linux/spinlock.h:350 [inline]
          btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490
          btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline]
          btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547
          __btrfs_bio_end_io fs/btrfs/bio.c:118 [inline]
          __btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112
          btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163
          btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378
          bio_endio+0x589/0x690 block/bio.c:1617
          req_bio_endio block/blk-mq.c:766 [inline]
          blk_update_request+0x5c5/0x1620 block/blk-mq.c:911
          blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032
          lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
          blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110
          __do_softirq+0x1d4/0x905 kernel/softirq.c:553
          run_ksoftirqd kernel/softirq.c:921 [inline]
          run_ksoftirqd+0x31/0x60 kernel/softirq.c:913
          smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
          kthread+0x344/0x440 kernel/kthread.c:389
          ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
        irq event stamp: 39
        hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline]
        hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline]
        hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575
        hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline]
        hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline]
        hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575
        softirqs last  enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448
        softirqs last disabled at (0): [<0000000000000000>] 0x0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&fs_info->delayed_iput_lock);
          <Interrupt>
            lock(&fs_info->delayed_iput_lock);
      
         *** DEADLOCK ***
      
        1 lock held by btrfs-cleaner/16079:
         #0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463
      
        stack backtrace:
        CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d8fdc #0
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
         print_usage_bug kernel/locking/lockdep.c:3978 [inline]
         valid_state kernel/locking/lockdep.c:4020 [inline]
         mark_lock_irq kernel/locking/lockdep.c:4223 [inline]
         mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685
         mark_lock kernel/locking/lockdep.c:4649 [inline]
         mark_usage kernel/locking/lockdep.c:4598 [inline]
         __lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098
         lock_acquire kernel/locking/lockdep.c:5761 [inline]
         lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
         __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
         _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
         spin_lock include/linux/spinlock.h:350 [inline]
         btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
         cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478
         kthread+0x344/0x440 kernel/kthread.c:389
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
         </TASK>
      
      So fix this by using spin_lock_irq() and spin_unlock_irq() when running
      delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore()
      when adding a delayed iput().
      
      Reported-by: default avatar <syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com>
      Fixes: ec63b84d ("btrfs: add an ordered_extent pointer to struct btrfs_bio")
      Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/
      
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      866e98a4
    • Filipe Manana's avatar
      btrfs: fix iput() on error pointer after error during orphan cleanup · cbaee87f
      Filipe Manana authored
      
      
      At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns
      an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the
      inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body
      of the following if statement:
      
          if (ret == -ENOENT || inode->i_nlink) {
              (...)
              trans = btrfs_start_transaction(root, 1);
              if (IS_ERR(trans)) {
                  ret = PTR_ERR(trans);
                  iput(inode);
                  goto out;
              }
              (...)
              ret = btrfs_del_orphan_item(trans, root,
                                          found_key.objectid);
              btrfs_end_transaction(trans);
              if (ret) {
                  iput(inode);
                  goto out;
              }
              continue;
          }
      
      If we get an error from btrfs_start_transaction() or from the call to
      btrfs_del_orphan_item() we end calling iput() against an inode pointer
      that has a value of ERR_PTR(-ENOENT), resulting in a crash with the
      following trace:
      
        [876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096
        [876.667] #PF: supervisor read access in kernel mode
        [876.667] #PF: error_code(0x0000) - not-present page
        [876.667] PGD 0 P4D 0
        [876.668] Oops: 0000 [#1] PREEMPT SMP PTI
        [876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
        [876.668] RIP: 0010:iput+0xa/0x20
        [876.668] Code: ff ff ff 66 (...)
        [876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282
        [876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000
        [876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe
        [876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4
        [876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0
        [876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4
        [876.670] FS:  00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000
        [876.670] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0
        [876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [876.672] Call Trace:
        [876.744]  <TASK>
        [876.744]  ? __die_body+0x1b/0x60
        [876.744]  ? page_fault_oops+0x15d/0x450
        [876.745]  ? __kmem_cache_alloc_node+0x47/0x410
        [876.745]  ? do_user_addr_fault+0x65/0x8a0
        [876.745]  ? exc_page_fault+0x74/0x170
        [876.746]  ? asm_exc_page_fault+0x22/0x30
        [876.746]  ? iput+0xa/0x20
        [876.746]  btrfs_orphan_cleanup+0x221/0x330 [btrfs]
        [876.746]  btrfs_lookup_dentry+0x58f/0x5f0 [btrfs]
        [876.747]  btrfs_lookup+0xe/0x30 [btrfs]
        [876.747]  __lookup_slow+0x82/0x130
        [876.785]  walk_component+0xe5/0x160
        [876.786]  path_lookupat.isra.0+0x6e/0x150
        [876.786]  filename_lookup+0xcf/0x1a0
        [876.786]  ? mod_objcg_state+0xd2/0x360
        [876.786]  ? obj_cgroup_charge+0xf5/0x110
        [876.787]  ? should_failslab+0xa/0x20
        [876.787]  ? kmem_cache_alloc+0x47/0x450
        [876.787]  vfs_path_lookup+0x51/0x90
        [876.788]  mount_subtree+0x8d/0x130
        [876.788]  btrfs_mount+0x149/0x410 [btrfs]
        [876.788]  ? __kmem_cache_alloc_node+0x47/0x410
        [876.788]  ? vfs_parse_fs_param+0xc0/0x110
        [876.789]  legacy_get_tree+0x24/0x50
        [876.834]  vfs_get_tree+0x22/0xd0
        [876.852]  path_mount+0x2d8/0x9c0
        [876.852]  do_mount+0x79/0x90
        [876.852]  __x64_sys_mount+0x8e/0xd0
        [876.853]  do_syscall_64+0x38/0x90
        [876.899]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [876.958] RIP: 0033:0x7f5fbe50b76a
        [876.959] Code: 48 8b 0d a9 (...)
        [876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        [876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a
        [876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0
        [876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001
        [876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        [876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70
        [876.960]  </TASK>
      
      So fix this by setting 'inode' to NULL whenever we get an error from
      btrfs_iget(), and to make the code simpler, stop testing for 'ret' being
      -ENOENT to check if we have an inode - instead test for 'inode' being NULL
      or not. Having a NULL 'inode' prevents any iput() call from crashing, as
      iput() ignores NULL inode pointers. Also, stop testing for a NULL return
      value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never
      returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT),
      and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM).
      We also don't need the extra iput() calls on the error branches for the
      btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have
      already called iput() before, so remove them.
      
      Fixes: a13bb2c0 ("btrfs: add missing iputs on orphan cleanup failure")
      CC: stable@vger.kernel.org # 6.4
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbaee87f
    • Filipe Manana's avatar
      btrfs: fix double iput() on inode after an error during orphan cleanup · b777d279
      Filipe Manana authored
      
      
      At btrfs_orphan_cleanup(), if we were able to find the inode, we do an
      iput() on the inode, then if btrfs_drop_verity_items() succeeds and then
      either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do
      another iput() in the respective error paths, resulting in an extra iput()
      on the inode.
      
      Fix this by setting inode to NULL after the first iput(), as iput()
      ignores a NULL inode pointer argument.
      
      Fixes: a13bb2c0 ("btrfs: add missing iputs on orphan cleanup failure")
      CC: stable@vger.kernel.org # 6.4
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b777d279
    • Filipe Manana's avatar
      btrfs: zoned: fix memory leak after finding block group with super blocks · f1a07c2b
      Filipe Manana authored
      
      
      At exclude_super_stripes(), if we happen to find a block group that has
      super blocks mapped to it and we are on a zoned filesystem, we error out
      as this is not supposed to happen, indicating either a bug or maybe some
      memory corruption for example. However we are exiting the function without
      freeing the memory allocated for the logical address of the super blocks.
      Fix this by freeing the logical address.
      
      Fixes: 12659251 ("btrfs: implement log-structured superblock for ZONED mode")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1a07c2b
  8. Jul 11, 2023
    • Filipe Manana's avatar
      btrfs: fix use-after-free of new block group that became unused · 0657b20c
      Filipe Manana authored
      
      
      If a task creates a new block group and that block group becomes unused
      before we finish its creation, at btrfs_create_pending_block_groups(),
      then when btrfs_mark_bg_unused() is called against the block group, we
      assume that the block group is currently in the list of block groups to
      reclaim, and we move it out of the list of new block groups and into the
      list of unused block groups. This has two consequences:
      
      1) We move it out of the list of new block groups associated to the
         current transaction. So the block group creation is not finished and
         if we attempt to delete the bg because it's unused, we will not find
         the block group item in the extent tree (or the new block group tree),
         its device extent items in the device tree etc, resulting in the
         deletion to fail due to the missing items;
      
      2) We don't increment the reference count on the block group when we
         move it to the list of unused block groups, because we assumed the
         block group was on the list of block groups to reclaim, and in that
         case it already has the correct reference count. However the block
         group was on the list of new block groups, in which case no extra
         reference was taken because it's local to the current task. This
         later results in doing an extra reference count decrement when
         removing the block group from the unused list, eventually leading the
         reference count to 0.
      
      This second case was caught when running generic/297 from fstests, which
      produced the following assertion failure and stack trace:
      
        [589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
        [589.559] ------------[ cut here ]------------
        [589.559] kernel BUG at fs/btrfs/block-group.c:4299!
        [589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
        [589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
        [589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.561] Code: 68 62 da c0 (...)
        [589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
        [589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
        [589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
        [589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
        [589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
        [589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
        [589.563] FS:  00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
        [589.563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
        [589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [589.564] Call Trace:
        [589.564]  <TASK>
        [589.565]  ? __die_body+0x1b/0x60
        [589.565]  ? die+0x39/0x60
        [589.565]  ? do_trap+0xeb/0x110
        [589.565]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.566]  ? do_error_trap+0x6a/0x90
        [589.566]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.566]  ? exc_invalid_op+0x4e/0x70
        [589.566]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.567]  ? asm_exc_invalid_op+0x16/0x20
        [589.567]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.567]  ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
        [589.567]  close_ctree+0x35d/0x560 [btrfs]
        [589.568]  ? fsnotify_sb_delete+0x13e/0x1d0
        [589.568]  ? dispose_list+0x3a/0x50
        [589.568]  ? evict_inodes+0x151/0x1a0
        [589.568]  generic_shutdown_super+0x73/0x1a0
        [589.569]  kill_anon_super+0x14/0x30
        [589.569]  btrfs_kill_super+0x12/0x20 [btrfs]
        [589.569]  deactivate_locked_super+0x2e/0x70
        [589.569]  cleanup_mnt+0x104/0x160
        [589.570]  task_work_run+0x56/0x90
        [589.570]  exit_to_user_mode_prepare+0x160/0x170
        [589.570]  syscall_exit_to_user_mode+0x22/0x50
        [589.570]  ? __x64_sys_umount+0x12/0x20
        [589.571]  do_syscall_64+0x48/0x90
        [589.571]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [589.571] RIP: 0033:0x7f497ff0a567
        [589.571] Code: af 98 0e (...)
        [589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
        [589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
        [589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
        [589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        [589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
        [589.573]  </TASK>
        [589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
        [589.576] ---[ end trace 0000000000000000 ]---
      
      Fix this by adding a runtime flag to the block group to tell that the
      block group is still in the list of new block groups, and therefore it
      should not be moved to the list of unused block groups, at
      btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
      creation of the block group at btrfs_create_pending_block_groups().
      
      Fixes: a9f18971 ("btrfs: move out now unused BG from the reclaim list")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0657b20c
    • Christoph Hellwig's avatar
      btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block · 4e7de35e
      Christoph Hellwig authored
      
      
      The mirror_num_ret is allowed to be NULL, although it has to be set when
      smap is set.  Unfortunately that is not a well enough specifiable
      invariant for static type checkers, so add a NULL check to make sure they
      are fine.
      
      Fixes: 03793cbb ("btrfs: add fast path for single device io in __btrfs_map_block")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e7de35e
    • Josef Bacik's avatar
      btrfs: fix race between balance and cancel/pause · b19c98f2
      Josef Bacik authored
      
      
      Syzbot reported a panic that looks like this:
      
        assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED, in fs/btrfs/ioctl.c:465
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/messages.c:259!
        RIP: 0010:btrfs_assertfail+0x2c/0x30 fs/btrfs/messages.c:259
        Call Trace:
         <TASK>
         btrfs_exclop_balance fs/btrfs/ioctl.c:465 [inline]
         btrfs_ioctl_balance fs/btrfs/ioctl.c:3564 [inline]
         btrfs_ioctl+0x531e/0x5b30 fs/btrfs/ioctl.c:4632
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:870 [inline]
         __se_sys_ioctl fs/ioctl.c:856 [inline]
         __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The reproducer is running a balance and a cancel or pause in parallel.
      The way balance finishes is a bit wonky, if we were paused we need to
      save the balance_ctl in the fs_info, but clear it otherwise and cleanup.
      However we rely on the return values being specific errors, or having a
      cancel request or no pause request.  If balance completes and returns 0,
      but we have a pause or cancel request we won't do the appropriate
      cleanup, and then the next time we try to start a balance we'll trip
      this ASSERT.
      
      The error handling is just wrong here, we always want to clean up,
      unless we got -ECANCELLED and we set the appropriate pause flag in the
      exclusive op.  With this patch the reproducer ran for an hour without
      tripping, previously it would trip in less than a few minutes.
      
      Reported-by: default avatar <syzbot+c0f3acf145cb465426d5@syzkaller.appspotmail.com>
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b19c98f2
  9. Jun 20, 2023
    • Filipe Manana's avatar
      btrfs: fix race between quota disable and relocation · 8a4a0b2a
      Filipe Manana authored
      
      
      If we disable quotas while we have a relocation of a metadata block group
      that has extents belonging to the quota root, we can cause the relocation
      to fail with -ENOENT. This is because relocation builds backref nodes for
      extents of the quota root and later needs to walk the backrefs and access
      the quota root - however if in between a task disables quotas, it results
      in deleting the quota root from the root tree (with btrfs_del_root(),
      called from btrfs_quota_disable().
      
      This can be sporadically triggered by test case btrfs/255 from fstests:
      
        $ ./check btrfs/255
        FSTYP         -- btrfs
        PLATFORM      -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023
        MKFS_OPTIONS  -- /dev/sdc
        MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
      
        btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg)
        - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad)
            --- tests/btrfs/255.out	2023-03-02 21:47:53.876609426 +0000
            +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad	2023-06-16 10:20:39.267563212 +0100
            @@ -1,2 +1,4 @@
             QA output created by 255
            +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory
            +There may be more info in syslog - try dmesg | tail
             Silence is golden
            ...
            (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad'  to see the entire diff)
        Ran: btrfs/255
        Failures: btrfs/255
        Failed 1 of 1 tests
      
      To fix this make the quota disable operation take the cleaner mutex, as
      relocation of a block group also takes this mutex. This is also what we
      do when deleting a subvolume/snapshot, we take the cleaner mutex in the
      cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root()
      at btrfs_drop_snapshot() while under the protection of the cleaner mutex.
      
      Fixes: bed92eae ("Btrfs: qgroup implementation and prototypes")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a4a0b2a
    • Filipe Manana's avatar
      btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots · 08eb2ad9
      Filipe Manana authored
      
      
      Add a comment to struct btrfs_fs_info::dirty_cowonly_roots to mention
      that struct btrfs_fs_info::trans_lock is the lock that protects that
      list.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08eb2ad9
    • Filipe Manana's avatar
      btrfs: fix race when deleting free space root from the dirty cow roots list · babebf02
      Filipe Manana authored
      
      
      When deleting the free space tree we are deleting the free space root
      from the list fs_info->dirty_cowonly_roots without taking the lock that
      protects it, which is struct btrfs_fs_info::trans_lock.
      This unsynchronized list manipulation may cause chaos if there's another
      concurrent manipulation of this list, such as when adding a root to it
      with ctree.c:add_root_to_dirty_list().
      
      This can result in all sorts of weird failures caused by a race, such as
      the following crash:
      
        [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
        [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.279928] Code: 85 38 06 00 (...)
        [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
        [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
        [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
        [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
        [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
        [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
        [337571.281723] FS:  00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
        [337571.281950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
        [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [337571.282874] Call Trace:
        [337571.283101]  <TASK>
        [337571.283327]  ? __die_body+0x1b/0x60
        [337571.283570]  ? die_addr+0x39/0x60
        [337571.283796]  ? exc_general_protection+0x22e/0x430
        [337571.284022]  ? asm_exc_general_protection+0x22/0x30
        [337571.284251]  ? commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.284531]  btrfs_commit_transaction+0x42e/0xf90 [btrfs]
        [337571.284803]  ? _raw_spin_unlock+0x15/0x30
        [337571.285031]  ? release_extent_buffer+0x103/0x130 [btrfs]
        [337571.285305]  reset_balance_state+0x152/0x1b0 [btrfs]
        [337571.285578]  btrfs_balance+0xa50/0x11e0 [btrfs]
        [337571.285864]  ? __kmem_cache_alloc_node+0x14a/0x410
        [337571.286086]  btrfs_ioctl+0x249a/0x3320 [btrfs]
        [337571.286358]  ? mod_objcg_state+0xd2/0x360
        [337571.286577]  ? refill_obj_stock+0xb0/0x160
        [337571.286798]  ? seq_release+0x25/0x30
        [337571.287016]  ? __rseq_handle_notify_resume+0x3ba/0x4b0
        [337571.287235]  ? percpu_counter_add_batch+0x2e/0xa0
        [337571.287455]  ? __x64_sys_ioctl+0x88/0xc0
        [337571.287675]  __x64_sys_ioctl+0x88/0xc0
        [337571.287901]  do_syscall_64+0x38/0x90
        [337571.288126]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [337571.288352] RIP: 0033:0x7f478aaffe9b
      
      So fix this by locking struct btrfs_fs_info::trans_lock before deleting
      the free space root from that list.
      
      Fixes: a5ed9182 ("Btrfs: implement the free space B-tree")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      babebf02
    • Filipe Manana's avatar
      btrfs: fix race when deleting quota root from the dirty cow roots list · b31cb5a6
      Filipe Manana authored
      
      
      When disabling quotas we are deleting the quota root from the list
      fs_info->dirty_cowonly_roots without taking the lock that protects it,
      which is struct btrfs_fs_info::trans_lock. This unsynchronized list
      manipulation may cause chaos if there's another concurrent manipulation
      of this list, such as when adding a root to it with
      ctree.c:add_root_to_dirty_list().
      
      This can result in all sorts of weird failures caused by a race, such as
      the following crash:
      
        [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
        [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.279928] Code: 85 38 06 00 (...)
        [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
        [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
        [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
        [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
        [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
        [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
        [337571.281723] FS:  00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
        [337571.281950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
        [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [337571.282874] Call Trace:
        [337571.283101]  <TASK>
        [337571.283327]  ? __die_body+0x1b/0x60
        [337571.283570]  ? die_addr+0x39/0x60
        [337571.283796]  ? exc_general_protection+0x22e/0x430
        [337571.284022]  ? asm_exc_general_protection+0x22/0x30
        [337571.284251]  ? commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.284531]  btrfs_commit_transaction+0x42e/0xf90 [btrfs]
        [337571.284803]  ? _raw_spin_unlock+0x15/0x30
        [337571.285031]  ? release_extent_buffer+0x103/0x130 [btrfs]
        [337571.285305]  reset_balance_state+0x152/0x1b0 [btrfs]
        [337571.285578]  btrfs_balance+0xa50/0x11e0 [btrfs]
        [337571.285864]  ? __kmem_cache_alloc_node+0x14a/0x410
        [337571.286086]  btrfs_ioctl+0x249a/0x3320 [btrfs]
        [337571.286358]  ? mod_objcg_state+0xd2/0x360
        [337571.286577]  ? refill_obj_stock+0xb0/0x160
        [337571.286798]  ? seq_release+0x25/0x30
        [337571.287016]  ? __rseq_handle_notify_resume+0x3ba/0x4b0
        [337571.287235]  ? percpu_counter_add_batch+0x2e/0xa0
        [337571.287455]  ? __x64_sys_ioctl+0x88/0xc0
        [337571.287675]  __x64_sys_ioctl+0x88/0xc0
        [337571.287901]  do_syscall_64+0x38/0x90
        [337571.288126]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [337571.288352] RIP: 0033:0x7f478aaffe9b
      
      So fix this by locking struct btrfs_fs_info::trans_lock before deleting
      the quota root from that list.
      
      Fixes: bed92eae ("Btrfs: qgroup implementation and prototypes")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b31cb5a6
    • Naohiro Aota's avatar
      btrfs: tracepoints: also show actual number of the outstanding extents · 64425500
      Naohiro Aota authored
      
      
      The btrfs_inode_mod_outstanding_extents trace event only shows the modified
      number to the number of outstanding extents. It would be helpful if we can
      see the resulting extent number as well.
      
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64425500
  10. Jun 19, 2023
    • Jeff Layton's avatar
      btrfs: update i_version in update_dev_time · c9e561c4
      Jeff Layton authored
      
      
      When updating the ctime, we also want to update i_version.
      
      This is just something I noticed by inspection. There is probably no way
      to test this today unless you can somehow get to this inode via nfsd.
      Still, I think it's the right thing to do for consistency's sake.
      
      David Sterba's comment: I don't see anything wrong with setting the
      iversion bit, however I also don't see where this would be useful.
      Agreed with the consistency, otherwise the time is updated when device
      super block is wiped or a device initialized, both are big events so
      missing that due to lack of iversion update seems unlikely. I'll add it
      to the queue, thanks.
      
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      [ add comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9e561c4
    • Ben Dooks's avatar
      btrfs: make btrfs_compressed_bioset static · e794203e
      Ben Dooks authored
      
      
      The 'btrfs_compressed_bioset' struct isn't exported outside of the
      fs/btrfs/compression.c file, so make it static to fix the following
      sparse warning:
      
      fs/btrfs/compression.c:40:16: warning: symbol 'btrfs_compressed_bioset' was not declared. Should it be static?
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e794203e
    • Matt Corallo's avatar
      btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile · 160fe8f6
      Matt Corallo authored
      
      
      Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
      one allocation profile flag, and failing to do so may ultimately
      result in a WARN_ON and remount-ro when allocating new blocks, like
      the below transaction abort on 6.1.
      
      `btrfs_reduce_alloc_profile` has two ways of determining the profile,
      first it checks if a conversion balance is currently running and
      uses the profile we're converting to. If no balance is currently
      running, it returns the max-redundancy profile which at least one
      block in the selected block group has.
      
      This works by simply checking each known allocation profile bit in
      redundancy order. However, `btrfs_reduce_alloc_profile` has not been
      updated as new flags have been added - first with the `DUP` profile
      and later with the RAID1C34 profiles.
      
      Because of the way it checks, if we have blocks with different
      profiles and at least one is known, that profile will be selected.
      However, if none are known we may return a flag set with multiple
      allocation profiles set.
      
      This is currently only possible when a balance from one of the three
      unhandled profiles to another of the unhandled profiles is canceled
      after allocating at least one block using the new profile.
      
      In that case, a transaction abort like the below will occur and the
      filesystem will need to be mounted with -o skip_balance to get it
      mounted rw again (but the balance cannot be resumed without a
      similar abort).
      
        [770.648] ------------[ cut here ]------------
        [770.648] BTRFS: Transaction aborted (error -22)
        [770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
        [770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G        W 6.1.0-0.deb11.7-powerpc64le #1  Debian 6.1.20-2~bpo11+1a~test
        [770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
        [770.648] NIP:  c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
        [770.648] REGS: c000200089afe9a0 TRAP: 0700   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
        [770.648] MSR:  9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 28848282  XER: 20040000
        [770.648] CFAR: c000000000135110 IRQMASK: 0
      	    GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
      	    GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
      	    GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
      	    GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
      	    GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
      	    GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
      	    GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
      	    GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
        [770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
        [770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
        [770.648] Call Trace:
        [770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
        [770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
        [770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
        [770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
        [770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
        [770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
        [770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
        [770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
        [770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
        [770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
        [770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
        [770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
        [770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
        [770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
        [770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
        [770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
        [770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
        [770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
        [770.648] --- interrupt: c00 at 0x7fff94126800
        [770.648] NIP:  00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
        [770.648] REGS: c000200089affe80 TRAP: 0c00   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
        [770.648] MSR:  900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE>  CR: 24002848  XER: 00000000
        [770.648] IRQMASK: 0
      	    GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
      	    GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
      	    GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
      	    GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
      	    GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
      	    GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
      	    GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
      	    GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
        [770.648] NIP [00007fff94126800] 0x7fff94126800
        [770.648] LR [0000000107e0b594] 0x107e0b594
        [770.648] --- interrupt: c00
        [770.648] Instruction dump:
        [770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
        [770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
        [770.648] ---[ end trace 0000000000000000 ]---
        [770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
        [770.648] BTRFS info (device dm-2: state EA): forced readonly
        [770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
        [770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
        [770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
        [770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
      
      Fixes: 47e6f742 ("btrfs: add support for 3-copy replication (raid1c3)")
      Fixes: 8d6fac00 ("btrfs: add support for 4-copy replication (raid1c4)")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarMatt Corallo <blnxfsl@bluematt.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      160fe8f6
    • Qu Wenruo's avatar
      btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers · 81db6ae8
      Qu Wenruo authored
      
      
      Since the scrub rework introduced by commit 2af2aaf9 ("btrfs:
      scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
      and later commits, scrub only needs one single workqueue,
      fs_info::scrub_worker.
      
      That scrub_wr_completion_workers is initially to handle the delay work
      after write bios finished.  But the new scrub code goes submit-and-wait
      for write bios, thus all the work are done inside the scrub_worker.
      
      The last user of fs_info::scrub_wr_completion_workers is removed in
      commit 16f93993 ("btrfs: scrub: remove the old writeback
      infrastructure"), so we can safely remove the workqueue.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81db6ae8
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_ctx::csum_list member · c2bbc0ba
      Qu Wenruo authored
      
      
      Since the rework of scrub introduced by commit 2af2aaf9 ("btrfs:
      scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
      and later commits, scrub no longer keeps its data checksum inside sctx.
      
      Instead we have scrub_stripe::csums for the checksum of the stripe.
      Thus we can remove the unused scrub_ctx::csum_list member.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2bbc0ba
    • Filipe Manana's avatar
      btrfs: do not BUG_ON after failure to migrate space during truncation · 6822b3f7
      Filipe Manana authored
      
      
      During truncation we reserve 2 metadata units when starting a transaction
      (reserved space goes to fs_info->trans_block_rsv) and then attempt to
      migrate 1 unit (min_size bytes) from fs_info->trans_block_rsv into the
      local block reserve. If we ever fail we trigger a BUG_ON(), which should
      never happen, because we reserved 2 units. However if we happen to fail
      for some reason, we don't need to be so dire and hit a BUG_ON(), we can
      just error out the truncation and, since this is highly unexpected,
      surround the error check with WARN_ON(), to get an informative stack
      trace and tag the branh as 'unlikely'. Also make the 'min_size' variable
      const, since it's not supposed to ever change and any accidental change
      could possibly make the space migration not so unlikely to fail.
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6822b3f7
    • Filipe Manana's avatar
      btrfs: do not BUG_ON on failure to get dir index for new snapshot · df9f2782
      Filipe Manana authored
      
      
      During the transaction commit path, at create_pending_snapshot(), there
      is no need to BUG_ON() in case we fail to get a dir index for the snapshot
      in the parent directory. This should fail very rarely because the parent
      inode should be loaded in memory already, with the respective delayed
      inode created and the parent inode's index_cnt field already initialized.
      
      However if it fails, it may be -ENOMEM like the comment at
      create_pending_snapshot() says or any error returned by
      btrfs_search_slot() through btrfs_set_inode_index_count(), which can be
      pretty much anything such as -EIO or -EUCLEAN for example. So the comment
      is not correct when it says it can only be -ENOMEM.
      
      However doing a BUG_ON() here is overkill, since we can instead abort
      the transaction and return the error. Note that any error returned by
      create_pending_snapshot() will eventually result in a transaction
      abort at cleanup_transaction(), called from btrfs_commit_transaction(),
      but we can explicitly abort the transaction at this point instead so that
      we get a stack trace to tell us that the call to btrfs_set_inode_index()
      failed.
      
      So just abort the transaction and return in case btrfs_set_inode_index()
      returned an error at create_pending_snapshot().
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df9f2782