Skip to content
  1. Jan 31, 2022
    • Su Yue's avatar
      btrfs: tree-checker: check item_size for inode_item · 0c982944
      Su Yue authored
      while mounting the crafted image, out-of-bounds access happens:
      
        [350.429619] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
        [350.429636] index 1048096 is out of range for type 'page *[16]'
        [350.429650] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4 #1
        [350.429652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
        [350.429653] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
        [350.429772] Call Trace:
        [350.429774]  <TASK>
        [350.429776]  dump_stack_lvl+0x47/0x5c
        [350.429780]  ubsan_epilogue+0x5/0x50
        [350.429786]  __ubsan_handle_out_of_bounds+0x66/0x70
        [350.429791]  btrfs_get_16+0xfd/0x120 [btrfs]
        [350.429832]  check_leaf+0x754/0x1a40 [btrfs]
        [350.429874]  ? filemap_read+0x34a/0x390
        [350.429878]  ? load_balance+0x175/0xfc0
        [350.429881]  validate_extent_buffer+0x244/0x310 [btrfs]
        [350.429911]  btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
        [350.429935]  end_bio_extent_readpage+0x3af/0x850 [btrfs]
        [350.429969]  ? newidle_balance+0x259/0x480
        [350.429972]  end_workqueue_fn+0x29/0x40 [btrfs]
        [350.429995]  btrfs_work_helper+0x71/0x330 [btrfs]
        [350.430030]  ? __schedule+0x2fb/0xa40
        [350.430033]  process_one_work+0x1f6/0x400
        [350.430035]  ? process_one_work+0x400/0x400
        [350.430036]  worker_thread+0x2d/0x3d0
        [350.430037]  ? process_one_work+0x400/0x400
        [350.430038]  kthread+0x165/0x190
        [350.430041]  ? set_kthread_struct+0x40/0x40
        [350.430043]  ret_from_fork+0x1f/0x30
        [350.430047]  </TASK>
        [350.430077] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
      
      check_leaf() is checking the leaf:
      
        corrupt leaf: root=4 block=29396992 slot=1, bad key order, prev (16140901064495857664 1 0) current (1 204 12582912)
        leaf 29396992 items 6 free space 3565 generation 6 owner DEV_TREE
        leaf 29396992 flags 0x1(WRITTEN) backref revision 1
        fs uuid a62e00e8-e94e-4200-8217-12444de93c2e
        chunk uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
      	  item 0 key (16140901064495857664 INODE_ITEM 0) itemoff 3955 itemsize 40
      		  generation 0 transid 0 size 0 nbytes 17592186044416
      		  block group 0 mode 52667 links 33 uid 0 gid 2104132511 rdev 94223634821136
      		  sequence 100305 flags 0x2409000(none)
      		  atime 0.0 (1970-01-01 08:00:00)
      		  ctime 2973280098083405823.4294967295 (-269783007-01-01 21:37:03)
      		  mtime 18446744071572723616.4026825121 (1902-04-16 12:40:00)
      		  otime 9249929404488876031.4294967295 (622322949-04-16 04:25:58)
      	  item 1 key (1 DEV_EXTENT 12582912) itemoff 3907 itemsize 48
      		  dev extent chunk_tree 3
      		  chunk_objectid 256 chunk_offset 12582912 length 8388608
      		  chunk_tree_uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
      
      The corrupted leaf of device tree has an inode item. The leaf passed
      checksum and others checks in validate_extent_buffer until check_leaf_item().
      Because of the key type BTRFS_INODE_ITEM, check_inode_item() is called even we
      are in the device tree. Since the
      item offset + sizeof(struct btrfs_inode_item) > eb->len, out-of-bounds access
      is triggered.
      
      The item end vs leaf boundary check has been done before
      check_leaf_item(), so fix it by checking item size in check_inode_item()
      before access of the inode item in extent buffer.
      
      Other check functions except check_dev_item() in check_leaf_item()
      have their item size checks.
      The commit for check_dev_item() is followed.
      
      No regression observed during running fstests.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
      
      
      CC: stable@vger.kernel.org # 5.10+
      CC: Wenqing Liu <wenqingliu0120@gmail.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c982944
    • Shin'ichiro Kawasaki's avatar
      btrfs: fix deadlock between quota disable and qgroup rescan worker · e804861b
      Shin'ichiro Kawasaki authored
      Quota disable ioctl starts a transaction before waiting for the qgroup
      rescan worker completes. However, this wait can be infinite and results
      in deadlock because of circular dependency among the quota disable
      ioctl, the qgroup rescan worker and the other task with transaction such
      as block group relocation task.
      
      The deadlock happens with the steps following:
      
      1) Task A calls ioctl to disable quota. It starts a transaction and
         waits for qgroup rescan worker completes.
      2) Task B such as block group relocation task starts a transaction and
         joins to the transaction that task A started. Then task B commits to
         the transaction. In this commit, task B waits for a commit by task A.
      3) Task C as the qgroup rescan worker starts its job and starts a
         transaction. In this transaction start, task C waits for completion
         of the transaction that task A started and task B committed.
      
      This deadlock was found with fstests test case btrfs/115 a...
      e804861b
    • Qu Wenruo's avatar
      btrfs: don't start transaction for scrub if the fs is mounted read-only · 2d192fc4
      Qu Wenruo authored
      [BUG]
      The following super simple script would crash btrfs at unmount time, if
      CONFIG_BTRFS_ASSERT() is set.
      
       mkfs.btrfs -f $dev
       mount $dev $mnt
       xfs_io -f -c "pwrite 0 4k" $mnt/file
       umount $mnt
       mount -r ro $dev $mnt
       btrfs scrub start -Br $mnt
       umount $mnt
      
      This will trigger the following ASSERT() introduced by commit
      0a31daa4
      
       ("btrfs: add assertion for empty list of transactions at
      late stage of umount").
      
      That patch is definitely not the cause, it just makes enough noise for
      developers.
      
      [CAUSE]
      We will start transaction for the following call chain during scrub:
      
        scrub_enumerate_chunks()
        |- btrfs_inc_block_group_ro()
           |- btrfs_join_transaction()
      
      However for RO mount, there is no running transaction at all, thus
      btrfs_join_transaction() will start a new transaction.
      
      Furthermore, since it's read-only mount, btrfs_sync_fs() will not call
      btrfs_commit_super() to commit the new but empty transaction.
      
      And leads to the ASSERT().
      
      The bug has been there for a long time. Only the new ASSERT() makes it
      noisy enough to be noticed.
      
      [FIX]
      For read-only scrub on read-only mount, there is no need to start a
      transaction nor to allocate new chunks in btrfs_inc_block_group_ro().
      
      Just do extra read-only mount check in btrfs_inc_block_group_ro(), and
      if it's read-only, skip all chunk allocation and go inc_block_group_ro()
      directly.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d192fc4
  2. Jan 25, 2022
    • Filipe Manana's avatar
      btrfs: update writeback index when starting defrag · 27cdfde1
      Filipe Manana authored
      When starting a defrag, we should update the writeback index of the
      inode's mapping in case it currently has a value beyond the start of the
      range we are defragging. This can help performance and often result in
      getting less extents after writeback - for e.g., if the current value
      of the writeback index sits somewhere in the middle of a range that
      gets dirty by the defrag, then after writeback we can get two smaller
      extents instead of a single, larger extent.
      
      We used to have this before the refactoring in 5.16, but it was removed
      without any reason to do so. Originally it was added in kernel 3.1, by
      commit 2a0f7f57 ("Btrfs: fix recursive auto-defrag"), in order to
      fix a loop with autodefrag resulting in dirtying and writing pages over
      and over, but some testing on current code did not show that happening,
      at least with the test described in that commit.
      
      So add back the behaviour, as at the very least it is a nice to have
      optimization.
      
      Fixes: 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27cdfde1
    • Filipe Manana's avatar
      btrfs: add back missing dirty page rate limiting to defrag · 3c9d31c7
      Filipe Manana authored
      A defrag operation can dirty a lot of pages, specially if operating on
      the entire file or a large file range. Any task dirtying pages should
      periodically call balance_dirty_pages_ratelimited(), as stated in that
      function's comments, otherwise they can leave too many dirty pages in
      the system. This is what we did before the refactoring in 5.16, and
      it should have remained, just like in the buffered write path and
      relocation. So restore that behaviour.
      
      Fixes: 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c9d31c7
    • Filipe Manana's avatar
      btrfs: fix deadlock when reserving space during defrag · 0cb5950f
      Filipe Manana authored
      When defragging we can end up collecting a range for defrag that has
      already pages under delalloc (dirty), as long as the respective extent
      map for their range is not mapped to a hole, a prealloc extent or
      the extent map is from an old generation.
      
      Most of the time that is harmless from a functional perspective at
      least, however it can result in a deadlock:
      
      1) At defrag_collect_targets() we find an extent map that meets all
         requirements but there's delalloc for the range it covers, and we add
         its range to list of ranges to defrag;
      
      2) The defrag_collect_targets() function is called at defrag_one_range(),
         after it locked a range that overlaps the range of the extent map;
      
      3) At defrag_one_range(), while the range is still locked, we call
         defrag_one_locked_target() for the range associated to the extent
         map we collected at step 1);
      
      4) Then finally at defrag_one_locked_target() we do a call to
         btrfs_delalloc_reserve_space(), which will reserve data and metadata
         space. If the space reservations can not be satisfied right away, the
         flusher might be kicked in and start flushing delalloc and wait for
         the respective ordered extents to complete. If this happens we will
         deadlock, because both flushing delalloc and finishing an ordered
         extent, requires locking the range in the inode's io tree, which was
         already locked at defrag_collect_targets().
      
      So fix this by skipping extent maps for which there's already delalloc.
      
      Fixes: eb793cf8
      
       ("btrfs: defrag: introduce helper to collect target file extents")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cb5950f
  3. Jan 20, 2022
    • Qu Wenruo's avatar
      btrfs: defrag: properly update range->start for autodefrag · c080b414
      Qu Wenruo authored
      [BUG]
      After commit 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to
      implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
      the file from previously finished location.
      
      [CAUSE]
      The recent refactoring of defrag only focuses on defrag ioctl subpage
      support, doesn't take autodefrag into consideration.
      
      There are two problems involved which prevents autodefrag to restart its
      scan:
      
      - No range.start update
        Previously when one defrag target is found, range->start will be
        updated to indicate where next search should start from.
      
        But now btrfs_defrag_file() doesn't update it anymore, making all
        autodefrag to rescan from file offset 0.
      
        This would also make autodefrag to mark the same range dirty again and
        again, causing extra IO.
      
      - No proper quick exit for defrag_one_cluster()
        Currently if we reached or exceed @max_sectors limit, we just exit
        defrag_one_cluster(), and let next defrag_one_cluster() call to do a
        quick exit.
        This makes @cur increase, thus no way to properly know which range is
        defragged and which range is skipped.
      
      [FIX]
      The fix involves two modifications:
      
      - Update range->start to next cluster start
        This is a little different from the old behavior.
        Previously range->start is updated to the next defrag target.
      
        But in the end, the behavior should still be pretty much the same,
        as now we skip to next defrag target inside btrfs_defrag_file().
      
        Thus if auto-defrag determines to re-scan, then we still do the skip,
        just at a different timing.
      
      - Make defrag_one_cluster() to return >0 to indicate a quick exit
        So that btrfs_defrag_file() can also do a quick exit, without
        increasing @cur to the range end, and re-use @cur to update
        @range->start.
      
      - Add comment for btrfs_defrag_file() to mention the range->start update
        Currently only autodefrag utilize this behavior, as defrag ioctl won't
        set @max_to_defrag parameter, thus unless interrupted it will always
        try to defrag the whole range.
      
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c080b414
    • Qu Wenruo's avatar
      btrfs: defrag: fix wrong number of defragged sectors · 484167da
      Qu Wenruo authored
      [BUG]
      There are users using autodefrag mount option reporting obvious increase
      in IO:
      
      > If I compare the write average (in total, I don't have it per process)
      > when taking idle periods on the same machine:
      >     Linux 5.16:
      >         without autodefrag: ~ 10KiB/s
      >         with autodefrag: between 1 and 2MiB/s.
      >
      >     Linux 5.15:
      >         with autodefrag:~ 10KiB/s (around the same as without
      > autodefrag on 5.16)
      
      [CAUSE]
      When autodefrag mount option is enabled, btrfs_defrag_file() will be
      called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many
      sectors we can defrag in one try.
      
      And then use the number of sectors defragged to determine if we need to
      re-defrag.
      
      But commit b18c3ab2
      
       ("btrfs: defrag: introduce helper to defrag one
      cluster") uses wrong unit to increase @sectors_defragged, which should
      be in unit of sector, not byte.
      
      This means, if we have defragged any sector, then @sectors_defragged
      will be >= sectorsize (normally 4096), which is larger than
      BTRFS_DEFRAG_BATCH.
      
      This makes the @max_sectors check in defrag_one_cluster() to underflow,
      rendering the whole @max_sectors check useless.
      
      Thus causing way more IO for autodefrag mount options, as now there is
      no limit on how many sectors can really be defragged.
      
      [FIX]
      Fix the problems by:
      
      - Use sector as unit when increasing @sectors_defragged
      
      - Include @sectors_defragged > @max_sectors case to break the loop
      
      - Add extra comment on the return value of btrfs_defrag_file()
      
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: b18c3ab2 ("btrfs: defrag: introduce helper to defrag one cluster")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      484167da
    • Filipe Manana's avatar
      btrfs: allow defrag to be interruptible · b767c2fc
      Filipe Manana authored
      During defrag, at btrfs_defrag_file(), we have this loop that iterates
      over a file range in steps no larger than 256K subranges. If the range
      is too long, there's no way to interrupt it. So make the loop check in
      each iteration if there's signal pending, and if there is, break and
      return -AGAIN to userspace.
      
      Before kernel 5.16, we used to allow defrag to be cancelled through a
      signal, but that was lost with commit 7b508037 ("btrfs: defrag:
      use defrag_one_cluster() to implement btrfs_defrag_file()").
      
      This change adds back the possibility to cancel a defrag with a signal
      and keeps the same semantics, returning -EAGAIN to user space (and not
      the usually more expected -EINTR).
      
      This is also motivated by a recent bug on 5.16 where defragging a 1 byte
      file resulted in iterating from file range 0 to (u64)-1, as hitting the
      bug triggered a too long loop, basically requiring one to reboot the
      machine, as it was not possible to cancel defrag.
      
      Fixes: 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b767c2fc
    • Filipe Manana's avatar
      btrfs: fix too long loop when defragging a 1 byte file · 6b34cd8e
      Filipe Manana authored
      
      
      When attempting to defrag a file with a single byte, we can end up in a
      too long loop, which is nearly infinite because at btrfs_defrag_file()
      we end up with the variable last_byte assigned with a value of
      18446744073709551615 (which is (u64)-1). The problem comes from the fact
      we end up doing:
      
          last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
      
      So if last_byte was assigned 0, which is i_size - 1, we underflow and
      end up with the value 18446744073709551615.
      
      This is trivial to reproduce and the following script triggers it:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        echo -n "X" > $MNT/foobar
      
        btrfs filesystem defragment $MNT/foobar
      
        umount $MNT
      
      So fix this by not decrementing last_byte by 1 before doing the sector
      size round up. Also, to make it easier to follow, make the round up right
      after computing last_byte.
      
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b34cd8e
  4. Jan 07, 2022