Skip to content
  1. Dec 16, 2021
    • Josef Bacik's avatar
      btrfs: check WRITE_ERR when trying to read an extent buffer · 651740a5
      Josef Bacik authored
      Filipe reported a hang when we have errors on btrfs.  This turned out to
      be a side-effect of my fix c2e39305
      
       ("btrfs: clear extent buffer
      uptodate when we fail to write it") which made it so we clear
      EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out.
      
      Below is a paste of Filipe's analysis he got from using drgn to debug
      the hang
      
      """
      btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to
      a value while writeback of all pages has not yet completed:
         --> writeback for the first 3 pages finishes, we clear
             EXTENT_BUFFER_UPTODATE from eb on the first page when we get an
             error.
         --> at this point eb->io_pages is 1 and we cleared Uptodate bit from the
             first 3 pages
         --> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so
             it continues, it's able to lock the pages since we obviously don't
             hold the pages locked during writeback
         --> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets
             eb->io_pages to 3, since only the first page does not have Uptodate
             bit set at this point
         --> writeback for the remaining page completes, we ended decrementing
             eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore
             never calling end_extent_buffer_writeback(), so
             EXTENT_BUFFER_WRITEBACK remains in the eb's flags
         --> of course, when the read bio completes, it doesn't and shouldn't
             call end_extent_buffer_writeback()
         --> we should clear EXTENT_BUFFER_UPTODATE only after all pages of
             the eb finished writeback?  or maybe make the read pages code
             wait for writeback of all pages of the eb to complete before
             checking which pages need to be read, touch ->io_pages, submit
             read bio, etc
      
      writeback bit never cleared means we can hang when aborting a
      transaction, at:
      
          btrfs_cleanup_one_transaction()
             btrfs_destroy_marked_extents()
               wait_on_extent_buffer_writeback()
      """
      
      This is a problem because our writes are not synchronized with reads in
      any way.  We clear the UPTODATE flag and then we can easily come in and
      try to read the EB while we're still waiting on other bio's to
      complete.
      
      We have two options here, we could lock all the pages, and then check to
      see if eb->io_pages != 0 to know if we've already got an outstanding
      write on the eb.
      
      Or we can simply check to see if we have WRITE_ERR set on this extent
      buffer.  We set this bit _before_ we clear UPTODATE, so if the read gets
      triggered because we aren't UPTODATE because of a write error we're
      guaranteed to have WRITE_ERR set, and in this case we can simply return
      -EIO.  This will fix the reported hang.
      
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: c2e39305
      
       ("btrfs: clear extent buffer uptodate when we fail to write it")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      651740a5
  2. Dec 14, 2021
    • Filipe Manana's avatar
      btrfs: fix missing last dir item offset update when logging directory · 1b2e5e5c
      Filipe Manana authored
      
      
      When logging a directory, once we finish processing a leaf that is full
      of dir items, if we find the next leaf was not modified in the current
      transaction, we grab the first key of that next leaf and log it as to
      mark the end of a key range boundary.
      
      However we did not update the value of ctx->last_dir_item_offset, which
      tracks the offset of the last logged key. This can result in subsequent
      logging of the same directory in the current transaction to not realize
      that key was already logged, and then add it to the middle of a batch
      that starts with a lower key, resulting later in a leaf with one key
      that is duplicated and at non-consecutive slots. When that happens we get
      an error later when writing out the leaf, reporting that there is a pair
      of keys in wrong order. The report is something like the following:
      
      Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
      root=18446744073709551610 block=118444032 slot=21, bad key order, prev
      (704687 84 4146773349) current (704687 84 1063561078)
      Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
      91449 total ptrs 39 free space 546 owner 18446744073709551610
      Dec 13 21:44:50 kernel:         item 0 key (704687 1 0) itemoff 3835
      itemsize 160
      Dec 13 21:44:50 kernel:                 inode generation 35532 size
      1026 mode 40755
      Dec 13 21:44:50 kernel:         item 1 key (704687 12 704685) itemoff
      3822 itemsize 13
      Dec 13 21:44:50 kernel:         item 2 key (704687 24 3817753667)
      itemoff 3736 itemsize 86
      Dec 13 21:44:50 kernel:         item 3 key (704687 60 0) itemoff 3728 itemsize 8
      Dec 13 21:44:50 kernel:         item 4 key (704687 72 0) itemoff 3720 itemsize 8
      Dec 13 21:44:50 kernel:         item 5 key (704687 84 140445108)
      itemoff 3666 itemsize 54
      Dec 13 21:44:50 kernel:                 dir oid 704793 type 1
      Dec 13 21:44:50 kernel:         item 6 key (704687 84 298800632)
      itemoff 3599 itemsize 67
      Dec 13 21:44:50 kernel:                 dir oid 707849 type 2
      Dec 13 21:44:50 kernel:         item 7 key (704687 84 476147658)
      itemoff 3532 itemsize 67
      Dec 13 21:44:50 kernel:                 dir oid 707901 type 2
      Dec 13 21:44:50 kernel:         item 8 key (704687 84 633818382)
      itemoff 3471 itemsize 61
      Dec 13 21:44:50 kernel:                 dir oid 704694 type 2
      Dec 13 21:44:50 kernel:         item 9 key (704687 84 654256665)
      itemoff 3403 itemsize 68
      Dec 13 21:44:50 kernel:                 dir oid 707841 type 1
      Dec 13 21:44:50 kernel:         item 10 key (704687 84 995843418)
      itemoff 3331 itemsize 72
      Dec 13 21:44:50 kernel:                 dir oid 2167736 type 1
      Dec 13 21:44:50 kernel:         item 11 key (704687 84 1063561078)
      itemoff 3278 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
      Dec 13 21:44:50 kernel:         item 12 key (704687 84 1101156010)
      itemoff 3225 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704696 type 1
      Dec 13 21:44:50 kernel:         item 13 key (704687 84 2521936574)
      itemoff 3173 itemsize 52
      Dec 13 21:44:50 kernel:                 dir oid 704704 type 2
      Dec 13 21:44:50 kernel:         item 14 key (704687 84 2618368432)
      itemoff 3112 itemsize 61
      Dec 13 21:44:50 kernel:                 dir oid 704738 type 1
      Dec 13 21:44:50 kernel:         item 15 key (704687 84 2676316190)
      itemoff 3046 itemsize 66
      Dec 13 21:44:50 kernel:                 dir oid 2167729 type 1
      Dec 13 21:44:50 kernel:         item 16 key (704687 84 3319104192)
      itemoff 2986 itemsize 60
      Dec 13 21:44:50 kernel:                 dir oid 704745 type 2
      Dec 13 21:44:50 kernel:         item 17 key (704687 84 3908046265)
      itemoff 2929 itemsize 57
      Dec 13 21:44:50 kernel:                 dir oid 2167734 type 1
      Dec 13 21:44:50 kernel:         item 18 key (704687 84 3945713089)
      itemoff 2857 itemsize 72
      Dec 13 21:44:50 kernel:                 dir oid 2167730 type 1
      Dec 13 21:44:50 kernel:         item 19 key (704687 84 4077169308)
      itemoff 2795 itemsize 62
      Dec 13 21:44:50 kernel:                 dir oid 704688 type 1
      Dec 13 21:44:50 kernel:         item 20 key (704687 84 4146773349)
      itemoff 2727 itemsize 68
      Dec 13 21:44:50 kernel:                 dir oid 707892 type 1
      Dec 13 21:44:50 kernel:         item 21 key (704687 84 1063561078)
      itemoff 2674 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
      Dec 13 21:44:50 kernel:         item 22 key (704687 96 2) itemoff 2612
      itemsize 62
      Dec 13 21:44:50 kernel:         item 23 key (704687 96 6) itemoff 2551
      itemsize 61
      Dec 13 21:44:50 kernel:         item 24 key (704687 96 7) itemoff 2498
      itemsize 53
      Dec 13 21:44:50 kernel:         item 25 key (704687 96 12) itemoff
      2446 itemsize 52
      Dec 13 21:44:50 kernel:         item 26 key (704687 96 14) itemoff
      2385 itemsize 61
      Dec 13 21:44:50 kernel:         item 27 key (704687 96 18) itemoff
      2325 itemsize 60
      Dec 13 21:44:50 kernel:         item 28 key (704687 96 24) itemoff
      2271 itemsize 54
      Dec 13 21:44:50 kernel:         item 29 key (704687 96 28) itemoff
      2218 itemsize 53
      Dec 13 21:44:50 kernel:         item 30 key (704687 96 62) itemoff
      2150 itemsize 68
      Dec 13 21:44:50 kernel:         item 31 key (704687 96 66) itemoff
      2083 itemsize 67
      Dec 13 21:44:50 kernel:         item 32 key (704687 96 75) itemoff
      2015 itemsize 68
      Dec 13 21:44:50 kernel:         item 33 key (704687 96 79) itemoff
      1948 itemsize 67
      Dec 13 21:44:50 kernel:         item 34 key (704687 96 82) itemoff
      1882 itemsize 66
      Dec 13 21:44:50 kernel:         item 35 key (704687 96 83) itemoff
      1810 itemsize 72
      Dec 13 21:44:50 kernel:         item 36 key (704687 96 85) itemoff
      1753 itemsize 57
      Dec 13 21:44:50 kernel:         item 37 key (704687 96 87) itemoff
      1681 itemsize 72
      Dec 13 21:44:50 kernel:         item 38 key (704694 1 0) itemoff 1521
      itemsize 160
      Dec 13 21:44:50 kernel:                 inode generation 35534 size 30
      mode 40755
      Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
      write time tree block corruption detected
      
      So fix that by adding the missing update of ctx->last_dir_item_offset with
      the offset of the boundary key.
      
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
      Fixes: dc287224
      
       ("btrfs: keep track of the last logged keys when logging a directory")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b2e5e5c
    • Filipe Manana's avatar
      btrfs: fix double free of anon_dev after failure to create subvolume · 33fab972
      Filipe Manana authored
      When creating a subvolume, at create_subvol(), we allocate an anonymous
      device and later call btrfs_get_new_fs_root(), which in turn just calls
      btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
      the anonymous device to the root, but if after that call there's an error,
      when we jump to 'fail' label, we call btrfs_put_root(), which frees the
      anonymous device and then returns an error that is propagated back to
      create_subvol(). Than create_subvol() frees the anonymous device again.
      
      When this happens, if the anonymous device was not reallocated after
      the first time it was freed with btrfs_put_root(), we get a kernel
      message like the following:
      
        (...)
        [13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
        [13950.283027] ida_free called for id=65 which is not allocated.
        [13950.285974] BTRFS info (device dm-0): forced readonly
        (...)
      
      If the anonymous device gets reallocated by another btrfs filesystem
      or any other kernel subsystem, then bad things can happen.
      
      So fix this by setting the root's anonymous device to 0 at
      btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
      happened.
      
      Fixes: 2dfb1e43
      
       ("btrfs: preallocate anon block device at first phase of snapshot creation")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33fab972
    • Jianglei Nie's avatar
      btrfs: fix memory leak in __add_inode_ref() · f35838a6
      Jianglei Nie authored
      Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
      but  when the function returns in line 1184 (#4) victim_name allocated
      by line 1169 (#3) is not freed, which will lead to a memory leak.
      There is a similar snippet of code in this function as allocating a memory
      chunk for victim_name in line 1104 (#1) as well as releasing the memory
      in line 1116 (#2).
      
      We should kfree() victim_name when the return value of backref_in_log()
      is less than zero and before the function returns in line 1184 (#4).
      
      1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
      1058 				  struct btrfs_root *root,
      1059 				  struct btrfs_path *path,
      1060 				  struct btrfs_root *log_root,
      1061 				  struct btrfs_inode *dir,
      1062 				  struct btrfs_inode *inode,
      1063 				  u64 inode_objectid, u64 parent_objectid,
      1064 				  u64 ref_index, char *name, int namelen,
      1065 				  int *search_done)
      1066 {
      
      1104 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
      	// #1: kmalloc (victim_name-1)
      1105 	if (!victim_name)
      1106 		return -ENOMEM;
      
      1112	ret = backref_in_log(log_root, &search_key,
      1113			parent_objectid, victim_name,
      1114			victim_name_len);
      1115	if (ret < 0) {
      1116		kfree(victim_name); // #2: kfree (victim_name-1)
      1117		return ret;
      1118	} else if (!ret) {
      
      1169 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
      	// #3: kmalloc (victim_name-2)
      1170 	if (!victim_name)
      1171 		return -ENOMEM;
      
      1180 	ret = backref_in_log(log_root, &search_key,
      1181 			parent_objectid, victim_name,
      1182 			victim_name_len);
      1183 	if (ret < 0) {
      1184 		return ret; // #4: missing kfree (victim_name-2)
      1185 	} else if (!ret) {
      
      1241 	return 0;
      1242 }
      
      Fixes: d3316c82
      
       ("btrfs: Properly handle backref_in_log retval")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJianglei Nie <niejianglei2021@163.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f35838a6
  3. Dec 08, 2021
    • Qu Wenruo's avatar
      btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling · 8289ed9f
      Qu Wenruo authored
      
      
      I hit the BUG_ON() with generic/475 test case, and to my surprise, all
      callers of btrfs_del_root_ref() are already aborting transaction, thus
      there is not need for such BUG_ON(), just go to @out label and caller
      will properly handle the error.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8289ed9f
    • Johannes Thumshirn's avatar
      btrfs: zoned: clear data relocation bg on zone finish · 5911f538
      Johannes Thumshirn authored
      When finishing a zone that is used by a dedicated data relocation
      block group, also remove its reference from fs_info, so we're not trying
      to use a full block group for allocations during data relocation, which
      will always fail.
      
      The result is we're not making any forward progress and end up in a
      deadlock situation.
      
      Fixes: c2707a25
      
       ("btrfs: zoned: add a dedicated data relocation block group")
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5911f538
    • Johannes Thumshirn's avatar
      btrfs: free exchange changeset on failures · da5e817d
      Johannes Thumshirn authored
      
      
      Fstests runs on my VMs have show several kmemleak reports like the following.
      
        unreferenced object 0xffff88811ae59080 (size 64):
          comm "xfs_io", pid 12124, jiffies 4294987392 (age 6.368s)
          hex dump (first 32 bytes):
            00 c0 1c 00 00 00 00 00 ff cf 1c 00 00 00 00 00  ................
            90 97 e5 1a 81 88 ff ff 90 97 e5 1a 81 88 ff ff  ................
          backtrace:
            [<00000000ac0176d2>] ulist_add_merge+0x60/0x150 [btrfs]
            [<0000000076e9f312>] set_state_bits+0x86/0xc0 [btrfs]
            [<0000000014fe73d6>] set_extent_bit+0x270/0x690 [btrfs]
            [<000000004f675208>] set_record_extent_bits+0x19/0x20 [btrfs]
            [<00000000b96137b1>] qgroup_reserve_data+0x274/0x310 [btrfs]
            [<0000000057e9dcbb>] btrfs_check_data_free_space+0x5c/0xa0 [btrfs]
            [<0000000019c4511d>] btrfs_delalloc_reserve_space+0x1b/0xa0 [btrfs]
            [<000000006d37e007>] btrfs_dio_iomap_begin+0x415/0x970 [btrfs]
            [<00000000fb8a74b8>] iomap_iter+0x161/0x1e0
            [<0000000071dff6ff>] __iomap_dio_rw+0x1df/0x700
            [<000000002567ba53>] iomap_dio_rw+0x5/0x20
            [<0000000072e555f8>] btrfs_file_write_iter+0x290/0x530 [btrfs]
            [<000000005eb3d845>] new_sync_write+0x106/0x180
            [<000000003fb505bf>] vfs_write+0x24d/0x2f0
            [<000000009bb57d37>] __x64_sys_pwrite64+0x69/0xa0
            [<000000003eba3fdf>] do_syscall_64+0x43/0x90
      
      In case brtfs_qgroup_reserve_data() or btrfs_delalloc_reserve_metadata()
      fail the allocated extent_changeset will not be freed.
      
      So in btrfs_check_data_free_space() and btrfs_delalloc_reserve_space()
      free the allocated extent_changeset to get rid of the allocated memory.
      
      The issue currently only happens in the direct IO write path, but only
      after 65b3c08606e5 ("btrfs: fix ENOSPC failure when attempting direct IO
      write into NOCOW range"), and also at defrag_one_locked_target(). Every
      other place is always calling extent_changeset_free() even if its call
      to btrfs_delalloc_reserve_space() or btrfs_check_data_free_space() has
      failed.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da5e817d
    • Naohiro Aota's avatar
      btrfs: fix re-dirty process of tree-log nodes · 84c25448
      Naohiro Aota authored
      There is a report of a transaction abort of -EAGAIN with the following
      script.
      
        #!/bin/sh
      
        for d in sda sdb; do
                mkfs.btrfs -d single -m single -f /dev/\${d}
        done
      
        mount /dev/sda /mnt/test
        mount /dev/sdb /mnt/scratch
      
        for dir in test scratch; do
                echo 3 >/proc/sys/vm/drop_caches
                fio --directory=/mnt/\${dir} --name=fio.\${dir} --rw=read --size=50G --bs=64m \
                        --numjobs=$(nproc) --time_based --ramp_time=5 --runtime=480 \
                        --group_reporting |& tee /dev/shm/fio.\${dir}
                echo 3 >/proc/sys/vm/drop_caches
        done
      
        for d in sda sdb; do
                umount /dev/\${d}
        done
      
      The stack trace is shown in below.
      
        [3310.967991] BTRFS: error (device sda) in btrfs_commit_transaction:2341: errno=-11 unknown (Error while writing out transaction)
        [3310.968060] BTRFS info (device sda): forced readonly
        [3310.968064] BTRFS warning (device sda): Skipping commit of aborted transaction.
        [3310.968065] ------------[ cut here ]------------
        [3310.968066] BTRFS: Transaction aborted (error -11)
        [3310.968074] WARNING: CPU: 14 PID: 1684 at fs/btrfs/transaction.c:1946 btrfs_commit_transaction.cold+0x209/0x2c8
        [3310.968131] CPU: 14 PID: 1684 Comm: fio Not tainted 5.14.10-300.fc35.x86_64 #1
        [3310.968135] Hardware name: DIAWAY Tartu/Tartu, BIOS V2.01.B10 04/08/2021
        [3310.968137] RIP: 0010:btrfs_commit_transaction.cold+0x209/0x2c8
        [3310.968144] RSP: 0018:ffffb284ce393e10 EFLAGS: 00010282
        [3310.968147] RAX: 0000000000000026 RBX: ffff973f147b0f60 RCX: 0000000000000027
        [3310.968149] RDX: ffff974ecf098a08 RSI: 0000000000000001 RDI: ffff974ecf098a00
        [3310.968150] RBP: ffff973f147b0f08 R08: 0000000000000000 R09: ffffb284ce393c48
        [3310.968151] R10: ffffb284ce393c40 R11: ffffffff84f47468 R12: ffff973f101bfc00
        [3310.968153] R13: ffff971f20cf2000 R14: 00000000fffffff5 R15: ffff973f147b0e58
        [3310.968154] FS:  00007efe65468740(0000) GS:ffff974ecf080000(0000) knlGS:0000000000000000
        [3310.968157] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [3310.968158] CR2: 000055691bcbe260 CR3: 000000105cfa4001 CR4: 0000000000770ee0
        [3310.968160] PKRU: 55555554
        [3310.968161] Call Trace:
        [3310.968167]  ? dput+0xd4/0x300
        [3310.968174]  btrfs_sync_file+0x3f1/0x490
        [3310.968180]  __x64_sys_fsync+0x33/0x60
        [3310.968185]  do_syscall_64+0x3b/0x90
        [3310.968190]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [3310.968194] RIP: 0033:0x7efe6557329b
        [3310.968200] RSP: 002b:00007ffe0236ebc0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
        [3310.968203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efe6557329b
        [3310.968204] RDX: 0000000000000000 RSI: 00007efe58d77010 RDI: 0000000000000006
        [3310.968205] RBP: 0000000004000000 R08: 0000000000000000 R09: 00007efe58d77010
        [3310.968207] R10: 0000000016cacc0c R11: 0000000000000293 R12: 00007efe5ce95980
        [3310.968208] R13: 0000000000000000 R14: 00007efe6447c790 R15: 0000000c80000000
        [3310.968212] ---[ end trace 1a346f4d3c0d96ba ]---
        [3310.968214] BTRFS: error (device sda) in cleanup_transaction:1946: errno=-11 unknown
      
      The abort occurs because of a write hole while writing out freeing tree
      nodes of a tree-log tree. For zoned btrfs, we re-dirty a freed tree
      node to ensure btrfs can write the region and does not leave a hole on
      write on a zoned device. The current code fails to re-dirty a node
      when the tree-log tree's depth is greater or equal to 2. That leads to
      a transaction abort with -EAGAIN.
      
      Fix the issue by properly re-dirtying a node on walking up the tree.
      
      Fixes: d3575156 ("btrfs: zoned: redirty released extent buffers")
      CC: stable@vger.kernel.org # 5.12+
      Link: https://github.com/kdave/btrfs-progs/issues/415
      
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84c25448
    • Josef Bacik's avatar
      btrfs: call mapping_set_error() on btree inode with a write error · 68b85589
      Josef Bacik authored
      
      
      generic/484 fails sometimes with compression on because the write ends
      up small enough that it goes into the btree.  This means that we never
      call mapping_set_error() on the inode itself, because the page gets
      marked as fine when we inline it into the metadata.  When the metadata
      writeback happens we see it and abort the transaction properly and mark
      the fs as readonly, however we don't do the mapping_set_error() on
      anything.  In syncfs() we will simply return 0 if the sb is marked
      read-only, so we can't check for this in our syncfs callback.  The only
      way the error gets returned if we called mapping_set_error() on
      something.  Fix this by calling mapping_set_error() on the btree inode
      mapping.  This allows us to properly return an error on syncfs and pass
      generic/484 with compression on.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68b85589
    • Josef Bacik's avatar
      btrfs: clear extent buffer uptodate when we fail to write it · c2e39305
      Josef Bacik authored
      
      
      I got dmesg errors on generic/281 on our overnight fstests.  Looking at
      the history this happens occasionally, with errors like this
      
        WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50
        CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G        W         5.16.0-rc2+ #469
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Workqueue: btrfs-cache btrfs_work_helper
        RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
        RSP: 0018:ffffae598230bc60 EFLAGS: 00010246
        RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000
        RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340
        RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000
        R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0
        R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000
        FS:  0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0
        Call Trace:
      
         extent_buffer_test_bit+0x3f/0x70
         free_space_test_bit+0xa6/0xc0
         load_free_space_tree+0x1d6/0x430
         caching_thread+0x454/0x630
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? rcu_read_lock_sched_held+0x12/0x60
         ? lock_release+0x1f0/0x2d0
         btrfs_work_helper+0xf2/0x3e0
         ? lock_release+0x1f0/0x2d0
         ? finish_task_switch.isra.0+0xf9/0x3a0
         process_one_work+0x270/0x5a0
         worker_thread+0x55/0x3c0
         ? process_one_work+0x5a0/0x5a0
         kthread+0x174/0x1a0
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we're trying to read from a extent buffer page that
      is !PageUptodate.  This happens because we will clear the page uptodate
      when we have an IO error, but we don't clear the extent buffer uptodate.
      If we do a read later and find this extent buffer we'll think its valid
      and not return an error, and then trip over this warning.
      
      Fix this by also clearing uptodate on the extent buffer when this
      happens, so that we get an error when we do a btrfs_search_slot() and
      find this block later.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2e39305
    • Josef Bacik's avatar
      btrfs: fail if fstrim_range->start == U64_MAX · f981fec1
      Josef Bacik authored
      
      
      We've always been failing generic/260 because it's testing things we
      actually don't care about and thus won't fail for.  However we probably
      should fail for fstrim_range->start == U64_MAX since we clearly can't
      trim anything past that.  This in combination with an update to
      generic/260 will allow us to pass this test properly.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f981fec1
    • Dan Carpenter's avatar
      btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2() · d815b3f2
      Dan Carpenter authored
      If memdup_user() fails the error handing will crash when it tries
      to kfree() an error pointer.  Just return directly because there is
      no cleanup required.
      
      Fixes: 1a15eb72
      
       ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d815b3f2
  4. Nov 26, 2021
    • Qu Wenruo's avatar
      btrfs: fix the memory leak caused in lzo_compress_pages() · daf87e95
      Qu Wenruo authored
      [BUG]
      Fstests generic/027 is pretty easy to trigger a slow but steady memory
      leak if run with "-o compress=lzo" mount option.
      
      Normally one single run of generic/027 is enough to eat up at least 4G ram.
      
      [CAUSE]
      In commit d4088803
      
       ("btrfs: subpage: make lzo_compress_pages()
      compatible") we changed how @page_in is released.
      
      But that refactoring makes @page_in only released after all pages being
      compressed.
      
      This leaves error path not releasing @page_in. And by "error path"
      things like incompressible data will also be treated as an error
      (-E2BIG).
      
      Thus it can cause a memory leak if even nothing wrong happened.
      
      [FIX]
      Add check under @out label to release @page_in when needed, so when we
      hit any error, the input page is properly released.
      
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Fixes: d4088803
      
       ("btrfs: subpage: make lzo_compress_pages() compatible")
      Reviewed-and-tested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      daf87e95
  5. Nov 16, 2021
    • Nikolay Borisov's avatar
      btrfs: deprecate BTRFS_IOC_BALANCE ioctl · 6c405b24
      Nikolay Borisov authored
      
      
      The v2 balance ioctl has been introduced more than 9 years ago. Users of
      the old v1 ioctl should have long been migrated to it. It's time we
      deprecate it and eventually remove it.
      
      The only known user is in btrfs-progs that tries v1 as a fallback in
      case v2 is not supported. This is not necessary anymore.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c405b24
    • Colin Ian King's avatar
      btrfs: make 1-bit bit-fields of scrub_page unsigned int · d08e38b6
      Colin Ian King authored
      The bitfields have_csum and io_error are currently signed which is not
      recommended as the representation is an implementation defined
      behaviour. Fix this by making the bit-fields unsigned ints.
      
      Fixes: 2c363954
      
       ("btrfs: scrub: remove the anonymous structure from scrub_page")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d08e38b6
    • Wang Yugui's avatar
      btrfs: check-integrity: fix a warning on write caching disabled disk · a91cf0ff
      Wang Yugui authored
      
      
      When a disk has write caching disabled, we skip submission of a bio with
      flush and sync requests before writing the superblock, since it's not
      needed. However when the integrity checker is enabled, this results in
      reports that there are metadata blocks referred by a superblock that
      were not properly flushed. So don't skip the bio submission only when
      the integrity checker is enabled for the sake of simplicity, since this
      is a debug tool and not meant for use in non-debug builds.
      
      fstests/btrfs/220 trigger a check-integrity warning like the following
      when CONFIG_BTRFS_FS_CHECK_INTEGRITY=y and the disk with WCE=0.
      
        btrfs: attempt to write superblock which references block M @5242880 (sdb2/5242880/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
        ------------[ cut here ]------------
        WARNING: CPU: 28 PID: 843680 at fs/btrfs/check-integrity.c:2196 btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
        CPU: 28 PID: 843680 Comm: umount Not tainted 5.15.0-0.rc5.39.el8.x86_64 #1
        Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019
        RIP: 0010:btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
        RSP: 0018:ffffb642afb47940 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
        RDX: 00000000ffffffff RSI: ffff8b722fc97d00 RDI: ffff8b722fc97d00
        RBP: ffff8b5601c00000 R08: 0000000000000000 R09: c0000000ffff7fff
        R10: 0000000000000001 R11: ffffb642afb476f8 R12: ffffffffffffffff
        R13: ffffb642afb47974 R14: ffff8b5499254c00 R15: 0000000000000003
        FS:  00007f00a06d4080(0000) GS:ffff8b722fc80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fff5cff5ff0 CR3: 00000001c0c2a006 CR4: 00000000001706e0
        Call Trace:
         btrfsic_process_written_block+0x2f7/0x850 [btrfs]
         __btrfsic_submit_bio.part.19+0x310/0x330 [btrfs]
         ? bio_associate_blkg_from_css+0xa4/0x2c0
         btrfsic_submit_bio+0x18/0x30 [btrfs]
         write_dev_supers+0x81/0x2a0 [btrfs]
         ? find_get_pages_range_tag+0x219/0x280
         ? pagevec_lookup_range_tag+0x24/0x30
         ? __filemap_fdatawait_range+0x6d/0xf0
         ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
         ? find_first_extent_bit+0x9b/0x160 [btrfs]
         ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
         write_all_supers+0x1b3/0xa70 [btrfs]
         ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
         btrfs_commit_transaction+0x59d/0xac0 [btrfs]
         close_ctree+0x11d/0x339 [btrfs]
         generic_shutdown_super+0x71/0x110
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0xb8/0x140
         task_work_run+0x6d/0xb0
         exit_to_user_mode_prepare+0x1f0/0x200
         syscall_exit_to_user_mode+0x12/0x30
         do_syscall_64+0x46/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f009f711dfb
        RSP: 002b:00007fff5cff7928 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 000055b68c6c9970 RCX: 00007f009f711dfb
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b68c6c9b50
        RBP: 0000000000000000 R08: 000055b68c6ca900 R09: 00007f009f795580
        R10: 0000000000000000 R11: 0000000000000246 R12: 000055b68c6c9b50
        R13: 00007f00a04bf184 R14: 0000000000000000 R15: 00000000ffffffff
        ---[ end trace 2c4b82abcef9eec4 ]---
        S-65536(sdb2/65536/1)
         -->
        M-1064960(sdb2/1064960/1)
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a91cf0ff
    • Filipe Manana's avatar
      btrfs: silence lockdep when reading chunk tree during mount · 4d9380e0
      Filipe Manana authored
      
      
      Often some test cases like btrfs/161 trigger lockdep splats that complain
      about possible unsafe lock scenario due to the fact that during mount,
      when reading the chunk tree we end up calling blkdev_get_by_path() while
      holding a read lock on a leaf of the chunk tree. That produces a lockdep
      splat like the following:
      
      [ 3653.683975] ======================================================
      [ 3653.685148] WARNING: possible circular locking dependency detected
      [ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted
      [ 3653.687239] ------------------------------------------------------
      [ 3653.688400] mount/447465 is trying to acquire lock:
      [ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.691054]
                     but task is already holding lock:
      [ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.693978]
                     which lock already depends on the new lock.
      
      [ 3653.695510]
                     the existing dependency chain (in reverse order) is:
      [ 3653.696915]
                     -> #3 (btrfs-chunk-00){++++}-{3:3}:
      [ 3653.698053]        down_read_nested+0x4b/0x140
      [ 3653.698893]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.699988]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 3653.701205]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 3653.702234]        btrfs_insert_empty_items+0x32/0x70 [btrfs]
      [ 3653.703332]        btrfs_init_new_device+0x563/0x15b0 [btrfs]
      [ 3653.704439]        btrfs_ioctl+0x2110/0x3530 [btrfs]
      [ 3653.705405]        __x64_sys_ioctl+0x83/0xb0
      [ 3653.706215]        do_syscall_64+0x3b/0xc0
      [ 3653.706990]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.708040]
                     -> #2 (sb_internal#2){.+.+}-{0:0}:
      [ 3653.708994]        lock_release+0x13d/0x4a0
      [ 3653.709533]        up_write+0x18/0x160
      [ 3653.710017]        btrfs_sync_file+0x3f3/0x5b0 [btrfs]
      [ 3653.710699]        __loop_update_dio+0xbd/0x170 [loop]
      [ 3653.711360]        lo_ioctl+0x3b1/0x8a0 [loop]
      [ 3653.711929]        block_ioctl+0x48/0x50
      [ 3653.712442]        __x64_sys_ioctl+0x83/0xb0
      [ 3653.712991]        do_syscall_64+0x3b/0xc0
      [ 3653.713519]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.714233]
                     -> #1 (&lo->lo_mutex){+.+.}-{3:3}:
      [ 3653.715026]        __mutex_lock+0x92/0x900
      [ 3653.715648]        lo_open+0x28/0x60 [loop]
      [ 3653.716275]        blkdev_get_whole+0x28/0x90
      [ 3653.716867]        blkdev_get_by_dev.part.0+0x142/0x320
      [ 3653.717537]        blkdev_open+0x5e/0xa0
      [ 3653.718043]        do_dentry_open+0x163/0x390
      [ 3653.718604]        path_openat+0x3f0/0xa80
      [ 3653.719128]        do_filp_open+0xa9/0x150
      [ 3653.719652]        do_sys_openat2+0x97/0x160
      [ 3653.720197]        __x64_sys_openat+0x54/0x90
      [ 3653.720766]        do_syscall_64+0x3b/0xc0
      [ 3653.721285]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.721986]
                     -> #0 (&disk->open_mutex){+.+.}-{3:3}:
      [ 3653.722775]        __lock_acquire+0x130e/0x2210
      [ 3653.723348]        lock_acquire+0xd7/0x310
      [ 3653.723867]        __mutex_lock+0x92/0x900
      [ 3653.724394]        blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.725041]        blkdev_get_by_path+0xb8/0xd0
      [ 3653.725614]        btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
      [ 3653.726332]        open_fs_devices+0xd7/0x2c0 [btrfs]
      [ 3653.726999]        btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
      [ 3653.727739]        open_ctree+0xb8e/0x17bf [btrfs]
      [ 3653.728384]        btrfs_mount_root.cold+0x12/0xde [btrfs]
      [ 3653.729130]        legacy_get_tree+0x30/0x50
      [ 3653.729676]        vfs_get_tree+0x28/0xc0
      [ 3653.730192]        vfs_kern_mount.part.0+0x71/0xb0
      [ 3653.730800]        btrfs_mount+0x11d/0x3a0 [btrfs]
      [ 3653.731427]        legacy_get_tree+0x30/0x50
      [ 3653.731970]        vfs_get_tree+0x28/0xc0
      [ 3653.732486]        path_mount+0x2d4/0xbe0
      [ 3653.732997]        __x64_sys_mount+0x103/0x140
      [ 3653.733560]        do_syscall_64+0x3b/0xc0
      [ 3653.734080]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.734782]
                     other info that might help us debug this:
      
      [ 3653.735784] Chain exists of:
                       &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00
      
      [ 3653.737123]  Possible unsafe locking scenario:
      
      [ 3653.737865]        CPU0                    CPU1
      [ 3653.738435]        ----                    ----
      [ 3653.739007]   lock(btrfs-chunk-00);
      [ 3653.739449]                                lock(sb_internal#2);
      [ 3653.740193]                                lock(btrfs-chunk-00);
      [ 3653.740955]   lock(&disk->open_mutex);
      [ 3653.741431]
                      *** DEADLOCK ***
      
      [ 3653.742176] 3 locks held by mount/447465:
      [ 3653.742739]  #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0
      [ 3653.744114]  #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs]
      [ 3653.745563]  #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.747066]
                     stack backtrace:
      [ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1
      [ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 3653.750592] Call Trace:
      [ 3653.750967]  dump_stack_lvl+0x57/0x72
      [ 3653.751526]  check_noncircular+0xf3/0x110
      [ 3653.752136]  ? stack_trace_save+0x4b/0x70
      [ 3653.752748]  __lock_acquire+0x130e/0x2210
      [ 3653.753356]  lock_acquire+0xd7/0x310
      [ 3653.753898]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.754596]  ? lock_is_held_type+0xe8/0x140
      [ 3653.755125]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.755729]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.756338]  __mutex_lock+0x92/0x900
      [ 3653.756794]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.757400]  ? do_raw_spin_unlock+0x4b/0xa0
      [ 3653.757930]  ? _raw_spin_unlock+0x29/0x40
      [ 3653.758437]  ? bd_prepare_to_claim+0x129/0x150
      [ 3653.758999]  ? trace_module_get+0x2b/0xd0
      [ 3653.759508]  ? try_module_get.part.0+0x50/0x80
      [ 3653.760072]  blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.760661]  ? devcgroup_check_permission+0xc1/0x1f0
      [ 3653.761288]  blkdev_get_by_path+0xb8/0xd0
      [ 3653.761797]  btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
      [ 3653.762454]  open_fs_devices+0xd7/0x2c0 [btrfs]
      [ 3653.763055]  ? clone_fs_devices+0x8f/0x170 [btrfs]
      [ 3653.763689]  btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
      [ 3653.764370]  ? kvm_sched_clock_read+0x14/0x40
      [ 3653.764922]  open_ctree+0xb8e/0x17bf [btrfs]
      [ 3653.765493]  ? super_setup_bdi_name+0x79/0xd0
      [ 3653.766043]  btrfs_mount_root.cold+0x12/0xde [btrfs]
      [ 3653.766780]  ? rcu_read_lock_sched_held+0x3f/0x80
      [ 3653.767488]  ? kfree+0x1f2/0x3c0
      [ 3653.767979]  legacy_get_tree+0x30/0x50
      [ 3653.768548]  vfs_get_tree+0x28/0xc0
      [ 3653.769076]  vfs_kern_mount.part.0+0x71/0xb0
      [ 3653.769718]  btrfs_mount+0x11d/0x3a0 [btrfs]
      [ 3653.770381]  ? rcu_read_lock_sched_held+0x3f/0x80
      [ 3653.771086]  ? kfree+0x1f2/0x3c0
      [ 3653.771574]  legacy_get_tree+0x30/0x50
      [ 3653.772136]  vfs_get_tree+0x28/0xc0
      [ 3653.772673]  path_mount+0x2d4/0xbe0
      [ 3653.773201]  __x64_sys_mount+0x103/0x140
      [ 3653.773793]  do_syscall_64+0x3b/0xc0
      [ 3653.774333]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.775094] RIP: 0033:0x7f648bc45aaa
      
      This happens because through btrfs_read_chunk_tree(), which is called only
      during mount, ends up acquiring the mutex open_mutex of a block device
      while holding a read lock on a leaf of the chunk tree while other paths
      need to acquire other locks before locking extent buffers of the chunk
      tree.
      
      Since at mount time when we call btrfs_read_chunk_tree() we know that
      we don't have other tasks running in parallel and modifying the chunk
      tree, we can simply skip locking of chunk tree extent buffers. So do
      that and move the assertion that checks the fs is not yet mounted to the
      top block of btrfs_read_chunk_tree(), with a comment before doing it.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d9380e0
    • Nikolay Borisov's avatar
      btrfs: fix memory ordering between normal and ordered work functions · 45da9c17
      Nikolay Borisov authored
      
      
      Ordered work functions aren't guaranteed to be handled by the same thread
      which executed the normal work functions. The only way execution between
      normal/ordered functions is synchronized is via the WORK_DONE_BIT,
      unfortunately the used bitops don't guarantee any ordering whatsoever.
      
      This manifested as seemingly inexplicable crashes on ARM64, where
      async_chunk::inode is seen as non-null in async_cow_submit which causes
      submit_compressed_extents to be called and crash occurs because
      async_chunk::inode suddenly became NULL. The call trace was similar to:
      
          pc : submit_compressed_extents+0x38/0x3d0
          lr : async_cow_submit+0x50/0xd0
          sp : ffff800015d4bc20
      
          <registers omitted for brevity>
      
          Call trace:
           submit_compressed_extents+0x38/0x3d0
           async_cow_submit+0x50/0xd0
           run_ordered_work+0xc8/0x280
           btrfs_work_helper+0x98/0x250
           process_one_work+0x1f0/0x4ac
           worker_thread+0x188/0x504
           kthread+0x110/0x114
           ret_from_fork+0x10/0x18
      
      Fix this by adding respective barrier calls which ensure that all
      accesses preceding setting of WORK_DONE_BIT are strictly ordered before
      setting the flag. At the same time add a read barrier after reading of
      WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads
      would be strictly ordered after reading the bit. This in turn ensures
      are all accesses before WORK_DONE_BIT are going to be strictly ordered
      before any access that can occur in ordered_func.
      
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Fixes: 08a9ff32 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue")
      CC: stable@vger.kernel.org # 4.4+
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928
      
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Tested-by: default avatarChris Murphy <chris@colorremedies.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45da9c17
    • Qu Wenruo's avatar
      btrfs: fix a out-of-bound access in copy_compressed_data_to_page() · 6f019c0e
      Qu Wenruo authored
      [BUG]
      The following script can cause btrfs to crash:
      
        $ mount -o compress-force=lzo $DEV /mnt
        $ dd if=/dev/urandom of=/mnt/foo bs=4k count=1
        $ sync
      
      The call trace looks like this:
      
        general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4
        Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
        RIP: 0010:__memcpy+0x12/0x20
        Call Trace:
         lzo_compress_pages+0x236/0x540 [btrfs]
         btrfs_compress_pages+0xaa/0xf0 [btrfs]
         compress_file_range+0x431/0x8e0 [btrfs]
         async_cow_start+0x12/0x30 [btrfs]
         btrfs_work_helper+0xf6/0x3e0 [btrfs]
         process_one_work+0x294/0x5d0
         worker_thread+0x55/0x3c0
         kthread+0x140/0x170
         ret_from_fork+0x22/0x30
        ---[ end trace 63c3c0f131e61982 ]---
      
      [CAUSE]
      In lzo_compress_pages(), parameter @out_pages is not only an output
      parameter (for the number of compressed pages), but also an input
      parameter, as the upper limit of compressed pages we can utilize.
      
      In commit d4088803 ("btrfs: subpage: make lzo_compress_pages()
      compatible"), the refactoring doesn't take @out_pages as an input, thus
      completely ignoring the limit.
      
      And for compress-force case, we could hit incompressible data that
      compressed size would go beyond the page limit, and cause the above
      crash.
      
      [FIX]
      Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(),
      and check if we're beyond the limit before accessing the pages.
      
      Note: this also fixes crash on 32bit architectures that was suspected to
      be caused by merge of btrfs patches to 5.16-rc1. Reported in
      https://lore.kernel.org/all/20211104115001.GU20319@twin.jikos.cz/
      
       .
      
      Reported-by: default avatarOmar Sandoval <osandov@fb.com>
      Fixes: d4088803
      
       ("btrfs: subpage: make lzo_compress_pages() compatible")
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f019c0e
  6. Oct 29, 2021
    • Filipe Manana's avatar
      btrfs: remove root argument from check_item_in_log() · d1ed82f3
      Filipe Manana authored
      
      
      The root argument passed to check_item_in_log() always matches the root
      of the given directory, so it can be eliminated.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1ed82f3
    • Filipe Manana's avatar
      btrfs: remove root argument from add_link() · 6d9cc072
      Filipe Manana authored
      
      
      The root argument for tree-log.c:add_link() always matches the root of the
      given directory and the given inode, so it can eliminated.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d9cc072
    • Filipe Manana's avatar
      btrfs: remove root argument from btrfs_unlink_inode() · 4467af88
      Filipe Manana authored
      
      
      The root argument passed to btrfs_unlink_inode() and its callee,
      __btrfs_unlink_inode(), always matches the root of the given directory and
      the given inode. So remove the argument and make __btrfs_unlink_inode()
      use the root of the directory.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4467af88
    • Filipe Manana's avatar
      btrfs: remove root argument from drop_one_dir_item() · 9798ba24
      Filipe Manana authored
      
      
      The root argument for drop_one_dir_item() always matches the root of the
      given directory inode, since each log tree is associated to one and only
      one subvolume/root, so remove the argument.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9798ba24
    • Li Zhang's avatar
      btrfs: clear MISSING device status bit in btrfs_close_one_device · 5d03dbeb
      Li Zhang authored
      Reported bug: https://github.com/kdave/btrfs-progs/issues/389
      
      
      
      There's a problem with scrub reporting aborted status but returning
      error code 0, on a filesystem with missing and readded device.
      
      Roughly these steps:
      
      - mkfs -d raid1 dev1 dev2
      - fill with data
      - unmount
      - make dev1 disappear
      - mount -o degraded
      - copy more data
      - make dev1 appear again
      
      Running scrub afterwards reports that the command was aborted, but the
      system log message says the exit code was 0.
      
      It seems that the cause of the error is decrementing
      fs_devices->missing_devices but not clearing device->dev_state.  Every
      time we umount filesystem, it would call close_ctree, And it would
      eventually involve btrfs_close_one_device to close the device, but it
      only decrements fs_devices->missing_devices but does not clear the
      device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
      Overflow, because every time umount, fs_devices->missing_devices will
      decrease. If fs_devices->missing_devices value hit 0, it would overflow.
      
      With added debugging:
      
         loop1: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 0
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
      
      If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
      
      After apply this patch, the fs_devices->missing_devices seems to be
      right:
      
        $ truncate -s 10g test1
        $ truncate -s 10g test2
        $ losetup /dev/loop1 test1
        $ losetup /dev/loop2 test2
        $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
        $ losetup -d /dev/loop2
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ dmesg
      
         loop1: detected capacity change from 0 to 20971520
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): checking UUID tree
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarLi Zhang <zhanglikernel@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d03dbeb
    • Anand Jain's avatar
      btrfs: call btrfs_check_rw_degradable only if there is a missing device · 5c78a5e7
      Anand Jain authored
      
      
      In open_ctree() in btrfs_check_rw_degradable() [1], we check each block
      group individually if at least the minimum number of devices is available
      for that profile. If all the devices are available, then we don't have to
      check degradable.
      
      [1]
      open_ctree()
      ::
      3559 if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
      
      Also before calling btrfs_check_rw_degradable() in open_ctee() at the
      line number shown below [2] we call btrfs_read_chunk_tree() and down to
      add_missing_dev() to record number of missing devices.
      
      [2]
      open_ctree()
      ::
      3454         ret = btrfs_read_chunk_tree(fs_info);
      
      btrfs_read_chunk_tree()
        read_one_chunk() / read_one_dev()
          add_missing_dev()
      
      So, check if there is any missing device before btrfs_check_rw_degradable()
      in open_ctree().
      
      Also, with this the mount command could save ~16ms.[3] in the most
      common case, that is no device is missing.
      
      [3]
       1) * 16934.96 us | btrfs_check_rw_degradable [btrfs]();
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c78a5e7
    • David Sterba's avatar
      btrfs: send: prepare for v2 protocol · e77fbf99
      David Sterba authored
      
      
      This is preparatory work for send protocol update to version 2 and
      higher.
      
      We have many pending protocol update requests but still don't have the
      basic protocol rev in place, the first thing that must happen is to do
      the actual versioning support.
      
      The protocol version is u32 and is a new member in the send ioctl
      struct. Validity of the version field is backed by a new flag bit. Old
      kernels would fail when a higher version is requested. Version protocol
      0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION,
        that's also exported in sysfs.
      
      The version is still unchanged and will be increased once we have new
      incompatible commands or stream updates.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e77fbf99
  7. Oct 27, 2021
    • Anand Jain's avatar
      btrfs: fix comment about sector sizes supported in 64K systems · 50780d9b
      Anand Jain authored
      Commit 95ea0486
      
       ("btrfs: allow read-write for 4K sectorsize on 64K
      page size systems") added write support for 4K sectorsize on a 64K
      systems. Fix the now stale comments.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50780d9b
    • Josef Bacik's avatar
      btrfs: update device path inode time instead of bd_inode · 54fde91f
      Josef Bacik authored
      Christoph pointed out that I'm updating bdev->bd_inode for the device
      time when we remove block devices from a btrfs file system, however this
      isn't actually exposed to anything.  The inode we want to update is the
      one that's associated with the path to the device, usually on devtmpfs,
      so that blkid notices the difference.
      
      We still don't want to do the blkdev_open, so use kern_path() to get the
      path to the given device and do the update time on that inode.
      
      Fixes: 8f96a5bf
      
       ("btrfs: update the bdev time directly when closing")
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54fde91f
    • Josef Bacik's avatar
      fs: export an inode_update_time helper · e60feb44
      Josef Bacik authored
      
      
      If you already have an inode and need to update the time on the inode
      there is no way to do this properly.  Export this helper to allow file
      systems to update time on the inode so the appropriate handler is
      called, either ->update_time or generic_update_time.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e60feb44
    • Omar Sandoval's avatar
      btrfs: fix deadlock when defragging transparent huge pages · 24bcb454
      Omar Sandoval authored
      
      
      Attempting to defragment a Btrfs file containing a transparent huge page
      immediately deadlocks with the following stack trace:
      
        #0  context_switch (kernel/sched/core.c:4940:2)
        #1  __schedule (kernel/sched/core.c:6287:8)
        #2  schedule (kernel/sched/core.c:6366:3)
        #3  io_schedule (kernel/sched/core.c:8389:2)
        #4  wait_on_page_bit_common (mm/filemap.c:1356:4)
        #5  __lock_page (mm/filemap.c:1648:2)
        #6  lock_page (./include/linux/pagemap.h:625:3)
        #7  pagecache_get_page (mm/filemap.c:1910:4)
        #8  find_or_create_page (./include/linux/pagemap.h:420:9)
        #9  defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
        #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
        #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
        #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
        #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
        #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
        #15 vfs_ioctl (fs/ioctl.c:51:10)
        #16 __do_sys_ioctl (fs/ioctl.c:874:11)
        #17 __se_sys_ioctl (fs/ioctl.c:860:1)
        #18 __x64_sys_ioctl (fs/ioctl.c:860:1)
        #19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
        #20 do_syscall_64 (arch/x86/entry/common.c:80:7)
        #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)
      
      A huge page is represented by a compound page, which consists of a
      struct page for each PAGE_SIZE page within the huge page. The first
      struct page is the "head page", and the remaining are "tail pages".
      
      Defragmentation attempts to lock each page in the range. However,
      lock_page() on a tail page actually locks the corresponding head page.
      So, if defragmentation tries to lock more than one struct page in a
      compound page, it tries to lock the same head page twice and deadlocks
      with itself.
      
      Ideally, we should be able to defragment transparent huge pages.
      However, THP for filesystems is currently read-only, so a lot of code is
      not ready to use huge pages for I/O. For now, let's just return
      ETXTBUSY.
      
      This can be reproduced with the following on a kernel with
      CONFIG_READ_ONLY_THP_FOR_FS=y:
      
        $ cat create_thp_file.c
        #include <fcntl.h>
        #include <stdbool.h>
        #include <stdio.h>
        #include <stdint.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <sys/mman.h>
      
        static const char zeroes[1024 * 1024];
        static const size_t FILE_SIZE = 2 * 1024 * 1024;
      
        int main(int argc, char **argv)
        {
                if (argc != 2) {
                        fprintf(stderr, "usage: %s PATH\n", argv[0]);
                        return EXIT_FAILURE;
                }
                int fd = creat(argv[1], 0777);
                if (fd == -1) {
                        perror("creat");
                        return EXIT_FAILURE;
                }
                size_t written = 0;
                while (written < FILE_SIZE) {
                        ssize_t ret = write(fd, zeroes,
                                            sizeof(zeroes) < FILE_SIZE - written ?
                                            sizeof(zeroes) : FILE_SIZE - written);
                        if (ret < 0) {
                                perror("write");
                                return EXIT_FAILURE;
                        }
                        written += ret;
                }
                close(fd);
                fd = open(argv[1], O_RDONLY);
                if (fd == -1) {
                        perror("open");
                        return EXIT_FAILURE;
                }
      
                /*
                 * Reserve some address space so that we can align the file mapping to
                 * the huge page size.
                 */
                void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
                                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
                if (placeholder_map == MAP_FAILED) {
                        perror("mmap (placeholder)");
                        return EXIT_FAILURE;
                }
      
                void *aligned_address =
                        (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));
      
                void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
                                 MAP_SHARED | MAP_FIXED, fd, 0);
                if (map == MAP_FAILED) {
                        perror("mmap");
                        return EXIT_FAILURE;
                }
                if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
                        perror("madvise");
                        return EXIT_FAILURE;
                }
      
                char *line = NULL;
                size_t line_capacity = 0;
                FILE *smaps_file = fopen("/proc/self/smaps", "r");
                if (!smaps_file) {
                        perror("fopen");
                        return EXIT_FAILURE;
                }
                for (;;) {
                        for (size_t off = 0; off < FILE_SIZE; off += 4096)
                                ((volatile char *)map)[off];
      
                        ssize_t ret;
                        bool this_mapping = false;
                        while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
                                unsigned long start, end, huge;
                                if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
                                        this_mapping = (start <= (uintptr_t)map &&
                                                        (uintptr_t)map < end);
                                } else if (this_mapping &&
                                           sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
                                           huge > 0) {
                                        return EXIT_SUCCESS;
                                }
                        }
      
                        sleep(6);
                        rewind(smaps_file);
                        fflush(smaps_file);
                }
        }
        $ ./create_thp_file huge
        $ btrfs fi defrag -czstd ./huge
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24bcb454
    • Anand Jain's avatar
      btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit · 020e5277
      Anand Jain authored
      Commit 2efc459d
      
       ("sysfs: Add sysfs_emit and sysfs_emit_at to format
      sysfs out") merged in 5.10 introduced two new functions sysfs_emit() and
      sysfs_emit_at() which are aware of the PAGE_SIZE limit of the output
      buffer.
      
      Use the above two new functions instead of scnprintf() and snprintf()
      in various sysfs show().
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      020e5277
    • Qu Wenruo's avatar
      btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE · 38732474
      Qu Wenruo authored
      
      
      It's a common practice to avoid use sizeof(struct btrfs_super_block)
      (3531), but to use BTRFS_SUPER_INFO_SIZE (4096).
      
      The problem is that, sizeof(struct btrfs_super_block) doesn't match
      BTRFS_SUPER_INFO_SIZE from the very beginning.
      
      Furthermore, for all call sites except selftests, we always allocate
      BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason
      to use the smaller value, and it doesn't really save any space.
      
      So let's get rid of such confusing behavior, and unify those two values.
      
      This modification also adds a new static_assert() to verify the size,
      and moves the BTRFS_SUPER_INFO_* macros to the definition of
      btrfs_super_block for the static_assert().
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38732474
    • Filipe Manana's avatar
      btrfs: update comments for chunk allocation -ENOSPC cases · ecd84d54
      Filipe Manana authored
      
      
      Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
      describe which cases can lead to a failure to allocate metadata and system
      space despite having previously reserved space. This adds one more reason
      that I previously forgot to mention.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ecd84d54
    • Filipe Manana's avatar
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana authored
      
      
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712
      
       ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • Johannes Thumshirn's avatar
      btrfs: zoned: use greedy gc for auto reclaim · 2ca0ec77
      Johannes Thumshirn authored
      
      
      Currently auto reclaim of unusable zones reclaims the block-groups in
      the order they have been added to the reclaim list.
      
      Change this to a greedy algorithm by sorting the list so we have the
      block-groups with the least amount of valid bytes reclaimed first.
      
      Note: we can't splice the block groups from reclaim_bgs to let the sort
      happen outside of the lock. The block groups can be still in use by
      other parts eg. via bg_list and we must hold unused_bgs_lock while
      processing them.
      
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ write note and comment why we can't splice the list ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ca0ec77
    • Christoph Hellwig's avatar
      btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state · 813ebc16
      Christoph Hellwig authored
      
      
      Just use the %pg format specifier in all the debug printks previously
      using it.  Note that both bdevname and the %pg specifier never print
      a pathname, so the kbasename call wasn't needed to start with.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ adjust messages and indentation ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      813ebc16
    • Josef Bacik's avatar
      btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls · 1a15eb72
      Josef Bacik authored
      
      
      For device removal and replace we call btrfs_find_device_by_devspec,
      which if we give it a device path and nothing else will call
      btrfs_get_dev_args_from_path, which opens the block device and reads the
      super block and then looks up our device based on that.
      
      However at this point we're holding the sb write "lock", so reading the
      block device pulls in the dependency of ->open_mutex, which produces the
      following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #405 Not tainted
      ------------------------------------------------------
      losetup/11576 is trying to acquire lock:
      ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_get_by_path+0x98/0xa0
             btrfs_get_bdev_and_sb+0x1b/0xb0
             btrfs_find_device_by_devspec+0x12b/0x1c0
             btrfs_rm_device+0x127/0x610
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11576:
       #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f31b02404cb
      
      Instead what we want to do is populate our device lookup args before we
      grab any locks, and then pass these args into btrfs_rm_device().  From
      there we can find the device and do the appropriate removal.
      
      Suggested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a15eb72
    • Josef Bacik's avatar
      btrfs: add a btrfs_get_dev_args_from_path helper · faa775c4
      Josef Bacik authored
      
      
      We are going to want to populate our device lookup args outside of any
      locks and then do the actual device lookup later, so add a helper to do
      this work and make btrfs_find_device_by_devspec() use this helper for
      now.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      faa775c4
    • Josef Bacik's avatar
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik authored
      
      
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      562d7b15
    • Josef Bacik's avatar
      btrfs: do not call close_fs_devices in btrfs_rm_device · 8b41393f
      Josef Bacik authored
      
      
      There's a subtle case where if we're removing the seed device from a
      file system we need to free its private copy of the fs_devices.  However
      we do not need to call close_fs_devices(), because at this point there
      are no devices left to close as we've closed the last one.  The only
      thing that close_fs_devices() does is decrement ->opened, which should
      be 1.  We want to avoid calling close_fs_devices() here because it has a
      lockdep_assert_held(&uuid_mutex), and we are going to stop holding the
      uuid_mutex in this path.
      
      So simply decrement the  ->opened counter like we should, and then clean
      up like normal.  Also add a comment explaining what we're doing here as
      I initially removed this code erroneously.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b41393f