Skip to content
  1. Dec 04, 2018
    • Qu Wenruo's avatar
      btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable · 10950929
      Qu Wenruo authored
      [BUG]
      A completely valid btrfs will refuse to mount, with error message like:
        BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
          bg_start=12018974720 bg_len=10888413184, invalid block group size, \
          have 10888413184 expect (0, 10737418240]
      
      This has been reported several times as the 4.19 kernel is now being
      used. The filesystem refuses to mount, but is otherwise ok and booting
      4.18 is a workaround.
      
      Btrfs check returns no error, and all kernels used on this fs is later
      than 2011, which should all have the 10G size limit commit.
      
      [CAUSE]
      For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
      stripe stripe bump up.
      
      __btrfs_alloc_chunk()
      |- max_stripe_size = 1G
      |- max_chunk_size = 10G
      |- data_stripe = 11
      |- if (1G * 11 > 10G) {
             stripe_size = 976128930;
             stripe_size = round_up(976128930, SZ_16M) = 989855744
      
      However the final stripe_size (989855744) * 11 = 10888413184, which is
      still larger than 10G.
      
      [FIX]
      For the comprehensive check, we need to do the full check at chunk read
      time, and rely on bg <-> chunk mapping to do the check.
      
      We could just skip the length check for now.
      
      Fixes: fce466ea
      
       ("btrfs: tree-checker: Verify block_group_item")
      Cc: stable@vger.kernel.org # v4.19+
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10950929
  2. Nov 23, 2018
    • Pan Bian's avatar
      btrfs: relocation: set trans to be NULL after ending transaction · 42a657f5
      Pan Bian authored
      The function relocate_block_group calls btrfs_end_transaction to release
      trans when update_backref_cache returns 1, and then continues the loop
      body. If btrfs_block_rsv_refill fails this time, it will jump out the
      loop and the freed trans will be accessed. This may result in a
      use-after-free bug. The patch assigns NULL to trans after trans is
      released so that it will not be accessed.
      
      Fixes: 0647bf56
      
       ("Btrfs: improve forever loop when doing balance relocation")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42a657f5
    • Filipe Manana's avatar
      Btrfs: fix race between enabling quotas and subvolume creation · 552f0329
      Filipe Manana authored
      
      
      We have a race between enabling quotas end subvolume creation that cause
      subvolume creation to fail with -EINVAL, and the following diagram shows
      how it happens:
      
                    CPU 0                                          CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
      
                                                        btrfs_ioctl()
                                                         create_subvol()
                                                          btrfs_qgroup_inherit()
                                                           -> save fs_info->quota_root
                                                              into quota_root
                                                           -> stores a NULL value
                                                           -> tries to lock the mutex
                                                              qgroup_ioctl_lock
                                                              -> blocks waiting for
                                                                 the task at CPU0
      
         -> sets BTRFS_FS_QUOTA_ENABLED in fs_info
         -> sets quota_root in fs_info->quota_root
            (non-NULL value)
      
         mutex_unlock(fs_info->qgroup_ioctl_lock)
      
                                                           -> checks quota enabled
                                                              flag is set
                                                           -> returns -EINVAL because
                                                              fs_info->quota_root was
                                                              NULL before it acquired
                                                              the mutex
                                                              qgroup_ioctl_lock
                                                         -> ioctl returns -EINVAL
      
      Returning -EINVAL to user space will be confusing if all the arguments
      passed to the subvolume creation ioctl were valid.
      
      Fix it by grabbing the value from fs_info->quota_root after acquiring
      the mutex.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      552f0329
  3. Nov 22, 2018
    • Robbie Ko's avatar
      Btrfs: send, fix infinite loop due to directory rename dependencies · a4390aee
      Robbie Ko authored
      
      
      When doing an incremental send, due to the need of delaying directory move
      (rename) operations we can end up in infinite loop at
      apply_children_dir_moves().
      
      An example scenario that triggers this problem is described below, where
      directory names correspond to the numbers of their respective inodes.
      
      Parent snapshot:
      
       .
       |--- 261/
             |--- 271/
                   |--- 266/
                         |--- 259/
                         |--- 260/
                         |     |--- 267
                         |
                         |--- 264/
                         |     |--- 258/
                         |           |--- 257/
                         |
                         |--- 265/
                         |--- 268/
                         |--- 269/
                         |     |--- 262/
                         |
                         |--- 270/
                         |--- 272/
                         |     |--- 263/
                         |     |--- 275/
                         |
                         |--- 274/
                               |--- 273/
      
      Send snapshot:
      
       .
       |-- 275/
            |-- 274/
                 |-- 273/
                      |-- 262/
                           |-- 269/
                                |-- 258/
                                     |-- 271/
                                          |-- 268/
                                               |-- 267/
                                                    |-- 270/
                                                         |-- 259/
                                                         |    |-- 265/
                                                         |
                                                         |-- 272/
                                                              |-- 257/
                                                                   |-- 260/
                                                                   |-- 264/
                                                                        |-- 263/
                                                                             |-- 261/
                                                                                  |-- 266/
      
      When processing inode 257 we delay its move (rename) operation because its
      new parent in the send snapshot, inode 272, was not yet processed. Then
      when processing inode 272, we delay the move operation for that inode
      because inode 274 is its ancestor in the send snapshot. Finally we delay
      the move operation for inode 274 when processing it because inode 275 is
      its new parent in the send snapshot and was not yet moved.
      
      When finishing processing inode 275, we start to do the move operations
      that were previously delayed (at apply_children_dir_moves()), resulting in
      the following iterations:
      
      1) We issue the move operation for inode 274;
      
      2) Because inode 262 depended on the move operation of inode 274 (it was
         delayed because 274 is its ancestor in the send snapshot), we issue the
         move operation for inode 262;
      
      3) We issue the move operation for inode 272, because it was delayed by
         inode 274 too (ancestor of 272 in the send snapshot);
      
      4) We issue the move operation for inode 269 (it was delayed by 262);
      
      5) We issue the move operation for inode 257 (it was delayed by 272);
      
      6) We issue the move operation for inode 260 (it was delayed by 272);
      
      7) We issue the move operation for inode 258 (it was delayed by 269);
      
      8) We issue the move operation for inode 264 (it was delayed by 257);
      
      9) We issue the move operation for inode 271 (it was delayed by 258);
      
      10) We issue the move operation for inode 263 (it was delayed by 264);
      
      11) We issue the move operation for inode 268 (it was delayed by 271);
      
      12) We verify if we can issue the move operation for inode 270 (it was
          delayed by 271). We detect a path loop in the current state, because
          inode 267 needs to be moved first before we can issue the move
          operation for inode 270. So we delay again the move operation for
          inode 270, this time we will attempt to do it after inode 267 is
          moved;
      
      13) We issue the move operation for inode 261 (it was delayed by 263);
      
      14) We verify if we can issue the move operation for inode 266 (it was
          delayed by 263). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12);
      
      15) We issue the move operation for inode 267 (it was delayed by 268);
      
      16) We verify if we can issue the move operation for inode 266 (it was
          delayed by 270). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12). So here we added
          again the same delayed move operation that we added in step 14;
      
      17) We attempt again to see if we can issue the move operation for inode
          266, and as in step 16, we realize we can not due to a path loop in
          the current state due to a dependency on inode 270. Again we delay
          inode's 266 rename to happen after inode's 270 move operation, adding
          the same dependency to the empty stack that we did in steps 14 and 16.
          The next iteration will pick the same move dependency on the stack
          (the only entry) and realize again there is still a path loop and then
          again the same dependency to the stack, over and over, resulting in
          an infinite loop.
      
      So fix this by preventing adding the same move dependency entries to the
      stack by removing each pending move record from the red black tree of
      pending moves. This way the next call to get_pending_dir_moves() will
      not return anything for the current parent inode.
      
      A test case for fstests, with this reproducer, follows soon.
      
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Wrote changelog with example and more clear explanation]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4390aee
  4. Nov 15, 2018
  5. Nov 13, 2018
    • Filipe Manana's avatar
      Btrfs: fix rare chances for data loss when doing a fast fsync · aab15e8e
      Filipe Manana authored
      After the simplification of the fast fsync patch done recently by commit
      b5e6c3e1 ("btrfs: always wait on ordered extents at fsync time") and
      commit e7175a69 ("btrfs: remove the wait ordered logic in the
      log_one_extent path"), we got a very short time window where we can get
      extents logged without writeback completing first or extents logged
      without logging the respective data checksums. Both issues can only happen
      when doing a non-full (fast) fsync.
      
      As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
      inode and then wait for the writeback to complete before starting to log
      the inode. However before we acquire the inode's lock and after we started
      writeback, it's possible that more writes happened and dirtied more pages.
      If that happened and those pages get writeback triggered while we are
      logging the inode (for example, the VM subsystem triggering it due to
      memory pressure, or another concurrent fsync), we end up seeing the
      respective extent maps in the inode's list of modified extents and will
      log matching file extent items without waiting for the respective
      ordered extents to complete, meaning that either of the following will
      happen:
      
      1) We log an extent after its writeback finishes but before its checksums
         are added to the csum tree, leading to -EIO errors when attempting to
         read the extent after a log replay.
      
      2) We log an extent before its writeback finishes.
         Therefore after the log replay we will have a file extent item pointing
         to an unwritten extent (and without the respective data checksums as
         well).
      
      This could not happen before the fast fsync patch simplification, because
      for any extent we found in the list of modified extents, we would wait for
      its respective ordered extent to finish writeback or collect its checksums
      for logging if it did not complete yet.
      
      Fix this by triggering writeback again after acquiring the inode's lock
      and before waiting for ordered extents to complete.
      
      Fixes: e7175a69 ("btrfs: remove the wait ordered logic in the log_one_extent path")
      Fixes: b5e6c3e1
      
       ("btrfs: always wait on ordered extents at fsync time")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aab15e8e
    • Nikolay Borisov's avatar
      btrfs: Always try all copies when reading extent buffers · f8397d69
      Nikolay Borisov authored
      When a metadata read is served the endio routine btree_readpage_end_io_hook
      is called which eventually runs the tree-checker. If tree-checker fails
      to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
      leads to btree_read_extent_buffer_pages wrongly assuming that all
      available copies of this extent buffer are wrong and failing prematurely.
      Fix this modify btree_read_extent_buffer_pages to read all copies of
      the data.
      
      This failure was exhibitted in xfstests btrfs/124 which would
      spuriously fail its balance operations. The reason was that when balance
      was run following re-introduction of the missing raid1 disk
      __btrfs_map_block would map the read request to stripe 0, which
      corresponded to devid 2 (the disk which is being removed in the test):
      
          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
      	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
      	io_align 65536 io_width 65536 sector_size 4096
      	num_stripes 2 sub_stripes 1
      		stripe 0 devid 2 offset 2156920832
      		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
      		stripe 1 devid 1 offset 3553624064
      		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
      
      This caused read requests for a checksum item that to be routed to the
      stale disk which triggered the aforementioned logic involving
      EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
      the balance operation.
      
      Fixes: a826d6dc
      
       ("Btrfs: check items for correctness as we search")
      CC: stable@vger.kernel.org # 4.4+
      Suggested-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f8397d69
  6. Nov 08, 2018
    • Omar Sandoval's avatar
      Btrfs: fix missing delayed iputs on unmount · d6fd0ae2
      Omar Sandoval authored
      There's a race between close_ctree() and cleaner_kthread().
      close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
      sees it set, but this is racy; the cleaner might have already checked
      the bit and could be cleaning stuff. In particular, if it deletes unused
      block groups, it will create delayed iputs for the free space cache
      inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
      longer running delayed iputs after a commit. Therefore, if the cleaner
      creates more delayed iputs after delayed iputs are run in
      btrfs_commit_super(), we will leak inodes on unmount and get a busy
      inode crash from the VFS.
      
      Fix it by parking the cleaner before we actually close anything. Then,
      any remaining delayed iputs will always be handled in
      btrfs_commit_super(). This also ensures that the commit in close_ctree()
      is really the last commit, so we can get rid of the commit in
      cleaner_kthread().
      
      The fstest/generic/475 followed by 476 can trigger a crash that
      manifests as a slab corruption caused by accessing the freed kthread
      structure by a wake up function. Sample trace:
      
      [ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
      [ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
      [ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
      [ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
      [ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
      [ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
      [ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
      [ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
      [ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
      [ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
      [ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
      [ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
      [ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
      [ 5657.103159] Call Trace:
      [ 5657.103776]  shrink_inactive_list+0x194/0x410
      [ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
      [ 5657.105750]  shrink_node+0x62/0x1c0
      [ 5657.106529]  try_to_free_pages+0x1a4/0x500
      [ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
      [ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
      [ 5657.109348]  kmalloc_large_node+0x37/0x90
      [ 5657.110205]  __kmalloc_node+0x236/0x310
      [ 5657.111014]  kvmalloc_node+0x3e/0x70
      
      Fixes: 30928e9b
      
       ("btrfs: don't run delayed_iputs in commit")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add trace ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6fd0ae2
  7. Nov 06, 2018
    • Filipe Manana's avatar
      Btrfs: fix data corruption due to cloning of eof block · ac765f83
      Filipe Manana authored
      We currently allow cloning a range from a file which includes the last
      block of the file even if the file's size is not aligned to the block
      size. This is fine and useful when the destination file has the same size,
      but when it does not and the range ends somewhere in the middle of the
      destination file, it leads to corruption because the bytes between the EOF
      and the end of the block have undefined data (when there is support for
      discard/trimming they have a value of 0x00).
      
      Example:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ export foo_size=$((256 * 1024 + 100))
       $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
      
       $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
      
       $ od -A d -t x1 /mnt/bar
       0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
       *
       0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
       0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       *
       0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       1048576
      
      The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
      (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
      of 0xb5.
      
      This is similar to the problem we had for deduplication that got recently
      fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files").
      
      Fix this by not allowing such operations to be performed and return the
      errno -EINVAL to user space. This is what XFS is doing as well at the VFS
      level. This change however now makes us return -EINVAL instead of
      -EOPNOTSUPP for cases where the source range maps to an inline extent and
      the destination range's end is smaller then the destination file's size,
      since the detection of inline extents is done during the actual process of
      dropping file extent items (at __btrfs_drop_extents()). Returning the
      -EINVAL error is done early on and solely based on the input parameters
      (offsets and length) and destination file's size. This makes us consistent
      with XFS and anyone else supporting cloning since this case is now checked
      at a higher level in the VFS and is where the -EINVAL will be returned
      from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
      by commit 07d19dc9
      
       ("vfs: avoid problematic remapping requests into
      partial EOF block"). So this change is more geared towards stable kernels,
      as it's unlikely the new VFS checks get removed intentionally.
      
      A test case for fstests follows soon, as well as an update to filter
      existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
      
      CC: <stable@vger.kernel.org> # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac765f83
    • Filipe Manana's avatar
      Btrfs: fix infinite loop on inode eviction after deduplication of eof block · 11023d3f
      Filipe Manana authored
      If we attempt to deduplicate the last block of a file A into the middle of
      a file B, and file A's size is not a multiple of the block size, we end
      rounding the deduplication length to 0 bytes, to avoid the data corruption
      issue fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files"). However a length of zero will
      cause the insertion of an extent state with a start value greater (by 1)
      then the end value, leading to a corrupt extent state that will trigger a
      warning and cause chaos such as an infinite loop during inode eviction.
      Example trace:
      
       [96049.833585] ------------[ cut here ]------------
       [96049.833714] WARNING: CPU: 0 PID: 24448 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.833767] CPU: 0 PID: 24448 Comm: xfs_io Not tainted 4.19.0-rc7-btrfs-next-39 #1
       [96049.833768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.833780] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.833783] RSP: 0018:ffffafd2c3707af0 EFLAGS: 00010282
       [96049.833785] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.833786] RDX: 0000000000000007 RSI: ffff99045c143230 RDI: ffff99047b2168a0
       [96049.833787] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.833787] R10: ffffafd2c3707ab8 R11: 0000000000000000 R12: ffff9903b93b12c8
       [96049.833788] R13: 000000000004e000 R14: ffffafd2c3707b80 R15: ffffafd2c3707b78
       [96049.833790] FS:  00007f5c14e7d700(0000) GS:ffff99047b200000(0000) knlGS:0000000000000000
       [96049.833791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.833792] CR2: 00007f5c146abff8 CR3: 0000000115f4c004 CR4: 00000000003606f0
       [96049.833795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.833796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.833796] Call Trace:
       [96049.833809]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.833823]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.833831]  ? _raw_spin_unlock+0x24/0x30
       [96049.833841]  ? test_range_bit+0xdf/0x130 [btrfs]
       [96049.833853]  lock_extent_range+0x8e/0x150 [btrfs]
       [96049.833864]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
       [96049.833875]  btrfs_extent_same_range+0x14e/0x550 [btrfs]
       [96049.833885]  ? rcu_read_lock_sched_held+0x3f/0x70
       [96049.833890]  ? __kmalloc_node+0x2b0/0x2f0
       [96049.833899]  ? btrfs_dedupe_file_range+0x19a/0x280 [btrfs]
       [96049.833909]  btrfs_dedupe_file_range+0x270/0x280 [btrfs]
       [96049.833916]  vfs_dedupe_file_range_one+0xd9/0xe0
       [96049.833919]  vfs_dedupe_file_range+0x131/0x1b0
       [96049.833924]  do_vfs_ioctl+0x272/0x6e0
       [96049.833927]  ? __fget+0x113/0x200
       [96049.833931]  ksys_ioctl+0x70/0x80
       [96049.833933]  __x64_sys_ioctl+0x16/0x20
       [96049.833937]  do_syscall_64+0x60/0x1b0
       [96049.833939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.833941] RIP: 0033:0x7f5c1478ddd7
       [96049.833943] RSP: 002b:00007ffe15b196a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [96049.833945] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5c1478ddd7
       [96049.833946] RDX: 00005625ece322d0 RSI: 00000000c0189436 RDI: 0000000000000004
       [96049.833947] RBP: 0000000000000000 R08: 00007f5c14a46f48 R09: 0000000000000040
       [96049.833948] R10: 0000000000000541 R11: 0000000000000202 R12: 0000000000000000
       [96049.833949] R13: 0000000000000000 R14: 0000000000000004 R15: 00005625ece322d0
       [96049.833954] irq event stamp: 6196
       [96049.833956] hardirqs last  enabled at (6195): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.833958] hardirqs last disabled at (6196): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.833959] softirqs last  enabled at (6114): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.833964] softirqs last disabled at (6095): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.833965] ---[ end trace db7b05f01b7fa10c ]---
       [96049.935816] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.935822] irq event stamp: 6584
       [96049.935823] hardirqs last  enabled at (6583): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.935825] hardirqs last disabled at (6584): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.935827] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.935828] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.935829] ---[ end trace db7b05f01b7fa123 ]---
       [96049.935840] ------------[ cut here ]------------
       [96049.936065] WARNING: CPU: 1 PID: 24463 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.936107] CPU: 1 PID: 24463 Comm: umount Tainted: G        W         4.19.0-rc7-btrfs-next-39 #1
       [96049.936108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.936117] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.936119] RSP: 0018:ffffafd2c3637bc0 EFLAGS: 00010282
       [96049.936120] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.936121] RDX: 0000000000000007 RSI: ffff990445cf88e0 RDI: ffff99047b2968a0
       [96049.936122] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.936123] R10: ffffafd2c3637b88 R11: 0000000000000000 R12: ffff9904574301e8
       [96049.936124] R13: 000000000004e000 R14: ffffafd2c3637c50 R15: ffffafd2c3637c48
       [96049.936125] FS:  00007fe4b87e72c0(0000) GS:ffff99047b280000(0000) knlGS:0000000000000000
       [96049.936126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.936128] CR2: 00005562e52618d8 CR3: 00000001151c8005 CR4: 00000000003606e0
       [96049.936129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.936131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.936131] Call Trace:
       [96049.936141]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.936154]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.936167]  btrfs_evict_inode+0x1e1/0x5a0 [btrfs]
       [96049.936172]  evict+0xbf/0x1c0
       [96049.936174]  dispose_list+0x51/0x80
       [96049.936176]  evict_inodes+0x193/0x1c0
       [96049.936180]  generic_shutdown_super+0x3f/0x110
       [96049.936182]  kill_anon_super+0xe/0x30
       [96049.936189]  btrfs_kill_super+0x13/0x100 [btrfs]
       [96049.936191]  deactivate_locked_super+0x3a/0x70
       [96049.936193]  cleanup_mnt+0x3b/0x80
       [96049.936195]  task_work_run+0x93/0xc0
       [96049.936198]  exit_to_usermode_loop+0xfa/0x100
       [96049.936201]  do_syscall_64+0x17f/0x1b0
       [96049.936202]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.936204] RIP: 0033:0x7fe4b80cfb37
       [96049.936206] RSP: 002b:00007ffff092b688 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       [96049.936207] RAX: 0000000000000000 RBX: 00005562e5259060 RCX: 00007fe4b80cfb37
       [96049.936208] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005562e525faa0
       [96049.936209] RBP: 00005562e525faa0 R08: 00005562e525f770 R09: 0000000000000015
       [96049.936210] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fe4b85d1e64
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936216] irq event stamp: 6616
       [96049.936219] hardirqs last  enabled at (6615): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.936219] hardirqs last disabled at (6616): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.936222] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.936222] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.936223] ---[ end trace db7b05f01b7fa124 ]---
      
      The second stack trace, from inode eviction, is repeated forever due to
      the infinite loop during eviction.
      
      This is the same type of problem fixed way back in 2015 by commit
      113e8283 ("Btrfs: fix inode eviction infinite loop after extent_same
      ioctl") and commit ccccf3d6 ("Btrfs: fix inode eviction infinite loop
      after cloning into it").
      
      So fix this by returning immediately if the deduplication range length
      gets rounded down to 0 bytes, as there is nothing that needs to be done in
      such case.
      
      Example reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ xfs_io -f -c "pwrite -S 0xe6 0 100" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xe6 0 1M" /mnt/bar
      
       # Unmount the filesystem and mount it again so that we start without any
       # extent state records when we ask for the deduplication.
       $ umount /mnt
       $ mount /dev/sdb /mnt
      
       $ xfs_io -c "dedupe /mnt/foo 0 500K 100" /mnt/bar
      
       # This unmount triggers the infinite loop.
       $ umount /mnt
      
      A test case for fstests will follow soon.
      
      Fixes: de02b9f6
      
       ("Btrfs: fix data corruption when deduplicating between different files")
      CC: <stable@vger.kernel.org> # 4.19+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11023d3f
    • Filipe Manana's avatar
      Btrfs: fix deadlock on tree root leaf when finding free extent · 4222ea71
      Filipe Manana authored
      
      
      When we are writing out a free space cache, during the transaction commit
      phase, we can end up in a deadlock which results in a stack trace like the
      following:
      
       schedule+0x28/0x80
       btrfs_tree_read_lock+0x8e/0x120 [btrfs]
       ? finish_wait+0x80/0x80
       btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
       btrfs_search_slot+0xf6/0x9f0 [btrfs]
       ? evict_refill_and_join+0xd0/0xd0 [btrfs]
       ? inode_insert5+0x119/0x190
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_iget+0x113/0x690 [btrfs]
       __lookup_free_space_inode+0xd8/0x150 [btrfs]
       lookup_free_space_inode+0x5b/0xb0 [btrfs]
       load_free_space_cache+0x7c/0x170 [btrfs]
       ? cache_block_group+0x72/0x3b0 [btrfs]
       cache_block_group+0x1b3/0x3b0 [btrfs]
       ? finish_wait+0x80/0x80
       find_free_extent+0x799/0x1010 [btrfs]
       btrfs_reserve_extent+0x9b/0x180 [btrfs]
       btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
       __btrfs_cow_block+0x11d/0x500 [btrfs]
       btrfs_cow_block+0xdc/0x180 [btrfs]
       btrfs_search_slot+0x3bd/0x9f0 [btrfs]
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_update_inode_item+0x46/0x100 [btrfs]
       cache_save_setup+0xe4/0x3a0 [btrfs]
       btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
       btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
      
      At cache_save_setup() we need to update the inode item of a block group's
      cache which is located in the tree root (fs_info->tree_root), which means
      that it may result in COWing a leaf from that tree. If that happens we
      need to find a free metadata extent and while looking for one, if we find
      a block group which was not cached yet we attempt to load its cache by
      calling cache_block_group(). However this function will try to load the
      inode of the free space cache, which requires finding the matching inode
      item in the tree root - if that inode item is located in the same leaf as
      the inode item of the space cache we are updating at cache_save_setup(),
      we end up in a deadlock, since we try to obtain a read lock on the same
      extent buffer that we previously write locked.
      
      So fix this by using the tree root's commit root when searching for a
      block group's free space cache inode item when we are attempting to load
      a free space cache. This is safe since block groups once loaded stay in
      memory forever, as well as their caches, so after they are first loaded
      we will never need to read their inode items again. For new block groups,
      once they are created they get their ->cached field set to
      BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
      
      Reported-by: default avatarAndrew Nelson <andrew.s.nelson@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
      Fixes: 9d66e233
      
       ("Btrfs: load free space cache if it exists")
      Tested-by: default avatarAndrew Nelson <andrew.s.nelson@gmail.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4222ea71
    • Arnd Bergmann's avatar
      btrfs: avoid link error with CONFIG_NO_AUTO_INLINE · 7e17916b
      Arnd Bergmann authored
      Note: this patch fixes a problem in a feature outside of btrfs ("kernel
      hacking: add a config option to disable compiler auto-inlining") and is
      applied ahead of time due to cross-subsystem dependencies.
      
      On 32-bit ARM with gcc-8, I see a link error with the addition of the
      CONFIG_NO_AUTO_INLINE option:
      
      fs/btrfs/super.o: In function `btrfs_statfs':
      super.c:(.text+0x67b8): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x67fc): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x6858): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x6920): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x693c): undefined reference to `__aeabi_uldivmod'
      fs/btrfs/super.o:super.c:(.text+0x6958): more undefined references to `__aeabi_uldivmod' follow
      
      So far this is the only file that shows the behavior, so I'd propose
      to just work around it by marking the functions as 'static inline'
      that normally get inlined here.
      
      The reference to __aeabi_uldivmod comes from a div_u64() which has an
      optimization for a constant division that uses a straight '/' operator
      when the result should be known to the compiler. My interpretation is
      that as we turn off inlining, gcc still expects the result to be constant
      but fails to use that constant value.
      
      Link: https://lkml.kernel.org/r/20181103153941.1881966-1-arnd@arndb.de
      
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarChangbin Du <changbin.du@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      [ add the note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e17916b
    • Shaokun Zhang's avatar
      btrfs: tree-checker: Fix misleading group system information · 761333f2
      Shaokun Zhang authored
      block_group_err shows the group system as a decimal value with a '0x'
      prefix, which is somewhat misleading.
      
      Fix it to print hexadecimal, as was intended.
      
      Fixes: fce466ea
      
       ("btrfs: tree-checker: Verify block_group_item")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarShaokun Zhang <zhangshaokun@hisilicon.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      761333f2
    • Filipe Manana's avatar
      Btrfs: fix missing data checksums after a ranged fsync (msync) · 008c6753
      Filipe Manana authored
      Recently we got a massive simplification for fsync, where for the fast
      path we no longer log new extents while their respective ordered extents
      are still running.
      
      However that simplification introduced a subtle regression for the case
      where we use a ranged fsync (msync). Consider the following example:
      
                     CPU 0                                    CPU 1
      
                                                  mmap write to range [2Mb, 4Mb[
        mmap write to range [512Kb, 1Mb[
        msync range [512K, 1Mb[
          --> triggers fast fsync
              (BTRFS_INODE_NEEDS_FULL_SYNC
               not set)
          --> creates extent map A for this
              range and adds it to list of
              modified extents
          --> starts ordered extent A for
              this range
          --> waits for it to complete
      
                                                  writeback triggered for range
                                                  [2Mb, 4Mb[
                                                    --> create extent map B and
                                                        adds it to the list of
                                                        modified extents
                                                    --> creates ordered extent B
      
          --> start looking for and logging
              modified extents
          --> logs extent maps A and B
          --> finds checksums for extent A
              in the csum tree, but not for
              extent B
        fsync (msync) finishes
      
                                                    --> ordered extent B
                                                        finishes and its
                                                        checksums are added
                                                        to the csum tree
      
                                      <power cut>
      
      After replaying the log, we have the extent covering the range [2Mb, 4Mb[
      but do not have the data checksum items covering that file range.
      
      This happens because at the very beginning of an fsync (btrfs_sync_file())
      we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
      wait for any ordered extents in that range to complete before we start
      logging the extents. However if right before we start logging the extent
      in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
      such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
      or msync (btrfs_sync_file() starts writeback before acquiring the inode's
      lock), an ordered extent is created for that other range and a new extent
      map is created to represent that range and added to the inode's list of
      modified extents.
      
      That means that we will see that other extent in that list when collecting
      extents for logging (done at btrfs_log_changed_extents()) and log the
      extent before the respective ordered extent finishes - namely before the
      checksum items are added to the checksums tree, which is where
      log_extent_csums() looks for the checksums, therefore making us log an
      extent without logging its checksums. Before that massive simplification
      of fsync, this wasn't a problem because besides looking for checkums in
      the checksums tree, we also looked for them in any ordered extent still
      running.
      
      The consequence of data checksums missing for a file range is that users
      attempting to read the affected file range will get -EIO errors and dmesg
      reports the following:
      
       [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
       [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
      
      So fix this by skipping extents outside of our logging range at
      btrfs_log_changed_extents() and leaving them on the list of modified
      extents so that any subsequent ranged fsync may collect them if needed.
      Also, if we find a hole extent outside of the range still log it, just
      to prevent having gaps between extent items after replaying the log,
      otherwise fsck will complain when we are not using the NO_HOLES feature
      (fstest btrfs/056 triggers such case).
      
      Fixes: e7175a69
      
       ("btrfs: remove the wait ordered logic in the log_one_extent path")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      008c6753
    • Lu Fengqi's avatar
      btrfs: fix pinned underflow after transaction aborted · fcd5e742
      Lu Fengqi authored
      When running generic/475, we may get the following warning in dmesg:
      
      [ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
      [ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
      [ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
      [ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
      [ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
      [ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
      [ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
      [ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
      [ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
      [ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 6902.137836] Call Trace:
      [ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
      [ 6902.140181]  ? kthread_stop+0x146/0x1f0
      [ 6902.141277]  generic_shutdown_super+0x6c/0x100
      [ 6902.142517]  kill_anon_super+0x14/0x30
      [ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
      [ 6902.144790]  deactivate_locked_super+0x2f/0x70
      [ 6902.146014]  cleanup_mnt+0x3b/0x70
      [ 6902.147020]  task_work_run+0x9e/0xd0
      [ 6902.148036]  do_syscall_64+0x470/0x600
      [ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 6902.151640] RIP: 0033:0x7f45077a6a7b
      [ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
      [ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
      [ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
      [ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
      [ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
      [ 6902.167553] irq event stamp: 0
      [ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 6902.176407] ---[ end trace 463138c2986b275c ]---
      [ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
      [ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
      
      In the above line there's "pinned=18446744073708158976" which is an
      unsigned u64 value of -1392640, an obvious underflow.
      
      When transaction_kthread is running cleanup_transaction(), another
      fsstress is running btrfs_commit_transaction(). The
      btrfs_finish_extent_commit() may get the same range as
      btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
      
      Fixes: d4b450cd
      
       ("Btrfs: fix race between transaction commit and empty block group removal")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fcd5e742
    • Robbie Ko's avatar
      Btrfs: fix cur_offset in the error case for nocow · 506481b2
      Robbie Ko authored
      
      
      When the cow_file_range fails, the related resources are unlocked
      according to the range [start..end), so the unlock cannot be repeated in
      run_delalloc_nocow.
      
      In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
      not updated correctly, so move the cur_offset update before
      cow_file_range.
      
        kernel BUG at mm/page-writeback.c:2663!
        Internal error: Oops - BUG: 0 [#1] SMP
        CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
        Hardware name: Realtek_RTD1296 (DT)
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
        PC is at clear_page_dirty_for_io+0x1bc/0x1e8
        LR is at clear_page_dirty_for_io+0x14/0x1e8
        pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
        sp : ffffffc02e9af4f0
        Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
        Call trace:
        [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
        [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
        [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
        [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
        [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
        [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
        [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
        [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
        [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
        [<ffffffc00033d758>] do_writepages+0x30/0x88
        [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
        [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
        [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
        [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
        [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
        [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
        [<ffffffc0002b02f0>] worker_thread+0x130/0x500
        [<ffffffc0002b6344>] kthread+0x10c/0x110
        [<ffffffc000284590>] ret_from_fork+0x10/0x40
        Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      506481b2
  8. Oct 23, 2018
    • Filipe Manana's avatar
      Btrfs: fix use-after-free when dumping free space · 9084cb6a
      Filipe Manana authored
      
      
      We were iterating a block group's free space cache rbtree without locking
      first the lock that protects it (the free_space_ctl->free_space_offset
      rbtree is protected by the free_space_ctl->tree_lock spinlock).
      
      KASAN reported an use-after-free problem when iterating such a rbtree due
      to a concurrent rbtree delete:
      
      [ 9520.359168] ==================================================================
      [ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90
      [ 9520.359949] Read of size 8 at addr ffff8800b7ada500 by task btrfs-transacti/1721
      [ 9520.360357]
      [ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G             L    4.19.0-rc8-nbor #555
      [ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [ 9520.362682] Call Trace:
      [ 9520.362887]  dump_stack+0xa4/0xf5
      [ 9520.363146]  print_address_description+0x78/0x280
      [ 9520.363412]  kasan_report+0x263/0x390
      [ 9520.363650]  ? rb_next+0x13/0x90
      [ 9520.363873]  __asan_load8+0x54/0x90
      [ 9520.364102]  rb_next+0x13/0x90
      [ 9520.364380]  btrfs_dump_free_space+0x146/0x160 [btrfs]
      [ 9520.364697]  dump_space_info+0x2cd/0x310 [btrfs]
      [ 9520.364997]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
      [ 9520.365310]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
      [ 9520.365646]  ? btrfs_update_time+0x180/0x180 [btrfs]
      [ 9520.365923]  ? _raw_spin_unlock+0x27/0x40
      [ 9520.366204]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
      [ 9520.366549]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
      [ 9520.366880]  cache_save_setup+0x42e/0x580 [btrfs]
      [ 9520.367220]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
      [ 9520.367518]  ? lock_downgrade+0x2f0/0x2f0
      [ 9520.367799]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
      [ 9520.368104]  ? kasan_check_read+0x11/0x20
      [ 9520.368349]  ? do_raw_spin_unlock+0xa8/0x140
      [ 9520.368638]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
      [ 9520.368978]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
      [ 9520.369282]  ? do_raw_spin_unlock+0xa8/0x140
      [ 9520.369534]  ? _raw_spin_unlock+0x27/0x40
      [ 9520.369811]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
      [ 9520.370137]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
      [ 9520.370560]  ? commit_fs_roots+0x350/0x350 [btrfs]
      [ 9520.370926]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
      [ 9520.371285]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
      [ 9520.371612]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
      [ 9520.371943]  ? start_transaction+0x168/0x6c0 [btrfs]
      [ 9520.372257]  transaction_kthread+0x21c/0x240 [btrfs]
      [ 9520.372537]  kthread+0x1d2/0x1f0
      [ 9520.372793]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
      [ 9520.373090]  ? kthread_park+0xb0/0xb0
      [ 9520.373329]  ret_from_fork+0x3a/0x50
      [ 9520.373567]
      [ 9520.373738] Allocated by task 1804:
      [ 9520.373974]  kasan_kmalloc+0xff/0x180
      [ 9520.374208]  kasan_slab_alloc+0x11/0x20
      [ 9520.374447]  kmem_cache_alloc+0xfc/0x2d0
      [ 9520.374731]  __btrfs_add_free_space+0x40/0x580 [btrfs]
      [ 9520.375044]  unpin_extent_range+0x4f7/0x7a0 [btrfs]
      [ 9520.375383]  btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs]
      [ 9520.375707]  btrfs_commit_transaction+0xb06/0x10e0 [btrfs]
      [ 9520.376027]  btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs]
      [ 9520.376365]  btrfs_check_data_free_space+0x81/0xd0 [btrfs]
      [ 9520.376689]  btrfs_delalloc_reserve_space+0x25/0x80 [btrfs]
      [ 9520.377018]  btrfs_direct_IO+0x42e/0x6d0 [btrfs]
      [ 9520.377284]  generic_file_direct_write+0x11e/0x220
      [ 9520.377587]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
      [ 9520.377875]  aio_write+0x25c/0x360
      [ 9520.378106]  io_submit_one+0xaa0/0xdc0
      [ 9520.378343]  __se_sys_io_submit+0xfa/0x2f0
      [ 9520.378589]  __x64_sys_io_submit+0x43/0x50
      [ 9520.378840]  do_syscall_64+0x7d/0x240
      [ 9520.379081]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 9520.379387]
      [ 9520.379557] Freed by task 1802:
      [ 9520.379782]  __kasan_slab_free+0x173/0x260
      [ 9520.380028]  kasan_slab_free+0xe/0x10
      [ 9520.380262]  kmem_cache_free+0xc1/0x2c0
      [ 9520.380544]  btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs]
      [ 9520.380866]  find_free_extent+0xa99/0x17e0 [btrfs]
      [ 9520.381166]  btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
      [ 9520.381474]  btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs]
      [ 9520.381761]  __blockdev_direct_IO+0x10ee/0x58a1
      [ 9520.382059]  btrfs_direct_IO+0x25a/0x6d0 [btrfs]
      [ 9520.382321]  generic_file_direct_write+0x11e/0x220
      [ 9520.382623]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
      [ 9520.382904]  aio_write+0x25c/0x360
      [ 9520.383172]  io_submit_one+0xaa0/0xdc0
      [ 9520.383416]  __se_sys_io_submit+0xfa/0x2f0
      [ 9520.383678]  __x64_sys_io_submit+0x43/0x50
      [ 9520.383927]  do_syscall_64+0x7d/0x240
      [ 9520.384165]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 9520.384439]
      [ 9520.384610] The buggy address belongs to the object at ffff8800b7ada500
                      which belongs to the cache btrfs_free_space of size 72
      [ 9520.385175] The buggy address is located 0 bytes inside of
                      72-byte region [ffff8800b7ada500, ffff8800b7ada548)
      [ 9520.385691] The buggy address belongs to the page:
      [ 9520.385957] page:ffffea0002deb680 count:1 mapcount:0 mapping:ffff880108a1d700 index:0x0 compound_mapcount: 0
      [ 9520.388030] flags: 0x8100(slab|head)
      [ 9520.388281] raw: 0000000000008100 ffffea0002deb608 ffffea0002728808 ffff880108a1d700
      [ 9520.388722] raw: 0000000000000000 0000000000130013 00000001ffffffff 0000000000000000
      [ 9520.389169] page dumped because: kasan: bad access detected
      [ 9520.389473]
      [ 9520.389658] Memory state around the buggy address:
      [ 9520.389943]  ffff8800b7ada400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 9520.390368]  ffff8800b7ada480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 9520.390796] >ffff8800b7ada500: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
      [ 9520.391223]                    ^
      [ 9520.391461]  ffff8800b7ada580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 9520.391885]  ffff8800b7ada600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 9520.392313] ==================================================================
      [ 9520.392772] BTRFS critical (device vdc): entry offset 2258497536, bytes 131072, bitmap no
      [ 9520.393247] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
      [ 9520.393705] PGD 800000010dbab067 P4D 800000010dbab067 PUD 107551067 PMD 0
      [ 9520.394059] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [ 9520.394378] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G    B        L    4.19.0-rc8-nbor #555
      [ 9520.394858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [ 9520.395350] RIP: 0010:rb_next+0x3c/0x90
      [ 9520.396461] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
      [ 9520.396762] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
      [ 9520.397115] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
      [ 9520.397468] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
      [ 9520.397821] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
      [ 9520.398188] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
      [ 9520.398555] FS:  0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
      [ 9520.399007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 9520.399335] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
      [ 9520.399679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 9520.400023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 9520.400400] Call Trace:
      [ 9520.400648]  btrfs_dump_free_space+0x146/0x160 [btrfs]
      [ 9520.400974]  dump_space_info+0x2cd/0x310 [btrfs]
      [ 9520.401287]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
      [ 9520.401609]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
      [ 9520.401952]  ? btrfs_update_time+0x180/0x180 [btrfs]
      [ 9520.402232]  ? _raw_spin_unlock+0x27/0x40
      [ 9520.402522]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
      [ 9520.402882]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
      [ 9520.403261]  cache_save_setup+0x42e/0x580 [btrfs]
      [ 9520.403570]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
      [ 9520.403871]  ? lock_downgrade+0x2f0/0x2f0
      [ 9520.404161]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
      [ 9520.404481]  ? kasan_check_read+0x11/0x20
      [ 9520.404732]  ? do_raw_spin_unlock+0xa8/0x140
      [ 9520.405026]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
      [ 9520.405375]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
      [ 9520.405694]  ? do_raw_spin_unlock+0xa8/0x140
      [ 9520.405958]  ? _raw_spin_unlock+0x27/0x40
      [ 9520.406243]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
      [ 9520.406574]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
      [ 9520.406899]  ? commit_fs_roots+0x350/0x350 [btrfs]
      [ 9520.407253]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
      [ 9520.407589]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
      [ 9520.407925]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
      [ 9520.408262]  ? start_transaction+0x168/0x6c0 [btrfs]
      [ 9520.408582]  transaction_kthread+0x21c/0x240 [btrfs]
      [ 9520.408870]  kthread+0x1d2/0x1f0
      [ 9520.409138]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
      [ 9520.409440]  ? kthread_park+0xb0/0xb0
      [ 9520.409682]  ret_from_fork+0x3a/0x50
      [ 9520.410508] Dumping ftrace buffer:
      [ 9520.410764]    (ftrace buffer empty)
      [ 9520.411007] CR2: 0000000000000011
      [ 9520.411297] ---[ end trace 01a0863445cf360a ]---
      [ 9520.411568] RIP: 0010:rb_next+0x3c/0x90
      [ 9520.412644] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
      [ 9520.412932] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
      [ 9520.413274] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
      [ 9520.413616] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
      [ 9520.414007] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
      [ 9520.414349] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
      [ 9520.416074] FS:  0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
      [ 9520.416536] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 9520.416848] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
      [ 9520.418477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 9520.418846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 9520.419204] Kernel panic - not syncing: Fatal exception
      [ 9520.419666] Dumping ftrace buffer:
      [ 9520.419930]    (ftrace buffer empty)
      [ 9520.420168] Kernel Offset: disabled
      [ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Fix this by acquiring the respective lock before iterating the rbtree.
      
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9084cb6a
  9. Oct 19, 2018
    • Filipe Manana's avatar
      Btrfs: fix use-after-free during inode eviction · 421f0922
      Filipe Manana authored
      At inode.c:evict_inode_truncate_pages(), when we iterate over the
      inode's extent states, we access an extent state record's "state" field
      after we unlocked the inode's io tree lock. This can lead to a
      use-after-free issue because after we unlock the io tree that extent
      state record might have been freed due to being merged into another
      adjacent extent state record (a previous inflight bio for a read
      operation finished in the meanwhile which unlocked a range in the io
      tree and cause a merge of extent state records, as explained in the
      comment before the while loop added in commit 6ca07097 ("Btrfs: fix
      hang during inode eviction due to concurrent readahead")).
      
      Fix this by keeping a copy of the extent state's flags in a local
      variable and using it after unlocking the io tree.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189
      Fixes: b9d0b389
      
       ("btrfs: Add handler for invalidate page")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      421f0922
    • Josef Bacik's avatar
      btrfs: move the dio_sem higher up the callchain · c495144b
      Josef Bacik authored
      
      
      We're getting a lockdep splat because we take the dio_sem under the
      log_mutex.  What we really need is to protect fsync() from logging an
      extent map for an extent we never waited on higher up, so just guard the
      whole thing with dio_sem.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted
      ------------------------------------------------------
      aio-dio-invalid/30928 is trying to acquire lock:
      0000000092621cfd (&mm->mmap_sem){++++}, at: get_user_pages_unlocked+0x5a/0x1e0
      
      but task is already holding lock:
      00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (&ei->dio_sem){++++}:
             lock_acquire+0xbd/0x220
             down_write+0x51/0xb0
             btrfs_log_changed_extents+0x80/0xa40
             btrfs_log_inode+0xbaf/0x1000
             btrfs_log_inode_parent+0x26f/0xa80
             btrfs_log_dentry_safe+0x50/0x70
             btrfs_sync_file+0x357/0x540
             do_fsync+0x38/0x60
             __ia32_sys_fdatasync+0x12/0x20
             do_fast_syscall_32+0x9a/0x2f0
             entry_SYSENTER_compat+0x84/0x96
      
      -> #4 (&ei->log_mutex){+.+.}:
             lock_acquire+0xbd/0x220
             __mutex_lock+0x86/0xa10
             btrfs_record_unlink_dir+0x2a/0xa0
             btrfs_unlink+0x5a/0xc0
             vfs_unlink+0xb1/0x1a0
             do_unlinkat+0x264/0x2b0
             do_fast_syscall_32+0x9a/0x2f0
             entry_SYSENTER_compat+0x84/0x96
      
      -> #3 (sb_internal#2){.+.+}:
             lock_acquire+0xbd/0x220
             __sb_start_write+0x14d/0x230
             start_transaction+0x3e6/0x590
             btrfs_evict_inode+0x475/0x640
             evict+0xbf/0x1b0
             btrfs_run_delayed_iputs+0x6c/0x90
             cleaner_kthread+0x124/0x1a0
             kthread+0x106/0x140
             ret_from_fork+0x3a/0x50
      
      -> #2 (&fs_info->cleaner_delayed_iput_mutex){+.+.}:
             lock_acquire+0xbd/0x220
             __mutex_lock+0x86/0xa10
             btrfs_alloc_data_chunk_ondemand+0x197/0x530
             btrfs_check_data_free_space+0x4c/0x90
             btrfs_delalloc_reserve_space+0x20/0x60
             btrfs_page_mkwrite+0x87/0x520
             do_page_mkwrite+0x31/0xa0
             __handle_mm_fault+0x799/0xb00
             handle_mm_fault+0x7c/0xe0
             __do_page_fault+0x1d3/0x4a0
             async_page_fault+0x1e/0x30
      
      -> #1 (sb_pagefaults){.+.+}:
             lock_acquire+0xbd/0x220
             __sb_start_write+0x14d/0x230
             btrfs_page_mkwrite+0x6a/0x520
             do_page_mkwrite+0x31/0xa0
             __handle_mm_fault+0x799/0xb00
             handle_mm_fault+0x7c/0xe0
             __do_page_fault+0x1d3/0x4a0
             async_page_fault+0x1e/0x30
      
      -> #0 (&mm->mmap_sem){++++}:
             __lock_acquire+0x42e/0x7a0
             lock_acquire+0xbd/0x220
             down_read+0x48/0xb0
             get_user_pages_unlocked+0x5a/0x1e0
             get_user_pages_fast+0xa4/0x150
             iov_iter_get_pages+0xc3/0x340
             do_direct_IO+0xf93/0x1d70
             __blockdev_direct_IO+0x32d/0x1c20
             btrfs_direct_IO+0x227/0x400
             generic_file_direct_write+0xcf/0x180
             btrfs_file_write_iter+0x308/0x58c
             aio_write+0xf8/0x1d0
             io_submit_one+0x3a9/0x620
             __ia32_compat_sys_io_submit+0xb2/0x270
             do_int80_syscall_32+0x5b/0x1a0
             entry_INT80_compat+0x88/0xa0
      
      other info that might help us debug this:
      
      Chain exists of:
        &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ei->dio_sem);
                                     lock(&ei->log_mutex);
                                     lock(&ei->dio_sem);
        lock(&mm->mmap_sem);
      
       *** DEADLOCK ***
      
      1 lock held by aio-dio-invalid/30928:
       #0: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400
      
      stack backtrace:
      CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Call Trace:
       dump_stack+0x7c/0xbb
       print_circular_bug.isra.37+0x297/0x2a4
       check_prev_add.constprop.45+0x781/0x7a0
       ? __lock_acquire+0x42e/0x7a0
       validate_chain.isra.41+0x7f0/0xb00
       __lock_acquire+0x42e/0x7a0
       lock_acquire+0xbd/0x220
       ? get_user_pages_unlocked+0x5a/0x1e0
       down_read+0x48/0xb0
       ? get_user_pages_unlocked+0x5a/0x1e0
       get_user_pages_unlocked+0x5a/0x1e0
       get_user_pages_fast+0xa4/0x150
       iov_iter_get_pages+0xc3/0x340
       do_direct_IO+0xf93/0x1d70
       ? __alloc_workqueue_key+0x358/0x490
       ? __blockdev_direct_IO+0x14b/0x1c20
       __blockdev_direct_IO+0x32d/0x1c20
       ? btrfs_run_delalloc_work+0x40/0x40
       ? can_nocow_extent+0x490/0x490
       ? kvm_clock_read+0x1f/0x30
       ? can_nocow_extent+0x490/0x490
       ? btrfs_run_delalloc_work+0x40/0x40
       btrfs_direct_IO+0x227/0x400
       ? btrfs_run_delalloc_work+0x40/0x40
       generic_file_direct_write+0xcf/0x180
       btrfs_file_write_iter+0x308/0x58c
       aio_write+0xf8/0x1d0
       ? kvm_clock_read+0x1f/0x30
       ? __might_fault+0x3e/0x90
       io_submit_one+0x3a9/0x620
       ? io_submit_one+0xe5/0x620
       __ia32_compat_sys_io_submit+0xb2/0x270
       do_int80_syscall_32+0x5b/0x1a0
       entry_INT80_compat+0x88/0xa0
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c495144b
    • Josef Bacik's avatar
      btrfs: don't run delayed_iputs in commit · 30928e9b
      Josef Bacik authored
      
      
      This could result in a really bad case where we do something like
      
      evict
        evict_refill_and_join
          btrfs_commit_transaction
            btrfs_run_delayed_iputs
              evict
                evict_refill_and_join
                  btrfs_commit_transaction
      ... forever
      
      We have plenty of other places where we run delayed iputs that are much
      safer, let those do the work.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30928e9b
    • Josef Bacik's avatar
      btrfs: fix insert_reserved error handling · 80ee54bf
      Josef Bacik authored
      
      
      We were not handling the reserved byte accounting properly for data
      references.  Metadata was fine, if it errored out the error paths would
      free the bytes_reserved count and pin the extent, but it even missed one
      of the error cases.  So instead move this handling up into
      run_one_delayed_ref so we are sure that both cases are properly cleaned
      up in case of a transaction abort.
      
      CC: stable@vger.kernel.org # 4.18+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80ee54bf
    • Josef Bacik's avatar
      btrfs: only free reserved extent if we didn't insert it · 49940bdd
      Josef Bacik authored
      
      
      When we insert the file extent once the ordered extent completes we free
      the reserved extent reservation as it'll have been migrated to the
      bytes_used counter.  However if we error out after this step we'll still
      clear the reserved extent reservation, resulting in a negative
      accounting of the reserved bytes for the block group and space info.
      Fix this by only doing the free if we didn't successfully insert a file
      extent for this extent.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49940bdd
    • Josef Bacik's avatar
      btrfs: don't use ctl->free_space for max_extent_size · fb5c39d7
      Josef Bacik authored
      
      
      max_extent_size is supposed to be the largest contiguous range for the
      space info, and ctl->free_space is the total free space in the block
      group.  We need to keep track of these separately and _only_ use the
      max_free_space if we don't have a max_extent_size, as that means our
      original request was too large to search any of the block groups for and
      therefore wouldn't have a max_extent_size set.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb5c39d7
    • Josef Bacik's avatar
      btrfs: set max_extent_size properly · ad22cf6e
      Josef Bacik authored
      
      
      We can't use entry->bytes if our entry is a bitmap entry, we need to use
      entry->max_extent_size in that case.  Fix up all the logic to make this
      consistent.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad22cf6e
    • Josef Bacik's avatar
      btrfs: reset max_extent_size properly · 21a94f7a
      Josef Bacik authored
      
      
      If we use up our block group before allocating a new one we'll easily
      get a max_extent_size that's set really really low, which will result in
      a lot of fragmentation.  We need to make sure we're resetting the
      max_extent_size when we add a new chunk or add new space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21a94f7a
    • Josef Bacik's avatar
      MAINTAINERS: update my email address for btrfs · b2b5b650
      Josef Bacik authored
      
      
      My work email is completely useless, switch it to my personal address so
      I get emails on a account I actually pay attention to.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2b5b650
  10. Oct 18, 2018
  11. Oct 17, 2018
    • Filipe Manana's avatar
      Btrfs: fix deadlock when writing out free space caches · 5ce55557
      Filipe Manana authored
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/
      
      
      Reported-by: default avatarE V <eliventer@gmail.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5ce55557
    • Filipe Manana's avatar
      Btrfs: fix assertion on fsync of regular file when using no-holes feature · 7ed586d0
      Filipe Manana authored
      
      
      When using the NO_HOLES feature and logging a regular file, we were
      expecting that if we find an inline extent, that either its size in RAM
      (uncompressed and unenconded) matches the size of the file or if it does
      not, that it matches the sector size and it represents compressed data.
      This assertion does not cover a case where the length of the inline extent
      is smaller than the sector size and also smaller the file's size, such
      case is possible through fallocate. Example:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
        $ xfs_io -c "falloc 40 40" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
      In the above example we trigger the assertion because the inline extent's
      length is 21 bytes while the file size is 80 bytes. The fallocate() call
      merely updated the file's size and did not touch the existing inline
      extent, as expected.
      
      So fix this by adjusting the assertion so that an inline extent length
      smaller than the file size is valid if the file size is smaller than the
      filesystem's sector size.
      
      A test case for fstests follows soon.
      
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Fixes: a89ca6f2 ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
      CC: stable@vger.kernel.org # 4.14+
      Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/
      
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ed586d0
    • Filipe Manana's avatar
      Btrfs: fix null pointer dereference on compressed write path error · 3527a018
      Filipe Manana authored
      At inode.c:compress_file_range(), under the "free_pages_out" label, we can
      end up dereferencing the "pages" pointer when it has a NULL value. This
      case happens when "start" has a value of 0 and we fail to allocate memory
      for the "pages" pointer. When that happens we jump to the "cont" label and
      then enter the "if (start == 0)" branch where we immediately call the
      cow_file_range_inline() function. If that function returns 0 (success
      creating an inline extent) or an error (like -ENOMEM for example) we jump
      to the "free_pages_out" label and then access "pages[i]" leading to a NULL
      pointer dereference, since "nr_pages" has a value greater than zero at
      that point.
      
      Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
      the "pages" pointer.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
      Fixes: 771ed689
      
       ("Btrfs: Optimize compressed writeback and reads")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3527a018
  12. Oct 15, 2018