Skip to content
  1. Jan 31, 2022
    • Shin'ichiro Kawasaki's avatar
      btrfs: fix deadlock between quota disable and qgroup rescan worker · e804861b
      Shin'ichiro Kawasaki authored
      
      
      Quota disable ioctl starts a transaction before waiting for the qgroup
      rescan worker completes. However, this wait can be infinite and results
      in deadlock because of circular dependency among the quota disable
      ioctl, the qgroup rescan worker and the other task with transaction such
      as block group relocation task.
      
      The deadlock happens with the steps following:
      
      1) Task A calls ioctl to disable quota. It starts a transaction and
         waits for qgroup rescan worker completes.
      2) Task B such as block group relocation task starts a transaction and
         joins to the transaction that task A started. Then task B commits to
         the transaction. In this commit, task B waits for a commit by task A.
      3) Task C as the qgroup rescan worker starts its job and starts a
         transaction. In this transaction start, task C waits for completion
         of the transaction that task A started and task B committed.
      
      This deadlock was found with fstests test case btrfs/115 and a zoned
      null_blk device. The test case enables and disables quota, and the
      block group reclaim was triggered during the quota disable by chance.
      The deadlock was also observed by running quota enable and disable in
      parallel with 'btrfs balance' command on regular null_blk devices.
      
      An example report of the deadlock:
      
        [372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds.
        [372.479944]       Not tainted 5.16.0-rc8 #7
        [372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [372.493898] task:kworker/u16:6   state:D stack:    0 pid:  103 ppid:     2 flags:0x00004000
        [372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs]
        [372.510782] Call Trace:
        [372.514092]  <TASK>
        [372.521684]  __schedule+0xb56/0x4850
        [372.530104]  ? io_schedule_timeout+0x190/0x190
        [372.538842]  ? lockdep_hardirqs_on+0x7e/0x100
        [372.547092]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [372.555591]  schedule+0xe0/0x270
        [372.561894]  btrfs_commit_transaction+0x18bb/0x2610 [btrfs]
        [372.570506]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
        [372.578875]  ? free_unref_page+0x3f2/0x650
        [372.585484]  ? finish_wait+0x270/0x270
        [372.591594]  ? release_extent_buffer+0x224/0x420 [btrfs]
        [372.599264]  btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs]
        [372.607157]  ? lock_release+0x3a9/0x6d0
        [372.613054]  ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs]
        [372.620960]  ? do_raw_spin_lock+0x11e/0x250
        [372.627137]  ? rwlock_bug.part.0+0x90/0x90
        [372.633215]  ? lock_is_held_type+0xe4/0x140
        [372.639404]  btrfs_work_helper+0x1ae/0xa90 [btrfs]
        [372.646268]  process_one_work+0x7e9/0x1320
        [372.652321]  ? lock_release+0x6d0/0x6d0
        [372.658081]  ? pwq_dec_nr_in_flight+0x230/0x230
        [372.664513]  ? rwlock_bug.part.0+0x90/0x90
        [372.670529]  worker_thread+0x59e/0xf90
        [372.676172]  ? process_one_work+0x1320/0x1320
        [372.682440]  kthread+0x3b9/0x490
        [372.687550]  ? _raw_spin_unlock_irq+0x24/0x50
        [372.693811]  ? set_kthread_struct+0x100/0x100
        [372.700052]  ret_from_fork+0x22/0x30
        [372.705517]  </TASK>
        [372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds.
        [372.729827]       Not tainted 5.16.0-rc8 #7
        [372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [372.767106] task:btrfs-transacti state:D stack:    0 pid: 2347 ppid:     2 flags:0x00004000
        [372.787776] Call Trace:
        [372.801652]  <TASK>
        [372.812961]  __schedule+0xb56/0x4850
        [372.830011]  ? io_schedule_timeout+0x190/0x190
        [372.852547]  ? lockdep_hardirqs_on+0x7e/0x100
        [372.871761]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [372.886792]  schedule+0xe0/0x270
        [372.901685]  wait_current_trans+0x22c/0x310 [btrfs]
        [372.919743]  ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs]
        [372.938923]  ? finish_wait+0x270/0x270
        [372.959085]  ? join_transaction+0xc75/0xe30 [btrfs]
        [372.977706]  start_transaction+0x938/0x10a0 [btrfs]
        [372.997168]  transaction_kthread+0x19d/0x3c0 [btrfs]
        [373.013021]  ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs]
        [373.031678]  kthread+0x3b9/0x490
        [373.047420]  ? _raw_spin_unlock_irq+0x24/0x50
        [373.064645]  ? set_kthread_struct+0x100/0x100
        [373.078571]  ret_from_fork+0x22/0x30
        [373.091197]  </TASK>
        [373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds.
        [373.114147]       Not tainted 5.16.0-rc8 #7
        [373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [373.130393] task:btrfs           state:D stack:    0 pid: 3145 ppid:  3141 flags:0x00004000
        [373.140998] Call Trace:
        [373.145501]  <TASK>
        [373.149654]  __schedule+0xb56/0x4850
        [373.155306]  ? io_schedule_timeout+0x190/0x190
        [373.161965]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.168469]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [373.175468]  schedule+0xe0/0x270
        [373.180814]  wait_for_commit+0x104/0x150 [btrfs]
        [373.187643]  ? test_and_set_bit+0x20/0x20 [btrfs]
        [373.194772]  ? kmem_cache_free+0x124/0x550
        [373.201191]  ? btrfs_put_transaction+0x69/0x3d0 [btrfs]
        [373.208738]  ? finish_wait+0x270/0x270
        [373.214704]  ? __btrfs_end_transaction+0x347/0x7b0 [btrfs]
        [373.222342]  btrfs_commit_transaction+0x44d/0x2610 [btrfs]
        [373.230233]  ? join_transaction+0x255/0xe30 [btrfs]
        [373.237334]  ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs]
        [373.245251]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
        [373.253296]  relocate_block_group+0x105/0xc20 [btrfs]
        [373.260533]  ? mutex_lock_io_nested+0x1270/0x1270
        [373.267516]  ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs]
        [373.275155]  ? merge_reloc_roots+0x710/0x710 [btrfs]
        [373.283602]  ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs]
        [373.291934]  ? kmem_cache_free+0x124/0x550
        [373.298180]  btrfs_relocate_block_group+0x35c/0x930 [btrfs]
        [373.306047]  btrfs_relocate_chunk+0x85/0x210 [btrfs]
        [373.313229]  btrfs_balance+0x12f4/0x2d20 [btrfs]
        [373.320227]  ? lock_release+0x3a9/0x6d0
        [373.326206]  ? btrfs_relocate_chunk+0x210/0x210 [btrfs]
        [373.333591]  ? lock_is_held_type+0xe4/0x140
        [373.340031]  ? rcu_read_lock_sched_held+0x3f/0x70
        [373.346910]  btrfs_ioctl_balance+0x548/0x700 [btrfs]
        [373.354207]  btrfs_ioctl+0x7f2/0x71b0 [btrfs]
        [373.360774]  ? lockdep_hardirqs_on_prepare+0x410/0x410
        [373.367957]  ? lockdep_hardirqs_on_prepare+0x410/0x410
        [373.375327]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
        [373.383841]  ? find_held_lock+0x2c/0x110
        [373.389993]  ? lock_release+0x3a9/0x6d0
        [373.395828]  ? mntput_no_expire+0xf7/0xad0
        [373.402083]  ? lock_is_held_type+0xe4/0x140
        [373.408249]  ? vfs_fileattr_set+0x9f0/0x9f0
        [373.414486]  ? selinux_file_ioctl+0x349/0x4e0
        [373.420938]  ? trace_raw_output_lock+0xb4/0xe0
        [373.427442]  ? selinux_inode_getsecctx+0x80/0x80
        [373.434224]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.440660]  ? force_qs_rnp+0x2a0/0x6b0
        [373.446534]  ? lock_is_held_type+0x9b/0x140
        [373.452763]  ? __blkcg_punt_bio_submit+0x1b0/0x1b0
        [373.459732]  ? security_file_ioctl+0x50/0x90
        [373.466089]  __x64_sys_ioctl+0x127/0x190
        [373.472022]  do_syscall_64+0x3b/0x90
        [373.477513]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [373.484823] RIP: 0033:0x7f8f4af7e2bb
        [373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb
        [373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003
        [373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0
        [373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001
        [373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002
        [373.546506]  </TASK>
        [373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds.
        [373.559383]       Not tainted 5.16.0-rc8 #7
        [373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [373.575748] task:btrfs           state:D stack:    0 pid: 3146 ppid:  2168 flags:0x00000000
        [373.586314] Call Trace:
        [373.590846]  <TASK>
        [373.595121]  __schedule+0xb56/0x4850
        [373.600901]  ? __lock_acquire+0x23db/0x5030
        [373.607176]  ? io_schedule_timeout+0x190/0x190
        [373.613954]  schedule+0xe0/0x270
        [373.619157]  schedule_timeout+0x168/0x220
        [373.625170]  ? usleep_range_state+0x150/0x150
        [373.631653]  ? mark_held_locks+0x9e/0xe0
        [373.637767]  ? do_raw_spin_lock+0x11e/0x250
        [373.643993]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
        [373.651267]  ? _raw_spin_unlock_irq+0x24/0x50
        [373.657677]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.664103]  wait_for_completion+0x163/0x250
        [373.670437]  ? bit_wait_timeout+0x160/0x160
        [373.676585]  btrfs_quota_disable+0x176/0x9a0 [btrfs]
        [373.683979]  ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs]
        [373.691340]  ? down_write+0xd0/0x130
        [373.696880]  ? down_write_killable+0x150/0x150
        [373.703352]  btrfs_ioctl+0x3945/0x71b0 [btrfs]
        [373.710061]  ? find_held_lock+0x2c/0x110
        [373.716192]  ? lock_release+0x3a9/0x6d0
        [373.722047]  ? __handle_mm_fault+0x23cd/0x3050
        [373.728486]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
        [373.737032]  ? set_pte+0x6a/0x90
        [373.742271]  ? do_raw_spin_unlock+0x55/0x1f0
        [373.748506]  ? lock_is_held_type+0xe4/0x140
        [373.754792]  ? vfs_fileattr_set+0x9f0/0x9f0
        [373.761083]  ? selinux_file_ioctl+0x349/0x4e0
        [373.767521]  ? selinux_inode_getsecctx+0x80/0x80
        [373.774247]  ? __up_read+0x182/0x6e0
        [373.780026]  ? count_memcg_events.constprop.0+0x46/0x60
        [373.787281]  ? up_write+0x460/0x460
        [373.792932]  ? security_file_ioctl+0x50/0x90
        [373.799232]  __x64_sys_ioctl+0x127/0x190
        [373.805237]  do_syscall_64+0x3b/0x90
        [373.810947]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [373.818102] RIP: 0033:0x7f1383ea02bb
        [373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb
        [373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003
        [373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
        [373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a
        [373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000
        [373.879838]  </TASK>
        [373.884018]
                     Showing all locks held in the system:
        [373.894250] 3 locks held by kworker/4:1/58:
        [373.900356] 1 lock held by khungtaskd/63:
        [373.906333]  #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260
        [373.917307] 3 locks held by kworker/u16:6/103:
        [373.923938]  #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320
        [373.936555]  #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320
        [373.951109]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs]
        [373.964027] 2 locks held by less/1803:
        [373.969982]  #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80
        [373.981295]  #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060
        [373.992969] 1 lock held by btrfs-transacti/2347:
        [373.999893]  #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs]
        [374.015872] 3 locks held by btrfs/3145:
        [374.022298]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs]
        [374.034456]  #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs]
        [374.047646]  #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs]
        [374.063295] 4 locks held by btrfs/3146:
        [374.069647]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs]
        [374.081601]  #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs]
        [374.094283]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs]
        [374.106885]  #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs]
      
        [374.126780] =============================================
      
      To avoid the deadlock, wait for the qgroup rescan worker to complete
      before starting the transaction for the quota disable ioctl. Clear
      BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to
      request the worker to complete. On transaction start failure, set the
      BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag
      changes can be done safely since the function btrfs_quota_disable is not
      called concurrently because of fs_info->subvol_sem.
      
      Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid
      another qgroup rescan worker to start after the previous qgroup worker
      completed.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e804861b
    • Qu Wenruo's avatar
      btrfs: don't start transaction for scrub if the fs is mounted read-only · 2d192fc4
      Qu Wenruo authored
      [BUG]
      The following super simple script would crash btrfs at unmount time, if
      CONFIG_BTRFS_ASSERT() is set.
      
       mkfs.btrfs -f $dev
       mount $dev $mnt
       xfs_io -f -c "pwrite 0 4k" $mnt/file
       umount $mnt
       mount -r ro $dev $mnt
       btrfs scrub start -Br $mnt
       umount $mnt
      
      This will trigger the following ASSERT() introduced by commit
      0a31daa4
      
       ("btrfs: add assertion for empty list of transactions at
      late stage of umount").
      
      That patch is definitely not the cause, it just makes enough noise for
      developers.
      
      [CAUSE]
      We will start transaction for the following call chain during scrub:
      
        scrub_enumerate_chunks()
        |- btrfs_inc_block_group_ro()
           |- btrfs_join_transaction()
      
      However for RO mount, there is no running transaction at all, thus
      btrfs_join_transaction() will start a new transaction.
      
      Furthermore, since it's read-only mount, btrfs_sync_fs() will not call
      btrfs_commit_super() to commit the new but empty transaction.
      
      And leads to the ASSERT().
      
      The bug has been there for a long time. Only the new ASSERT() makes it
      noisy enough to be noticed.
      
      [FIX]
      For read-only scrub on read-only mount, there is no need to start a
      transaction nor to allocate new chunks in btrfs_inc_block_group_ro().
      
      Just do extra read-only mount check in btrfs_inc_block_group_ro(), and
      if it's read-only, skip all chunk allocation and go inc_block_group_ro()
      directly.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d192fc4
  2. Jan 25, 2022
    • Filipe Manana's avatar
      btrfs: update writeback index when starting defrag · 27cdfde1
      Filipe Manana authored
      When starting a defrag, we should update the writeback index of the
      inode's mapping in case it currently has a value beyond the start of the
      range we are defragging. This can help performance and often result in
      getting less extents after writeback - for e.g., if the current value
      of the writeback index sits somewhere in the middle of a range that
      gets dirty by the defrag, then after writeback we can get two smaller
      extents instead of a single, larger extent.
      
      We used to have this before the refactoring in 5.16, but it was removed
      without any reason to do so. Originally it was added in kernel 3.1, by
      commit 2a0f7f57 ("Btrfs: fix recursive auto-defrag"), in order to
      fix a loop with autodefrag resulting in dirtying and writing pages over
      and over, but some testing on current code did not show that happening,
      at least with the test described in that commit.
      
      So add back the behaviour, as at the very least it is a nice to have
      optimization.
      
      Fixes: 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27cdfde1
    • Filipe Manana's avatar
      btrfs: add back missing dirty page rate limiting to defrag · 3c9d31c7
      Filipe Manana authored
      A defrag operation can dirty a lot of pages, specially if operating on
      the entire file or a large file range. Any task dirtying pages should
      periodically call balance_dirty_pages_ratelimited(), as stated in that
      function's comments, otherwise they can leave too many dirty pages in
      the system. This is what we did before the refactoring in 5.16, and
      it should have remained, just like in the buffered write path and
      relocation. So restore that behaviour.
      
      Fixes: 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c9d31c7
    • Filipe Manana's avatar
      btrfs: fix deadlock when reserving space during defrag · 0cb5950f
      Filipe Manana authored
      When defragging we can end up collecting a range for defrag that has
      already pages under delalloc (dirty), as long as the respective extent
      map for their range is not mapped to a hole, a prealloc extent or
      the extent map is from an old generation.
      
      Most of the time that is harmless from a functional perspective at
      least, however it can result in a deadlock:
      
      1) At defrag_collect_targets() we find an extent map that meets all
         requirements but there's delalloc for the range it covers, and we add
         its range to list of ranges to defrag;
      
      2) The defrag_collect_targets() function is called at defrag_one_range(),
         after it locked a range that overlaps the range of the extent map;
      
      3) At defrag_one_range(), while the range is still locked, we call
         defrag_one_locked_target() for the range associated to the extent
         map we collected at step 1);
      
      4) Then finally at defrag_one_locked_target() we do a call to
         btrfs_delalloc_reserve_space(), which will reserve data and metadata
         space. If the space reservations can not be satisfied right away, the
         flusher might be kicked in and start flushing delalloc and wait for
         the respective ordered extents to complete. If this happens we will
         deadlock, because both flushing delalloc and finishing an ordered
         extent, requires locking the range in the inode's io tree, which was
         already locked at defrag_collect_targets().
      
      So fix this by skipping extent maps for which there's already delalloc.
      
      Fixes: eb793cf8
      
       ("btrfs: defrag: introduce helper to collect target file extents")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cb5950f
  3. Jan 20, 2022
    • Qu Wenruo's avatar
      btrfs: defrag: properly update range->start for autodefrag · c080b414
      Qu Wenruo authored
      [BUG]
      After commit 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to
      implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
      the file from previously finished location.
      
      [CAUSE]
      The recent refactoring of defrag only focuses on defrag ioctl subpage
      support, doesn't take autodefrag into consideration.
      
      There are two problems involved which prevents autodefrag to restart its
      scan:
      
      - No range.start update
        Previously when one defrag target is found, range->start will be
        updated to indicate where next search should start from.
      
        But now btrfs_defrag_file() doesn't update it anymore, making all
        autodefrag to rescan from file offset 0.
      
        This would also make autodefrag to mark the same range dirty again and
        again, causing extra IO.
      
      - No proper quick exit for defrag_one_cluster()
        Currently if we reached or exceed @max_sectors limit, we just exit
        defrag_one_cluster(), and let next defrag_one_cluster() call to do a
        quick exit.
        This makes @cur increase, thus no way to properly know which range is
        defragged and which range is skipped.
      
      [FIX]
      The fix involves two modifications:
      
      - Update range->start to next cluster start
        This is a little different from the old behavior.
        Previously range->start is updated to the next defrag target.
      
        But in the end, the behavior should still be pretty much the same,
        as now we skip to next defrag target inside btrfs_defrag_file().
      
        Thus if auto-defrag determines to re-scan, then we still do the skip,
        just at a different timing.
      
      - Make defrag_one_cluster() to return >0 to indicate a quick exit
        So that btrfs_defrag_file() can also do a quick exit, without
        increasing @cur to the range end, and re-use @cur to update
        @range->start.
      
      - Add comment for btrfs_defrag_file() to mention the range->start update
        Currently only autodefrag utilize this behavior, as defrag ioctl won't
        set @max_to_defrag parameter, thus unless interrupted it will always
        try to defrag the whole range.
      
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c080b414
    • Qu Wenruo's avatar
      btrfs: defrag: fix wrong number of defragged sectors · 484167da
      Qu Wenruo authored
      [BUG]
      There are users using autodefrag mount option reporting obvious increase
      in IO:
      
      > If I compare the write average (in total, I don't have it per process)
      > when taking idle periods on the same machine:
      >     Linux 5.16:
      >         without autodefrag: ~ 10KiB/s
      >         with autodefrag: between 1 and 2MiB/s.
      >
      >     Linux 5.15:
      >         with autodefrag:~ 10KiB/s (around the same as without
      > autodefrag on 5.16)
      
      [CAUSE]
      When autodefrag mount option is enabled, btrfs_defrag_file() will be
      called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many
      sectors we can defrag in one try.
      
      And then use the number of sectors defragged to determine if we need to
      re-defrag.
      
      But commit b18c3ab2
      
       ("btrfs: defrag: introduce helper to defrag one
      cluster") uses wrong unit to increase @sectors_defragged, which should
      be in unit of sector, not byte.
      
      This means, if we have defragged any sector, then @sectors_defragged
      will be >= sectorsize (normally 4096), which is larger than
      BTRFS_DEFRAG_BATCH.
      
      This makes the @max_sectors check in defrag_one_cluster() to underflow,
      rendering the whole @max_sectors check useless.
      
      Thus causing way more IO for autodefrag mount options, as now there is
      no limit on how many sectors can really be defragged.
      
      [FIX]
      Fix the problems by:
      
      - Use sector as unit when increasing @sectors_defragged
      
      - Include @sectors_defragged > @max_sectors case to break the loop
      
      - Add extra comment on the return value of btrfs_defrag_file()
      
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: b18c3ab2 ("btrfs: defrag: introduce helper to defrag one cluster")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      484167da
    • Filipe Manana's avatar
      btrfs: allow defrag to be interruptible · b767c2fc
      Filipe Manana authored
      During defrag, at btrfs_defrag_file(), we have this loop that iterates
      over a file range in steps no larger than 256K subranges. If the range
      is too long, there's no way to interrupt it. So make the loop check in
      each iteration if there's signal pending, and if there is, break and
      return -AGAIN to userspace.
      
      Before kernel 5.16, we used to allow defrag to be cancelled through a
      signal, but that was lost with commit 7b508037 ("btrfs: defrag:
      use defrag_one_cluster() to implement btrfs_defrag_file()").
      
      This change adds back the possibility to cancel a defrag with a signal
      and keeps the same semantics, returning -EAGAIN to user space (and not
      the usually more expected -EINTR).
      
      This is also motivated by a recent bug on 5.16 where defragging a 1 byte
      file resulted in iterating from file range 0 to (u64)-1, as hitting the
      bug triggered a too long loop, basically requiring one to reboot the
      machine, as it was not possible to cancel defrag.
      
      Fixes: 7b508037
      
       ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b767c2fc
    • Filipe Manana's avatar
      btrfs: fix too long loop when defragging a 1 byte file · 6b34cd8e
      Filipe Manana authored
      
      
      When attempting to defrag a file with a single byte, we can end up in a
      too long loop, which is nearly infinite because at btrfs_defrag_file()
      we end up with the variable last_byte assigned with a value of
      18446744073709551615 (which is (u64)-1). The problem comes from the fact
      we end up doing:
      
          last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
      
      So if last_byte was assigned 0, which is i_size - 1, we underflow and
      end up with the value 18446744073709551615.
      
      This is trivial to reproduce and the following script triggers it:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        echo -n "X" > $MNT/foobar
      
        btrfs filesystem defragment $MNT/foobar
      
        umount $MNT
      
      So fix this by not decrementing last_byte by 1 before doing the sector
      size round up. Also, to make it easier to follow, make the round up right
      after computing last_byte.
      
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b34cd8e
  4. Jan 07, 2022