Skip to content
  1. Feb 23, 2021
    • Josef Bacik's avatar
      btrfs: tree-checker: do not error out if extent ref hash doesn't match · 1119a72e
      Josef Bacik authored
      
      
      The tree checker checks the extent ref hash at read and write time to
      make sure we do not corrupt the file system.  Generally extent
      references go inline, but if we have enough of them we need to make an
      item, which looks like
      
      key.objectid	= <bytenr>
      key.type	= <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
      key.offset	= hash(tree, owner, offset)
      
      However if key.offset collide with an unrelated extent reference we'll
      simply key.offset++ until we get something that doesn't collide.
      Obviously this doesn't match at tree checker time, and thus we error
      while writing out the transaction.  This is relatively easy to
      reproduce, simply do something like the following
      
        xfs_io -f -c "pwrite 0 1M" file
        offset=2
      
        for i in {0..10000}
        do
      	  xfs_io -c "reflink file 0 ${offset}M 1M" file
      	  offset=$(( offset + 2 ))
        done
      
        xfs_io -c "reflink file 0 17999258914816 1M" file
        xfs_io -c "reflink file 0 35998517829632 1M" file
        xfs_io -c "reflink file 0 53752752058368 1M" file
      
        btrfs filesystem sync
      
      And the sync will error out because we'll abort the transaction.  The
      magic values above are used because they generate hash collisions with
      the first file in the main subvol.
      
      The fix for this is to remove the hash value check from tree checker, as
      we have no idea which offset ours should belong to.
      
      Reported-by: default avatarTuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
      Fixes: 0785a9aa
      
       ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1119a72e
    • Filipe Manana's avatar
      btrfs: fix race between swap file activation and snapshot creation · dd0734f2
      Filipe Manana authored
      When creating a snapshot we check if the current number of swap files, in
      the root, is non-zero, and if it is, we error out and warn that we can not
      create the snapshot because there are active swap files.
      
      However this is racy because when a task started activation of a swap
      file, another task might have started already snapshot creation and might
      have seen the counter for the number of swap files as zero. This means
      that after the swap file is activated we may end up with a snapshot of the
      same root successfully created, and therefore when the first write to the
      swap file happens it has to fall back into COW mode, which should never
      happen for active swap files.
      
      Basically what can happen is:
      
      1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
         There it sees that root->nr_swapfiles has a value of 0 so it continues;
      
      2) Task B enters btrfs_swap_activate(). It is not aware that another task
         started snapshot creation but it did not finish yet. It increments
         root->nr_swapfiles from 0 to 1;
      
      3) Task B checks that the file meets all requirements to be an active
         swap file - it has NOCOW set, there are no snapshots for the inode's
         root at the moment, no file holes, no reflinked extents, etc;
      
      4) Task B returns success and now the file is an active swap file;
      
      5) Task A commits the transaction to create the snapshot and finishes.
         The swap file's extents are now shared between the original root and
         the snapshot;
      
      6) A write into an extent of the swap file is attempted - there is a
         snapshot of the file's root, so we fall back to COW mode and therefore
         the physical location of the extent changes on disk.
      
      So fix this by taking the snapshot lock during swap file activation before
      locking the extent range, as that is the order in which we lock these
      during buffered writes.
      
      Fixes: ed46ff3d
      
       ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd0734f2
    • Filipe Manana's avatar
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana authored
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d
      
       ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      195a49ea
    • Filipe Manana's avatar
      btrfs: avoid checking for RO block group twice during nocow writeback · 20903032
      Filipe Manana authored
      
      
      During the nocow writeback path, we currently iterate the rbtree of block
      groups twice: once for checking if the target block group is RO with the
      call to btrfs_extent_readonly()), and once again for getting a nocow
      reference on the block group with a call to btrfs_inc_nocow_writers().
      
      Since btrfs_inc_nocow_writers() already returns false when the target
      block group is RO, remove the call to btrfs_extent_readonly(). Not only
      we avoid searching the blocks group rbtree twice, it also helps reduce
      contention on the lock that protects it (specially since it is a spin
      lock and not a read-write lock). That may make a noticeable difference
      on very large filesystems, with thousands of allocated block groups.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20903032
    • Nikolay Borisov's avatar
      btrfs: fix race between extent freeing/allocation when using bitmaps · 3c179165
      Nikolay Borisov authored
      
      
      During allocation the allocator will try to allocate an extent using
      cluster policy. Once the current cluster is exhausted it will remove the
      entry under btrfs_free_cluster::lock and subsequently acquire
      btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
      and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
      because there exists a race condition between removing the entry under
      one lock and doing the necessary accounting holding a different lock
      since extent freeing only uses the 2nd lock. This can result in the
      following situation:
      
      T1:                                    T2:
      btrfs_alloc_from_cluster               insert_into_bitmap <holds tree_lock>
       if (entry->bytes == 0)                   if (block_group && !list_empty(&block_group->cluster_list)) {
          rb_erase(entry)
      
       spin_unlock(&cluster->lock);
         (total_bitmaps is still 4)           spin_lock(&cluster->lock);
                                               <doesn't find entry in cluster->root>
       spin_lock(&ctl->tree_lock);             <goes to new_bitmap label, adds
      <blocked since T2 holds tree_lock>       <a new entry and calls add_new_bitmap>
      					    recalculate_thresholds  <crashes,
                                                    due to total_bitmaps
      					      becoming 5 and triggering
      					      an ASSERT>
      
      To fix this ensure that once depleted, the cluster entry is deleted when
      both cluster lock and tree locks are held in the allocator (T1), this
      ensures that even if there is a race with a concurrent
      insert_into_bitmap call it will correctly find the entry in the cluster
      and add the new space to it.
      
      CC: <stable@vger.kernel.org> # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c179165
    • Qu Wenruo's avatar
      btrfs: make check_compressed_csum() to be subpage compatible · 04d4ba4c
      Qu Wenruo authored
      
      
      Currently check_compressed_csum() completely relies on sectorsize ==
      PAGE_SIZE to do checksum verification for compressed extents.
      
      To make it subpage compatible, this patch will:
      - Do extra calculation for the csum range
        Since we have multiple sectors inside a page, we need to only hash
        the range we want, not the full page anymore.
      
      - Do sector-by-sector hash inside the page
      
      With this patch and previous conversion on
      btrfs_submit_compressed_read(), now we can read subpage compressed
      extents properly, and do proper csum verification.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04d4ba4c
    • Qu Wenruo's avatar
      btrfs: make btrfs_submit_compressed_read() subpage compatible · be6a1361
      Qu Wenruo authored
      
      
      For compressed read, we always submit page read using page size.  This
      doesn't work well with subpage, as for subpage one page can contain
      several sectors.  Such submission will read range out of what we want,
      and cause problems.
      
      Thankfully to make it subpage compatible, we only need to change how the
      last page of the compressed extent is read.
      
      Instead of always adding a full page to the compressed read bio, if we're
      at the last page, calculate the size using compressed length, so that we
      only add part of the range into the compressed read bio.
      
      Since we are here, also change the PAGE_SIZE used in
      lookup_extent_mapping() to sectorsize.
      This modification won't cause any functional change, as
      lookup_extent_mapping() can handle the case where the search range is
      larger than found extent range.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be6a1361
    • Ira Weiny's avatar
      btrfs: fix raid6 qstripe kmap · d70cef0d
      Ira Weiny authored
      When a qstripe is required an extra page is allocated and mapped.  There
      were 3 problems:
      
      1) There is no corresponding call of kunmap() for the qstripe page.
      2) There is no reason to map the qstripe page more than once if the
         number of bits set in rbio->dbitmap is greater than one.
      3) There is no reason to map the parity page and unmap it each time
         through the loop.
      
      The page memory can continue to be reused with a single mapping on each
      iteration by raid6_call.gen_syndrome() without remapping.  So map the
      page for the duration of the loop.
      
      Similarly, improve the algorithm by mapping the parity page just 1 time.
      
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4.x: c17af965
      
      : btrfs: raid56: simplify tracking of Q stripe presence
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d70cef0d
  2. Feb 09, 2021
    • Naohiro Aota's avatar
      btrfs: zoned: enable to mount ZONED incompat flag · 9d294a68
      Naohiro Aota authored
      
      
      This final patch adds the ZONED incompat flag to the supported flags
      and enables to mount ZONED flagged file system.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d294a68
    • Naohiro Aota's avatar
      btrfs: zoned: deal with holes writing out tree-log pages · b528f467
      Naohiro Aota authored
      
      
      Since the zoned filesystem requires sequential write out of metadata, we
      cannot proceed with a hole in tree-log pages. When such a hole exists,
      btree_write_cache_pages() will return -EAGAIN. This happens when someone,
      e.g., a concurrent transaction commit, writes a dirty extent in this
      tree-log commit.
      
      If we are not going to wait for the extents, we can hope the concurrent
      writing fills the hole for us. So, we can ignore the error in this case and
      hope the next write will succeed.
      
      If we want to wait for them and got the error, we cannot wait for them
      because it will cause a deadlock. So, let's bail out to a full commit in
      this case.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b528f467
    • Naohiro Aota's avatar
      btrfs: zoned: reorder log node allocation on zoned filesystem · 3ddebf27
      Naohiro Aota authored
      
      
      This is the 3/3 patch to enable tree-log on zoned filesystems.
      
      The allocation order of nodes of "fs_info->log_root_tree" and nodes of
      "root->log_root" is not the same as the writing order of them. So, the
      writing causes unaligned write errors.
      
      Reorder the allocation of them by delaying allocation of the root node of
      "fs_info->log_root_tree," so that the node buffers can go out sequentially
      to devices.
      
      Cc: Filipe Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ddebf27
    • Naohiro Aota's avatar
      btrfs: zoned: serialize log transaction on zoned filesystems · fa1a0f42
      Naohiro Aota authored
      
      
      This is the 2/3 patch to enable tree-log on zoned filesystems.
      
      Since we can start more than one log transactions per subvolume
      simultaneously, nodes from multiple transactions can be allocated
      interleaved. Such mixed allocation results in non-sequential writes at
      the time of a log transaction commit. The nodes of the global log root
      tree (fs_info->log_root_tree), also have the same problem with mixed
      allocation.
      
      Serializes log transactions by waiting for a committing transaction when
      someone tries to start a new transaction, to avoid the mixed allocation
      problem. We must also wait for running log transactions from another
      subvolume, but there is no easy way to detect which subvolume root is
      running a log transaction. So, this patch forbids starting a new log
      transaction when other subvolumes already allocated the global log root
      tree.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa1a0f42
    • Naohiro Aota's avatar
      btrfs: zoned: extend zoned allocator to use dedicated tree-log block group · 40ab3be1
      Naohiro Aota authored
      
      
      This is the 1/3 patch to enable tree log on zoned filesystems.
      
      The tree-log feature does not work on a zoned filesystem as is. Blocks for
      a tree-log tree are allocated mixed with other metadata blocks and btrfs
      writes and syncs the tree-log blocks to devices at the time of fsync(),
      which has a different timing than a global transaction commit. As a
      result, both writing tree-log blocks and writing other metadata blocks
      become non-sequential writes that zoned filesystems must avoid.
      
      Introduce a dedicated block group for tree-log blocks, so that tree-log
      blocks and other metadata blocks can be separate write streams.  As a
      result, each write stream can now be written to devices separately.
      "fs_info->treelog_bg" tracks the dedicated block group and assigns
      "treelog_bg" on-demand on tree-log block allocation time.
      
      This commit extends the zoned block allocator to use the block group.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40ab3be1
    • Naohiro Aota's avatar
      btrfs: split alloc_log_tree() · 6ab6ebb7
      Naohiro Aota authored
      
      
      This is a preparation patch for the next patch. Split alloc_log_tree()
      into two parts. The first one allocating the tree structure, remains in
      alloc_log_tree() and the second part allocating the tree node, which is
      moved into btrfs_alloc_log_tree_node().
      
      Also export the latter part is to be used in the next patch.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ab6ebb7
    • Naohiro Aota's avatar
      btrfs: zoned: relocate block group to repair IO failure in zoned filesystems · f7ef5287
      Naohiro Aota authored
      
      
      When a bad checksum is found and if the filesystem has a mirror of the
      damaged data, we read the correct data from the mirror and writes it to
      damaged blocks. This however, violates the sequential write constraints
      of a zoned block device.
      
      We can consider three methods to repair an IO failure in zoned filesystems:
      
      (1) Reset and rewrite the damaged zone
      (2) Allocate new device extent and replace the damaged device extent to
          the new extent
      (3) Relocate the corresponding block group
      
      Method (1) is most similar to a behavior done with regular devices.
      However, it also wipes non-damaged data in the same device extent, and
      so it unnecessary degrades non-damaged data.
      
      Method (2) is much like device replacing but done in the same device. It
      is safe because it keeps the device extent until the replacing finish.
      However, extending device replacing is non-trivial. It assumes
      "src_dev->physical == dst_dev->physical". Also, the extent mapping
      replacing function should be extended to support replacing device extent
      position in one device.
      
      Method (3) invokes relocation of the damaged block group and is
      straightforward to implement. It relocates all the mirrored device
      extents, so it potentially is a more costly operation than method (1) or
      (2). But it relocates only used extents which reduce the total IO size.
      
      Let's apply method (3) for now. In the future, we can extend device-replace
      and apply method (2).
      
      For protecting a block group gets relocated multiple time with multiple
      IO errors, this commit introduces "relocating_repair" bit to show it's
      now relocating to repair IO failures. Also it uses a new kthread
      "btrfs-relocating-repair", not to block IO path with relocating process.
      
      This commit also supports repairing in the scrub process.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7ef5287
    • Naohiro Aota's avatar
      btrfs: zoned: enable relocation on a zoned filesystem · 32430c61
      Naohiro Aota authored
      
      
      Currently fallocate() is disabled on a zoned filesystem. Since current
      relocation process relies on preallocation to move file data extents, it
      must be handled differently.
      
      On a zoned filesystem, we just truncate the inode to the size that we
      wanted to pre-allocate. Then, we flush dirty pages on the file before
      finishing the relocation process. run_delalloc_zoned() will handle all
      the allocations and submit IOs to the underlying layers.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32430c61
    • Naohiro Aota's avatar
      btrfs: zoned: support dev-replace in zoned filesystems · 7db1c5d1
      Naohiro Aota authored
      
      
      This is 4/4 patch to implement device-replace on zoned filesystems.
      
      Even after the copying is done, the write pointers of the source device
      and the destination device may not be synchronized. For example, when
      the last allocated extent is freed before device-replace process, the
      extent is not copied, leaving a hole there.
      
      Synchronize the write pointers by writing zeroes to the destination
      device.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7db1c5d1
    • Naohiro Aota's avatar
      btrfs: zoned: implement copying for zoned device-replace · de17addc
      Naohiro Aota authored
      This is 3/4 patch to implement device-replace on zoned filesystems.
      
      This commit implements copying. To do this, it tracks the write pointer
      during the device replace process. As device-replace's copy process is
      smart enough to only copy used extents on the source device, we have to
      fill the gap to honor the sequential write requirement in the target
      device.
      
      The device-replace process on zoned filesystems must copy or clone all
      the extents in the source device exactly once. So, we need to ensure
      allocations started just before the dev-replace process to have their
      corresponding extent information in the B-trees.
      finish_extent_writes_for_zoned() implements that functionality, which
      basically is the removed code in the commit 042528f8
      
       ("Btrfs: fix
      block group remaining RO forever after error during device replace").
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de17addc
    • Naohiro Aota's avatar
      btrfs: zoned: implement cloning for zoned device-replace · 6143c23c
      Naohiro Aota authored
      
      
      This is 2/4 patch to implement device replace for zoned filesystems.
      
      In zoned mode, a block group must be either copied (from the source
      device to the target device) or cloned (to both devices).
      
      Implement the cloning part. If a block group targeted by an IO is marked
      to copy, we should not clone the IO to the destination device, because
      the block group is eventually copied by the replace process.
      
      This commit also handles cloning of device reset.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6143c23c
    • Naohiro Aota's avatar
      btrfs: zoned: mark block groups to copy for device-replace · 78ce9fc2
      Naohiro Aota authored
      
      
      This is the 1/4 patch to support device-replace on zoned filesystems.
      
      We have two types of IOs during the device replace process. One is an IO
      to "copy" (by the scrub functions) all the device extents from the source
      device to the destination device. The other one is an IO to "clone" (by
      handle_ops_on_dev_replace()) new incoming write IOs from users to the
      source device into the target device.
      
      Cloning incoming IOs can break the sequential write rule in on target
      device. When a write is mapped in the middle of a block group, the IO is
      directed to the middle of a target device zone, which breaks the
      sequential write requirement.
      
      However, the cloning function cannot be disabled since incoming IOs
      targeting already copied device extents must be cloned so that the IO is
      executed on the target device.
      
      We cannot use dev_replace->cursor_{left,right} to determine whether a bio
      is going to a not yet copied region. Since we have a time gap between
      finishing btrfs_scrub_dev() and rewriting the mapping tree in
      btrfs_dev_replace_finishing(), we can have a newly allocated device extent
      which is never cloned nor copied.
      
      So the point is to copy only already existing device extents. This patch
      introduces mark_block_group_to_copy() to mark existing block groups as a
      target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
      check the flag to do their job.
      
      Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
      is the last stripe in the block group. With the last stripe copied,
      the to_copy flag is finally disabled. Afterwards we can safely clone
      incoming IOs on this block group.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      78ce9fc2
    • Naohiro Aota's avatar
      btrfs: zoned: do not use async metadata checksum on zoned filesystems · 4eef29ef
      Naohiro Aota authored
      
      
      On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize
      the metadata write IOs.
      
      Even with this serialization, write bios sent from btree_write_cache_pages
      can be reordered by async checksum workers as these workers are per CPU
      and not per zone.
      
      To preserve write bio ordering, we disable async metadata checksum on a
      zoned filesystem. This does not result in lower performance with HDDs as
      a single CPU core is fast enough to do checksum for a single zone write
      stream with the maximum possible bandwidth of the device. If multiple
      zones are being written simultaneously, HDD seek overhead lowers the
      achievable maximum bandwidth, resulting again in a per zone checksum
      serialization not affecting the performance.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4eef29ef
    • Naohiro Aota's avatar
      btrfs: zoned: wait for existing extents before truncating · 24c0a722
      Naohiro Aota authored
      
      
      When truncating a file, file buffers which have already been allocated
      but not yet written may be truncated. Truncating these buffers could
      cause breakage of a sequential write pattern in a block group if the
      truncated blocks are for example followed by blocks allocated to another
      file. To avoid this problem, always wait for write out of all unwritten
      buffers before proceeding with the truncate execution.
      
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24c0a722
    • Naohiro Aota's avatar
      btrfs: zoned: serialize metadata IO · 0bc09ca1
      Naohiro Aota authored
      
      
      We cannot use zone append for writing metadata, because the B-tree nodes
      have references to each other using logical address. Without knowing
      the address in advance, we cannot construct the tree in the first place.
      So we need to serialize write IOs for metadata.
      
      We cannot add a mutex around allocation and submission because metadata
      blocks are allocated in an earlier stage to build up B-trees.
      
      Add a zoned_meta_io_lock and hold it during metadata IO submission in
      btree_write_cache_pages() to serialize IOs.
      
      Furthermore, this adds a per-block group metadata IO submission pointer
      "meta_write_pointer" to ensure sequential writing, which can break when
      attempting to write back blocks in an unfinished transaction. If the
      writing out failed because of a hole and the write out is for data
      integrity (WB_SYNC_ALL), it returns EAGAIN.
      
      A caller like fsync() code should handle this properly e.g. by falling
      back to a full transaction commit.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0bc09ca1
    • Naohiro Aota's avatar
      btrfs: zoned: introduce dedicated data write path for zoned filesystems · 42c01100
      Naohiro Aota authored
      
      
      If more than one IO is issued for one file extent, these IO can be
      written to separate regions on a device. Since we cannot map one file
      extent to such a separate area on a zoned filesystem, we need to follow
      the "one IO == one ordered extent" rule.
      
      The normal buffered, uncompressed and not pre-allocated write path (used
      by cow_file_range()) sometimes does not follow this rule. It can write a
      part of an ordered extent when specified a region to write e.g., when
      its called from fdatasync().
      
      Introduce a dedicated (uncompressed buffered) data write path for zoned
      filesystems, that will COW the region and write it at once.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42c01100
    • Naohiro Aota's avatar
      btrfs: zoned: enable zone append writing for direct IO · 544d24f9
      Naohiro Aota authored
      
      
      Likewise to buffered IO, enable zone append writing for direct IO when
      its used on a zoned block device.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      544d24f9
    • Naohiro Aota's avatar
      btrfs: zoned: use ZONE_APPEND write for zoned mode · d8e3fb10
      Naohiro Aota authored
      
      
      Enable zone append writing for zoned mode. When using zone append, a
      bio is issued to the start of a target zone and the device decides to
      place it inside the zone. Upon completion the device reports the actual
      written position back to the host.
      
      Three parts are necessary to enable zone append mode. First, modify the
      bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
      bi_sector to point the beginning of the zone.
      
      Second, record the returned physical address (and disk/partno) to the
      ordered extent in end_bio_extent_writepage() after the bio has been
      completed. We cannot resolve the physical address to the logical address
      because we can neither take locks nor allocate a buffer in this end_bio
      context. So, we need to record the physical address to resolve it later
      in btrfs_finish_ordered_io().
      
      And finally, rewrite the logical addresses of the extent mapping and
      checksum data according to the physical address using btrfs_rmap_block.
      If the returned address matches the originally allocated address, we can
      skip this rewriting process.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8e3fb10
    • Johannes Thumshirn's avatar
      btrfs: save irq flags when looking up an ordered extent · 24533f6a
      Johannes Thumshirn authored
      
      
      A following patch will add another caller of
      btrfs_lookup_ordered_extent(), but from a bio's endio context.
      
      btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
      disables interrupts. Change this to spin_lock_irqsave() so interrupts
      aren't disabled and re-enabled unconditionally.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24533f6a
    • Johannes Thumshirn's avatar
      btrfs: zoned: cache if block group is on a sequential zone · 08f45559
      Johannes Thumshirn authored
      
      
      On a zoned filesystem, cache if a block group is on a sequential write
      only zone.
      
      On sequential write only zones, we can use REQ_OP_ZONE_APPEND for
      writing data, therefore provide btrfs_use_zone_append() to figure out if
      IO is targeting a sequential write only zone and we can use
      REQ_OP_ZONE_APPEND for data writing.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08f45559
    • Naohiro Aota's avatar
      btrfs: extend btrfs_rmap_block for specifying a device · 138082f3
      Naohiro Aota authored
      
      
      btrfs_rmap_block currently reverse-maps the physical addresses on all
      devices to the corresponding logical addresses.
      
      Extend the function to match to a specified device. The old functionality
      of querying all devices is left intact by specifying NULL as target
      device.
      
      A block_device instead of a btrfs_device is passed into btrfs_rmap_block,
      as this function is intended to reverse-map the result of a bio, which
      only has a block_device.
      
      Also export the function for later use.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      138082f3
    • Johannes Thumshirn's avatar
      btrfs: zoned: check if bio spans across an ordered extent · cacb2cea
      Johannes Thumshirn authored
      
      
      To ensure that an ordered extent maps to a contiguous region on disk, we
      need to maintain a "one bio == one ordered extent" rule.
      
      Ensure that constructing bio does not span more than an ordered extent.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cacb2cea
    • Naohiro Aota's avatar
      btrfs: zoned: split ordered extent when bio is sent · d22002fd
      Naohiro Aota authored
      
      
      For a zone append write, the device decides the location the data is being
      written to. Therefore we cannot ensure that two bios are written
      consecutively on the device. In order to ensure that an ordered extent
      maps to a contiguous region on disk, we need to maintain a "one bio ==
      one ordered extent" rule.
      
      Implement splitting of an ordered extent and extent map on bio submission
      to adhere to the rule.
      
      extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
      corresponding ordered extent so that the ordered extent's region fits into
      one bio and the corresponding device limits.
      
      Several sanity checks need to be done in extract_ordered_extent() e.g.
      
      - We cannot split once end_bio'd ordered extent because we cannot divide
        ordered->bytes_left for the split ones
      - We do not expect a compressed ordered extent
      - We should not have checksum list because we omit the list splitting.
        Since the function is called before btrfs_wq_submit_bio() or
        btrfs_csum_one_bio(), this should be always ensured.
      
      We also need to split an extent map by creating a new one. If not,
      unpin_extent_cache() complains about the difference between the start of
      the extent map and the file's logical offset.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d22002fd
    • Naohiro Aota's avatar
      btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing · cfe94440
      Naohiro Aota authored
      
      
      Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
      devices.
      
      Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
      REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
      bio_op().
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfe94440
    • Naohiro Aota's avatar
      btrfs: zoned: use bio_add_zone_append_page · e1326f03
      Naohiro Aota authored
      
      
      A zoned device has its own hardware restrictions e.g. max_zone_append_size
      when using REQ_OP_ZONE_APPEND. To follow these restrictions, use
      bio_add_zone_append_page() instead of bio_add_page(). We need target device
      to use bio_add_zone_append_page(), so this commit reads the chunk
      information to cache the target device to btrfs_io_bio(bio)->device.
      
      Caching only the target device is sufficient here as zoned filesystems
      only supports the single profile at the moment. Once more profiles will be
      supported btrfs_io_bio can hold an extent_map to be able to check for the
      restrictions of all devices the btrfs_bio will be mapped to.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1326f03
    • Naohiro Aota's avatar
      btrfs: factor out helper adding a page to bio · 953651eb
      Naohiro Aota authored
      
      
      Factor out adding a page to a bio from submit_extent_page().  The page
      is added only when bio_flags are the same, contiguous and the added page
      fits in the same stripe as pages in the bio.
      
      Condition checks are reordered to allow early return to avoid possibly
      heavy btrfs_bio_fits_in_stripe() calling.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      953651eb
    • Naohiro Aota's avatar
      btrfs: zoned: reset zones of unused block groups · dcba6e48
      Naohiro Aota authored
      
      
      We must reset the zones of a deleted unused block group to rewind the
      zones' write pointers to the zones' start.
      
      To do this, we can use the DISCARD_SYNC code to do the reset when the
      filesystem is running on zoned devices.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dcba6e48
    • Naohiro Aota's avatar
      btrfs: zoned: advance allocation pointer after tree log node · 011b41bf
      Naohiro Aota authored
      
      
      Since the allocation info of a tree log node is not recorded in the extent
      tree, calculate_alloc_pointer() cannot detect this node, so the pointer
      can be over a tree node.
      
      Replaying the log calls btrfs_remove_free_space() for each node in the
      log tree.
      
      So, advance the pointer after the node to not allocate over it.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      011b41bf
    • Naohiro Aota's avatar
      btrfs: zoned: redirty released extent buffers · d3575156
      Naohiro Aota authored
      
      
      Tree manipulating operations like merging nodes often release
      once-allocated tree nodes. Such nodes are cleaned so that pages in the
      node are not uselessly written out. On zoned volumes, however, such
      optimization blocks the following IOs as the cancellation of the write
      out of the freed blocks breaks the sequential write sequence expected by
      the device.
      
      Introduce a list of clean and unwritten extent buffers that have been
      released in a transaction. Redirty the buffers so that
      btree_write_cache_pages() can send proper bios to the devices.
      
      Besides it clears the entire content of the extent buffer not to confuse
      raw block scanners e.g. 'btrfs check'. By clearing the content,
      csum_dirty_buffer() complains about bytenr mismatch, so avoid the
      checking and checksum using newly introduced buffer flag
      EXTENT_BUFFER_NO_CHECK.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3575156
    • Naohiro Aota's avatar
      btrfs: zoned: implement sequential extent allocation · 2eda5708
      Naohiro Aota authored
      
      
      Implement a sequential extent allocator for zoned filesystems. This
      allocator only needs to check if there is enough space in the block group
      after the allocation pointer to satisfy the extent allocation request.
      Therefore the allocator never manages bitmaps or clusters. Also, add
      assertions to the corresponding functions.
      
      As zone append writing is used, it would be unnecessary to track the
      allocation offset, as the allocator only needs to check available space.
      But by tracking and returning the offset as an allocated region, we can
      skip modification of ordered extents and checksum information when there
      is no IO reordering.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2eda5708
    • Naohiro Aota's avatar
      btrfs: zoned: track unusable bytes for zones · 169e0da9
      Naohiro Aota authored
      
      
      In a zoned filesystem a once written then freed region is not usable
      until the underlying zone has been reset. So we need to distinguish such
      unusable space from usable free space.
      
      Therefore we need to introduce the "zone_unusable" field to the block
      group structure, and "bytes_zone_unusable" to the space_info structure
      to track the unusable space.
      
      Pinned bytes are always reclaimed to the unusable space. But, when an
      allocated region is returned before using e.g., the block group becomes
      read-only between allocation time and reservation time, we can safely
      return the region to the block group. For the situation, this commit
      introduces "btrfs_add_free_space_unused". This behaves the same as
      btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
      rewinds the allocation offset.
      
      Because the read-only bytes tracks free but unusable bytes when the block
      group is read-only, we need to migrate the zone_unusable bytes to
      read-only bytes when a block group is marked read-only.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      169e0da9
    • Naohiro Aota's avatar
      btrfs: zoned: calculate allocation offset for conventional zones · a94794d5
      Naohiro Aota authored
      
      
      Conventional zones do not have a write pointer, so we cannot use it to
      determine the allocation offset for sequential allocation if a block
      group contains a conventional zone.
      
      But instead, we can consider the end of the highest addressed extent in
      the block group for the allocation offset.
      
      For new block group, we cannot calculate the allocation offset by
      consulting the extent tree, because it can cause deadlock by taking
      extent buffer lock after chunk mutex, which is already taken in
      btrfs_make_block_group(). Since it is a new block group anyways, we can
      simply set the allocation offset to 0.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a94794d5