Skip to content
  1. Oct 09, 2009
    • Josef Bacik's avatar
      Btrfs: async delalloc flushing under space pressure · e3ccfa98
      Josef Bacik authored
      
      
      This patch moves the delalloc flushing that occurs when we are under space
      pressure off to a async thread pool.  This helps since we only free up
      metadata space when we actually insert the extent item, which means it takes
      quite a while for space to be free'ed up if we wait on all ordered extents.
      However, if space is freed up due to inline extents being inserted, we can
      wake people who are waiting up early, and they can finish their work.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e3ccfa98
    • Josef Bacik's avatar
      Btrfs: release delalloc reservations on extent item insertion · 32c00aff
      Josef Bacik authored
      
      
      This patch fixes an issue with the delalloc metadata space reservation
      code.  The problem is we used to free the reservation as soon as we
      allocated the delalloc region.  The problem with this is if we are not
      inserting an inline extent, we don't actually insert the extent item until
      after the ordered extent is written out.  This patch does 3 things,
      
      1) It moves the reservation clearing stuff into the ordered code, so when
      we remove the ordered extent we remove the reservation.
      2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
      delalloc bits in the cases where we want to clear the metadata reservation
      when we clear the delalloc extent, in the case that we do an inline extent
      or we invalidate the page.
      3) It adds another waitqueue to the space info so that when we start a fs
      wide delalloc flush, anybody else who also hits that area will simply wait
      for the flush to finish and then try to make their allocation.
      
      This has been tested thoroughly to make sure we did not regress on
      performance.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      32c00aff
    • Chris Mason's avatar
      Btrfs: delay clearing EXTENT_DELALLOC for compressed extents · a3429ab7
      Chris Mason authored
      
      
      When compression is on, the cow_file_range code is farmed off to
      worker threads.  This allows us to do significant CPU work in parallel
      on SMP machines.
      
      But it is a delicate balance around when we clear flags and how.  In
      the past we cleared the delalloc flag immediately, which was safe
      because the pages stayed locked.
      
      But this is causing problems with the newest ENOSPC code, and with the
      recent extent state cleanups we can now clear the delalloc bit at the
      same time the uncompressed code does.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a3429ab7
    • Chris Mason's avatar
      Btrfs: cleanup extent_clear_unlock_delalloc flags · a791e35e
      Chris Mason authored
      
      
      extent_clear_unlock_delalloc has a growing set of ugly parameters
      that is very difficult to read and maintain.
      
      This switches to a flag field and well named flag defines.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a791e35e
  2. Oct 06, 2009
    • Josef Bacik's avatar
      Btrfs: fix possible softlockup in the allocator · 1cdda9b8
      Josef Bacik authored
      
      
      Like the cluster allocating stuff, we can lockup the box with the normal
      allocation path.  This happens when we
      
      1) Start to cache a block group that is severely fragmented, but has a decent
      amount of free space.
      2) Start to commit a transaction
      3) Have the commit try and empty out some of the delalloc inodes with extents
      that are relatively large.
      
      The inodes will not be able to make the allocations because they will ask for
      allocations larger than a contiguous area in the free space cache.  So we will
      wait for more progress to be made on the block group, but since we're in a
      commit the caching kthread won't make any more progress and it already has
      enough free space that wait_block_group_cache_progress will just return.  So,
      if we wait and fail to make the allocation the next time around, just loop and
      go to the next block group.  This keeps us from getting stuck in a softlockup.
      Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1cdda9b8
  3. Oct 05, 2009
    • Chris Mason's avatar
      Btrfs: fix deadlock on async thread startup · 61d92c32
      Chris Mason authored
      
      
      The btrfs async worker threads are used for a wide variety of things,
      including processing bio end_io functions.  This means that when
      the endio threads aren't running, the rest of the FS isn't
      able to do the final processing required to clear PageWriteback.
      
      The endio threads also try to exit as they become idle and
      start more as the work piles up.  The problem is that starting more
      threads means kthreadd may need to allocate ram, and that allocation
      may wait until the global number of writeback pages on the system is
      below a certain limit.
      
      The result of that throttling is that end IO threads wait on
      kthreadd, who is waiting on IO to end, which will never happen.
      
      This commit fixes the deadlock by handing off thread startup to a
      dedicated thread.  It also fixes a bug where the on-demand thread
      creation was creating far too many threads because it didn't take into
      account threads being started by other procs.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      61d92c32
  4. Oct 02, 2009
  5. Oct 01, 2009
  6. Sep 30, 2009
  7. Sep 29, 2009
    • Josef Bacik's avatar
      Btrfs: proper -ENOSPC handling · 9ed74f2d
      Josef Bacik authored
      
      
      At the start of a transaction we do a btrfs_reserve_metadata_space() and
      specify how many items we plan on modifying.  Then once we've done our
      modifications and such, just call btrfs_unreserve_metadata_space() for
      the same number of items we reserved.
      
      For keeping track of metadata needed for data I've had to add an extent_io op
      for when we merge extents.  This lets us track space properly when we are doing
      sequential writes, so we don't end up reserving way more metadata space than
      what we need.
      
      The only place where the metadata space accounting is not done is in the
      relocation code.  This is because Yan is going to be reworking that code in the
      near future, so running btrfs-vol -b could still possibly result in a ENOSPC
      related panic.  This patch also turns off the metadata_ratio stuff in order to
      allow users to more efficiently use their disk space.
      
      This patch makes it so we track how much metadata we need for an inode's
      delayed allocation extents by tracking how many extents are currently
      waiting for allocation.  It introduces two new callbacks for the
      extent_io tree's, merge_extent_hook and split_extent_hook.  These help
      us keep track of when we merge delalloc extents together and split them
      up.  Reservations are handled prior to any actually dirty'ing occurs,
      and then we unreserve after we dirty.
      
      btrfs_unreserve_metadata_for_delalloc() will make the appropriate
      unreservations as needed based on the number of reservations we
      currently have and the number of extents we currently have.  Doing the
      reservation outside of doing any of the actual dirty'ing lets us do
      things like filemap_flush() the inode to try and force delalloc to
      happen, or as a last resort actually start allocation on all delalloc
      inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
      test.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9ed74f2d
  8. Sep 24, 2009
    • Yan Zheng's avatar
      Btrfs: hash the btree inode during fill_super · c65ddb52
      Yan Zheng authored
      
      
      The snapshot deletion  patches dropped this line, but the inode
      needs to be hashed.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c65ddb52
    • Yan, Zheng's avatar
      Btrfs: relocate file extents in clusters · 0257bb82
      Yan, Zheng authored
      
      
      The extent relocation code copy file extents one by one when
      relocating data block group. This is inefficient if file
      extents are small. This patch makes the relocation code copy
      file extents in clusters. So we can can make better use of
      read-ahead.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0257bb82
    • Yan, Zheng's avatar
      Btrfs: don't rename file into dummy directory · f679a840
      Yan, Zheng authored
      
      
      A recent change enforces only one access point to each subvolume. The first
      directory entry (the one added when the subvolume/snapshot was created) is
      treated as valid access point, all other subvolume links are linked to dummy
      empty directories. The dummy directories are temporary inodes that only in
      memory, so we can not rename file into them.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f679a840
    • Yan, Zheng's avatar
      Btrfs: check size of inode backref before adding hardlink · a5719521
      Yan, Zheng authored
      
      
      For every hardlink in btrfs, there is a corresponding inode back
      reference. All inode back references for hardlinks in a given
      directory are stored in single b-tree item. The size of b-tree item
      is limited by the size of b-tree leaf, so we can only create limited
      number of hardlinks to a given file in a directory.
      
      The original code lacks of the check, it oops if the number of
      hardlinks goes over the limit. This patch fixes the issue by adding
      check to btrfs_link and btrfs_rename.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a5719521
    • Chris Mason's avatar
      Btrfs: fix releasepage to avoid unlocking extents we haven't locked · 11ef160f
      Chris Mason authored
      
      
      During releasepage, we try to drop any extent_state structs for the
      bye offsets of the page we're releaseing.  But the code was incorrectly
      telling clear_extent_bit to delete the state struct unconditionallly.
      
      Normally this would be fine because we have the page locked, but other
      parts of btrfs will lock down an entire extent, the most common place
      being IO completion.
      
      releasepage was deleting the extent state without first locking the extent,
      which may result in removing a state struct that another process had
      locked down.  The fix here is to leave the NODATASUM and EXTENT_LOCKED
      bits alone in releasepage.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      11ef160f
    • Chris Mason's avatar
      Btrfs: Fix test_range_bit for whole file extents · 46562cec
      Chris Mason authored
      
      
      If test_range_bit finds an extent that goes all the way to (u64)-1, it
      can incorrectly wrap the u64 instead of treaing it like the end of
      the address space.
      
      This just adds a check for the highest possible offset so we don't wrap.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      46562cec
    • Chris Mason's avatar
      Btrfs: fix errors handling cached state in set/clear_extent_bit · 42daec29
      Chris Mason authored
      
      
      Both set and clear_extent_bit allow passing a cached
      state struct to reduce rbtree search times.  clear_extent_bit
      was improperly bypassing some of the checks around making sure
      the extent state fields were correct for a given operation.
      
      The fix used here (from Yan Zheng) is to use the hit_next
      goto target instead of jumping all the way down to start clearing
      bits without making sure the cached state was exactly correct
      for the operation we were doing.
      
      This also fixes up the setting of the start variable for both
      ops in the case where we find an overlapping extent that
      begins before the range we want to change.  In both cases
      we were incorrectly going backwards from the original
      requested change.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      42daec29
  9. Sep 23, 2009
    • Chris Mason's avatar
      Btrfs: fix early enospc during balancing · 7ce618db
      Chris Mason authored
      
      
      We now do extra checks before a balance to make sure
      there is room for the balance to take place.  One of
      the checks was testing to see if we were trying to
      balance away the last block group of a given type.
      
      If there is no space available for new chunks, we
      should not try and balance away the last block group
      of a give type.  But, the code wasn't checking for
      available chunk space, and so it was exiting too soon.
      
      The fix here is to combine some of the checks and make
      sure we try to allocate new chunks when we're balancing
      the last block group.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      7ce618db
    • Chris Mason's avatar
      Btrfs: deal with NULL space info · 33b4d47f
      Chris Mason authored
      
      
      After a balance it is briefly possible for the space info
      field in the inode to be NULL.  This adds some checks
      to make sure things properly deal with the NULL value.
      
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      33b4d47f
  10. Sep 22, 2009
    • Josef Bacik's avatar
      Btrfs: account for space used by the super mirrors · 1b2da372
      Josef Bacik authored
      
      
      As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
      space accounting for the space info's.  Currently we exclude the free space for
      the super mirrors, but the space they take up isn't accounted for in any of the
      counters.  This patch introduces bytes_super, which keeps track of the amount
      of bytes used for a super mirror in the block group cache and space info.  This
      makes sure that our free space caclucations will be completely accurate.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1b2da372
    • Josef Bacik's avatar
      Btrfs: fix extent entry threshold calculation · 25891f79
      Josef Bacik authored
      
      
      There is a slight problem with the extent entry threshold calculation for the
      free space cache.  We only adjust the threshold down as we add bitmaps, but
      never actually adjust the threshold up as we add bitmaps.  This means we could
      fragment the free space so badly that we end up using all bitmaps to describe
      the free space, use all the free space which would result in the bitmaps being
      freed, but then go to add free space again as we delete things and immediately
      add bitmaps since the extent threshold would still be 0.  Now as we free
      bitmaps the extent threshold will be ratcheted up to allow more extent entries
      to be added.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      25891f79
    • Josef Bacik's avatar
      Btrfs: remove dead code · f61408b8
      Josef Bacik authored
      
      
      This patch removes a bunch of dead code from the snapshot removal stuff.  It
      was confusing me when doing the metadata ENOSPC stuff so I killed it.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f61408b8
    • Josef Bacik's avatar
      Btrfs: fix bitmap size tracking · f019f426
      Josef Bacik authored
      
      
      When we first go to add free space, we allocate a new info and set the offset
      and bytes to the space we are adding.  This is fine, except we actually set the
      size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd
      end up counting the same space twice.  This isn't a huge deal, it just makes
      the allocator behave weirdly since it will think that a bitmap entry has more
      space than it ends up actually having.  I used a BUG_ON() to catch when this
      problem happened, and with this patch I no longer get the BUG_ON().
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f019f426
    • Josef Bacik's avatar
      Btrfs: don't keep retrying a block group if we fail to allocate a cluster · 0a24325e
      Josef Bacik authored
      
      
      The box can get locked up in the allocator if we happen upon a block group
      under these conditions:
      
      1) During a commit, so caching threads cannot make progress
      2) Our block group currently is in the middle of being cached
      3) Our block group currently has plenty of free space in it
      4) Our block group is so fragmented that it ends up having no free space chunks
      larger than min_bytes calculated by btrfs_find_space_cluster.
      
      What happens is we try and do btrfs_find_space_cluster, which fails because it
      is unable to find enough free space chunks that are large than min_bytes and
      are close enough together.  Since the block group is not cached we do a
      wait_block_group_cache_progress, which waits for the number of bytes we need,
      except the block group already has _plenty_ of free space, its just severely
      fragmented, so we loop and try again, ad infinitum.  This patch keeps us from
      waiting on the block group to finish caching if we failed to find a free space
      cluster before.  It also makes sure that we don't even try to find a free space
      cluster if we are on our last loop in the allocator, since we will have tried
      everything at this point at it is futile.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0a24325e
    • Josef Bacik's avatar
      Btrfs: make balance code choose more wisely when relocating · ba1bf481
      Josef Bacik authored
      
      
      Currently, we can panic the box if the first block group we go to move is of a
      type where there is no space left to move those extents.  For example, if we
      fill the disk up with data, and then we try to balance and we have no room to
      move the data nor room to allocate new chunks, we will panic.  Change this by
      checking to see if we have room to move this chunk around, and if not, return
      -ENOSPC and move on to the next chunk.  This will make sure we remove block
      groups that are moveable, like if we have alot of empty metadata block groups,
      and then that way we make room to be able to balance our data chunks as well.
      Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
      panics with this patch.
      
      V1->V2:
      -actually search for a free extent on the device to make sure we can allocate a
      chunk if need be.
      
      -fix btrfs_shrink_device to make sure we actually try to relocate all the
      chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
      we don't remove the device with data still on it.
      
      -check to make sure the block group we are going to relocate isn't the last one
      in that particular space
      
      -fix a bug in btrfs_shrink_device where we would change the device's size and
      not fix it if we fail to do our relocate
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ba1bf481
    • Sage Weil's avatar
      Btrfs: fix arithmetic error in clone ioctl · 1fb58a60
      Sage Weil authored
      
      
      Fix an arithmetic error that was breaking extents cloned via the clone
      ioctl starting in the second half of a file.
      
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1fb58a60
    • Yan, Zheng's avatar
      Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c
      Yan, Zheng authored
      
      
      This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
      used and doesn't contains links to other subvolumes can be destroyed.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      76dda93c
    • Yan, Zheng's avatar
      Btrfs: change how subvolumes are organized · 4df27c4d
      Yan, Zheng authored
      
      
      btrfs allows subvolumes and snapshots anywhere in the directory tree.
      If we snapshot a subvolume that contains a link to other subvolume
      called subvolA, subvolA can be accessed through both the original
      subvolume and the snapshot. This is similar to creating hard link to
      directory, and has the very similar problems.
      
      The aim of this patch is enforcing there is only one access point to
      each subvolume. Only the first directory entry (the one added when
      the subvolume/snapshot was created) is treated as valid access point.
      The first directory entry is distinguished by checking root forward
      reference. If the corresponding root forward reference is missing,
      we know the entry is not the first one.
      
      This patch also adds snapshot/subvolume rename support, the code
      allows rename subvolume link across subvolumes.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4df27c4d
    • Yan, Zheng's avatar
      Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8
      Yan, Zheng authored
      
      
      The new back reference format does not allow reusing objectid of
      deleted snapshot/subvol. So we use ++highest_objectid to allocate
      objectid for new snapshot/subvol.
      
      Now we use ++highest_objectid to allocate objectid for both new inode
      and new snapshot/subvolume, so this patch removes 'find hole' code in
      btrfs_find_free_objectid.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      13a8a7c8
    • Yan, Zheng's avatar
      Btrfs: speed up snapshot dropping · 1c4850e2
      Yan, Zheng authored
      
      
      This patch contains two changes to avoid unnecessary tree block reads during
      snapshot dropping.
      
      First, check tree block's reference count and flags before reading the tree
      block. if reference count > 1 and there is no need to update backrefs, we can
      avoid reading the tree block.
      
      Second, save when snapshot was created in root_key.offset. we can compare block
      pointer's generation with snapshot's creation generation during updating
      backrefs. If a given block was created before snapshot was created, the
      snapshot can't be the tree block's owner. So we can avoid reading the block.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1c4850e2
  11. Sep 19, 2009
    • Chris Mason's avatar
      Btrfs: search for an allocation hint while filling file COW · b917b7c3
      Chris Mason authored
      
      
      The allocator has some nice knobs for sending hints about where
      to try and allocate new blocks, but when we're doing file allocations
      we're not sending any hint at all.
      
      This commit adds a simple extent map search to see if we can
      quickly and easily find a hint for the allocator.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b917b7c3
    • Chris Mason's avatar
      Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c
      Chris Mason authored
      
      
      When btrfs fills a delayed allocation, it tries to increase
      the wbc nr_to_write to cover a big part of allocation.  The
      theory is that we're doing contiguous IO and writing a few
      more blocks will save seeks overall at a very low cost.
      
      The problem is that extent_write_cache_pages could ignore
      the new higher nr_to_write if nr_to_write had already gone
      down to zero.  We fix that by rechecking the nr_to_write
      for every page that is processed in the pagevec.
      
      This updates the math around bumping the nr_to_write value
      to make sure we don't leave a tiny amount of IO hanging
      around for the very end of a new extent.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f85d7d6c
  12. Sep 18, 2009
    • Yan Zheng's avatar
      Btrfs: improve async block group caching · 11833d66
      Yan Zheng authored
      
      
      This patch gets rid of two limitations of async block group caching.
      The old code delays handling pinned extents when block group is in
      caching. To allocate logged file extents, the old code need wait
      until block group is fully cached. To get rid of the limitations,
      This patch introduces a data structure to track the progress of
      caching. Base on the caching progress, we know which extents should
      be added to the free space cache when handling the pinned extents.
      The logged file extents are also handled in a similar way.
      
      This patch also changes how pinned extents are tracked. The old
      code uses one tree to track pinned extents, and copy the pinned
      extents tree at transaction commit time. This patch makes it use
      two trees to track pinned extents. One tree for extents that are
      pinned in the running transaction, one tree for extents that can
      be unpinned. At transaction commit time, we swap the two trees.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      11833d66
  13. Sep 16, 2009