Skip to content
  1. Oct 14, 2009
    • Chris Mason's avatar
      Btrfs: always pin metadata in discard mode · 444528b3
      Chris Mason authored
      
      
      We have an optimization in btrfs to allow blocks to be
      immediately freed if they were allocated in this transaction and never
      written.  Otherwise they are pinned and freed when the transaction
      commits.
      
      This isn't optimal for discard mode because immediately freeing
      them means immediately discarding them.  It is better to give the
      block to the pinning code and letting the (slow) discard happen later.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      444528b3
    • Christoph Hellwig's avatar
      Btrfs: enable discard support · 06348574
      Christoph Hellwig authored
      
      
      The discard support code in btrfs currently is guarded by ifdefs for
      BIO_RW_DISCARD, which is never defines as it's the name of an enum
      memeber.  Just remove the useless ifdefs to actually enable the code.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      06348574
    • Christoph Hellwig's avatar
      Btrfs: add -o discard option · e244a0ae
      Christoph Hellwig authored
      
      
      Enable discard by default is not a good idea given the the trim speed
      of SSD prototypes we've seen, and the carecteristics for many high-end
      arrays.  Turn of discards by default and require the -o discard option
      to enable them on.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e244a0ae
    • Yan, Zheng's avatar
      Btrfs: properly wait log writers during log sync · 86df7eb9
      Yan, Zheng authored
      
      
      A recently fsync optimization make btrfs_sync_log skip calling
      wait_for_writer in the single log writer case. This is incorrect
      since the writer count can also be increased by btrfs_pin_log.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      86df7eb9
    • Josef Bacik's avatar
      Btrfs: fix possible ENOSPC problems with truncate · 5d5e103a
      Josef Bacik authored
      
      
      There's a problem where we don't do any space reservation for truncates, which
      can cause you to OOPs because you will be allowed to go off in the weeds a bit
      since we don't account for the delalloc bytes that are created as a result of
      the truncate.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5d5e103a
    • Chris Mason's avatar
      Btrfs: fix btrfs acl #ifdef checks · 0eda294d
      Chris Mason authored
      
      
      The btrfs acl code was #ifdefing for a define
      that didn't exist.  This correctly matches it
      to the values used by the Kconfig file.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0eda294d
    • Chris Mason's avatar
      Btrfs: streamline tree-log btree block writeout · 690587d1
      Chris Mason authored
      
      
      Syncing the tree log is a 3 phase operation.
      
      1) write and wait for all the tree log blocks for a given root.
      
      2) write and wait for all the tree log blocks for the
      tree of tree log roots.
      
      3) write and wait for the super blocks (barriers here)
      
      This isn't as efficient as it could be because there is
      no requirement to wait for the blocks from step one to hit the disk
      before we start writing the blocks from step two.  This commit
      changes the sequence so that we don't start waiting until
      all the tree blocks from both steps one and two have been sent
      to disk.
      
      We do this by breaking up btrfs_write_wait_marked_extents into
      two functions, which is trivial because it was already broken
      up into two parts.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      690587d1
    • Chris Mason's avatar
      Btrfs: avoid tree log commit when there are no changes · 257c62e1
      Chris Mason authored
      
      
      rpm has a habit of running fdatasync when the file hasn't
      changed.  We already detect if a file hasn't been changed
      in the current transaction but it might have been sent to
      the tree-log in this transaction and not changed since
      the last call to fsync.
      
      In this case, we want to avoid a tree log sync, which includes
      a number of synchronous writes and barriers.  This commit
      extends the existing tracking of the last transaction to change
      a file to also track the last sub-transaction.
      
      The end result is that rpm -ivh and -Uvh are roughly twice as fast,
      and on par with ext3.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      257c62e1
    • Chris Mason's avatar
      Btrfs: only write one super copy during fsync · 4722607d
      Chris Mason authored
      
      
      During a tree-log commit for fsync, we've been writing at least
      two copies of the super block and forcing them to disk.
      
      The other filesystems write only one, and this change brings us on
      par with them.  A full transaction commit will write all the super
      copies, so we still have redundant info written on a regular
      basis.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4722607d
  2. Oct 09, 2009
  3. Oct 06, 2009
    • Josef Bacik's avatar
      Btrfs: fix possible softlockup in the allocator · 1cdda9b8
      Josef Bacik authored
      
      
      Like the cluster allocating stuff, we can lockup the box with the normal
      allocation path.  This happens when we
      
      1) Start to cache a block group that is severely fragmented, but has a decent
      amount of free space.
      2) Start to commit a transaction
      3) Have the commit try and empty out some of the delalloc inodes with extents
      that are relatively large.
      
      The inodes will not be able to make the allocations because they will ask for
      allocations larger than a contiguous area in the free space cache.  So we will
      wait for more progress to be made on the block group, but since we're in a
      commit the caching kthread won't make any more progress and it already has
      enough free space that wait_block_group_cache_progress will just return.  So,
      if we wait and fail to make the allocation the next time around, just loop and
      go to the next block group.  This keeps us from getting stuck in a softlockup.
      Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1cdda9b8
  4. Oct 05, 2009
    • Chris Mason's avatar
      Btrfs: fix deadlock on async thread startup · 61d92c32
      Chris Mason authored
      
      
      The btrfs async worker threads are used for a wide variety of things,
      including processing bio end_io functions.  This means that when
      the endio threads aren't running, the rest of the FS isn't
      able to do the final processing required to clear PageWriteback.
      
      The endio threads also try to exit as they become idle and
      start more as the work piles up.  The problem is that starting more
      threads means kthreadd may need to allocate ram, and that allocation
      may wait until the global number of writeback pages on the system is
      below a certain limit.
      
      The result of that throttling is that end IO threads wait on
      kthreadd, who is waiting on IO to end, which will never happen.
      
      This commit fixes the deadlock by handing off thread startup to a
      dedicated thread.  It also fixes a bug where the on-demand thread
      creation was creating far too many threads because it didn't take into
      account threads being started by other procs.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      61d92c32
  5. Oct 02, 2009
  6. Oct 01, 2009
  7. Sep 30, 2009
  8. Sep 29, 2009
    • Josef Bacik's avatar
      Btrfs: proper -ENOSPC handling · 9ed74f2d
      Josef Bacik authored
      
      
      At the start of a transaction we do a btrfs_reserve_metadata_space() and
      specify how many items we plan on modifying.  Then once we've done our
      modifications and such, just call btrfs_unreserve_metadata_space() for
      the same number of items we reserved.
      
      For keeping track of metadata needed for data I've had to add an extent_io op
      for when we merge extents.  This lets us track space properly when we are doing
      sequential writes, so we don't end up reserving way more metadata space than
      what we need.
      
      The only place where the metadata space accounting is not done is in the
      relocation code.  This is because Yan is going to be reworking that code in the
      near future, so running btrfs-vol -b could still possibly result in a ENOSPC
      related panic.  This patch also turns off the metadata_ratio stuff in order to
      allow users to more efficiently use their disk space.
      
      This patch makes it so we track how much metadata we need for an inode's
      delayed allocation extents by tracking how many extents are currently
      waiting for allocation.  It introduces two new callbacks for the
      extent_io tree's, merge_extent_hook and split_extent_hook.  These help
      us keep track of when we merge delalloc extents together and split them
      up.  Reservations are handled prior to any actually dirty'ing occurs,
      and then we unreserve after we dirty.
      
      btrfs_unreserve_metadata_for_delalloc() will make the appropriate
      unreservations as needed based on the number of reservations we
      currently have and the number of extents we currently have.  Doing the
      reservation outside of doing any of the actual dirty'ing lets us do
      things like filemap_flush() the inode to try and force delalloc to
      happen, or as a last resort actually start allocation on all delalloc
      inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
      test.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9ed74f2d
  9. Sep 24, 2009
    • Yan Zheng's avatar
      Btrfs: hash the btree inode during fill_super · c65ddb52
      Yan Zheng authored
      
      
      The snapshot deletion  patches dropped this line, but the inode
      needs to be hashed.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c65ddb52
    • Yan, Zheng's avatar
      Btrfs: relocate file extents in clusters · 0257bb82
      Yan, Zheng authored
      
      
      The extent relocation code copy file extents one by one when
      relocating data block group. This is inefficient if file
      extents are small. This patch makes the relocation code copy
      file extents in clusters. So we can can make better use of
      read-ahead.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0257bb82
    • Yan, Zheng's avatar
      Btrfs: don't rename file into dummy directory · f679a840
      Yan, Zheng authored
      
      
      A recent change enforces only one access point to each subvolume. The first
      directory entry (the one added when the subvolume/snapshot was created) is
      treated as valid access point, all other subvolume links are linked to dummy
      empty directories. The dummy directories are temporary inodes that only in
      memory, so we can not rename file into them.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f679a840
    • Yan, Zheng's avatar
      Btrfs: check size of inode backref before adding hardlink · a5719521
      Yan, Zheng authored
      
      
      For every hardlink in btrfs, there is a corresponding inode back
      reference. All inode back references for hardlinks in a given
      directory are stored in single b-tree item. The size of b-tree item
      is limited by the size of b-tree leaf, so we can only create limited
      number of hardlinks to a given file in a directory.
      
      The original code lacks of the check, it oops if the number of
      hardlinks goes over the limit. This patch fixes the issue by adding
      check to btrfs_link and btrfs_rename.
      
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a5719521
    • Chris Mason's avatar
      Btrfs: fix releasepage to avoid unlocking extents we haven't locked · 11ef160f
      Chris Mason authored
      
      
      During releasepage, we try to drop any extent_state structs for the
      bye offsets of the page we're releaseing.  But the code was incorrectly
      telling clear_extent_bit to delete the state struct unconditionallly.
      
      Normally this would be fine because we have the page locked, but other
      parts of btrfs will lock down an entire extent, the most common place
      being IO completion.
      
      releasepage was deleting the extent state without first locking the extent,
      which may result in removing a state struct that another process had
      locked down.  The fix here is to leave the NODATASUM and EXTENT_LOCKED
      bits alone in releasepage.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      11ef160f
    • Chris Mason's avatar
      Btrfs: Fix test_range_bit for whole file extents · 46562cec
      Chris Mason authored
      
      
      If test_range_bit finds an extent that goes all the way to (u64)-1, it
      can incorrectly wrap the u64 instead of treaing it like the end of
      the address space.
      
      This just adds a check for the highest possible offset so we don't wrap.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      46562cec
    • Chris Mason's avatar
      Btrfs: fix errors handling cached state in set/clear_extent_bit · 42daec29
      Chris Mason authored
      
      
      Both set and clear_extent_bit allow passing a cached
      state struct to reduce rbtree search times.  clear_extent_bit
      was improperly bypassing some of the checks around making sure
      the extent state fields were correct for a given operation.
      
      The fix used here (from Yan Zheng) is to use the hit_next
      goto target instead of jumping all the way down to start clearing
      bits without making sure the cached state was exactly correct
      for the operation we were doing.
      
      This also fixes up the setting of the start variable for both
      ops in the case where we find an overlapping extent that
      begins before the range we want to change.  In both cases
      we were incorrectly going backwards from the original
      requested change.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      42daec29
  10. Sep 23, 2009
    • Chris Mason's avatar
      Btrfs: fix early enospc during balancing · 7ce618db
      Chris Mason authored
      
      
      We now do extra checks before a balance to make sure
      there is room for the balance to take place.  One of
      the checks was testing to see if we were trying to
      balance away the last block group of a given type.
      
      If there is no space available for new chunks, we
      should not try and balance away the last block group
      of a give type.  But, the code wasn't checking for
      available chunk space, and so it was exiting too soon.
      
      The fix here is to combine some of the checks and make
      sure we try to allocate new chunks when we're balancing
      the last block group.
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      7ce618db
    • Chris Mason's avatar
      Btrfs: deal with NULL space info · 33b4d47f
      Chris Mason authored
      
      
      After a balance it is briefly possible for the space info
      field in the inode to be NULL.  This adds some checks
      to make sure things properly deal with the NULL value.
      
      
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      33b4d47f
  11. Sep 22, 2009
    • Josef Bacik's avatar
      Btrfs: account for space used by the super mirrors · 1b2da372
      Josef Bacik authored
      
      
      As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
      space accounting for the space info's.  Currently we exclude the free space for
      the super mirrors, but the space they take up isn't accounted for in any of the
      counters.  This patch introduces bytes_super, which keeps track of the amount
      of bytes used for a super mirror in the block group cache and space info.  This
      makes sure that our free space caclucations will be completely accurate.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1b2da372