Skip to content
  1. Oct 13, 2014
  2. Oct 03, 2014
    • Mark Tinguely's avatar
      xfs: xfs_iflush_done checks the wrong log item callback · 52177937
      Mark Tinguely authored
      Commit 30136832
      
       ("xfs: remove all the inodes on a buffer from the AIL
      in bulk") made the xfs inode flush callback more efficient by
      combining all the inode writes on the buffer and the deletions of
      the inode log item from AIL.
      
      The initial loop in this patch should be looping through all
      the log items on the buffer to see which items have
      xfs_iflush_done as their callback function. But currently,
      only the log item passed to the function has its callback
      compared to xfs_iflush_done. If the log item pointer passed to
      the function does have the xfs_iflush_done callback function,
      then all the log items on the buffer are removed from the
      li_bio_list on the buffer b_fspriv and could be removed from
      the AIL even though they may have not been written yet.
      
      This problem is masked by the fact that currently all inodes on a
      buffer will have the same calback function - either xfs_iflush_done
      or xfs_istale_done - and hence the bug cannot manifest in any way.
      Still, we need to remove the landmine so that if we add new
      callbacks in future this doesn't cause us problems.
      
      Signed-off-by: default avatarMark Tinguely <tinguely@sgi.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      52177937
  3. Oct 02, 2014
    • Brian Foster's avatar
      xfs: flush the range before zero range conversion · da5f1096
      Brian Foster authored
      
      
      XFS currently discards delalloc blocks within the target range of a
      zero range request. Unaligned start and end offsets are zeroed
      through the page cache and the internal, aligned blocks are
      converted to unwritten extents.
      
      If EOF is page aligned and covered by a delayed allocation extent.
      The inode size is not updated until I/O completion. If a zero range
      request discards a delalloc range that covers page aligned EOF as
      such, the inode size update never occurs. For example:
      
      $ rm -f /mnt/file
      $ xfs_io -fc "pwrite 0 64k" -c "zero 60k 4k" /mnt/file
      $ stat -c "%s" /mnt/file
      65536
      $ umount /mnt
      $ mount <dev> /mnt
      $ stat -c "%s" /mnt/file
      61440
      
      Update xfs_zero_file_space() to flush the range rather than discard
      delalloc blocks to ensure that inode size updates occur
      appropriately.
      
      [dchinner: Note that this is really a workaround to avoid the
      underlying problems. More work is needed (and ongoing) to fix those
      issues so this fix is being added as a temporary stop-gap measure. ]
      
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      da5f1096
    • Brian Foster's avatar
      xfs: restore buffer_head unwritten bit on ioend cancel · 07d08681
      Brian Foster authored
      
      
      xfs_vm_writepage() walks each buffer_head on the page, maps to the block
      on disk and attaches to a running ioend structure that represents the
      I/O submission. A new ioend is created when the type of I/O (unwritten,
      delayed allocation or overwrite) required for a particular buffer_head
      differs from the previous. If a buffer_head is a delalloc or unwritten
      buffer, the associated bits are cleared by xfs_map_at_offset() once the
      buffer_head is added to the ioend.
      
      The process of mapping each buffer_head occurs in xfs_map_blocks() and
      acquires the ilock in blocking or non-blocking mode, depending on the
      type of writeback in progress. If the lock cannot be acquired for
      non-blocking writeback, we cancel the ioend, redirty the page and
      return. Writeback will revisit the page at some later point.
      
      Note that we acquire the ilock for each buffer on the page. Therefore
      during non-blocking writeback, it is possible to add an unwritten buffer
      to the ioend, clear the unwritten state, fail to acquire the ilock when
      mapping a subsequent buffer and cancel the ioend. If this occurs, the
      unwritten status of the buffer sitting in the ioend has been lost. The
      page will eventually hit writeback again, but xfs_vm_writepage() submits
      overwrite I/O instead of unwritten I/O and does not perform unwritten
      extent conversion at I/O completion. This leads to data corruption
      because unwritten extents are treated as holes on reads and zeroes are
      returned instead of reading from disk.
      
      Modify xfs_cancel_ioend() to restore the buffer unwritten bit for ioends
      of type XFS_IO_UNWRITTEN. This ensures that unwritten extent conversion
      occurs once the page is eventually written back.
      
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      07d08681
    • Eric Sandeen's avatar
      xfs: check for null dquot in xfs_quota_calc_throttle() · 5cca3f61
      Eric Sandeen authored
      
      
      Coverity spotted this.
      
      Granted, we *just* checked xfs_inod_dquot() in the caller (by
      calling xfs_quota_need_throttle). However, this is the only place we
      don't check the return value but the check is cheap and future-proof
      so add it.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      5cca3f61
    • Eric Sandeen's avatar
      xfs: fix crc field handling in xfs_sb_to/from_disk · 04dd1a0d
      Eric Sandeen authored
      
      
      I discovered this in userspace, but the same change applies
      to the kernel.
      
      If we xfs_mdrestore an image from a non-crc filesystem, lo
      and behold the restored image has gained a CRC:
      
      # db/xfs_metadump.sh -o /dev/sdc1 - | xfs_mdrestore - test.img
      # xfs_db -c "sb 0" -c "p crc" /dev/sdc1
      crc = 0 (correct)
      # xfs_db -c "sb 0" -c "p crc" test.img
      crc = 0xb6f8d6a0 (correct)
      
      This is because xfs_sb_from_disk doesn't fill in sb_crc,
      but xfs_sb_to_disk(XFS_SB_ALL_BITS) does write the in-memory
      CRC to disk - so we get uninitialized memory on disk.
      
      Fix this by always initializing sb_crc to 0 when we read
      the superblock, and masking out the CRC bit from ALL_BITS
      when we write it.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      04dd1a0d
    • Eric Sandeen's avatar
      xfs: don't send null bp to xfs_trans_brelse() · 6ee49a20
      Eric Sandeen authored
      
      
      In this case, if bp is NULL, error is set, and we send a
      NULL bp to xfs_trans_brelse, which will try to dereference it.
      
      Test whether we actually have a buffer before we try to
      free it.
      
      Coverity spotted this.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      6ee49a20
    • Brian Foster's avatar
      xfs: check for inode size overflow in xfs_new_eof() · ce57bcf6
      Brian Foster authored
      
      
      If we write to the maximum file offset (2^63-2), XFS fails to log the
      inode size update when the page is flushed. For example:
      
      $ xfs_io -fc "pwrite `echo "2^63-1-1" | bc` 1" /mnt/file
      wrote 1/1 bytes at offset 9223372036854775806
      1.000000 bytes, 1 ops; 0.0000 sec (22.711 KiB/sec and 23255.8140 ops/sec)
      $ stat -c %s /mnt/file
      9223372036854775807
      $ umount /mnt ; mount <dev> /mnt/
      $ stat -c %s /mnt/file
      0
      
      This occurs because XFS calculates the new file size as io_offset +
      io_size, I/O occurs in block sized requests, and the maximum supported
      file size is not block aligned. Therefore, a write to the max allowable
      offset on a 4k blocksize fs results in a write of size 4k to offset
      2^63-4096 (e.g., equivalent to round_down(2^63-1, 4096), or IOW the
      offset of the block that contains the max file size). The offset plus
      size calculation (2^63 - 4096 + 4096 == 2^63) overflows the signed
      64-bit variable which goes negative and causes the > comparison to the
      on-disk inode size to fail. This returns 0 from xfs_new_eof() and
      results in no change to the inode on-disk.
      
      Update xfs_new_eof() to explicitly detect overflow of the local
      calculation and use the VFS inode size in this scenario. The VFS inode
      size is capped to the maximum and thus XFS writes the correct inode size
      to disk.
      
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      ce57bcf6
    • Dave Chinner's avatar
      xfs: only set extent size hint when asked · a872703f
      Dave Chinner authored
      
      
      Currently the extent size hint is set unconditionally in
      xfs_ioctl_setattr() when the FSX_EXTSIZE flag is set. Hence we can
      set hints when the inode flags indicating the hint should be used
      are not set.  Hence only set the extent size hint from userspace
      when the inode has the XFS_DIFLAG_EXTSIZE flag set to indicate that
      we should have an extent size hint set on the inode.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      a872703f
    • Dave Chinner's avatar
      xfs: project id inheritance is a directory only flag · 9336e3a7
      Dave Chinner authored
      
      
      xfs_set_diflags() allows it to be set on non-directory inodes, and
      this flags errors in xfs_repair. Further, inode allocation allows
      the same directory-only flag to be inherited to non-directories.
      Make sure directory inode flags don't appear on other types of
      inodes.
      
      This fixes several xfstests scratch fileystem corruption reports
      (e.g. xfs/050) now that xfstests checks scratch filesystems after
      test completion.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      9336e3a7
    • Dave Chinner's avatar
      xfs: kill time.h · e076b0f3
      Dave Chinner authored
      
      
      The typedef for timespecs and nanotime() are completely unnecessary,
      and delay() can be moved to fs/xfs/linux.h, which means this file
      can go away.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e076b0f3
    • Dave Chinner's avatar
      xfs: compat_xfs_bstat does not have forkoff · b1d6cc02
      Dave Chinner authored
      
      
      struct compat_xfs_bstat is missing the di_forkoff field and so does
      not fully translate the structure correctly. Fix it.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b1d6cc02
    • Dave Chinner's avatar
      75e58ce4
    • Christoph Hellwig's avatar
      xfs: simplify xfs_zero_remaining_bytes · 8c156125
      Christoph Hellwig authored
      
      
      xfs_zero_remaining_bytes() open codes a log of buffer manupulations
      to do a read forllowed by a write. It can simply be replaced by an
      uncached read followed by a xfs_bwrite() call.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8c156125
    • Dave Chinner's avatar
      xfs: check xfs_buf_read_uncached returns correctly · ba372674
      Dave Chinner authored
      
      
      xfs_buf_read_uncached() has two failure modes. If can either return
      NULL or bp->b_error != 0 depending on the type of failure, and not
      all callers check for both. Fix it so that xfs_buf_read_uncached()
      always returns the error status, and the buffer is returned as a
      function parameter. The buffer will only be returned on success.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      ba372674
    • Dave Chinner's avatar
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner authored
      
      
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      595bff75
    • Dave Chinner's avatar
      xfs: kill xfs_bioerror_relse · 8b131973
      Dave Chinner authored
      
      
      There is only one caller now - xfs_trans_read_buf_map() - and it has
      very well defined call semantics - read, synchronous, and b_iodone
      is NULL. Hence it's pretty clear what error handling is necessary
      for this case. The bigger problem of untangling
      xfs_trans_read_buf_map error handling is left to a future patch.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      8b131973
    • Dave Chinner's avatar
      xfs: xfs_bioerror can die. · 27187754
      Dave Chinner authored
      
      
      Internal buffer write error handling is a mess due to the unnatural
      split between xfs_bioerror and xfs_bioerror_relse().
      
      xfs_bwrite() only does sync IO and determines the handler to
      call based on b_iodone, so for this caller the only difference
      between xfs_bioerror() and xfs_bioerror_release() is the XBF_DONE
      flag. We don't care what the XBF_DONE flag state is because we stale
      the buffer in both paths - the next buffer lookup will clear
      XBF_DONE because XBF_STALE is set. Hence we can use common
      error handling for xfs_bwrite().
      
      __xfs_buf_delwri_submit() is a similar - it's only ever called
      on writes - all sync or async - and again there's no reason to
      handle them any differently at all.
      
      Clean up the nasty error handling and remove xfs_bioerror().
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      27187754
    • Dave Chinner's avatar
      xfs: kill xfs_bdstrat_cb · 8dac3921
      Dave Chinner authored
      
      
      Only has two callers, and is just a shutdown check and error handler
      around xfs_buf_iorequest. However, the error handling is a mess of
      read and write semantics, and both internal callers only call it for
      writes. Hence kill the wrapper, and follow up with a patch to
      sanitise the error handling.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      8dac3921
    • Dave Chinner's avatar
      xfs: rework xfs_buf_bio_endio error handling · 61be9c52
      Dave Chinner authored
      
      
      Currently the report of a bio error from completion
      immediately marks the buffer with an error. The issue is that this
      is racy w.r.t. synchronous IO - the submitter can see b_error being
      set before the IO is complete, and hence we cannot differentiate
      between submission failures and completion failures.
      
      Add an internal b_io_error field protected by the b_lock to catch IO
      completion errors, and only propagate that to the buffer during
      final IO completion handling. Hence we can tell in xfs_buf_iorequest
      if we've had a submission failure bey checking bp->b_error before
      dropping our b_io_remaining reference - that reference will prevent
      b_io_error values from being propagated to b_error in the event that
      completion races with submission.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      61be9c52
    • Dave Chinner's avatar
      xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality · e8aaba9a
      Dave Chinner authored
      
      
      We do some work in xfs_buf_ioend, and some work in
      xfs_buf_iodone_work, but much of that functionality is the same.
      This work can all be done in a single function, leaving
      xfs_buf_iodone just a wrapper to determine if we should execute it
      by workqueue or directly. hence rename xfs_buf_iodone_work to
      xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
      need async processing.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e8aaba9a
    • Dave Chinner's avatar
      xfs: synchronous buffer IO needs a reference · e11bb805
      Dave Chinner authored
      
      
      When synchronous IO runs IO completion work, it does so without an
      IO reference or a hold reference on the buffer. The IO "hold
      reference" is owned by the submitter, and released when the
      submission is complete. The IO reference is released when both the
      submitter and the bio end_io processing is run, and so if the io
      completion work is run from IO completion context, it is run without
      an IO reference.
      
      Hence we can get the situation where the submitter can submit the
      IO, see an error on the buffer and unlock and free the buffer while
      there is still IO in progress. This leads to use-after-free and
      memory corruption.
      
      Fix this by taking a "sync IO hold" reference that is owned by the
      IO and not released until after the buffer completion calls are run
      to wake up synchronous waiters. This means that the buffer will not
      be freed in any circumstance until all IO processing is completed.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      e11bb805
    • Dave Chinner's avatar
      xfs: Don't use xfs_buf_iowait in the delwri buffer code · cf53e99d
      Dave Chinner authored
      
      
      For the special case of delwri buffer submission and waiting, we
      don't need to issue IO synchronously at all. The second pass to call
      xfs_buf_iowait() can be replaced with  blocking on xfs_buf_lock() -
      the buffer will be unlocked when the async IO is complete.
      
      This formalises a sane the method of waiting for async IO - take an
      extra reference, submit the IO, call xfs_buf_lock() when you want to
      wait for IO completion. i.e.:
      
      	bp = xfs_buf_find();
      	xfs_buf_hold(bp);
      	bp->b_flags |= XBF_ASYNC;
      	xfs_buf_iosubmit(bp);
      	xfs_buf_lock(bp)
      	error = bp->b_error;
      	....
      	xfs_buf_relse(bp);
      
      While this is somewhat racy for gathering IO errors, none of the
      code that calls xfs_buf_delwri_submit() will race against other
      users of the buffers being submitted. Even if they do, we don't
      really care if the error is detected by the delwri code or the user
      we raced against. Either way, the error will be detected and
      handled.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      cf53e99d
    • Dave Chinner's avatar
      xfs: force the log before shutting down · a870fe6d
      Dave Chinner authored
      
      
      When we have marked the filesystem for shutdown, we want to prevent
      any further buffer IO from being submitted. However, we currently
      force the log after marking the filesystem as shut down, hence
      allowing IO to the log *after* we have marked both the filesystem
      and the log as in an error state.
      
      Clean this up by forcing the log before we mark the filesytem with
      an error. This replaces the pure CIL flush that we currently have
      which works around this same issue (i.e the CIL can't be flushed
      once the shutdown flags are set) and hence enables us to clean up
      the logic substantially.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      a870fe6d
  4. Sep 29, 2014
  5. Sep 23, 2014