Skip to content
  1. Jun 08, 2011
    • Wu Fengguang's avatar
      writeback: the kupdate expire timestamp should be a moving target · ba9aa839
      Wu Fengguang authored
      
      
      Dynamically compute the dirty expire timestamp at queue_io() time.
      
      writeback_control.older_than_this used to be determined at entrance to
      the kupdate writeback work. This _static_ timestamp may go stale if the
      kupdate work runs on and on. The flusher may then stuck with some old
      busy inodes, never considering newly expired inodes thereafter.
      
      This has two possible problems:
      
      - It is unfair for a large dirty inode to delay (for a long time) the
        writeback of small dirty inodes.
      
      - As time goes by, the large and busy dirty inode may contain only
        _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
        delaying the expired dirty pages to the end of LRU lists, triggering
        the evil pageout(). Nevertheless this patch merely addresses part
        of the problem.
      
      v2: keep policy changes inside wb_writeback() and keep the
      wbc.older_than_this visibility as suggested by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      ba9aa839
    • Wu Fengguang's avatar
      writeback: try more writeback as long as something was written · e6fb6da2
      Wu Fengguang authored
      
      
      writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
      they only populate possibly a subset of eligible inodes into b_io at
      entrance time. When the queued set of inodes are all synced, they just
      return, possibly with all queued inode pages written but still
      wbc.nr_to_write > 0.
      
      For kupdate and background writeback, there may be more eligible inodes
      sitting in b_dirty when the current set of b_io inodes are completed. So
      it is necessary to try another round of writeback as long as we made some
      progress in this round. When there are no more eligible inodes, no more
      inodes will be enqueued in queue_io(), hence nothing could/will be
      synced and we may safely bail.
      
      For example, imagine 100 inodes
      
              i0, i1, i2, ..., i90, i91, i99
      
      At queue_io() time, i90-i99 happen to be expired and moved to s_io for
      IO. When finished successfully, if their total size is less than
      MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
      quit the background work (w/o this patch) while it's still over
      background threshold. This will be a fairly normal/frequent case I guess.
      
      Now that we do tagged sync and update inode->dirtied_when after the sync,
      this change won't livelock sync(1).  I actually tried to write 1 page
      per 1ms with this command
      
      	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
      
      and do sync(1) at the same time. The sync completes quickly on ext4,
      xfs, btrfs.
      
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e6fb6da2
    • Wu Fengguang's avatar
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang authored
      
      
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • Wu Fengguang's avatar
      writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb
      Wu Fengguang authored
      
      
      Explicitly update .dirtied_when on synced inodes, so that they are no
      longer considered for writeback in the next round.
      
      It can prevent both of the following livelock schemes:
      
      - while true; do echo data >> f; done
      - while true; do touch f;        done (in theory)
      
      The exact livelock condition is, during sync(1):
      
      (1) no new inodes are dirtied
      (2) an inode being actively dirtied
      
      On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
      When finished, it will be redirty_tail()ed because it's still dirty
      and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
      on condition (1). The sync work will then revisit it on the next
      queue_io() and find it eligible again because its old ->dirtied_when
      predates the sync work start time.
      
      We'll do more aggressive "keep writeback as long as we wrote something"
      logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
      b9543dac ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
      no longer be enough to stop sync livelock.
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      94c3dcbb
    • Wu Fengguang's avatar
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang authored
      
      
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      
      Acked-by: default avatarJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  2. Jun 06, 2011
  3. Jun 05, 2011
  4. Jun 04, 2011
  5. Jun 03, 2011