Skip to content
  1. Jun 08, 2011
    • Christoph Hellwig's avatar
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig authored
      
      
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      f758eeab
    • Wu Fengguang's avatar
      writeback: refill b_io iff empty · 424b351f
      Wu Fengguang authored
      
      
      There is no point to carry different refill policies between for_kupdate
      and other type of works. Use a consistent "refill b_io iff empty" policy
      which can guarantee fairness in an easy to understand way.
      
      A b_io refill will setup a _fixed_ work set with all currently eligible
      inodes and start a new round of walk through b_io. The "fixed" work set
      means no new inodes will be added to the work set during the walk.
      Only when a complete walk over b_io is done, new inodes that are
      eligible at the time will be enqueued and the walk be started over.
      
      This procedure provides fairness among the inodes because it guarantees
      each inode to be synced once and only once at each round. So all inodes
      will be free from starvations.
      
      This change relies on wb_writeback() to keep retrying as long as we made
      some progress on cleaning some pages and/or inodes. Without that ability,
      the old logic on background works relies on aggressively queuing all
      eligible inodes into b_io at every time. But that's not a guarantee.
      
      The below test script completes a slightly faster now:
      
                   2.6.39-rc3	  2.6.39-rc3-dyn-expire+
      ------------------------------------------------
      all elapsed     256.043      252.367
      stddev           24.381       12.530
      
      tar elapsed      30.097       28.808
      dd  elapsed      13.214       11.782
      
      	#!/bin/zsh
      
      	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
      
      	umount /dev/sda7
      	mkfs.xfs -f /dev/sda7
      	mount /dev/sda7 /fs
      
      	echo 3 > /proc/sys/vm/drop_caches
      
      	tic=$(cat /proc/uptime|cut -d' ' -f2)
      
      	cd /fs
      	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
      	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
      
      	wait
      	sync
      	tac=$(cat /proc/uptime|cut -d' ' -f2)
      	echo elapsed: $((tac - tic))
      
      It maintains roughly the same small vs. large file writeout shares, and
      offers large files better chances to be written in nice 4M chunks.
      
      Analyzes from Dave Chinner in great details:
      
      Let's say we have lots of inodes with 100 dirty pages being created,
      and one large writeback going on. We expire 8 new inodes for every
      1024 pages we write back.
      
      With the old code, we do:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      
      	writeback  8 small inodes 800 pages
      		   1 large inode 224 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      	.....
      
      Your new code:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 8s)
      	writeback  8 small inodes 800 pages
      
      	b_io empty: (1800 pages written)
      		b_more_io (large inode) -> b_io (1l)
      		14 newly expired inodes -> b_io (1l, 14s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 14s)
      	writeback  10 small inodes 1000 pages
      		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
      	writeback  5 small inodes 500 pages
      	b_io empty: (2548 pages written)
      		b_more_io (large inode) -> b_io (1l, 1s(24))
      		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
      	......
      
      Rough progression of pages written at b_io refill:
      
      Old code:
      
      	total	large file	% of writeback
      	1024	224		21.9% (fixed)
      
      New code:
      	total	large file	% of writeback
      	1800	1024		~55%
      	2550	1024		~40%
      	3050	1024		~33%
      	3500	1024		~29%
      	3950	1024		~26%
      	4250	1024		~24%
      	4500	1024		~22.7%
      	4700	1024		~21.7%
      	4800	1024		~21.3%
      	4800	1024		~21.3%
      	(pretty much steady state from here)
      
      Ok, so the steady state is reached with a similar percentage of
      writeback to the large file as the existing code. Ok, that's good,
      but providing some evidence that is doesn't change the shared of
      writeback to the large should be in the commit message ;)
      
      The other advantage to this is that we always write 1024 page chunks
      to the large file, rather than smaller "whatever remains" chunks.
      
      CC: Jan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      424b351f
    • Wu Fengguang's avatar
      writeback: the kupdate expire timestamp should be a moving target · ba9aa839
      Wu Fengguang authored
      
      
      Dynamically compute the dirty expire timestamp at queue_io() time.
      
      writeback_control.older_than_this used to be determined at entrance to
      the kupdate writeback work. This _static_ timestamp may go stale if the
      kupdate work runs on and on. The flusher may then stuck with some old
      busy inodes, never considering newly expired inodes thereafter.
      
      This has two possible problems:
      
      - It is unfair for a large dirty inode to delay (for a long time) the
        writeback of small dirty inodes.
      
      - As time goes by, the large and busy dirty inode may contain only
        _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
        delaying the expired dirty pages to the end of LRU lists, triggering
        the evil pageout(). Nevertheless this patch merely addresses part
        of the problem.
      
      v2: keep policy changes inside wb_writeback() and keep the
      wbc.older_than_this visibility as suggested by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      ba9aa839
    • Wu Fengguang's avatar
      writeback: try more writeback as long as something was written · e6fb6da2
      Wu Fengguang authored
      
      
      writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
      they only populate possibly a subset of eligible inodes into b_io at
      entrance time. When the queued set of inodes are all synced, they just
      return, possibly with all queued inode pages written but still
      wbc.nr_to_write > 0.
      
      For kupdate and background writeback, there may be more eligible inodes
      sitting in b_dirty when the current set of b_io inodes are completed. So
      it is necessary to try another round of writeback as long as we made some
      progress in this round. When there are no more eligible inodes, no more
      inodes will be enqueued in queue_io(), hence nothing could/will be
      synced and we may safely bail.
      
      For example, imagine 100 inodes
      
              i0, i1, i2, ..., i90, i91, i99
      
      At queue_io() time, i90-i99 happen to be expired and moved to s_io for
      IO. When finished successfully, if their total size is less than
      MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
      quit the background work (w/o this patch) while it's still over
      background threshold. This will be a fairly normal/frequent case I guess.
      
      Now that we do tagged sync and update inode->dirtied_when after the sync,
      this change won't livelock sync(1).  I actually tried to write 1 page
      per 1ms with this command
      
      	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
      
      and do sync(1) at the same time. The sync completes quickly on ext4,
      xfs, btrfs.
      
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e6fb6da2
    • Wu Fengguang's avatar
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang authored
      
      
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • Wu Fengguang's avatar
      writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb
      Wu Fengguang authored
      Explicitly update .dirtied_when on synced inodes, so that they are no
      longer considered for writeback in the next round.
      
      It can prevent both of the following livelock schemes:
      
      - while true; do echo data >> f; done
      - while true; do touch f;        done (in theory)
      
      The exact livelock condition is, during sync(1):
      
      (1) no new inodes are dirtied
      (2) an inode being actively dirtied
      
      On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
      When finished, it will be redirty_tail()ed because it's still dirty
      and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
      on condition (1). The sync work will then revisit it on the next
      queue_io() and find it eligible again because its old ->dirtied_when
      predates the sync work start time.
      
      We'll do more aggressive "keep writeback as long as we wrote something"
      logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
      b9543dac
      
       ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
      no longer be enough to stop sync livelock.
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      94c3dcbb
    • Wu Fengguang's avatar
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang authored
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae
      
      ,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      
      Acked-by: default avatarJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  2. Jun 06, 2011
  3. Jun 05, 2011
  4. Jun 04, 2011