Skip to content
  1. Jan 14, 2011
    • Michel Lespinasse's avatar
      mm: add FOLL_MLOCK follow_page flag. · 110d74a9
      Michel Lespinasse authored
      
      
      Move the code to mlock pages from __mlock_vma_pages_range() to
      follow_page().
      
      This allows __mlock_vma_pages_range() to not have to break down work into
      16-page batches.
      
      An additional motivation for doing this within the present patch series is
      that it'll make it easier for a later chagne to drop mmap_sem when
      blocking on disk (we'd like to be able to resume at the page that was read
      from disk instead of at the start of a 16-page batch).
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      110d74a9
    • Michel Lespinasse's avatar
      mlock: only hold mmap_sem in shared mode when faulting in pages · fed067da
      Michel Lespinasse authored
      
      
      Currently mlock() holds mmap_sem in exclusive mode while the pages get
      faulted in.  In the case of a large mlock, this can potentially take a
      very long time, during which various commands such as 'ps auxw' will
      block.  This makes sysadmins unhappy:
      
      real    14m36.232s
      user    0m0.003s
      sys     0m0.015s
      (output from 'time ps auxw' while a 20GB file was being mlocked without
      being previously preloaded into page cache)
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages, in small batches, releasing mmap_sem when we
      block on disk access or when we detect some contention.
      
      This patch:
      
      Before this change, mlock() holds mmap_sem in exclusive mode while the
      pages get faulted in.  In the case of a large mlock, this can potentially
      take a very long time.  Various things will block while mmap_sem is held,
      including 'ps auxw'.  This can make sysadmins angry.
      
      I propose that mlock() could release mmap_sem after the VM_LOCKED bits
      have been set in all appropriate VMAs.  Then a second pass could be done
      to actually mlock the pages with mmap_sem held for reads only.  We need to
      recheck the vma flags after we re-acquire mmap_sem, but this is easy.
      
      In the case where a vma has been munlocked before mlock completes, pages
      that were already marked as PageMlocked() are handled by the munlock()
      call, and mlock() is careful to not mark new page batches as PageMlocked()
      after the munlock() call has cleared the VM_LOCKED vma flags.  So, the end
      result will be identical to what'd happen if munlock() had executed after
      the mlock() call.
      
      In a later change, I will allow the second pass to release mmap_sem when
      blocking on disk accesses or when it is otherwise contended, so that it
      won't be held for long periods of time even in shared mode.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Tested-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed067da
    • Michel Lespinasse's avatar
      mlock: avoid dirtying pages and triggering writeback · 5ecfda04
      Michel Lespinasse authored
      
      
      When faulting in pages for mlock(), we want to break COW for anonymous or
      file pages within VM_WRITABLE, non-VM_SHARED vmas.  However, there is no
      need to write-fault into VM_SHARED vmas since shared file pages can be
      mlocked first and dirtied later, when/if they actually get written to.
      Skipping the write fault is desirable, as we don't want to unnecessarily
      cause these pages to be dirtied and queued for writeback.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ecfda04
    • Michel Lespinasse's avatar
      do_wp_page: clarify dirty_page handling · 72ddc8f7
      Michel Lespinasse authored
      
      
      Reorganize the code so that dirty pages are handled closer to the place
      that makes them dirty (handling write fault into shared, writable VMAs).
      No behavior changes.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72ddc8f7
    • Michel Lespinasse's avatar
      do_wp_page: remove the 'reuse' flag · b009c024
      Michel Lespinasse authored
      
      
      mlocking a shared, writable vma currently causes the corresponding pages
      to be marked as dirty and queued for writeback.  This seems rather
      unnecessary given that the pages are not being actually modified during
      mlock.  It is understood that for non-shared mappings (file or anon) we
      want to use a write fault in order to break COW, but there is just no such
      need for shared mappings.
      
      The first two patches in this series do not introduce any behavior change.
       The intent there is to make it obvious that dirtying file pages is only
      done in the (writable, shared) case.  I think this clarifies the code, but
      I wouldn't mind dropping these two patches if there is no consensus about
      them.
      
      The last patch is where we actually avoid dirtying shared mappings during
      mlock.  Note that as a side effect of this, we won't call page_mkwrite()
      for the mappings that define it, and won't be pre-allocating data blocks
      at the FS level if the mapped file was sparsely allocated.  My
      understanding is that mlock does not need to provide such guarantee, as
      evidenced by the fact that it never did for the filesystems that don't
      define page_mkwrite() - including some common ones like ext3.  However, I
      would like to gather feedback on this from filesystem people as a
      precaution.  If this turns out to be a showstopper, maybe block
      preallocation can be added back on using a different interface.
      
      Large shared mlocks are getting significantly (>2x) faster in my tests, as
      the disk can be fully used for reading the file instead of having to share
      between this and writeback.
      
      This patch:
      
      Reorganize the code to remove the 'reuse' flag.  No behavior changes.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Theodore Tso <tytso@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b009c024
    • Rik van Riel's avatar
      mm: clear PageError bit in msync & fsync · 212260aa
      Rik van Riel authored
      
      
      Temporary IO failures, eg.  due to loss of both multipath paths, can
      permanently leave the PageError bit set on a page, resulting in msync or
      fsync returning -EIO over and over again, even if IO is now getting to the
      disk correctly.
      
      We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
      filemap_fdatawait_range function.  Also clearing the PageError bit on the
      page allows subsequent msync or fsync calls on this file to return without
      an error, if the subsequent IO succeeds.
      
      Unfortunately data written out in the msync or fsync call that returned
      -EIO can still get lost, because the page dirty bit appears to not get
      restored on IO error.  However, the alternative could be potentially all
      of memory filling up with uncleanable dirty pages, hanging the system, so
      there is no nice choice here...
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarValerie Aurora <vaurora@redhat.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      212260aa
    • Mandeep Singh Baines's avatar
      oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down · dabb16f6
      Mandeep Singh Baines authored
      
      
      We'd like to be able to oom_score_adj a process up/down as it
      enters/leaves the foreground.  Currently, it is not possible to oom_adj
      down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
      oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
      or its inherited value at fork.  Assuming the thread that has forked it
      has oom_score_adj of 0, each process could decrease it back from 0 upon
      activation unless a CAP_SYS_RESOURCE thread elevated it to something
      higher.
      
      Alternative considered:
      
      * a setuid binary
      * a daemon with CAP_SYS_RESOURCE
      
      Since you don't wan't all processes to be able to reduce their oom_adj, a
      setuid or daemon implementation would be complex.  The alternatives also
      have much higher overhead.
      
      This patch updated from original patch based on feedback from David
      Rientjes.
      
      Signed-off-by: default avatarMandeep Singh Baines <msb@chromium.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dabb16f6
    • David Rientjes's avatar
      mm: unify module_alloc code for vmalloc · d0a21265
      David Rientjes authored
      
      
      Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
      module_init().  Much of the code is duplicated and can be generalized in a
      globally accessible function, __vmalloc_node_range().
      
      __vmalloc_node() now calls into __vmalloc_node_range() with a range of
      [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.
      
      Each architecture may then use __vmalloc_node_range() directly to remove
      the duplication of code.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0a21265
    • David Rientjes's avatar
      mm: remove gfp mask from pcpu_get_vm_areas · ec3f64fc
      David Rientjes authored
      
      
      pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
      formal and use the mask internally.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec3f64fc
    • David Rientjes's avatar
      mm: remove unused get_vm_area_node · e5a5623b
      David Rientjes authored
      
      
      get_vm_area_node() is unused in the kernel and can thus be removed.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5a5623b
    • Mel Gorman's avatar
      mm: vmscan: rename lumpy_mode to reclaim_mode · f3a310bc
      Mel Gorman authored
      
      
      With compaction being used instead of lumpy reclaim, the name lumpy_mode
      and associated variables is a bit misleading.  Rename lumpy_mode to
      reclaim_mode which is a better fit.  There is no functional change.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3a310bc
    • Mel Gorman's avatar
      mm: compaction: perform a faster migration scan when migrating asynchronously · 9927af74
      Mel Gorman authored
      
      
      try_to_compact_pages() is initially called to only migrate pages
      asychronously and kswapd always compacts asynchronously.  Both are being
      optimistic so it is important to complete the work as quickly as possible
      to minimise stalls.
      
      This patch alters the scanner when asynchronous to only consider
      MIGRATE_MOVABLE pageblocks as migration candidates.  This reduces stalls
      when allocating huge pages while not impairing allocation success rates as
      a full scan will be performed if necessary after direct reclaim.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9927af74
    • Mel Gorman's avatar
      mm: migration: cleanup migrate_pages API by matching types for offlining and sync · 7f0f2496
      Mel Gorman authored
      
      
      With the introduction of the boolean sync parameter, the API looks a
      little inconsistent as offlining is still an int.  Convert offlining to a
      bool for the sake of being tidy.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f0f2496
    • Mel Gorman's avatar
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman authored
      
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • Mel Gorman's avatar
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman authored
      
      
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • Mel Gorman's avatar
      mm: vmscan: convert lumpy_mode into a bitmask · ee64fc93
      Mel Gorman authored
      
      
      Currently lumpy_mode is an enum and determines if lumpy reclaim is off,
      syncronous or asyncronous.  In preparation for using compaction instead of
      lumpy reclaim, this patch converts the flags into a bitmap.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee64fc93
    • Mel Gorman's avatar
      mm: compaction: add trace events for memory compaction activity · b7aba698
      Mel Gorman authored
      
      
      In preparation for a patches promoting the use of memory compaction over
      lumpy reclaim, this patch adds trace points for memory compaction
      activity.  Using them, we can monitor the scanning activity of the
      migration and free page scanners as well as the number and success rates
      of pages passed to page migration.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7aba698
    • Nikanth Karthikesan's avatar
      mm: smaps: export mlock information · 2d90508f
      Nikanth Karthikesan authored
      
      
      Currently there is no way to find whether a process has locked its pages
      in memory or not.  And which of the memory regions are locked in memory.
      
      Add a new field "Locked" to export this information via the smaps file.
      
      Signed-off-by: default avatarNikanth Karthikesan <knikanth@suse.de>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d90508f
    • Joe Perches's avatar
      mm: convert sprintf_symbol to %pS · 62c70bce
      Joe Perches authored
      
      
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Jiri Kosina <trivial@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62c70bce
    • Hai Shan's avatar
      fs/mpage.c: consolidate code · c32b0d4b
      Hai Shan authored
      
      
      Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
      eliminate code duplication.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHai Shan <shan.hai@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c32b0d4b
    • Nick Piggin's avatar
      mm: find_get_pages_contig fixlet · 9cbb4cb2
      Nick Piggin authored
      
      
      Testing ->mapping and ->index without a ref is not stable as the page
      may have been reused at this point.
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cbb4cb2
    • KOSAKI Motohiro's avatar
      vmscan: factor out kswapd sleeping logic from kswapd() · f0bc0a60
      KOSAKI Motohiro authored
      
      
      Currently, kswapd() has deep nesting and is slightly hard to read.  Clean
      this up.
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0bc0a60
    • Bob Liu's avatar
      mm/page-writeback.c: fix __set_page_dirty_no_writeback() return value · c3f0da63
      Bob Liu authored
      
      
      __set_page_dirty_no_writeback() should return true if it actually
      transitioned the page from a clean to dirty state although it seems nobody
      uses its return value at present.
      
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3f0da63
    • Andrew Morton's avatar
      sync_inode_metadata: fix comment · c691b9d9
      Andrew Morton authored
      
      
      Use correct function name, remove incorrect apostrophe
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c691b9d9
    • Jan Kara's avatar
      writeback: avoid livelocking WB_SYNC_ALL writeback · b9543dac
      Jan Kara authored
      
      
      When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
      usually set to LONG_MAX.  The logic in wb_writeback() then calls
      __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
      easily end up with non-positive nr_to_write after the function returns, if
      the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.
      
      When nr_to_write is <= 0 wb_writeback() decides we need another round of
      writeback but this is wrong in some cases!  For example when a single
      large file is continuously dirtied, we would never finish syncing it
      because each pass would be able to write MAX_WRITEBACK_PAGES and inode
      dirty timestamp never gets updated (as inode is never completely clean).
      Thus __writeback_inodes_sb() would write the redirtied inode again and
      again.
      
      Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode.  We
      do not need nr_to_write in WB_SYNC_ALL mode anyway since
      write_cache_pages() does livelock avoidance using page tagging in
      WB_SYNC_ALL mode.
      
      This makes wb_writeback() call __writeback_inodes_sb() only once on
      WB_SYNC_ALL.  The latter function won't livelock because it works on
      
      - a finite set of files by doing queue_io() once at the beginning
      - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging
      
      After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
      longer able to stall sync forever.
      
      [fengguang.wu@intel.com: fix locking comment]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9543dac
    • Jan Kara's avatar
      writeback: stop background/kupdate works from livelocking other works · aa373cf5
      Jan Kara authored
      
      
      Background writeback is easily livelockable in a loop in wb_writeback() by
      a process continuously re-dirtying pages (or continuously appending to a
      file).  This is in fact intended as the target of background writeback is
      to write dirty pages it can find as long as we are over
      dirty_background_threshold.
      
      But the above behavior gets inconvenient at times because no other work
      queued in the flusher thread's queue gets processed.  In particular, since
      e.g.  sync(1) relies on flusher thread to do all the IO for it, sync(1)
      can hang forever waiting for flusher thread to do the work.
      
      Generally, when a flusher thread has some work queued, someone submitted
      the work to achieve a goal more specific than what background writeback
      does.  Moreover by working on the specific work, we also reduce amount of
      dirty pages which is exactly the target of background writeout.  So it
      makes sense to give specific work a priority over a generic page cleaning.
      
      Thus we interrupt background writeback if there is some other work to do.
      We return to the background writeback after completing all the queued
      work.
      
      This may delay the writeback of expired inodes for a while, however the
      expired inodes will eventually be flushed to disk as long as the other
      works won't livelock.
      
      [fengguang.wu@intel.com: update comment]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa373cf5
    • Wu Fengguang's avatar
      writeback: trace wakeup event for background writeback · 71927e84
      Wu Fengguang authored
      
      
      This tracks when balance_dirty_pages() tries to wakeup the flusher thread
      for background writeback (if it was not started already).
      
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71927e84
    • Jan Kara's avatar
      writeback: integrated background writeback work · 6585027a
      Jan Kara authored
      
      
      Check whether background writeback is needed after finishing each work.
      
      When bdi flusher thread finishes doing some work check whether any kind of
      background writeback needs to be done (either because
      dirty_background_ratio is exceeded or because we need to start flushing
      old inodes).  If so, just do background write back.
      
      This way, bdi_start_background_writeback() just needs to wake up the
      flusher thread.  It will do background writeback as soon as there is no
      other work.
      
      This is a preparatory patch for the next patch which stops background
      writeback as soon as there is other work to do.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6585027a
    • Mel Gorman's avatar
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman authored
      
      
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • Mel Gorman's avatar
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman authored
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840
      
       "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reported-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
    • Dave Jones's avatar
      sched: remove long deprecated CLONE_STOPPED flag · 43bb40c9
      Dave Jones authored
      This warning was added in commit bdff746a
      
       ("clone: prepare to recycle
      CLONE_STOPPED") three years ago.  2.6.26 came and went.  As far as I know,
      no-one is actually using CLONE_STOPPED.
      
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43bb40c9
    • Claudio Scordino's avatar
      atmel_serial: fix RTS high after initialization in RS485 mode · 5dfbd1d7
      Claudio Scordino authored
      
      
      When working in RS485 mode, the atmel_serial driver keeps RTS high after
      the initialization of the serial port.  It goes low only after the first
      character has been sent.
      
      [akpm@linux-foundation.org: simplify code]
      Signed-off-by: default avatarClaudio Scordino <claudio@evidence.eu.com>
      Signed-off-by: default avatarArkadiusz Bubala <arkadiusz.bubala@gmail.com>
      Tested-by: default avatarArkadiusz Bubala <arkadiusz.bubala@gmail.com>
      Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dfbd1d7
    • Eric Dumazet's avatar
      irq: use per_cpu kstat_irqs · 6c9ae009
      Eric Dumazet authored
      
      
      Use modern per_cpu API to increment {soft|hard}irq counters, and use
      per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.
      
      This gives better SMP/NUMA locality and saves few instructions per irq.
      
      With small nr_cpuids values (8 for example), kstats_irq was a small array
      (less than L1_CACHE_BYTES), potentially source of false sharing.
      
      In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
      kstat_irqs_all[NR_IRQS][NR_CPUS] array.
      
      Note: we still populate kstats_irq for all possible irqs in
      early_irq_init().  We probably could use on-demand allocations.  (Code
      included in alloc_descs()).  Problem is not all IRQS are used with a prior
      alloc_descs() call.
      
      kstat_irqs_this_cpu() is not used anymore, remove it.
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c9ae009
    • Bruce Chang's avatar
      MAINTAINERS: update entries affecting VIA Technologies · 558bbb2f
      Bruce Chang authored
      
      
      Since the original maintainer-Joseph Chan (josephchan@via.com.tw) doesn't
      handle the Linux driver for VIA now, I would like to request to update the
      maintainer for the SD/MMC CARD CONTROLLER DRIVER and VIA
      UNICHROME(PRO)/CHROME9 FRAMEBUFFER DRIVER before we find a better one.
      
      Signed-off-by: default avatarBruce Chang <brucechang@via.com.tw>
      Signed-off-by: default avatarFlorian Tobias Schandinat <FlorianSchandinat@gmx.de>
      Cc: Joseph Chan <JosephChan@via.com.tw>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Harald Welte <HaraldWelte@viatech.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      558bbb2f
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm · f6bcfd94
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits)
        dm: raid456 basic support
        dm: per target unplug callback support
        dm: introduce target callbacks and congestion callback
        dm mpath: delay activate_path retry on SCSI_DH_RETRY
        dm: remove superfluous irq disablement in dm_request_fn
        dm log: use PTR_ERR value instead of ENOMEM
        dm snapshot: avoid storing private suspended state
        dm snapshot: persistent make metadata_wq multithreaded
        dm: use non reentrant workqueues if equivalent
        dm: convert workqueues to alloc_ordered
        dm stripe: switch from local workqueue to system_wq
        dm: dont use flush_scheduled_work
        dm snapshot: remove unused dm_snapshot queued_bios_work
        dm ioctl: suppress needless warning messages
        dm crypt: add loop aes iv generator
        dm crypt: add multi key capability
        dm crypt: add post iv call to iv generator
        dm crypt: use io thread for reads only if mempool exhausted
        dm crypt: scale to multiple cpus
        dm crypt: simplify compatible table output
        ...
      f6bcfd94
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://neil.brown.name/md · 509e4aef
      Linus Torvalds authored
      * 'for-linus' of git://neil.brown.name/md:
        md: Fix removal of extra drives when converting RAID6 to RAID5
        md: range check slot number when manually adding a spare.
        md/raid5: handle manually-added spares in start_reshape.
        md: fix sync_completed reporting for very large drives (>2TB)
        md: allow suspend_lo and suspend_hi to decrease as well as increase.
        md: Don't let implementation detail of curr_resync leak out through sysfs.
        md: separate meta and data devs
        md-new-param-to_sync_page_io
        md-new-param-to-calc_dev_sboffset
        md: Be more careful about clearing flags bit in ->recovery
        md: md_stop_writes requires mddev_lock.
        md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
        md: Ensure no IO request to get md device before it is properly initialised.
        md: Fix single printks with multiple KERN_<level>s
        md: fix regression resulting in delays in clearing bits in a bitmap
        md: fix regression with re-adding devices to arrays with no metadata
      509e4aef
    • Linus Torvalds's avatar
      Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 · 375b6f5a
      Linus Torvalds authored
      * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
        [IA64] fix build error - arch/ia64/kernel/perfmon.c
      375b6f5a
    • Linus Torvalds's avatar
      Revert "gpiolib: annotate gpio-intialization with __must_check" · d8a3515e
      Linus Torvalds authored
      This reverts commit 0fdae42d
      
      , which
      wasn't really supposed to go in, and causes lots of annoying warnings.
      
      Quoth Andrew:
        "Complete brainfart - I meant to drop that patch ages ago."
      
      Quoth Greg:
        "Ick, yeah, that patch isn't ok to go in as-is, all of the callers
         need to be fixed up first, which is what I thought we had agreed on..."
      
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarGreg KH <greg@kroah.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8a3515e
    • Linus Torvalds's avatar
      ecryptfs: fix broken build · 6254b32b
      Linus Torvalds authored
      Stephen Rothwell reports that the vfs merge broke the build of ecryptfs.
      The breakage comes from commit 66cb7666
      
       ("sanitize ecryptfs
      ->mount()") which was obviously not even build tested. Tssk, tssk, Al.
      
      This is the minimal build fixup for the situation, although I don't have
      a filesystem to actually test it with.
      
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6254b32b
    • Tony Luck's avatar
      [IA64] fix build error - arch/ia64/kernel/perfmon.c · 09579770
      Tony Luck authored
      arch/ia64/kernel/perfmon.c:621: error: duplicate 'static'
      
      Introduced by commit c74a1cbb
      
      
      
          pass default dentry_operations to mount_pseudo()
      
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      09579770