Skip to content
  1. Oct 09, 2012
    • Davidlohr Bueso's avatar
      oom: remove deprecated oom_adj · 01dc52eb
      Davidlohr Bueso authored
      
      
      The deprecated /proc/<pid>/oom_adj is scheduled for removal this month.
      
      Signed-off-by: default avatarDavidlohr Bueso <dave@gnu.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01dc52eb
    • Gavin Shan's avatar
      mm/vmscan: fix error number for failed kthread · d5dc0ad9
      Gavin Shan authored
      
      
      Fix the return value while failing to create the kswapd kernel thread.
      Also, the error message is prioritized as KERN_ERR.
      
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5dc0ad9
    • Gavin Shan's avatar
      mm/mmu_notifier: init notifier if necessary · e0f3c3f7
      Gavin Shan authored
      
      
      While registering MMU notifier, new instance of MMU notifier_mm will be
      allocated and later free'd if currrent mm_struct's MMU notifier_mm has
      been initialized.  That causes some overhead.  The patch tries to
      elominate that.
      
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Sagi Grimberg <sagig@mellanox.co.il>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0f3c3f7
    • Sagi Grimberg's avatar
      mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely schedule · 21a92735
      Sagi Grimberg authored
      
      
      With an RCU based mmu_notifier implementation, any callout to
      mmu_notifier_invalidate_range_{start,end}() or
      mmu_notifier_invalidate_page() would not be allowed to call schedule()
      as that could potentially allow a modification to the mmu_notifier
      structure while it is currently being used.
      
      Since srcu allocs 4 machine words per instance per cpu, we may end up
      with memory exhaustion if we use srcu per mm.  So all mms share a global
      srcu.  Note that during large mmu_notifier activity exit & unregister
      paths might hang for longer periods, but it is tolerable for current
      mmu_notifier clients.
      
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.co.il>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21a92735
    • Xiao Guangrong's avatar
      mm: mmu_notifier: fix inconsistent memory between secondary MMU and host · 48af0d7c
      Xiao Guangrong authored
      
      
      There is a bug in set_pte_at_notify() which always sets the pte to the
      new page before releasing the old page in the secondary MMU.  At this
      time, the process will access on the new page, but the secondary MMU
      still access on the old page, the memory is inconsistent between them
      
      The below scenario shows the bug more clearly:
      
      at the beginning: *p = 0, and p is write-protected by KSM or shared with
      parent process
      
      CPU 0                                       CPU 1
      write 1 to p to trigger COW,
      set_pte_at_notify will be called:
        *pte = new_page + W; /* The W bit of pte is set */
      
                                           *p = 1; /* pte is valid, so no #PF */
      
                                           return back to secondary MMU, then
                                           the secondary MMU read p, but get:
                                           *p == 0;
      
                               /*
                                * !!!!!!
                                * the host has already set p to 1, but the secondary
                                * MMU still get the old value 0
                                */
      
        call mmu_notifier_change_pte to release
        old page in secondary MMU
      
      We can fix it by release old page first, then set the pte to the new
      page.
      
      Note, the new page will be firstly used in secondary MMU before it is
      mapped into the page table of the process, but this is safe because it
      is protected by the page table lock, there is no race to change the pte
      
      [akpm@linux-foundation.org: add comment from Andrea]
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48af0d7c
    • Mel Gorman's avatar
      mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma() · 00442ad0
      Mel Gorman authored
      Commit cc9a6c87
      
       ("cpuset: mm: reduce large amounts of memory barrier
      related damage v3") introduced a potential memory corruption.
      shmem_alloc_page() uses a pseudo vma and it has one significant unique
      combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED.
      
      get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
      and mpol_cond_put() DOES decrease a policy ref when a policy has
      MPOL_F_SHARED.  Therefore, when a cpuset update race occurs,
      alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the
      reference count and frees the policy prematurely.
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00442ad0
    • KOSAKI Motohiro's avatar
      mempolicy: fix refcount leak in mpol_set_shared_policy() · 63f74ca2
      KOSAKI Motohiro authored
      
      
      When shared_policy_replace() fails to allocate new->policy is not freed
      correctly by mpol_set_shared_policy().  The problem is that shared
      mempolicy code directly call kmem_cache_free() in multiple places where
      it is easy to make a mistake.
      
      This patch creates an sp_free wrapper function and uses it. The bug was
      introduced pre-git age (IOW, before 2.6.12-rc2).
      
      [mgorman@suse.de: Editted changelog]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63f74ca2
    • Mel Gorman's avatar
      mempolicy: fix a race in shared_policy_replace() · b22d127a
      Mel Gorman authored
      
      
      shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
      be dereferenced if sp->lock is not held and 2) another thread can modify
      sp_node between spin_unlock for allocating a new sp node and next
      spin_lock.  The bug was introduced before 2.6.12-rc2.
      
      Kosaki's original patch for this problem was to allocate an sp node and
      policy within shared_policy_replace and initialise it when the lock is
      reacquired.  I was not keen on this approach because it partially
      duplicates sp_alloc().  As the paths were sp->lock is taken are not that
      performance critical this patch converts sp->lock to sp->mutex so it can
      sleep when calling sp_alloc().
      
      [kosaki.motohiro@jp.fujitsu.com: Original patch]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b22d127a
    • KOSAKI Motohiro's avatar
      mempolicy: remove mempolicy sharing · 869833f2
      KOSAKI Motohiro authored
      Dave Jones' system call fuzz testing tool "trinity" triggered the
      following bug error with slab debugging enabled
      
          =============================================================================
          BUG numa_policy (Not tainted): Poison overwritten
          -----------------------------------------------------------------------------
      
          INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
          INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
           __slab_alloc+0x3d3/0x445
           kmem_cache_alloc+0x29d/0x2b0
           mpol_new+0xa3/0x140
           sys_mbind+0x142/0x620
           system_call_fastpath+0x16/0x1b
      
          INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
           __slab_free+0x2e/0x1de
           kmem_cache_free+0x25a/0x260
           __mpol_put+0x27/0x30
           remove_vma+0x68/0x90
           exit_mmap+0x118/0x140
           mmput+0x73/0x110
           exit_mm+0x108/0x130
           do_exit+0x162/0xb90
           do_group_exit+0x4f/0xc0
           sys_exit_group+0x17/0x20
           system_call_fastpath+0x16/0x1b
      
          INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x          (null) flags=0x20000000004080
          INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0
      
      The problem is that the structure is being prematurely freed due to a
      reference count imbalance. In the following case mbind(addr, len) should
      replace the memory policies of both vma1 and vma2 and thus they will
      become to share the same mempolicy and the new mempolicy will have the
      MPOL_F_SHARED flag.
      
        +-------------------+-------------------+
        |     vma1          |     vma2(shmem)   |
        +-------------------+-------------------+
        |                                       |
       addr                                 addr+len
      
      alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for
      maintaining the mempolicy reference count.  The current rule is that
      get_vma_policy() only increments refcount for shmem VMA and
      mpol_conf_put() only decrements refcount if the policy has
      MPOL_F_SHARED.
      
      In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED!
      The reference count will be decreased even though was not increased
      whenever alloc_page_vma() is called.  This has been broken since commit
      [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008.
      
      There is another serious bug with the sharing of memory policies.
      Currently, mempolicy rebind logic (it is called from cpuset rebinding)
      ignores a refcount of mempolicy and override it forcibly.  Thus, any
      mempolicy sharing may cause mempolicy corruption.  The bug was
      introduced by commit [68860ec1
      
      : cpusets: automatic numa mempolicy
      rebinding].
      
      Ideally, the shared policy handling would be rewritten to either
      properly handle COW of the policy structures or at least reference count
      MPOL_F_SHARED based exclusively on information within the policy.
      However, this patch takes the easier approach of disabling any policy
      sharing between VMAs.  Each new range allocated with sp_alloc will
      allocate a new policy, set the reference count to 1 and drop the
      reference count of the old policy.  This increases the memory footprint
      but is not expected to be a major problem as mbind() is unlikely to be
      used for fine-grained ranges.  It is also inefficient because it means
      we allocate a new policy even in cases where mbind_range() could use the
      new_policy passed to it.  However, it is more straight-forward and the
      change should be invisible to the user.
      
      [mgorman@suse.de: Edited changelog]
      Reported-by: default avatarDave Jones <davej@redhat.com&gt;,>
      Cc: Christoph Lameter <cl@linux.com>,
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      869833f2
    • KOSAKI Motohiro's avatar
      revert "mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages" · 8d34694c
      KOSAKI Motohiro authored
      Commit 05f144a0
      
       ("mm: mempolicy: Let vma_merge and vma_split handle
      vma->vm_policy linkages") removed vma->vm_policy updates code but it is
      the purpose of mbind_range().  Now, mbind_range() is virtually a no-op
      and while it does not allow memory corruption it is not the right fix.
      This patch is a revert.
      
      [mgorman@suse.de: Edited changelog]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d34694c
    • Mel Gorman's avatar
      mm: compaction: capture a suitable high-order page immediately when it is made available · 1fb3f8ca
      Mel Gorman authored
      
      
      While compaction is migrating pages to free up large contiguous blocks
      for allocation it races with other allocation requests that may steal
      these blocks or break them up.  This patch alters direct compaction to
      capture a suitable free page as soon as it becomes available to reduce
      this race.  It uses similar logic to split_free_page() to ensure that
      watermarks are still obeyed.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fb3f8ca
    • Mel Gorman's avatar
      mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures · 83fde0f2
      Mel Gorman authored
      
      
      If allocation fails after compaction then compaction may be deferred for
      a number of allocation attempts.  If there are subsequent failures,
      compact_defer_shift is increased to defer for longer periods.  This
      patch uses that information to scale the number of pages reclaimed with
      compact_defer_shift until allocations succeed again.  The rationale is
      that reclaiming the normal number of pages still allowed compaction to
      fail and its success depends on the number of pages.  If it's failing,
      reclaim more pages until it succeeds again.
      
      Note that this is not implying that VM reclaim is not reclaiming enough
      pages or that its logic is broken.  try_to_free_pages() always asks for
      SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
      what it does.  Direct reclaim stops normally with this check.
      
      	if (sc->nr_reclaimed >= sc->nr_to_reclaim)
      		goto out;
      
      should_continue_reclaim delays when that check is made until a minimum
      number of pages for reclaim/compaction are reclaimed.  It is possible
      that this patch could instead set nr_to_reclaim in try_to_free_pages()
      and drive it from there but that's behaves differently and not
      necessarily for the better.  If driven from do_try_to_free_pages(), it
      is also possible that priorities will rise.
      
      When they reach DEF_PRIORITY-2, it will also start stalling and setting
      pages for immediate reclaim which is more disruptive than not desirable
      in this case.  That is a more wide-reaching change that could cause
      another regression related to THP requests causing interactive jitter.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83fde0f2
    • Mel Gorman's avatar
      mm: compaction: update comment in try_to_compact_pages · 4ffb6335
      Mel Gorman authored
      Allocation success rates have been far lower since 3.4 due to commit
      fe2c2a10 ("vmscan: reclaim at order 0 when compaction is enabled").
      This commit was introduced for good reasons and it was known in advance
      that the success rates would suffer but it was justified on the grounds
      that the high allocation success rates were achieved by aggressive
      reclaim.  Success rates are expected to suffer even more in 3.6 due to
      commit 7db8889a ("mm: have order > 0 compaction start off where it
      left") which testing has shown to severely reduce allocation success
      rates under load - to 0% in one case.
      
      This series aims to improve the allocation success rates without
      regressing the benefits of commit fe2c2a10.  The series is based on
      latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going
      away.
      
      Patch 1 updates a stale comment seeing as I was in the general area.
      
      Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
      	of recent failures.
      
      Patch 3 captures suitable high-order pages freed by compaction to reduce
      	races with parallel allocation requests.
      
      Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction
      	start off where it left] to enable compaction again
      
      Patch 5 identifies when compacion is taking too long due to contention
      	and aborts.
      
      STRESS-HIGHALLOC
      		 3.6-rc1-akpm	  full-series
      Pass 1          36.00 ( 0.00%)    51.00 (15.00%)
      Pass 2          42.00 ( 0.00%)    63.00 (21.00%)
      while Rested    86.00 ( 0.00%)    86.00 ( 0.00%)
      
      From
      
        http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
      
      I know that the allocation success rates in 3.3.6 was 78% in comparison
      to 36% in in the current akpm tree.  With the full series applied, the
      success rates are up to around 51% with some variability in the results.
      This is not as high a success rate but it does not reclaim excessively
      which is a key point.
      
      MMTests Statistics: vmstat
      Page Ins                                     3050912     3078892
      Page Outs                                    8033528     8039096
      Swap Ins                                           0           0
      Swap Outs                                          0           0
      
      Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
      there were 71881 pages swapped out.
      
      Direct pages scanned                           70942      122976
      Kswapd pages scanned                         1366300     1520122
      Kswapd pages reclaimed                       1366214     1484629
      Direct pages reclaimed                         70936      105716
      Kswapd efficiency                                99%         97%
      Kswapd velocity                             1072.550    1182.615
      Direct efficiency                                99%         85%
      Direct velocity                               55.690      95.672
      
      The kswapd velocity changes very little as expected.  kswapd velocity is
      around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
      allocation success rates it was 8140 pages/second.  Direct velocity is
      higher as a result of patch 2 of the series but this is expected and is
      acceptable.  The direct reclaim and kswapd velocities change very little.
      
      If these get accepted for merging then there is a difficulty in how they
      should be handled.  7db8889a ("mm: have order > 0 compaction start off
      where it left") is broken but it is already in 3.6-rc1 and needs to be
      fixed.  However, if just patch 4 from this series is applied then Jim
      Schutt's workload is known to break again as his workload also requires
      patch 5.  While it would be preferred to have all these patches in 3.6 to
      improve compaction in general, it would at least be acceptable if just
      patches 4 and 5 were merged to 3.6 to fix a known problem without breaking
      compaction completely.  On the face of it, that would force
      __GFP_NO_KSWAPD patches to be merged at the same time but I can do a
      version of this series with __GFP_NO_KSWAPD change reverted and then
      rebase it on top of this series.  That might be best overall because I
      note that the __GFP_NO_KSWAPD patch should have removed
      deferred_compaction from page_alloc.c but it didn't but fixing that causes
      collisions with this series.
      
      This patch:
      
      The comment about order applied when the check was order >
      PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d
      
       ("thp:
      use compaction for all allocation orders").  Fixing the comment while I'm
      in the general area.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ffb6335
    • Hugh Dickins's avatar
      mm/mmap.c: replace find_vma_prepare() with clearer find_vma_links() · 6597d783
      Hugh Dickins authored
      People get confused by find_vma_prepare(), because it doesn't care about
      what it returns in its output args, when its callers won't be interested.
      
      Clarify by passing in end-of-range address too, and returning failure if
      any existing vma overlaps the new range: instead of returning an ambiguous
      vma which most callers then must check.  find_vma_links() is a clearer
      name.
      
      This does revert 2.6.27's dfe195fb
      
       ("mm: fix uninitialized variables
      for find_vma_prepare callers"), but it looks like gcc 4.3.0 was one of
      those releases too eager to shout about uninitialized variables: only
      copy_vma() warns with 4.5.1 and 4.7.1, which a BUG on error silences.
      
      [hughd@google.com: fix warning, remove BUG()]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6597d783
    • Robin Dong's avatar
      mm: fix nonuniform page status when writing new file with small buffer · d741c9cd
      Robin Dong authored
      
      
      When writing a new file with 2048 bytes buffer, such as write(fd, buffer,
      2048), it will call generic_perform_write() twice for every page:
      
      	write_begin
      	mark_page_accessed(page)
      	write_end
      
      	write_begin
      	mark_page_accessed(page)
      	write_end
      
      Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be
      added to active_list even they have be accessed twice because they are not
      PageLRU(page).  But when page 14th comes, all pages in lru-pvecs will be
      moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now
      page 14th *is* PageLRU(page).  And after second write_end() only page 14th
      will be in active_list.
      
      In Hadoop environment, we do comes to this situation: after writing a
      file, we find out that only 14th, 28th, 42th...  page are in active_list
      and others in inactive_list.  Now kswapd works, shrinks the inactive_list,
      the file only have 14th, 28th...pages in memory, the readahead request
      size will be broken to only 52k (13*4k), system's performance falls
      dramatically.
      
      This problem can also replay by below steps (the machine has 8G memory):
      
      	1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
      	2. cat another 7.5G file to /dev/null
      	3. vmtouch -m 1G -v /test/file.out, it will show:
      
      	/test/file.out
      	[oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144
      
      	the 'o' means same pages are in memory but same are not.
      
      The solution for this problem is simple: the 14th page should be added to
      lru_add_pvecs before mark_page_accessed() just as other pages.
      
      [akpm@linux-foundation.org: tweak comment]
      [akpm@linux-foundation.org: grab better comment from the v3 patch]
      Signed-off-by: default avatarRobin Dong <sanbai@taobao.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d741c9cd
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov authored
      
      
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • Konstantin Khlebnikov's avatar
      mm: prepare VM_DONTDUMP for using in drivers · 0103bd16
      Konstantin Khlebnikov authored
      
      
      Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags:
      VM_DONTEXPAND, VM_DONTCOPY.  Currently this flag used only for
      sys_madvise.  The next patch will use it for replacing the outdated flag
      VM_RESERVED.
      
      Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL
      (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0103bd16
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
      Konstantin Khlebnikov authored
      
      
      Currently the kernel sets mm->exe_file during sys_execve() and then tracks
      number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
      as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
      resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
      
      VMA with VM_EXECUTABLE flag appears after mapping file with flag
      MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
      splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
      mm->exe_file and mmaps executable vmas with this file, they hold
      mm->exe_file while task is running.
      
      comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
      where all this stuff was introduced:
      
      > The kernel implements readlink of /proc/pid/exe by getting the file from
      > the first executable VMA.  Then the path to the file is reconstructed and
      > reported as the result.
      >
      > Because of the VMA walk the code is slightly different on nommu systems.
      > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      > walking the VMAs to find the first executable file-backed VMA we store a
      > reference to the exec'd file in the mm_struct.
      >
      > That reference would prevent the filesystem holding the executable file
      > from being unmounted even after unmapping the VMAs.  So we track the number
      > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      > unmapped.  This avoids pinning the mounted filesystem.
      
      exe_file's vma accounting is hooked into every file mmap/unmmap and vma
      split/merge just to fix some hypothetical pinning fs from umounting by mm,
      which already unmapped all its executable files, but still alive.
      
      Seems like currently nobody depends on this behaviour.  We can try to
      remove this logic and keep mm->exe_file until final mmput().
      
      mm->exe_file is still protected with mm->mmap_sem, because we want to
      change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
      task can change its mm->exe_file and unpin mountpoint explicitly.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9714acf
    • Konstantin Khlebnikov's avatar
      mm: use mm->exe_file instead of first VM_EXECUTABLE vma->vm_file · 2dd8ad81
      Konstantin Khlebnikov authored
      
      
      Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
      a task's executable file.  After this patch they will use mm->exe_file
      directly.  mm->exe_file is protected with mm->mmap_sem, so locking stays
      the same.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Chris Metcalf <cmetcalf@tilera.com>			[arch/tile]
      Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	[tomoyo]
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Acked-by: default avatarJames Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2dd8ad81
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov authored
      
      
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_INSERTPAGE · 4b6e1e37
      Konstantin Khlebnikov authored
      
      
      Merge VM_INSERTPAGE into VM_MIXEDMAP.  VM_MIXEDMAP VMA can mix pure-pfn
      ptes, special ptes and normal ptes.
      
      Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
      VM_PFNMAP.  If driver populates whole VMA at mmap() it probably not
      expects page-faults.
      
      This patch removes special check from vma_wants_writenotify() which
      disables pages write tracking for VMA populated via vm_instert_page().
      BDI below mapped file should not use dirty-accounting, moreover
      do_wp_page() can handle this.
      
      vm_insert_page() still marks vma after first usage.  Usually it is called
      from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
      change vma->vm_flags.  Caller must set VM_MIXEDMAP at mmap time if it
      wants to call this function from other places, for example from page-fault
      handler.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b6e1e37
    • Konstantin Khlebnikov's avatar
      mm: introduce arch-specific vma flag VM_ARCH_1 · cc2383ec
      Konstantin Khlebnikov authored
      
      
      Combine several arch-specific vma flags into one.
      
      before patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     VM_NOHUGEPAGE   VM_HUGEPAGE     -               VM_PAT
      powerpc -               -               VM_SAO          -
      parisc  VM_GROWSUP      -               -               -
      ia64    VM_GROWSUP      -               -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               -               -               -
      
      after patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     -               VM_PAT          VM_HUGEPAGE     VM_NOHUGEPAGE
      powerpc -               VM_SAO          -               -
      parisc  -               VM_GROWSUP      -               -
      ia64    -               VM_GROWSUP      -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               VM_ARCH_1       -               -
      
      And voila! One completely free bit.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc2383ec
    • Konstantin Khlebnikov's avatar
      mm, x86, pat: rework linear pfn-mmap tracking · b3b9c293
      Konstantin Khlebnikov authored
      
      
      Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.
      
      We can toss mapping address from remap_pfn_range() into
      track_pfn_vma_new(), and collect all PAT-related logic together in
      arch/x86/.
      
      This patch also restores orignal frustration-free is_cow_mapping() check
      in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73
      ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")
      
      is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
      because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.
      
      [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3b9c293
    • Suresh Siddha's avatar
      x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn · 5180da41
      Suresh Siddha authored
      
      
      With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
      attribute and uses it.  Expectation is that the driver reserves the
      memory attributes for the pfn before calling vm_insert_pfn().
      
      remap_pfn_range() (when called for the whole vma) will setup a new
      attribute (based on the prot argument) for the specified pfn range.
      This addresses the legacy usage which typically calls remap_pfn_range()
      with a desired memory attribute.  For ranges smaller than the vma size
      (which is typically not the case), remap_pfn_range() will use the
      existing memory attribute for the pfn range.
      
      Expose two different API's for these different behaviors.
      track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn()
      and track_pfn_remap() for the remap_pfn_range().
      
      This cleanup also prepares the ground for the track/untrack pfn vma
      routines to take over the ownership of setting PAT specific vm_flag in
      the 'vma'.
      
      [khlebnikov@openvz.org: Clear checks in track_pfn_remap()]
      [akpm@linux-foundation.org: tweak a few comments]
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5180da41
    • Suresh Siddha's avatar
      x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines · b1a86e15
      Suresh Siddha authored
      
      
      'pfn' argument for track_pfn_vma_new() can be used for reserving the
      attribute for the pfn range.  No need to depend on 'vm_pgoff'
      
      Similarly, untrack_pfn_vma() can depend on the 'pfn' argument if it is
      non-zero or can use follow_phys() to get the starting value of the pfn
      range.
      
      Also the non zero 'size' argument can be used instead of recomputing it
      from vma.
      
      This cleanup also prepares the ground for the track/untrack pfn vma
      routines to take over the ownership of setting PAT specific vm_flag in the
      'vma'.
      
      [khlebnikov@openvz.org: Clear pfn to paddr conversion]
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1a86e15
    • Rik van Riel's avatar
      mm: remove __GFP_NO_KSWAPD · c6543459
      Rik van Riel authored
      
      
      When transparent huge pages were introduced, memory compaction and swap
      storms were an issue, and the kernel had to be careful to not make THP
      allocations cause pageout or compaction.
      
      Now that we have working compaction deferral, kswapd is smart enough to
      invoke compaction and the quadratic behaviour around isolate_free_pages
      has been fixed, it should be safe to remove __GFP_NO_KSWAPD.
      
      [minchan@kernel.org: Comment fix]
      [mgorman@suse.de: Avoid direct reclaim for deferred compaction]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6543459
    • Srivatsa S. Bhat's avatar
      CPU hotplug, debug: detect imbalance between get_online_cpus() and put_online_cpus() · 075663d1
      Srivatsa S. Bhat authored
      
      
      The synchronization between CPU hotplug readers and writers is achieved
      by means of refcounting, safeguarded by the cpu_hotplug.lock.
      
      get_online_cpus() increments the refcount, whereas put_online_cpus()
      decrements it.  If we ever hit an imbalance between the two, we end up
      compromising the guarantees of the hotplug synchronization i.e, for
      example, an extra call to put_online_cpus() can end up allowing a
      hotplug reader to execute concurrently with a hotplug writer.
      
      So, add a WARN_ON() in put_online_cpus() to detect such cases where the
      refcount can go negative, and also attempt to fix it up, so that we can
      continue to run.
      
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: And...
      075663d1
    • Catalin Marinas's avatar
      Kconfig: clean up the "#if defined(arch)" list for exception-trace sysctl entry · 7ac57a89
      Catalin Marinas authored
      
      
      Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
      architectures requiring support for the "exception-trace" debug_table
      entry in kernel/sysctl.c.
      
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ac57a89
    • Catalin Marinas's avatar
      Kconfig: clean up the long arch list for the DEBUG_BUGVERBOSE config option · 9b2a60c4
      Catalin Marinas authored
      
      
      Introduce HAVE_DEBUG_BUGVERBOSE config option and select it in
      corresponding architecture Kconfig files.  Architectures that already
      select GENERIC_BUG don't need to select HAVE_DEBUG_BUGVERBOSE.
      
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b2a60c4
    • Catalin Marinas's avatar
      Kconfig: clean up the long arch list for the DEBUG_KMEMLEAK config option · b69ec42b
      Catalin Marinas authored
      
      
      Introduce HAVE_DEBUG_KMEMLEAK config option and select it in corresponding
      architecture Kconfig files.  DEBUG_KMEMLEAK now only depends on
      HAVE_DEBUG_KMEMLEAK.
      
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b69ec42b
    • Catalin Marinas's avatar
      Kconfig: clean up the long arch list for the UID16 config option · af1839eb
      Catalin Marinas authored
      
      
      Introduce HAVE_UID16 config option and select it in corresponding
      architecture Kconfig files.  UID16 now only depends on HAVE_UID16.
      
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af1839eb
    • Konrad Rzeszutek Wilk's avatar
      MAINTAINERS: add Konrad as the SWIOTLB maintainer · 6e28b761
      Konrad Rzeszutek Wilk authored
      
      
      Now that I've an IA64 box on top of the other boxes (IBM with Calgary-X,
      Intel VT-d, AMD Vi, and AMD GART - that can use SWIOTLB as fallback) I can
      reliably do regression testing.
      
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: default avatar"H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e28b761
    • Linus Torvalds's avatar
      Merge tag 'sound-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · f5a246ea
      Linus Torvalds authored
      Pull sound updates from Takashi Iwai:
       "This contains pretty many small commits covering fairly large range of
        files in sound/ directory.  Partly because of additional API support
        and partly because of constantly developed ASoC and ARM stuff.
      
        Some highlights:
      
         - Introduced the helper function and documentation for exposing the
           channel map via control API, as discussed in Plumbers; most of PCI
           drivers are covered, will follow more drivers later
      
         - Most of drivers have been replaced with the new PM callbacks (if
           the bus is supported)
      
         - HD-audio controller got the support of runtime PM and the support
           of D3 clock-stop.  Also changing the power_save option in sysfs
           kicks off immediately to enable / disable the power-save mode.
      
         - Another significant code change in HD-audio is the rewrite of
           firmware loading code.  Other than that, most of changes in
           HD-audio are continued cleanups and standardization for the generic
           auto parser and bug fixes (HBR, device-specific fixups), in
           addition to the support of channel-map API.
      
         - Addition of ASoC bindings for the compressed API, used by the
           mid-x86 drivers.
      
         - Lots of cleanups and API refreshes for ASoC codec drivers and
           DaVinci.
      
         - Conversion of OMAP to dmaengine.
      
         - New machine driver for Wolfson Microelectronics Bells.
      
         - New CODEC driver for Wolfson Microelectronics WM0010.
      
         - Enhancements to the ux500 and wm2000 drivers
      
         - A new driver for DA9055 and the support for regulator bypass mode."
      
      Fix up various arm soc header file reorg conflicts.
      
      * tag 'sound-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (339 commits)
        ALSA: hda - Add new codec ALC283 ALC290 support
        ALSA: hda - avoid unneccesary indices on "Headphone Jack" controls
        ALSA: hda - fix indices on boost volume on Conexant
        ALSA: aloop - add locking to timer access
        ALSA: hda - Fix hang caused by race during suspend.
        sound: Remove unnecessary semicolon
        ALSA: hda/realtek - Fix detection of ALC271X codec
        ALSA: hda - Add inverted internal mic quirk for Lenovo IdeaPad U310
        ALSA: hda - make Realtek/Sigmatel/Conexant use the generic unsol event
        ALSA: hda - make a generic unsol event handler
        ASoC: codecs: Add DA9055 codec driver
        ASoC: eukrea-tlv320: Convert it to platform driver
        ALSA: ASoC: add DT bindings for CS4271
        ASoC: wm_hubs: Ensure volume updates are handled during class W startup
        ASoC: wm5110: Adding missing volume update bits
        ASoC: wm5110: Add OUT3R support
        ASoC: wm5110: Add AEC loopback support
        ASoC: wm5110: Rename EPOUT to HPOUT3
        ASoC: arizona: Add more clock rates
        ASoC: arizona: Add more DSP options for mixer input muxes
        ...
      f5a246ea
    • Oleg Nesterov's avatar
      exec: make de_thread() killable · d5bbd43d
      Oleg Nesterov authored
      
      
      Change de_thread() to use KILLABLE rather than UNINTERRUPTIBLE while
      waiting for other threads.  The only complication is that we should
      clear ->group_exit_task and ->notify_count before we return, and we
      should do this under tasklist_lock.  -EAGAIN is used to match the
      initial signal_group_exit() check/return, it doesn't really matter.
      
      This fixes the (unlikely) race with coredump.  de_thread() checks
      signal_group_exit() before it starts to kill the subthreads, but this
      can't help if another CLONE_VM (but non CLONE_THREAD) task starts the
      coredumping after de_thread() unlocks ->siglock.  In this case the
      killed sub-thread can block in exit_mm() waiting for coredump_finish(),
      execing thread waits for that sub-thead, and the coredumping thread
      waits for execing thread.  Deadlock.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5bbd43d
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 · b5356a19
      Linus Torvalds authored
      Pull arm64 changes from Catalin Marinas:
       "arm64 fixes:
         - Use swiotlb_init() instead of swiotlb_init_with_default_size().
           The latter is now a static function (commit 74838b75 "swiotlb:
           add the late swiotlb initialization function with iotlb memory").
         - Enable interrupts before calling do_notify_resume().
      
        arm64 clean-up:
         - Use the generic implementation of compat_sys_sendfile() on arm64 as
           commit 8f9c0119 (introducing the function) has been merged."
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64:
        arm64: Enable interrupts before calling do_notify_resume()
        arm64: Use the generic compat_sys_sendfile() implementation
        arm64: Call swiotlb_init() instead of swiotlb_init_with_default_size()
      b5356a19
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · 3c5af8d1
      Linus Torvalds authored
      Pull sparc changes from David S Miller:
       "There is an attempt to fix a bad interaction between syscall tracing
        and force_successful_syscall() from Al Viro, but it needs to be redone
        as it introduced regressions and thus had to be reverted for now.
      
        Al is working on an updated version.
      
        But what we do have here are some significant bzero/memset
        improvements for Niagara-4.  An 8K page can be cleared in around 600
        cycles, because we essentially have a store that behaves like
        powerpc's dcbz that we can actually make real use of."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        Revert strace hiccups fix.
        sparc64: Niagara-4 bzero/memset, plus use MRU stores in page copy.
        sparc64: Fix strace hiccups when force_successful_syscall() triggers.
        sparc64: Rearrange thread info to cheaply clear syscall noerror state.
      3c5af8d1
    • Catalin Marinas's avatar
      arm64: Enable interrupts before calling do_notify_resume() · 6916fd08
      Catalin Marinas authored
      task_work_run() implementation had the side effect of enabling
      interrupts. With commit ac3d0da8
      
       (task_work: Make task_work_add()
      lockless), interrupts are no longer enabled revealing the bug in the
      arch code. This patch enables the interrupt explicitly before calling
      do_notify_resume().
      
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      6916fd08
  2. Oct 08, 2012
    • Catalin Marinas's avatar
      arm64: Use the generic compat_sys_sendfile() implementation · e048d004
      Catalin Marinas authored
      The generic implementation of compat_sys_sendfile() has been introduced
      by commit 8f9c0119
      
      . This patch removes the arm64 implementation in
      favour of the generic one.
      
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      e048d004
    • Catalin Marinas's avatar
      arm64: Call swiotlb_init() instead of swiotlb_init_with_default_size() · 27222a3d
      Catalin Marinas authored
      Following commit 74838b75
      
       (swiotlb: add the late swiotlb initialization
      function with iotlb memory) the swiotlb_init_with_default_size() is a
      static function. This patch changes the arm64 code to call
      swiotlb_init() instead and use the default size of 64MB. It is assumed
      that AArch64 platforms have enough RAM to afford the pre-allocated
      swiotlb memory. It also removes the #ifdef around this call since
      CONFIG_SWIOTLB is always enabled.
      
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      27222a3d
    • Linus Torvalds's avatar
      Merge tag 'upstream-3.7-rc1-fastmap' of git://git.infradead.org/linux-ubi · e9eca4de
      Linus Torvalds authored
      Pull UBI fastmap changes from Artem Bityutskiy:
       "This pull request contains the UBI fastmap support implemented by
        Richard Weinberger from Linutronix.  Fastmap is designed to address
        UBI's slow scanning issues.  Namely, it introduces a new on-flash
        data-structure called "fastmap", which stores the information about
        logical<->physical eraseblocks mappings.  So now to get this
        information just read the fastmap, instead of doing full scan.  More
        information here can be found in Richard's announcement in LKML
        (Subject: UBI: Fastmap request for inclusion (v19)):
      
           http://thread.gmane.org/gmane.linux.kernel/1364922/focus=1369109
      
        One thing I want to explicitly say is that fastmap did not have large
        enough linux-next exposure.  It is partially my fault - I did not
        respond quickly enough.  I _really_ apologize for this.  But it had
        good testing and disabled by default, so I do not expect that we'll
        break anything.
      
        Fastmap is declared as experimental so far, and it is off by default.
        We did declare that the on-flash format may be changed.  The reason
        for this is that no one used it in real production so far, so there is
        a high risk that something is missing.  Besides, we do not have
        user-space tools supporting fastmap so far.
      
        Nevertheless, I suggest we merge this feature.  Many people want UBI's
        scanning bottleneck to be fixed and merging fastmap now should
        accelerate its production use.  The plan is to make it bullet-prove,
        somewhat clean-up, and make it the default for UBI.  I do not know how
        many kernel releases will it take.
      
        Basically, I what I want to do for fastmap is something like Linus did
        for btrfs few years ago."
      
      * tag 'upstream-3.7-rc1-fastmap' of git://git.infradead.org/linux-ubi:
        UBI: Wire-up fastmap
        UBI: Add fastmap core
        UBI: Add fastmap support to the WL sub-system
        UBI: Add fastmap stuff to attach.c
        UBI: Wire-up ->fm_sem
        UBI: Add fastmap bits to build.c
        UBI: Add self_check_eba()
        UBI: Export next_sqnum()
        UBI: Add fastmap stuff to ubi.h
        UBI: Add fastmap on-flash data structures
      e9eca4de