Skip to content
  1. Oct 08, 2016
    • zhong jiang's avatar
      mm: remove unnecessary condition in remove_inode_hugepages · 72e2936c
      zhong jiang authored
      
      
      When the huge page is added to the page cahce (huge_add_to_page_cache),
      the page private flag will be cleared.  since this code
      (remove_inode_hugepages) will only be called for pages in the page
      cahce, PagePrivate(page) will always be false.
      
      The patch remove the code without any functional change.
      
      Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72e2936c
    • Michal Hocko's avatar
      mm: warn about allocations which stall for too long · 63f53dea
      Michal Hocko authored
      
      
      Currently we do warn only about allocation failures but small
      allocations are basically nofail and they might loop in the page
      allocator for a long time.  Especially when the reclaim cannot make any
      progress - e.g.  GFP_NOFS cannot invoke the oom killer and rely on a
      different context to make a forward progress in case there is a lot
      memory used by filesystems.
      
      Give us at least a clue when something like this happens and warn about
      allocations which take more than 10s.  Print the basic allocation
      context information along with the cumulative time spent in the
      allocation as well as the allocation stack.  Repeat the warning after
      every 10 seconds so that we know that the problem is permanent rather
      than ephemeral.
      
      Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63f53dea
    • Michal Hocko's avatar
      mm: consolidate warn_alloc_failed users · 7877cdcc
      Michal Hocko authored
      
      
      warn_alloc_failed is currently used from the page and vmalloc
      allocators.  This is a good reuse of the code except that vmalloc would
      appreciate a slightly different warning message.  This is already
      handled by the fmt parameter except that
      
        "%s: page allocation failure: order:%u, mode:%#x(%pGg)"
      
      is printed anyway.  This might be quite misleading because it might be a
      vmalloc failure which leads to the warning while the page allocator is
      not the culprit here.  Fix this by always using the fmt string and only
      print the context that makes sense for the particular context (e.g.
      order makes only very little sense for the vmalloc context).
      
      Rename the function to not miss any user and also because a later patch
      will reuse it also for !failure cases.
      
      Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7877cdcc
    • Wei Fang's avatar
      vfs,mm: fix a dead loop in truncate_inode_pages_range() · c2a9737f
      Wei Fang authored
      
      
      We triggered a deadloop in truncate_inode_pages_range() on 32 bits
      architecture with the test case bellow:
      
      	...
      	fd = open();
      	write(fd, buf, 4096);
      	preadv64(fd, &iovec, 1, 0xffffffff000);
      	ftruncate(fd, 0);
      	...
      
      Then ftruncate() will not return forever.
      
      The filesystem used in this case is ubifs, but it can be triggered on
      many other filesystems.
      
      When preadv64() is called with offset=0xffffffff000, a page with
      index=0xffffffff will be added to the radix tree of ->mapping.  Then
      this page can be found in ->mapping with pagevec_lookup().  After that,
      truncate_inode_pages_range(), which is called in ftruncate(), will fall
      into an infinite loop:
      
       - find a page with index=0xffffffff, since index>=end, this page won't
         be truncated
      
       - index++, and index become 0
      
       - the page with index=0xffffffff will be found again
      
      The data type of index is unsigned long, so index won't overflow to 0 on
      64 bits architecture in this case, and the dead loop won't happen.
      
      Since truncate_inode_pages_range() is executed with holding lock of
      inode->i_rwsem, any operation related with this lock will be blocked,
      and a hung task will happen, e.g.:
      
        INFO: task truncate_test:3364 blocked for more than 120 seconds.
        ...
           call_rwsem_down_write_failed+0x17/0x30
           generic_file_write_iter+0x32/0x1c0
           ubifs_write_iter+0xcc/0x170
           __vfs_write+0xc4/0x120
           vfs_write+0xb2/0x1b0
           SyS_write+0x46/0xa0
      
      The page with index=0xffffffff added to ->mapping is useless.  Fix this
      by checking the read position before allocating pages.
      
      Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
      Signed-off-by: default avatarWei Fang <fangwei1@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2a9737f
    • Yisheng Xie's avatar
      arm64 Kconfig: select gigantic page · 14f09910
      Yisheng Xie authored
      Arm64 supports gigantic pages after commit 084bd298
      
       ("ARM64: mm:
      HugeTLB support.") however, it can only be allocated at boottime and
      can't be freed.
      
      This patch selects ARCH_HAS_GIGANTIC_PAGE to make gigantic pages can be
      allocated and freed at runtime for arch arm64.
      
      Link: http://lkml.kernel.org/r/1475227569-63446-3-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14f09910
    • Yisheng Xie's avatar
      mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE · 461a7184
      Yisheng Xie authored
      
      
      Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
      page.  No functional change with this patch.
      
      Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      461a7184
    • Michal Hocko's avatar
      oom: print nodemask in the oom report · 82e7d3ab
      Michal Hocko authored
      
      
      We have received a hard to explain oom report from a customer.  The oom
      triggered regardless there is a lot of free memory:
      
        PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
        PoolThread cpuset=/ mems_allowed=0-7
        Pid: 30055, comm: PoolThread Tainted: G           E X 3.0.101-80-default #1
        Call Trace:
          dump_trace+0x75/0x300
          dump_stack+0x69/0x6f
          dump_header+0x8e/0x110
          oom_kill_process+0xa6/0x350
          out_of_memory+0x2b7/0x310
          __alloc_pages_slowpath+0x7dd/0x820
          __alloc_pages_nodemask+0x1e9/0x200
          alloc_pages_vma+0xe1/0x290
          do_anonymous_page+0x13e/0x300
          do_page_fault+0x1fd/0x4c0
          page_fault+0x25/0x30
        [...]
        active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
         active_file:13093 inactive_file:222506 isolated_file:0
         unevictable:262144 dirty:2 writeback:0 unstable:0
         free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
         mapped:261139 shmem:166297 pagetables:2228282 bounce:0
        [...]
        Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
        lowmem_reserve[]: 0 2892 775542 775542
        Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 772650 772650
        Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
      
      So all nodes but Node 0 have a lot of free memory which should suggest
      that there is an available memory especially when mems_allowed=0-7.  One
      could speculate that a massive process has managed to terminate and free
      up a lot of memory while racing with the above allocation request.
      Although this is highly unlikely it cannot be ruled out.
      
      A further debugging, however shown that the faulting process had
      mempolicy (not cpuset) to bind to Node 0.  We cannot see that
      information from the report though.  mems_allowed turned out to be more
      confusing than really helpful.
      
      Fix this by always priting the nodemask.  It is either mempolicy mask
      (and non-null) or the one defined by the cpusets.  The new output for
      the above oom report would be
      
        PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0
      
      This patch doesn't touch show_mem and the node filtering based on the
      cpuset node mask because mempolicy is always a subset of cpusets and
      seeing the full cpuset oom context might be helpful for tunning more
      specific mempolicies inside cpusets (e.g.  when they turn out to be too
      restrictive).  To prevent from ugly ifdefs the mask is printed even for
      !NUMA configurations but this should be OK (a single node will be
      printed).
      
      Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarSellami Abdelkader <abdelkader.sellami@sap.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Sellami Abdelkader <abdelkader.sellami@sap.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82e7d3ab
    • Kirill A. Shutemov's avatar
      mm: clarify why we avoid page_mapcount() for slab pages in dump_page() · 9996f05e
      Kirill A. Shutemov authored
      
      
      Let's add comment on why we skip page_mapcount() for sl[aou]b pages.
      
      Link: http://lkml.kernel.org/r/20160922105532.GB24593@node
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9996f05e
    • Andrea Arcangeli's avatar
      mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb · 8f26e0b1
      Andrea Arcangeli authored
      
      
      The old code was always doing:
      
         vma->vm_end = next->vm_end
         vma_rb_erase(next) // in __vma_unlink
         vma->vm_next = next->vm_next // in __vma_unlink
         next = vma->vm_next
         vma_gap_update(next)
      
      The new code still does the above for remove_next == 1 and 2, but for
      remove_next == 3 it has been changed and it does:
      
         next->vm_start = vma->vm_start
         vma_rb_erase(vma) // in __vma_unlink
         vma_gap_update(next)
      
      In the latter case, while unlinking "vma", validate_mm_rb() is told to
      ignore "vma" that is being removed, but next->vm_start was reduced
      instead. So for the new case, to avoid the false positive from
      validate_mm_rb, it should be "next" that is ignored when "vma" is
      being unlinked.
      
      "vma" and "next" in the above comment, considered pre-swap().
      
      Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarShaun Tancheff <shaun.tancheff@seagate.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f26e0b1
    • Andrea Arcangeli's avatar
      mm: vma_adjust: minor comment correction · 86d12e47
      Andrea Arcangeli authored
      
      
      The cases are three not two.
      
      Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86d12e47
    • Andrea Arcangeli's avatar
      mm: vma_adjust: remove superfluous check for next not NULL · 97a42cd4
      Andrea Arcangeli authored
      
      
      If next would be NULL we couldn't reach such code path.
      
      Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97a42cd4
    • Andrea Arcangeli's avatar
      mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk · e86f15ee
      Andrea Arcangeli authored
      
      
      The rmap_walk can access vm_page_prot (and potentially vm_flags in the
      pte/pmd manipulations).  So it's not safe to wait the caller to update
      the vm_page_prot/vm_flags after vma_merge returned potentially removing
      the "next" vma and extending the "current" vma over the
      next->vm_start,vm_end range, but still with the "current" vma
      vm_page_prot, after releasing the rmap locks.
      
      The vm_page_prot/vm_flags must be transferred from the "next" vma to the
      current vma while vma_merge still holds the rmap locks.
      
      The side effect of this race condition is pte corruption during migrate
      as remove_migration_ptes when run on a address of the "next" vma that
      got removed, used the vm_page_prot of the current vma.
      
        migrate   	      	        mprotect
        ------------			-------------
        migrating in "next" vma
      				vma_merge() # removes "next" vma and
      			        	    # extends "current" vma
      					    # current vma is not with
      					    # vm_page_prot updated
        remove_migration_ptes
        read vm_page_prot of current "vma"
        establish pte with wrong permissions
      				vm_set_page_prot(vma) # too late!
      				change_protection in the old vma range
      				only, next range is not updated
      
      This caused segmentation faults and potentially memory corruption in
      heavy mprotect loads with some light page migration caused by compaction
      in the background.
      
      Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
      which confirms the case 8 is only buggy one where the race can trigger,
      in all other vma_merge cases the above cannot happen.
      
      This fix removes the oddness factor from case 8 and it converts it from:
      
            AAAA
        PPPPNNNNXXXX -> PPPPNNNNNNNN
      
      to:
      
            AAAA
        PPPPNNNNXXXX -> PPPPXXXXXXXX
      
      XXXX has the right vma properties for the whole merged vma returned by
      vma_adjust, so it solves the problem fully.  It has the added benefits
      that the callers could stop updating vma properties when vma_merge
      succeeds however the callers are not updated by this patch (there are
      bits like VM_SOFTDIRTY that still need special care for the whole range,
      as the vma merging ignores them, but as long as they're not processed by
      rmap walks and instead they're accessed with the mmap_sem at least for
      reading, they are fine not to be updated within vma_adjust before
      releasing the rmap_locks).
      
      Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarAditya Mandaleeka <adityam@microsoft.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e86f15ee
    • Andrea Arcangeli's avatar
      mm: vma_adjust: remove superfluous confusing update in remove_next == 1 case · fb8c41e9
      Andrea Arcangeli authored
      
      
      mm->highest_vm_end doesn't need any update.
      
      After finally removing the oddness from vma_merge case 8 that was
      causing:
      
      1) constant risk of trouble whenever anybody would check vma fields
         from rmap_walks, like it happened when page migration was
         introduced and it read the vma->vm_page_prot from a rmap_walk
      
      2) the callers of vma_merge to re-initialize any value different from
         the current vma, instead of vma_merge() more reliably returning a
         vma that already matches all fields passed as parameter
      
      .. it is also worth to take the opportunity of cleaning up superfluous
      code in vma_adjust(), that if not removed adds up to the hard
      readability of the function.
      
      Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb8c41e9
    • Andrea Arcangeli's avatar
      mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE · 6d2329f8
      Andrea Arcangeli authored
      
      
      vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
      concurrently and this prevents the risk of reading intermediate values.
      
      Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d2329f8
    • zhong jiang's avatar
      mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node() · 6213055f
      zhong jiang authored
      
      
      According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
      in rare cases cause a hung task warning.
      
      At present, if alloc_stable_node() allocation fails, two break_cows may
      want to allocate a couple of pages, and the issue will come up when free
      memory is under pressure.
      
      We fix it by adding __GFP_HIGH to GFP, to grant access to memory
      reserves, increasing the likelihood of allocation success.
      
      [akpm@linux-foundation.org: tweak comment]
      Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6213055f
    • Yisheng Xie's avatar
      mm/page_isolation: fix typo: "paes" -> "pages" · ac34dcd2
      Yisheng Xie authored
      
      
      Fix typo in comment.
      
      Link: http://lkml.kernel.org/r/1474788764-5774-1-git-send-email-ysxie@foxmail.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac34dcd2
    • Gerald Schaefer's avatar
      mm/hugetlb: improve locking in dissolve_free_huge_pages() · eb03aa00
      Gerald Schaefer authored
      
      
      For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
      call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
      the page is not huge at all or a hugepage that is in-use.
      
      Improve this by doing the PageHuge() and page_count() checks already in
      dissolve_free_huge_pages() before calling dissolve_free_huge_page().  In
      dissolve_free_huge_page(), when holding the spinlock, those checks need
      to be revalidated.
      
      Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.com
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb03aa00
    • Gerald Schaefer's avatar
      mm/hugetlb: check for reserved hugepages during memory offline · 082d5b6b
      Gerald Schaefer authored
      In dissolve_free_huge_pages(), free hugepages will be dissolved without
      making sure that there are enough of them left to satisfy hugepage
      reservations.
      
      Fix this by adding a return value to dissolve_free_huge_pages() and
      checking h->free_huge_pages vs.  h->resv_huge_pages.  Note that this may
      lead to the situation where dissolve_free_huge_page() returns an error
      and all free hugepages that were dissolved before that error are lost,
      while the memory block still cannot be set offline.
      
      Fixes: c8721bbb
      
       ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      082d5b6b
    • Gerald Schaefer's avatar
      mm/hugetlb: fix memory offline with hugepage size > memory block size · 2247bb33
      Gerald Schaefer authored
      Patch series "mm/hugetlb: memory offline issues with hugepages", v4.
      
      This addresses several issues with hugepages and memory offline.  While
      the first patch fixes a panic, and is therefore rather important, the
      last patch is just a performance optimization.
      
      The second patch fixes a theoretical issue with reserved hugepages,
      while still leaving some ugly usability issue, see description.
      
      This patch (of 3):
      
      dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
      list corruption and addressing exception when trying to set a memory
      block offline that is part (but not the first part) of a "gigantic"
      hugetlb page with a size > memory block size.
      
      When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
      will trigger directly.  In the other case we will run into an addressing
      exception later, because dissolve_free_huge_page() will not work on the
      head page of the compound hugetlb page which will result in a NULL
      hstate from page_hstate().
      
      To fix this, first remove the VM_BUG_ON() because it is wrong, and then
      use the compound head page in dissolve_free_huge_page().  This means
      that an unused pre-allocated gigantic page that has any part of itself
      inside the memory block that is going offline will be dissolved
      completely.  Losing an unused gigantic hugepage is preferable to failing
      the memory offline, for example in the situation where a (possibly
      faulty) memory DIMM needs to go offline.
      
      Fixes: c8721bbb
      
       ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2247bb33
    • Wanlong Gao's avatar
      mm: nobootmem: move the comment of free_all_bootmem · 914a0516
      Wanlong Gao authored
      Commit b4def350 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
      removed the unnecessary nodeid argument, after that, this comment
      becomes more confused.  We should move it to the right place.
      
      Fixes: b4def350
      
       ("mm, nobootmem: clean-up of free_low_memory_core_early()")
      Link: http://lkml.kernel.org/r/1473996082-14603-1-git-send-email-wanlong.gao@gmail.com
      Signed-off-by: default avatarWanlong Gao <wanlong.gao@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      914a0516
    • Rasmus Villemoes's avatar
      mm/shmem.c: constify anon_ops · 19938e35
      Rasmus Villemoes authored
      
      
      Every other dentry_operations instance is const, and this one might as
      well be.
      
      Link: http://lkml.kernel.org/r/1473890528-7009-1-git-send-email-linux@rasmusvillemoes.dk
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19938e35
    • Johannes Weiner's avatar
      mm: memcontrol: consolidate cgroup socket tracking · 2d758073
      Johannes Weiner authored
      
      
      The cgroup core and the memory controller need to track socket ownership
      for different purposes, but the tracking sites being entirely different
      is kind of ugly.
      
      Be a better citizen and rename the memory controller callbacks to match
      the cgroup core callbacks, then move them to the same place.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d758073
    • Baoyou Xie's avatar
      mm: move phys_mem_access_prot_allowed() declaration to pgtable.h · 08ea8c07
      Baoyou Xie authored
      
      
      We get 1 warning when building kernel with W=1:
      
        drivers/char/mem.c:220:12: warning: no previous prototype for 'phys_mem_access_prot_allowed' [-Wmissing-prototypes]
         int __weak phys_mem_access_prot_allowed(struct file *file,
      
      In fact, its declaration is spreading to several header files in
      different architecture, but need to be declare in common header file.
      
      So this patch moves phys_mem_access_prot_allowed() to pgtable.h.
      
      Link: http://lkml.kernel.org/r/1473751597-12139-1-git-send-email-baoyou.xie@linaro.org
      Signed-off-by: default avatarBaoyou Xie <baoyou.xie@linaro.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08ea8c07
    • Andrew Morton's avatar
      mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE() · cc30c5d6
      Andrew Morton authored
      
      
      So they are CONFIG_DEBUG_VM-only and more informative.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc30c5d6
    • Tetsuo Handa's avatar
      mm: don't emit warning from pagefault_out_of_memory() · a104808e
      Tetsuo Handa authored
      Commit c32b3cbe
      
       ("oom, PM: make OOM detection in the freezer path
      raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
      to warn when we raced with disabling the OOM killer.
      
      Now, patch "oom, suspend: fix oom_killer_disable vs.  pm suspend
      properly" introduced a timeout for oom_killer_disable().  Even if we
      raced with disabling the OOM killer and the system is OOM livelocked,
      the OOM killer will be enabled eventually (in 20 seconds by default) and
      the OOM livelock will be solved.  Therefore, we no longer need to warn
      when we raced with disabling the OOM killer.
      
      Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a104808e
    • Vlastimil Babka's avatar
      mm, compaction: restrict fragindex to costly orders · 20311420
      Vlastimil Babka authored
      
      
      Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
      heuristic to prevent excessive compaction for costly orders (i.e.  THP).
      It's unlikely to make any difference for non-costly orders, especially
      with the default threshold.  But we cannot afford any uncertainty for
      the non-costly orders where the only alternative to successful
      reclaim/compaction is OOM.  After the recent patches we are guaranteed
      maximum effort without heuristics from compaction before deciding OOM,
      and fragindex is the last remaining heuristic.  Therefore skip fragindex
      altogether for non-costly orders.
      
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20311420
    • Vlastimil Babka's avatar
      mm, compaction: ignore fragindex from compaction_zonelist_suitable() · cc5c9f09
      Vlastimil Babka authored
      
      
      The compaction_zonelist_suitable() function tries to determine if
      compaction will be able to proceed after sufficient reclaim, i.e.
      whether there are enough reclaimable pages to provide enough order-0
      freepages for compaction.
      
      This addition of reclaimable pages to the free pages works well for the
      order-0 watermark check, but in the fragmentation index check we only
      consider truly free pages.  Thus we can get fragindex value close to 0
      which indicates failure do to lack of memory, and wrongly decide that
      compaction won't be suitable even after reclaim.
      
      Instead of trying to somehow adjust fragindex for reclaimable pages,
      let's just skip it from compaction_zonelist_suitable().
      
      Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc5c9f09
    • Vlastimil Babka's avatar
      mm, page_alloc: pull no_progress_loops update to should_reclaim_retry() · 423b452e
      Vlastimil Babka authored
      
      
      The should_reclaim_retry() makes decisions based on no_progress_loops,
      so it makes sense to also update the counter there.  It will be also
      consistent with should_compact_retry() and compaction_retries.  No
      functional change.
      
      [hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
      Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      423b452e
    • Vlastimil Babka's avatar
      mm, compaction: make full priority ignore pageblock suitability · 9f7e3387
      Vlastimil Babka authored
      
      
      Several people have reported premature OOMs for order-2 allocations
      (stack) due to OOM rework in 4.7.  In the scenario (parallel kernel
      build and dd writing to two drives) many pageblocks get marked as
      Unmovable and compaction free scanner struggles to isolate free pages.
      Joonsoo Kim pointed out that the free scanner skips pageblocks that are
      not movable to prevent filling them and forcing non-movable allocations
      to fallback to other pageblocks.  Such heuristic makes sense to help
      prevent long-term fragmentation, but premature OOMs are relatively more
      urgent problem.  As a compromise, this patch disables the heuristic only
      for the ultimate compaction priority.
      
      Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
      Reported-by: default avatarRalf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
      Reported-by: default avatarArkadiusz Miskiewicz <a.miskiewicz@gmail.com>
      Reported-by: default avatarOlaf Hering <olaf@aepfle.de>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f7e3387
    • Vlastimil Babka's avatar
      mm, compaction: restrict full priority to non-costly orders · c2033b00
      Vlastimil Babka authored
      
      
      The new ultimate compaction priority disables some heuristics, which may
      result in excessive cost.  This is fine for non-costly orders where we
      want to try hard before resulting for OOM, but might be disruptive for
      costly orders which do not trigger OOM and should generally have some
      fallback.  Thus, we disable the full priority for costly orders.
      
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2033b00
    • Vlastimil Babka's avatar
      mm, compaction: more reliably increase direct compaction priority · d9436498
      Vlastimil Babka authored
      
      
      During reclaim/compaction loop, compaction priority can be increased by
      the should_compact_retry() function, but the current code is not
      optimal.  Priority is only increased when compaction_failed() is true,
      which means that compaction has scanned the whole zone.  This may not
      happen even after multiple attempts with a lower priority due to
      parallel activity, so we might needlessly struggle on the lower
      priorities and possibly run out of compaction retry attempts in the
      process.
      
      After this patch we are guaranteed at least one attempt at the highest
      compaction priority even if we exhaust all retries at the lower
      priorities.
      
      Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9436498
    • Vlastimil Babka's avatar
      Revert "mm, oom: prevent premature OOM killer invocation for high order request" · 3250845d
      Vlastimil Babka authored
      Patch series "reintroduce compaction feedback for OOM decisions".
      
      After several people reported OOM's for order-2 allocations in 4.7 due
      to Michal Hocko's OOM rework, he reverted the part that considered
      compaction feedback [1] in the decisions to retry reclaim/compaction.
      This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
      while mmotm had an almost complete solution that instead improved
      compaction reliability.
      
      This series completes the mmotm solution and reintroduces the compaction
      feedback into OOM decisions.  The first two patches restore the state of
      mmotm before the temporary solution was merged, the last patch should be
      the missing piece for reliability.  The third patch restricts the
      hardened compaction to non-costly orders, since costly orders don't
      result in OOMs in the first place.
      
      [1] http://marc.info/?i=20160822093249.GA14916%40dhcp22.suse.cz%3E
      
      This patch (of 4):
      
      Commit 6b4e3181 ("mm, oom: prevent premature OOM killer invocation
      for high order request") was intended as a quick fix of OOM regressions
      for 4.8 and stable 4.7.x kernels.  For a better long-term solution, we
      still want to consider compaction feedback, which should be possible
      after some more improvements in the following patches.
      
      This reverts commit 6b4e3181
      
      .
      
      Link: http://lkml.kernel.org/r/20160906135258.18335-2-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3250845d
    • Huang Ying's avatar
      mm: remove page_file_index · 8cd79788
      Huang Ying authored
      
      
      After using the offset of the swap entry as the key of the swap cache,
      the page_index() becomes exactly same as page_file_index().  So the
      page_file_index() is removed and the callers are changed to use
      page_index() instead.
      
      Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Anna Schumaker <anna.schumaker@netapp.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cd79788
    • Huang Ying's avatar
      mm, swap: use offset of swap entry as key of swap cache · f6ab1f7f
      Huang Ying authored
      
      
      This patch is to improve the performance of swap cache operations when
      the type of the swap device is not 0.  Originally, the whole swap entry
      value is used as the key of the swap cache, even though there is one
      radix tree for each swap device.  If the type of the swap device is not
      0, the height of the radix tree of the swap cache will be increased
      unnecessary, especially on 64bit architecture.  For example, for a 1GB
      swap device on the x86_64 architecture, the height of the radix tree of
      the swap cache is 11.  But if the offset of the swap entry is used as
      the key of the swap cache, the height of the radix tree of the swap
      cache is 4.  The increased height causes unnecessary radix tree
      descending and increased cache footprint.
      
      This patch reduces the height of the radix tree of the swap cache via
      using the offset of the swap entry instead of the whole swap entry value
      as the key of the swap cache.  In 32 processes sequential swap out test
      case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
      for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
      when the type of the swap device is 1.
      
      Use the whole swap entry as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
      
      Use the swap offset as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
      
      Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6ab1f7f
    • Dan Williams's avatar
      mm: fix cache mode tracking in vm_insert_mixed() · 87744ab3
      Dan Williams authored
      
      
      vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
      fails to check the pgprot_t it uses for the mapping against the one
      recorded in the memtype tracking tree.  Add the missing call to
      track_pfn_insert() to preclude cases where incompatible aliased mappings
      are established for a given physical address range.
      
      Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87744ab3
    • Reza Arbab's avatar
      memory-hotplug: fix store_mem_state() return value · d66ba15b
      Reza Arbab authored
      
      
      If store_mem_state() is called to online memory which is already online,
      it will return 1, the value it got from device_online().
      
      This is wrong because store_mem_state() is a device_attribute .store
      function.  Thus a non-negative return value represents input bytes read.
      
      Set the return value to -EINVAL in this case.
      
      Link: http://lkml.kernel.org/r/1472743777-24266-1-git-send-email-arbab@linux.vnet.ibm.com
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d66ba15b
    • James Morse's avatar
      mm/memcontrol.c: make the walk_page_range() limit obvious · 0247f3f4
      James Morse authored
      
      
      mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
      walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
      callback, which causes the current implementation to skip non-vma
      regions.  This is all fine but follow up changes would like to make
      walk_page_range more generic so it is better to be explicit about which
      range to traverse so let's use highest_vm_end to explicitly traverse
      only user mmaped memory.
      
      [mhocko@kernel.org: rewrote changelog]
      Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0247f3f4
    • Aaron Lu's avatar
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu authored
      
      
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • James Morse's avatar
      fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious · 0f30206b
      James Morse authored
      
      
      Trying to walk all of virtual memory requires architecture specific
      knowledge.  On x86_64, addresses must be sign extended from bit 48,
      whereas on arm64 the top VA_BITS of address space have their own set of
      page tables.
      
      clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
      provides a test_walk() callback that only expects to be walking over
      VMAs.  Currently walk_pmd_range() will skip memory regions that don't
      have a VMA, reporting them as a hole.
      
      As this call only expects to walk user address space, make it walk 0 to
      'highest_vm_end'.
      
      Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f30206b
    • Tim Chen's avatar
      cpu: fix node state for whether it contains CPU · 03e86dba
      Tim Chen authored
      
      
      In current kernel code, we only call node_set_state(cpu_to_node(cpu),
      N_CPU) when a cpu is hot plugged.  But we do not set the node state for
      N_CPU when the cpus are brought online during boot.
      
      So this could lead to failure when we check to see if a node contains
      cpu with node_state(node_id, N_CPU).
      
      One use case is in the node_reclaime function:
      
              /*
               * Only run node reclaim on the local node or on nodes that do
               * not
               * have associated processors. This will favor the local
               * processor
               * over remote processors and spread off node memory allocations
               * as wide as possible.
               */
              if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
                      numa_node_id())
                      return NODE_RECLAIM_NOSCAN;
      
      I instrumented the kernel to call this function after boot and it always
      returns 0 on a x86 desktop machine until I apply the attached patch.
      
         int num_cpu_node(void)
         {
             int i, nr_cpu_nodes = 0;
      
             for_each_node(i) {
                     if (node_state(i, N_CPU))
                             ++ nr_cpu_nodes;
             }
      
             return nr_cpu_nodes;
         }
      
      Fix this by checking each node for online CPU when we initialize
      vmstat that's responsible for maintaining node state.
      
      Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: <Huang@linux.intel.com>
      Cc: Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03e86dba