Skip to content
  1. Jul 29, 2016
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 1c88e19b
      Linus Torvalds authored
      Merge more updates from Andrew Morton:
       "The rest of MM"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
        mm, compaction: simplify contended compaction handling
        mm, compaction: introduce direct compaction priority
        mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
        mm, page_alloc: make THP-specific decisions more generic
        mm, page_alloc: restructure direct compaction handling in slowpath
        mm, page_alloc: don't retry initial attempt in slowpath
        mm, page_alloc: set alloc_flags only once in slowpath
        lib/stackdepot.c: use __GFP_NOWARN for stack allocations
        mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
        mm, kasan: account for object redzone in SLUB's nearest_obj()
        mm: fix use-after-free if memory allocation failed in vma_adjust()
        zsmalloc: Delete an unnecessary check before the function call "iput"
        mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
        mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
        mm: optimize copy_page_to/from_iter_iovec
        mm: add cond_resched() to generic_swapfile_activate()
        Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
        mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
        mm: hwpoison: remove incorrect comments
        make __section_nr() more efficient
        ...
      1c88e19b
    • Vlastimil Babka's avatar
      mm, compaction: simplify contended compaction handling · c3486f53
      Vlastimil Babka authored
      Async compaction detects contention either due to failing trylock on
      zone->lock or lru_lock, or by need_resched().  Since 1f9efdef
      
       ("mm,
      compaction: khugepaged should not give up due to need_resched()") the
      code got quite complicated to distinguish these two up to the
      __alloc_pages_slowpath() level, so different decisions could be taken
      for khugepaged allocations.
      
      After the recent changes, khugepaged allocations don't check for
      contended compaction anymore, so we again don't need to distinguish lock
      and sched contention, and simplify the current convoluted code a lot.
      
      However, I believe it's also possible to simplify even more and
      completely remove the check for contended compaction after the initial
      async compaction for costly orders, which was originally aimed at THP
      page fault allocations.  There are several reasons why this can be done
      now:
      
      - with the new defaults, THP page faults no longer do reclaim/compaction at
        all, unless the system admin has overridden the default, or application has
        indicated via madvise that it can benefit from THP's. In both cases, it
        means that the potential extra latency is expected and worth the benefits.
      - even if reclaim/compaction proceeds after this patch where it previously
        wouldn't, the second compaction attempt is still async and will detect the
        contention and back off, if the contention persists
      - there are still heuristics like deferred compaction and pageblock skip bits
        in place that prevent excessive THP page fault latencies
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3486f53
    • Vlastimil Babka's avatar
      mm, compaction: introduce direct compaction priority · a5508cd8
      Vlastimil Babka authored
      
      
      In the context of direct compaction, for some types of allocations we
      would like the compaction to either succeed or definitely fail while
      trying as hard as possible.  Current async/sync_light migration mode is
      insufficient, as there are heuristics such as caching scanner positions,
      marking pageblocks as unsuitable or deferring compaction for a zone.  At
      least the final compaction attempt should be able to override these
      heuristics.
      
      To communicate how hard compaction should try, we replace migration mode
      with a new enum compact_priority and change the relevant function
      signatures.  In compact_zone_order() where struct compact_control is
      constructed, the priority is mapped to suitable control flags.  This
      patch itself has no functional change, as the current priority levels
      are mapped back to the same migration modes as before.  Expanding them
      will be done next.
      
      Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
      removed, as the only caller exists under CONFIG_COMPACTION.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5508cd8
    • Vlastimil Babka's avatar
      mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations · 25160354
      Vlastimil Babka authored
      
      
      After the previous patch, we can distinguish costly allocations that
      should be really lightweight, such as THP page faults, with
      __GFP_NORETRY.  This means we don't need to recognize khugepaged
      allocations via PF_KTHREAD anymore.  We can also change THP page faults
      in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
      khugepaged, as the process has indicated that it benefits from THP's and
      is willing to pay some initial latency costs.
      
      We can also make the flags handling less cryptic by distinguishing
      GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
      GFP_TRANSHUGE (only direct reclaim, khugepaged default).  Adding
      __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
      
      The patch effectively changes the current GFP_TRANSHUGE users as
      follows:
      
      * get_huge_zero_page() - the zero page lifetime should be relatively
        long and it's shared by multiple users, so it's worth spending some
        effort on it.  We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
        This also restores direct reclaim to this allocation, which was
        unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
        by default to madvise and add a stall-free defrag option")
      
      * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
        is not an issue.  So if khugepaged "defrag" is enabled (the default), do
        reclaim via GFP_TRANSHUGE without __GFP_NORETRY.  We can remove the
        PF_KTHREAD check from page alloc.
      
        As a side-effect, khugepaged will now no longer check if the initial
        compaction was deferred or contended.  This is OK, as khugepaged sleep
        times between collapsion attempts are long enough to prevent noticeable
        disruption, so we should allow it to spend some effort.
      
      * migrate_misplaced_transhuge_page() - already was masking out
        __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
        equivalent.
      
      * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
        are now allocating without __GFP_NORETRY.  Other vma's keep using
        __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
        it's allowed only for madvised vma's).  The rest is conversion to
        GFP_TRANSHUGE(_LIGHT).
      
      [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
      Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25160354
    • Vlastimil Babka's avatar
      mm, page_alloc: make THP-specific decisions more generic · 3eb2771b
      Vlastimil Babka authored
      
      
      Since THP allocations during page faults can be costly, extra decisions
      are employed for them to avoid excessive reclaim and compaction, if the
      initial compaction doesn't look promising.  The detection has never been
      perfect as there is no gfp flag specific to THP allocations.  At this
      moment it checks the whole combination of flags that makes up
      GFP_TRANSHUGE, and hopes that no other users of such combination exist,
      or would mind being treated the same way.  Extra care is also taken to
      separate allocations from khugepaged, where latency doesn't matter that
      much.
      
      It is however possible to distinguish these allocations in a simpler and
      more reliable way.  The key observation is that after the initial
      compaction followed by the first iteration of "standard"
      reclaim/compaction, both __GFP_NORETRY allocations and costly
      allocations without __GFP_REPEAT are declared as failures:
      
              /* Do not loop if specifically requested */
              if (gfp_mask & __GFP_NORETRY)
                      goto nopage;
      
              /*
               * Do not retry costly high order allocations unless they are
               * __GFP_REPEAT
               */
              if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                      goto nopage;
      
      This means we can further distinguish allocations that are costly order
      *and* additionally include the __GFP_NORETRY flag.  As it happens,
      GFP_TRANSHUGE allocations do already fall into this category.  This will
      also allow other costly allocations with similar high-order benefit vs
      latency considerations to use this semantic.  Furthermore, we can
      distinguish THP allocations that should try a bit harder (such as from
      khugepageed) by removing __GFP_NORETRY, as will be done in the next
      patch.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eb2771b
    • Vlastimil Babka's avatar
      mm, page_alloc: restructure direct compaction handling in slowpath · a8161d1e
      Vlastimil Babka authored
      
      
      The retry loop in __alloc_pages_slowpath is supposed to keep trying
      reclaim and compaction (and OOM), until either the allocation succeeds,
      or returns with failure.  Success here is more probable when reclaim
      precedes compaction, as certain watermarks have to be met for compaction
      to even try, and more free pages increase the probability of compaction
      success.  On the other hand, starting with light async compaction (if
      the watermarks allow it), can be more efficient, especially for smaller
      orders, if there's enough free memory which is just fragmented.
      
      Thus, the current code starts with compaction before reclaim, and to
      make sure that the last reclaim is always followed by a final
      compaction, there's another direct compaction call at the end of the
      loop.  This makes the code hard to follow and adds some duplicated
      handling of migration_mode decisions.  It's also somewhat inefficient
      that even if reclaim or compaction decides not to retry, the final
      compaction is still attempted.  Some gfp flags combination also shortcut
      these retry decisions by "goto noretry;", making it even harder to
      follow.
      
      This patch attempts to restructure the code with only minimal functional
      changes.  The call to the first compaction and THP-specific checks are
      now placed above the retry loop, and the "noretry" direct compaction is
      removed.
      
      The initial compaction is additionally restricted only to costly orders,
      as we can expect smaller orders to be held back by watermarks, and only
      larger orders to suffer primarily from fragmentation.  This better
      matches the checks in reclaim's shrink_zones().
      
      There are two other smaller functional changes.  One is that the upgrade
      from async migration to light sync migration will always occur after the
      initial compaction.  This is how it has been until recent patch "mm,
      oom: protect !costly allocations some more", which introduced upgrading
      the mode based on COMPACT_COMPLETE result, but kept the final compaction
      always upgraded, which made it even more special.  It's better to return
      to the simpler handling for now, as migration modes will be further
      modified later in the series.
      
      The second change is that once both reclaim and compaction declare it's
      not worth to retry the reclaim/compact loop, there is no final
      compaction attempt.  As argued above, this is intentional.  If that
      final compaction were to succeed, it would be due to a wrong retry
      decision, or simply a race with somebody else freeing memory for us.
      
      The main outcome of this patch should be simpler code.  Logically, the
      initial compaction without reclaim is the exceptional case to the
      reclaim/compaction scheme, but prior to the patch, it was the last loop
      iteration that was exceptional.  Now the code matches the logic better.
      The change also enable the following patches.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8161d1e
    • Vlastimil Babka's avatar
      mm, page_alloc: don't retry initial attempt in slowpath · 23771235
      Vlastimil Babka authored
      
      
      After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
      kswapd, it first tries get_page_from_freelist() with the new
      alloc_flags, as it may succeed e.g. due to using min watermark instead
      of low watermark.  It makes sense to to do this attempt before adjusting
      zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
      path if we just wake up kswapd and successfully allocate.
      
      This patch therefore moves the initial attempt above the retry label and
      reorganizes a bit the part below the retry label.  We still have to
      attempt get_page_from_freelist() on each retry, as some allocations
      cannot do that as part of direct reclaim or compaction, and yet are not
      allowed to fail (even though they do a WARN_ON_ONCE() and thus should
      not exist).  We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
      and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
      it.  As a side-effect, the attempts from direct reclaim/compaction will
      also no longer obey watermarks once this is set, but there's little harm
      in that.
      
      Kswapd wakeups are also done on each retry to be safe from potential
      races resulting in kswapd going to sleep while a process (that may not
      be able to reclaim by itself) is still looping.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23771235
    • Vlastimil Babka's avatar
      mm, page_alloc: set alloc_flags only once in slowpath · 31a6c190
      Vlastimil Babka authored
      
      
      In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
      initialized, so move the initialization above the retry: label.  Also
      make the comment above the initialization more descriptive.
      
      The only exception in the alloc_flags being constant is
      ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
      allocating thread.  We can fix this, and make the code simpler and a bit
      more effective at the same time, by moving the part that determines
      ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().
      
      This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
      places in __alloc_pages_slowpath() anymore.  The only two tests for the
      flag can instead call gfp_pfmemalloc_allowed().
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a6c190
    • Kirill A. Shutemov's avatar
      lib/stackdepot.c: use __GFP_NOWARN for stack allocations · 87cc271d
      Kirill A. Shutemov authored
      
      
      This (large, atomic) allocation attempt can fail.  We expect and handle
      that, so avoid the scary warning.
      
      Link: http://lkml.kernel.org/r/20160720151905.GB19146@node.shutemov.name
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87cc271d
    • Alexander Potapenko's avatar
      mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB · 80a9201a
      Alexander Potapenko authored
      
      
      For KASAN builds:
       - switch SLUB allocator to using stackdepot instead of storing the
         allocation/deallocation stacks in the objects;
       - change the freelist hook so that parts of the freelist can be put
         into the quarantine.
      
      [aryabinin@virtuozzo.com: fixes]
        Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80a9201a
    • Alexander Potapenko's avatar
      mm, kasan: account for object redzone in SLUB's nearest_obj() · c146a2b9
      Alexander Potapenko authored
      When looking up the nearest SLUB object for a given address, correctly
      calculate its offset if SLAB_RED_ZONE is enabled for that cache.
      
      Previously, when KASAN had detected an error on an object from a cache
      with SLAB_RED_ZONE set, the actual start address of the object was
      miscalculated, which led to random stacks having been reported.
      
      When looking up the nearest SLUB object for a given address, correctly
      calculate its offset if SLAB_RED_ZONE is enabled for that cache.
      
      Fixes: 7ed2f9e6
      
       ("mm, kasan: SLAB support")
      Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c146a2b9
    • Kirill A. Shutemov's avatar
      mm: fix use-after-free if memory allocation failed in vma_adjust() · 734537c9
      Kirill A. Shutemov authored
      
      
      There's one case when vma_adjust() expands the vma, overlapping with
      *two* next vma.  See case 6 of mprotect, described in the comment to
      vma_merge().
      
      To handle this (and only this) situation we iterate twice over main part
      of the function.  See "goto again".
      
      Vegard reported[1] that he sees out-of-bounds access complain from
      KASAN, if anon_vma_clone() on the *second* iteration fails.
      
      This happens because we free 'next' vma by the end of first iteration
      and don't have a way to undo this if anon_vma_clone() fails on the
      second iteration.
      
      The solution is to do all required allocations upfront, before we touch
      vmas.
      
      The allocation on the second iteration is only required if first two
      vmas don't have anon_vma, but third does.  So we need, in total, one
      anon_vma_clone() call.
      
      It's easy to adjust 'exporter' to the third vma for such case.
      
      [1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com
      
      Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      734537c9
    • Markus Elfring's avatar
      zsmalloc: Delete an unnecessary check before the function call "iput" · c3491eca
      Markus Elfring authored
      
      
      iput() tests whether its argument is NULL and then returns immediately.
      Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      
      Link: http://lkml.kernel.org/r/559cf499-4a01-25f9-c87f-24d906626a57@users.sourceforge.net
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3491eca
    • zijun_hu's avatar
      mm/memblock.c: fix index adjustment error in __next_mem_range_rev() · fb399b48
      zijun_hu authored
      
      
      Fix region index adjustment error when parameter type_b of
      __next_mem_range_rev() == NULL.
      
      Signed-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Richard Leitner <dev@g0hl1n.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb399b48
    • Xishi Qiu's avatar
      mem-hotplug: alloc new page from a nearest neighbor node when mem-offline · 394e31d2
      Xishi Qiu authored
      
      
      If we offline a node, alloc the new page from a nearest neighbor node
      instead of the current node or other remote nodes, because re-migrate is
      a waste of time and the distance of the remote nodes is often very
      large.
      
      Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
      zone or highmem zone.
      
      Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      394e31d2
    • Mikulas Patocka's avatar
      mm: optimize copy_page_to/from_iter_iovec · 3fa6c507
      Mikulas Patocka authored
      
      
      copy_page_to_iter_iovec() and copy_page_from_iter_iovec() copy some data
      to userspace or from userspace.  These functions have a fast path where
      they map a page using kmap_atomic and a slow path where they use kmap.
      
      kmap is slower than kmap_atomic, so the fast path is preferred.
      
      However, on kernels without highmem support, kmap just calls
      page_address, so there is no need to avoid kmap.  On kernels without
      highmem support, the fast path just increases code size (and cache
      footprint) and it doesn't improve copy performance in any way.
      
      This patch enables the fast path only if CONFIG_HIGHMEM is defined.
      
      Code size reduced by this patch:
        x86 (without highmem)	  928
        x86-64		  960
        sparc64		  848
        alpha			 1136
        pa-risc		 1200
      
      [akpm@linux-foundation.org: use IS_ENABLED(), per Andi]
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221711410.4818@file01.intranet.prod.int.rdu2.redhat.com
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fa6c507
    • Mikulas Patocka's avatar
      mm: add cond_resched() to generic_swapfile_activate() · 7e4411bf
      Mikulas Patocka authored
      
      
      generic_swapfile_activate() can take quite long time, it iterates over
      all blocks of a file, so add cond_resched to it.  I observed about 1
      second stalls when activating a swapfile that was almost unfragmented -
      this patch fixes it.
      
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e4411bf
    • Michal Hocko's avatar
      Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements" · 4e390b2b
      Michal Hocko authored
      This reverts commit f9054c70 ("mm, mempool: only set __GFP_NOMEMALLOC
      if there are free elements").
      
      There has been a report about OOM killer invoked when swapping out to a
      dm-crypt device.  The primary reason seems to be that the swapout out IO
      managed to completely deplete memory reserves.  Ondrej was able to
      bisect and explained the issue by pointing to f9054c70 ("mm,
      mempool: only set __GFP_NOMEMALLOC if there are free elements").
      
      The reason is that the swapout path is not throttled properly because
      the md-raid layer needs to allocate from the generic_make_request path
      which means it allocates from the PF_MEMALLOC context.  dm layer uses
      mempool_alloc in order to guarantee a forward progress which used to
      inhibit access to memory reserves when using page allocator.  This has
      changed by f9054c70 ("mm, mempool: only set __GFP_NOMEMALLOC if
      there are free elements") which has dropped the __GFP_NOMEMALLOC
      protection wh...
      4e390b2b
    • Hugh Dickins's avatar
      mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode · 1d2047fe
      Hugh Dickins authored
      
      
      At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
      isolate a PageWriteback page, which __unmap_and_move() then rejects with
      -EBUSY: of course the writeback might complete in between, but that's
      not what we usually expect, so probably better not to isolate it.
      
      When tested by stress-highalloc from mmtests, this has reduced the
      number of page migrate failures by 60-70%.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d2047fe
    • Naoya Horiguchi's avatar
      mm: hwpoison: remove incorrect comments · 7c7fd825
      Naoya Horiguchi authored
      
      
      dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
      let's remove incorrect comment.
      
      The reason why the page lock is not really needed is that
      dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
      hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
      are just allocated but not linked to active list yet, even without
      taking page lock.
      
      Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarZhan Chen <zhanc1@andrew.cmu.edu>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c7fd825
    • Zhou Chengming's avatar
      make __section_nr() more efficient · 91fd8b95
      Zhou Chengming authored
      
      
      When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
      section number with a subtraction directly.
      
      Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
      Signed-off-by: default avatarZhou Chengming <zhouchengming1@huawei.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Li Bin <huawei.libin@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91fd8b95
    • Vegard Nossum's avatar
      kmemleak: don't hang if user disables scanning early · 98c42d94
      Vegard Nossum authored
      
      
      If the user tries to disable automatic scanning early in the boot
      process using e.g.:
      
        echo scan=off > /sys/kernel/debug/kmemleak
      
      then this command will hang until SECS_FIRST_SCAN (= 60) seconds have
      elapsed, even though the system is fully initialised.
      
      We can fix this using interruptible sleep and checking if we're supposed
      to stop whenever we wake up (like the rest of the code does).
      
      Link: http://lkml.kernel.org/r/1468835005-2873-1-git-send-email-vegard.nossum@oracle.com
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98c42d94
    • Dennis Chen's avatar
      arm64:acpi: fix the acpi alignment exception when 'mem=' specified · cb0a6502
      Dennis Chen authored
      
      
      When booting an ACPI enabled kernel with 'mem=x', there is the
      possibility that ACPI data regions from the firmware will lie above the
      memory limit.  Ordinarily these will be removed by
      memblock_enforce_memory_limit(.).
      
      Unfortunately, this means that these regions will then be mapped by
      acpi_os_ioremap(.) as device memory (instead of normal) thus unaligned
      accessess will then provoke alignment faults.
      
      In this patch we adopt memblock_mem_limit_remove_map instead, and this
      preserves these ACPI data regions (marked NOMAP) thus ensuring that
      these regions are not mapped as device memory.
      
      For example, below is an alignment exception observed on ARM platform
      when booting the kernel with 'acpi=on mem=8G':
      
        ...
        Unable to handle kernel paging request at virtual address ffff0000080521e7
        pgd = ffff000008aa0000
        [ffff0000080521e7] *pgd=000000801fffe003, *pud=000000801fffd003, *pmd=000000801fffc003, *pte=00e80083ff1c1707
        Internal error: Oops: 96000021 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc3-next-20160616+ #172
        Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1001A 02/09/2016
        task: ffff800001ef0000 ti: ffff800001ef8000 task.ti: ffff800001ef8000
        PC is at acpi_ns_lookup+0x520/0x734
        LR is at acpi_ns_lookup+0x4a4/0x734
        pc : [<ffff0000083b8b10>] lr : [<ffff0000083b8a94>] pstate: 60000045
        sp : ffff800001efb8b0
        x29: ffff800001efb8c0 x28: 000000000000001b
        x27: 0000000000000001 x26: 0000000000000000
        x25: ffff800001efb9e8 x24: ffff000008a10000
        x23: 0000000000000001 x22: 0000000000000001
        x21: ffff000008724000 x20: 000000000000001b
        x19: ffff0000080521e7 x18: 000000000000000d
        x17: 00000000000038ff x16: 0000000000000002
        x15: 0000000000000007 x14: 0000000000007fff
        x13: ffffff0000000000 x12: 0000000000000018
        x11: 000000001fffd200 x10: 00000000ffffff76
        x9 : 000000000000005f x8 : ffff000008725fa8
        x7 : ffff000008a8df70 x6 : ffff000008a8df70
        x5 : ffff000008a8d000 x4 : 0000000000000010
        x3 : 0000000000000010 x2 : 000000000000000c
        x1 : 0000000000000006 x0 : 0000000000000000
        ...
          acpi_ns_lookup+0x520/0x734
          acpi_ds_load1_begin_op+0x174/0x4fc
          acpi_ps_build_named_op+0xf8/0x220
          acpi_ps_create_op+0x208/0x33c
          acpi_ps_parse_loop+0x204/0x838
          acpi_ps_parse_aml+0x1bc/0x42c
          acpi_ns_one_complete_parse+0x1e8/0x22c
          acpi_ns_parse_table+0x8c/0x128
          acpi_ns_load_table+0xc0/0x1e8
          acpi_tb_load_namespace+0xf8/0x2e8
          acpi_load_tables+0x7c/0x110
          acpi_init+0x90/0x2c0
          do_one_initcall+0x38/0x12c
          kernel_init_freeable+0x148/0x1ec
          kernel_init+0x10/0xec
          ret_from_fork+0x10/0x40
        Code: b9009fbc 2a00037b 36380057 3219037b (b9400260)
        ---[ end trace 03381e5eb0a24de4 ]---
        Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      
      With 'efi=debug', we can see those ACPI regions loaded by firmware on
      that board as:
      
        efi:   0x0083ff185000-0x0083ff1b4fff [Reserved           |   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
        efi:   0x0083ff1b5000-0x0083ff1c2fff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
        efi:   0x0083ff223000-0x0083ff224fff [ACPI Memory NVS    |   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
      
      Link: http://lkml.kernel.org/r/1468475036-5852-3-git-send-email-dennis.chen@arm.com
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarDennis Chen <dennis.chen@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Kaly Xin <kaly.xin@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb0a6502
    • Dennis Chen's avatar
      mm/memblock.c: add new infrastructure to address the mem limit issue · a571d4eb
      Dennis Chen authored
      
      
      In some cases, memblock is queried by kernel to determine whether a
      specified address is RAM or not.  For example, the ACPI core needs this
      information to determine which attributes to use when mapping ACPI
      regions(acpi_os_ioremap).  Use of incorrect memory types can result in
      faults, data corruption, or other issues.
      
      Removing memory with memblock_enforce_memory_limit() throws away this
      information, and so a kernel booted with 'mem=' may suffer from the
      issues described above.  To avoid this, we need to keep those NOMAP
      regions instead of removing all above the limit, which preserves the
      information we need while preventing other use of those regions.
      
      This patch adds new infrastructure to retain all NOMAP memblock regions
      while removing others, to cater for this.
      
      Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com
      Signed-off-by: default avatarDennis Chen <dennis.chen@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Kaly Xin <kaly.xin@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a571d4eb
    • Andy Lutomirski's avatar
      printk: when dumping regs, show the stack, not thread_info · 8b70ca65
      Andy Lutomirski authored
      
      
      We currently show:
      
        task: <current> ti: <current_thread_info()> task.ti: <task_thread_info(current)>"
      
      "ti" and "task.ti" are redundant, and neither is actually what we want
      to show, which the the base of the thread stack.  Change the display to
      show the stack pointer explicitly.
      
      Link: http://lkml.kernel.org/r/543ac5bd66ff94000a57a02e11af7239571a3055.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b70ca65
    • Andy Lutomirski's avatar
      kdb: use task_cpu() instead of task_thread_info()->cpu · e558af65
      Andy Lutomirski authored
      
      
      We'll need this cleanup to make the cpu field in thread_info be
      optional.
      
      Link: http://lkml.kernel.org/r/da298328dc77ea494576c2f20a934218e758a6fa.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e558af65
    • Andy Lutomirski's avatar
      mm: fix memcg stack accounting for sub-page stacks · efdc9490
      Andy Lutomirski authored
      We should account for stacks regardless of stack size, and we need to
      account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the units
      to kilobytes and Move it into account_kernel_stack().
      
      Fixes: 12580e4b
      
       ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
      Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efdc9490
    • Andy Lutomirski's avatar
      mm: track NR_KERNEL_STACK in KiB instead of number of stacks · d30dd8be
      Andy Lutomirski authored
      
      
      Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
      This only makes sense if each kernel stack exists entirely in one zone,
      and allowing vmapped stacks could break this assumption.
      
      Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
      allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
      architectures.  Keep it simple and use KiB.
      
      Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d30dd8be
    • Dan Williams's avatar
      mm: cleanup ifdef guards for vmem_altmap · 11db0486
      Dan Williams authored
      
      
      Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some
      ifdef guards to just ZONE_DEVICE.
      
      Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Eric Sandeen <sandeen@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11db0486
    • Dan Williams's avatar
      mm: CONFIG_ZONE_DEVICE stop depending on CONFIG_EXPERT · c02b6aec
      Dan Williams authored
      
      
      When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
      CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
      However, now that the ZONE_DMA conflict has been eliminated it no longer
      makes sense to require CONFIG_EXPERT.
      
      Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c02b6aec
    • Christoph Hellwig's avatar
      memblock: include <asm/sections.h> instead of <asm-generic/sections.h> · c4c5ad6b
      Christoph Hellwig authored
      
      
      asm-generic headers are generic implementations for architecture
      specific code and should not be included by common code.  Thus use the
      asm/ version of sections.h to get at the linker sections.
      
      Link: http://lkml.kernel.org/r/1468285103-7470-1-git-send-email-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4c5ad6b
    • Huang Ying's avatar
      mm, THP: clean up return value of madvise_free_huge_pmd · 319904ad
      Huang Ying authored
      
      
      The definition of return value of madvise_free_huge_pmd is not clear
      before.  According to the suggestion of Minchan Kim, change the type of
      return value to bool and return true if we do MADV_FREE successfully on
      entire pmd page, otherwise, return false.  Comments are added too.
      
      Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      319904ad
    • Ganesh Mahendran's avatar
      mm/zsmalloc: use helper to clear page->flags bit · 18fd06bf
      Ganesh Mahendran authored
      
      
      Use ClearPagePrivate/ClearPagePrivate2 helpers to clear
      PG_private/PG_private_2 in page->flags
      
      Link: http://lkml.kernel.org/r/1467882338-4300-7-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18fd06bf
    • Ganesh Mahendran's avatar
      mm/zsmalloc: add __init,__exit attribute · 35b3445e
      Ganesh Mahendran authored
      
      
      Add __init,__exit attribute for function that only called in module
      init/exit to save memory.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-6-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35b3445e
    • Ganesh Mahendran's avatar
      mm/zsmalloc: keep comments consistent with code · fd854463
      Ganesh Mahendran authored
      
      
      Some minor commebnt changes:
      
       1). update zs_malloc(),zs_create_pool() function header
       2). update "Usage of struct page fields"
      
      Link: http://lkml.kernel.org/r/1467882338-4300-5-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd854463
    • Ganesh Mahendran's avatar
      mm/zsmalloc: avoid calculate max objects of zspage twice · 64d90465
      Ganesh Mahendran authored
      
      
      Currently, if a class can not be merged, the max objects of zspage in
      that class may be calculated twice.
      
      This patch calculate max objects of zspage at the begin, and pass the
      value to can_merge() to decide whether the class can be merged.
      
      Also this patch remove function get_maxobj_per_zspage(), as there is no
      other place to call this function.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-4-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64d90465
    • Ganesh Mahendran's avatar
      mm/zsmalloc: use class->objs_per_zspage to get num of max objects · b4fd07a0
      Ganesh Mahendran authored
      
      
      num of max objects in zspage is stored in each size_class now.  So there
      is no need to re-calculate it.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-3-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4fd07a0
    • Ganesh Mahendran's avatar
      mm/zsmalloc: take obj index back from find_alloced_obj · cf675acb
      Ganesh Mahendran authored
      
      
      the obj index value should be updated after return from
      find_alloced_obj() to avoid CPU burning caused by unnecessary object
      scanning.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-2-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf675acb
    • Ganesh Mahendran's avatar
      mm/zsmalloc: use obj_index to keep consistent with others · 41b88e14
      Ganesh Mahendran authored
      
      
      This is a cleanup patch.  Change "index" to "obj_index" to keep
      consistent with others in zsmalloc.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-1-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41b88e14
    • Minchan Kim's avatar
      mm: bail out in shrink_inactive_list() · 91dcade4
      Minchan Kim authored
      
      
      With node-lru, if there are enough reclaimable pages in highmem but
      nothing in lowmem, VM can try to shrink inactive list although the
      requested zone is lowmem.
      
      The problem is that if the inactive list is full of highmem pages then a
      direct reclaimer searching for a lowmem page waste CPU scanning
      uselessly.  It just burns out CPU.  Even, many direct reclaimers are
      stalled by too_many_isolated if lots of parallel reclaimer are going on
      although there are no reclaimable memory in inactive list.
      
      I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine to get
      elapsed time.
      
      	hackbench 500 process 2
      
       = Old =
      
        1st: 289s 2nd: 310s 3rd: 112s 4th: 272s
      
       = Now =
      
        1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.
      
      [akpm@linux-foundation.org: fixes per Mel]
      Link: http://lkml.kernel.org/r/1469433119-1543-1-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91dcade4