Skip to content
  1. Jul 29, 2016
    • Vlastimil Babka's avatar
      mm, page_alloc: don't retry initial attempt in slowpath · 23771235
      Vlastimil Babka authored
      
      
      After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
      kswapd, it first tries get_page_from_freelist() with the new
      alloc_flags, as it may succeed e.g. due to using min watermark instead
      of low watermark.  It makes sense to to do this attempt before adjusting
      zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
      path if we just wake up kswapd and successfully allocate.
      
      This patch therefore moves the initial attempt above the retry label and
      reorganizes a bit the part below the retry label.  We still have to
      attempt get_page_from_freelist() on each retry, as some allocations
      cannot do that as part of direct reclaim or compaction, and yet are not
      allowed to fail (even though they do a WARN_ON_ONCE() and thus should
      not exist).  We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
      and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
      it.  As a side-effect, the attempts from direct reclaim/compaction will
      also no longer obey watermarks once this is set, but there's little harm
      in that.
      
      Kswapd wakeups are also done on each retry to be safe from potential
      races resulting in kswapd going to sleep while a process (that may not
      be able to reclaim by itself) is still looping.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23771235
    • Vlastimil Babka's avatar
      mm, page_alloc: set alloc_flags only once in slowpath · 31a6c190
      Vlastimil Babka authored
      
      
      In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
      initialized, so move the initialization above the retry: label.  Also
      make the comment above the initialization more descriptive.
      
      The only exception in the alloc_flags being constant is
      ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
      allocating thread.  We can fix this, and make the code simpler and a bit
      more effective at the same time, by moving the part that determines
      ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().
      
      This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
      places in __alloc_pages_slowpath() anymore.  The only two tests for the
      flag can instead call gfp_pfmemalloc_allowed().
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a6c190
    • Kirill A. Shutemov's avatar
      lib/stackdepot.c: use __GFP_NOWARN for stack allocations · 87cc271d
      Kirill A. Shutemov authored
      
      
      This (large, atomic) allocation attempt can fail.  We expect and handle
      that, so avoid the scary warning.
      
      Link: http://lkml.kernel.org/r/20160720151905.GB19146@node.shutemov.name
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87cc271d
    • Alexander Potapenko's avatar
      mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB · 80a9201a
      Alexander Potapenko authored
      
      
      For KASAN builds:
       - switch SLUB allocator to using stackdepot instead of storing the
         allocation/deallocation stacks in the objects;
       - change the freelist hook so that parts of the freelist can be put
         into the quarantine.
      
      [aryabinin@virtuozzo.com: fixes]
        Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80a9201a
    • Alexander Potapenko's avatar
      mm, kasan: account for object redzone in SLUB's nearest_obj() · c146a2b9
      Alexander Potapenko authored
      When looking up the nearest SLUB object for a given address, correctly
      calculate its offset if SLAB_RED_ZONE is enabled for that cache.
      
      Previously, when KASAN had detected an error on an object from a cache
      with SLAB_RED_ZONE set, the actual start address of the object was
      miscalculated, which led to random stacks having been reported.
      
      When looking up the nearest SLUB object for a given address, correctly
      calculate its offset if SLAB_RED_ZONE is enabled for that cache.
      
      Fixes: 7ed2f9e6
      
       ("mm, kasan: SLAB support")
      Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c146a2b9
    • Kirill A. Shutemov's avatar
      mm: fix use-after-free if memory allocation failed in vma_adjust() · 734537c9
      Kirill A. Shutemov authored
      
      
      There's one case when vma_adjust() expands the vma, overlapping with
      *two* next vma.  See case 6 of mprotect, described in the comment to
      vma_merge().
      
      To handle this (and only this) situation we iterate twice over main part
      of the function.  See "goto again".
      
      Vegard reported[1] that he sees out-of-bounds access complain from
      KASAN, if anon_vma_clone() on the *second* iteration fails.
      
      This happens because we free 'next' vma by the end of first iteration
      and don't have a way to undo this if anon_vma_clone() fails on the
      second iteration.
      
      The solution is to do all required allocations upfront, before we touch
      vmas.
      
      The allocation on the second iteration is only required if first two
      vmas don't have anon_vma, but third does.  So we need, in total, one
      anon_vma_clone() call.
      
      It's easy to adjust 'exporter' to the third vma for such case.
      
      [1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com
      
      Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      734537c9
    • Markus Elfring's avatar
      zsmalloc: Delete an unnecessary check before the function call "iput" · c3491eca
      Markus Elfring authored
      
      
      iput() tests whether its argument is NULL and then returns immediately.
      Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      
      Link: http://lkml.kernel.org/r/559cf499-4a01-25f9-c87f-24d906626a57@users.sourceforge.net
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3491eca
    • zijun_hu's avatar
      mm/memblock.c: fix index adjustment error in __next_mem_range_rev() · fb399b48
      zijun_hu authored
      
      
      Fix region index adjustment error when parameter type_b of
      __next_mem_range_rev() == NULL.
      
      Signed-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Richard Leitner <dev@g0hl1n.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb399b48
    • Xishi Qiu's avatar
      mem-hotplug: alloc new page from a nearest neighbor node when mem-offline · 394e31d2
      Xishi Qiu authored
      
      
      If we offline a node, alloc the new page from a nearest neighbor node
      instead of the current node or other remote nodes, because re-migrate is
      a waste of time and the distance of the remote nodes is often very
      large.
      
      Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
      zone or highmem zone.
      
      Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      394e31d2
    • Mikulas Patocka's avatar
      mm: optimize copy_page_to/from_iter_iovec · 3fa6c507
      Mikulas Patocka authored
      
      
      copy_page_to_iter_iovec() and copy_page_from_iter_iovec() copy some data
      to userspace or from userspace.  These functions have a fast path where
      they map a page using kmap_atomic and a slow path where they use kmap.
      
      kmap is slower than kmap_atomic, so the fast path is preferred.
      
      However, on kernels without highmem support, kmap just calls
      page_address, so there is no need to avoid kmap.  On kernels without
      highmem support, the fast path just increases code size (and cache
      footprint) and it doesn't improve copy performance in any way.
      
      This patch enables the fast path only if CONFIG_HIGHMEM is defined.
      
      Code size reduced by this patch:
        x86 (without highmem)	  928
        x86-64		  960
        sparc64		  848
        alpha			 1136
        pa-risc		 1200
      
      [akpm@linux-foundation.org: use IS_ENABLED(), per Andi]
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221711410.4818@file01.intranet.prod.int.rdu2.redhat.com
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fa6c507
    • Mikulas Patocka's avatar
      mm: add cond_resched() to generic_swapfile_activate() · 7e4411bf
      Mikulas Patocka authored
      
      
      generic_swapfile_activate() can take quite long time, it iterates over
      all blocks of a file, so add cond_resched to it.  I observed about 1
      second stalls when activating a swapfile that was almost unfragmented -
      this patch fixes it.
      
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e4411bf
    • Michal Hocko's avatar
      Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements" · 4e390b2b
      Michal Hocko authored
      This reverts commit f9054c70 ("mm, mempool: only set __GFP_NOMEMALLOC
      if there are free elements").
      
      There has been a report about OOM killer invoked when swapping out to a
      dm-crypt device.  The primary reason seems to be that the swapout out IO
      managed to completely deplete memory reserves.  Ondrej was able to
      bisect and explained the issue by pointing to f9054c70 ("mm,
      mempool: only set __GFP_NOMEMALLOC if there are free elements").
      
      The reason is that the swapout path is not throttled properly because
      the md-raid layer needs to allocate from the generic_make_request path
      which means it allocates from the PF_MEMALLOC context.  dm layer uses
      mempool_alloc in order to guarantee a forward progress which used to
      inhibit access to memory reserves when using page allocator.  This has
      changed by f9054c70 ("mm, mempool: only set __GFP_NOMEMALLOC if
      there are free elements") which has dropped the __GFP_NOMEMALLOC
      protection wh...
      4e390b2b
    • Hugh Dickins's avatar
      mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode · 1d2047fe
      Hugh Dickins authored
      
      
      At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
      isolate a PageWriteback page, which __unmap_and_move() then rejects with
      -EBUSY: of course the writeback might complete in between, but that's
      not what we usually expect, so probably better not to isolate it.
      
      When tested by stress-highalloc from mmtests, this has reduced the
      number of page migrate failures by 60-70%.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d2047fe
    • Naoya Horiguchi's avatar
      mm: hwpoison: remove incorrect comments · 7c7fd825
      Naoya Horiguchi authored
      
      
      dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
      let's remove incorrect comment.
      
      The reason why the page lock is not really needed is that
      dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
      hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
      are just allocated but not linked to active list yet, even without
      taking page lock.
      
      Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarZhan Chen <zhanc1@andrew.cmu.edu>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c7fd825
    • Zhou Chengming's avatar
      make __section_nr() more efficient · 91fd8b95
      Zhou Chengming authored
      
      
      When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
      section number with a subtraction directly.
      
      Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
      Signed-off-by: default avatarZhou Chengming <zhouchengming1@huawei.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Li Bin <huawei.libin@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91fd8b95
    • Vegard Nossum's avatar
      kmemleak: don't hang if user disables scanning early · 98c42d94
      Vegard Nossum authored
      
      
      If the user tries to disable automatic scanning early in the boot
      process using e.g.:
      
        echo scan=off > /sys/kernel/debug/kmemleak
      
      then this command will hang until SECS_FIRST_SCAN (= 60) seconds have
      elapsed, even though the system is fully initialised.
      
      We can fix this using interruptible sleep and checking if we're supposed
      to stop whenever we wake up (like the rest of the code does).
      
      Link: http://lkml.kernel.org/r/1468835005-2873-1-git-send-email-vegard.nossum@oracle.com
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98c42d94
    • Dennis Chen's avatar
      arm64:acpi: fix the acpi alignment exception when 'mem=' specified · cb0a6502
      Dennis Chen authored
      
      
      When booting an ACPI enabled kernel with 'mem=x', there is the
      possibility that ACPI data regions from the firmware will lie above the
      memory limit.  Ordinarily these will be removed by
      memblock_enforce_memory_limit(.).
      
      Unfortunately, this means that these regions will then be mapped by
      acpi_os_ioremap(.) as device memory (instead of normal) thus unaligned
      accessess will then provoke alignment faults.
      
      In this patch we adopt memblock_mem_limit_remove_map instead, and this
      preserves these ACPI data regions (marked NOMAP) thus ensuring that
      these regions are not mapped as device memory.
      
      For example, below is an alignment exception observed on ARM platform
      when booting the kernel with 'acpi=on mem=8G':
      
        ...
        Unable to handle kernel paging request at virtual address ffff0000080521e7
        pgd = ffff000008aa0000
        [ffff0000080521e7] *pgd=000000801fffe003, *pud=000000801fffd003, *pmd=000000801fffc003, *pte=00e80083ff1c1707
        Internal error: Oops: 96000021 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc3-next-20160616+ #172
        Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1001A 02/09/2016
        task: ffff800001ef0000 ti: ffff800001ef8000 task.ti: ffff800001ef8000
        PC is at acpi_ns_lookup+0x520/0x734
        LR is at acpi_ns_lookup+0x4a4/0x734
        pc : [<ffff0000083b8b10>] lr : [<ffff0000083b8a94>] pstate: 60000045
        sp : ffff800001efb8b0
        x29: ffff800001efb8c0 x28: 000000000000001b
        x27: 0000000000000001 x26: 0000000000000000
        x25: ffff800001efb9e8 x24: ffff000008a10000
        x23: 0000000000000001 x22: 0000000000000001
        x21: ffff000008724000 x20: 000000000000001b
        x19: ffff0000080521e7 x18: 000000000000000d
        x17: 00000000000038ff x16: 0000000000000002
        x15: 0000000000000007 x14: 0000000000007fff
        x13: ffffff0000000000 x12: 0000000000000018
        x11: 000000001fffd200 x10: 00000000ffffff76
        x9 : 000000000000005f x8 : ffff000008725fa8
        x7 : ffff000008a8df70 x6 : ffff000008a8df70
        x5 : ffff000008a8d000 x4 : 0000000000000010
        x3 : 0000000000000010 x2 : 000000000000000c
        x1 : 0000000000000006 x0 : 0000000000000000
        ...
          acpi_ns_lookup+0x520/0x734
          acpi_ds_load1_begin_op+0x174/0x4fc
          acpi_ps_build_named_op+0xf8/0x220
          acpi_ps_create_op+0x208/0x33c
          acpi_ps_parse_loop+0x204/0x838
          acpi_ps_parse_aml+0x1bc/0x42c
          acpi_ns_one_complete_parse+0x1e8/0x22c
          acpi_ns_parse_table+0x8c/0x128
          acpi_ns_load_table+0xc0/0x1e8
          acpi_tb_load_namespace+0xf8/0x2e8
          acpi_load_tables+0x7c/0x110
          acpi_init+0x90/0x2c0
          do_one_initcall+0x38/0x12c
          kernel_init_freeable+0x148/0x1ec
          kernel_init+0x10/0xec
          ret_from_fork+0x10/0x40
        Code: b9009fbc 2a00037b 36380057 3219037b (b9400260)
        ---[ end trace 03381e5eb0a24de4 ]---
        Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      
      With 'efi=debug', we can see those ACPI regions loaded by firmware on
      that board as:
      
        efi:   0x0083ff185000-0x0083ff1b4fff [Reserved           |   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
        efi:   0x0083ff1b5000-0x0083ff1c2fff [ACPI Reclaim Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
        efi:   0x0083ff223000-0x0083ff224fff [ACPI Memory NVS    |   |  |  |  |  |  |  |   |WB|WT|WC|UC]*
      
      Link: http://lkml.kernel.org/r/1468475036-5852-3-git-send-email-dennis.chen@arm.com
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarDennis Chen <dennis.chen@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Kaly Xin <kaly.xin@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb0a6502
    • Dennis Chen's avatar
      mm/memblock.c: add new infrastructure to address the mem limit issue · a571d4eb
      Dennis Chen authored
      
      
      In some cases, memblock is queried by kernel to determine whether a
      specified address is RAM or not.  For example, the ACPI core needs this
      information to determine which attributes to use when mapping ACPI
      regions(acpi_os_ioremap).  Use of incorrect memory types can result in
      faults, data corruption, or other issues.
      
      Removing memory with memblock_enforce_memory_limit() throws away this
      information, and so a kernel booted with 'mem=' may suffer from the
      issues described above.  To avoid this, we need to keep those NOMAP
      regions instead of removing all above the limit, which preserves the
      information we need while preventing other use of those regions.
      
      This patch adds new infrastructure to retain all NOMAP memblock regions
      while removing others, to cater for this.
      
      Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com
      Signed-off-by: default avatarDennis Chen <dennis.chen@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Kaly Xin <kaly.xin@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a571d4eb
    • Andy Lutomirski's avatar
      printk: when dumping regs, show the stack, not thread_info · 8b70ca65
      Andy Lutomirski authored
      
      
      We currently show:
      
        task: <current> ti: <current_thread_info()> task.ti: <task_thread_info(current)>"
      
      "ti" and "task.ti" are redundant, and neither is actually what we want
      to show, which the the base of the thread stack.  Change the display to
      show the stack pointer explicitly.
      
      Link: http://lkml.kernel.org/r/543ac5bd66ff94000a57a02e11af7239571a3055.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b70ca65
    • Andy Lutomirski's avatar
      kdb: use task_cpu() instead of task_thread_info()->cpu · e558af65
      Andy Lutomirski authored
      
      
      We'll need this cleanup to make the cpu field in thread_info be
      optional.
      
      Link: http://lkml.kernel.org/r/da298328dc77ea494576c2f20a934218e758a6fa.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e558af65
    • Andy Lutomirski's avatar
      mm: fix memcg stack accounting for sub-page stacks · efdc9490
      Andy Lutomirski authored
      We should account for stacks regardless of stack size, and we need to
      account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the units
      to kilobytes and Move it into account_kernel_stack().
      
      Fixes: 12580e4b
      
       ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
      Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efdc9490
    • Andy Lutomirski's avatar
      mm: track NR_KERNEL_STACK in KiB instead of number of stacks · d30dd8be
      Andy Lutomirski authored
      
      
      Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
      This only makes sense if each kernel stack exists entirely in one zone,
      and allowing vmapped stacks could break this assumption.
      
      Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
      allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
      architectures.  Keep it simple and use KiB.
      
      Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d30dd8be
    • Dan Williams's avatar
      mm: cleanup ifdef guards for vmem_altmap · 11db0486
      Dan Williams authored
      
      
      Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some
      ifdef guards to just ZONE_DEVICE.
      
      Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Eric Sandeen <sandeen@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11db0486
    • Dan Williams's avatar
      mm: CONFIG_ZONE_DEVICE stop depending on CONFIG_EXPERT · c02b6aec
      Dan Williams authored
      
      
      When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
      CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
      However, now that the ZONE_DMA conflict has been eliminated it no longer
      makes sense to require CONFIG_EXPERT.
      
      Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c02b6aec
    • Christoph Hellwig's avatar
      memblock: include <asm/sections.h> instead of <asm-generic/sections.h> · c4c5ad6b
      Christoph Hellwig authored
      
      
      asm-generic headers are generic implementations for architecture
      specific code and should not be included by common code.  Thus use the
      asm/ version of sections.h to get at the linker sections.
      
      Link: http://lkml.kernel.org/r/1468285103-7470-1-git-send-email-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4c5ad6b
    • Huang Ying's avatar
      mm, THP: clean up return value of madvise_free_huge_pmd · 319904ad
      Huang Ying authored
      
      
      The definition of return value of madvise_free_huge_pmd is not clear
      before.  According to the suggestion of Minchan Kim, change the type of
      return value to bool and return true if we do MADV_FREE successfully on
      entire pmd page, otherwise, return false.  Comments are added too.
      
      Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      319904ad
    • Ganesh Mahendran's avatar
      mm/zsmalloc: use helper to clear page->flags bit · 18fd06bf
      Ganesh Mahendran authored
      
      
      Use ClearPagePrivate/ClearPagePrivate2 helpers to clear
      PG_private/PG_private_2 in page->flags
      
      Link: http://lkml.kernel.org/r/1467882338-4300-7-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18fd06bf
    • Ganesh Mahendran's avatar
      mm/zsmalloc: add __init,__exit attribute · 35b3445e
      Ganesh Mahendran authored
      
      
      Add __init,__exit attribute for function that only called in module
      init/exit to save memory.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-6-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35b3445e
    • Ganesh Mahendran's avatar
      mm/zsmalloc: keep comments consistent with code · fd854463
      Ganesh Mahendran authored
      
      
      Some minor commebnt changes:
      
       1). update zs_malloc(),zs_create_pool() function header
       2). update "Usage of struct page fields"
      
      Link: http://lkml.kernel.org/r/1467882338-4300-5-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd854463
    • Ganesh Mahendran's avatar
      mm/zsmalloc: avoid calculate max objects of zspage twice · 64d90465
      Ganesh Mahendran authored
      
      
      Currently, if a class can not be merged, the max objects of zspage in
      that class may be calculated twice.
      
      This patch calculate max objects of zspage at the begin, and pass the
      value to can_merge() to decide whether the class can be merged.
      
      Also this patch remove function get_maxobj_per_zspage(), as there is no
      other place to call this function.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-4-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64d90465
    • Ganesh Mahendran's avatar
      mm/zsmalloc: use class->objs_per_zspage to get num of max objects · b4fd07a0
      Ganesh Mahendran authored
      
      
      num of max objects in zspage is stored in each size_class now.  So there
      is no need to re-calculate it.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-3-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4fd07a0
    • Ganesh Mahendran's avatar
      mm/zsmalloc: take obj index back from find_alloced_obj · cf675acb
      Ganesh Mahendran authored
      
      
      the obj index value should be updated after return from
      find_alloced_obj() to avoid CPU burning caused by unnecessary object
      scanning.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-2-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf675acb
    • Ganesh Mahendran's avatar
      mm/zsmalloc: use obj_index to keep consistent with others · 41b88e14
      Ganesh Mahendran authored
      
      
      This is a cleanup patch.  Change "index" to "obj_index" to keep
      consistent with others in zsmalloc.
      
      Link: http://lkml.kernel.org/r/1467882338-4300-1-git-send-email-opensource.ganesh@gmail.com
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41b88e14
    • Minchan Kim's avatar
      mm: bail out in shrink_inactive_list() · 91dcade4
      Minchan Kim authored
      
      
      With node-lru, if there are enough reclaimable pages in highmem but
      nothing in lowmem, VM can try to shrink inactive list although the
      requested zone is lowmem.
      
      The problem is that if the inactive list is full of highmem pages then a
      direct reclaimer searching for a lowmem page waste CPU scanning
      uselessly.  It just burns out CPU.  Even, many direct reclaimers are
      stalled by too_many_isolated if lots of parallel reclaimer are going on
      although there are no reclaimable memory in inactive list.
      
      I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine to get
      elapsed time.
      
      	hackbench 500 process 2
      
       = Old =
      
        1st: 289s 2nd: 310s 3rd: 112s 4th: 272s
      
       = Now =
      
        1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.
      
      [akpm@linux-foundation.org: fixes per Mel]
      Link: http://lkml.kernel.org/r/1469433119-1543-1-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91dcade4
    • Mel Gorman's avatar
      mm, vmscan: account for skipped pages as a partial scan · d7f05528
      Mel Gorman authored
      
      
      Page reclaim determines whether a pgdat is unreclaimable by examining
      how many pages have been scanned since a page was freed and comparing
      that to the LRU sizes.  Skipped pages are not reclaim candidates but
      contribute to scanned.  This can prematurely mark a pgdat as
      unreclaimable and trigger an OOM kill.
      
      This patch accounts for skipped pages as a partial scan so that an
      unreclaimable pgdat will still be marked as such but by scaling the cost
      of a skip, it'll avoid the pgdat being marked prematurely.
      
      Link: http://lkml.kernel.org/r/1469110261-7365-6-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7f05528
    • Mel Gorman's avatar
      mm: consider whether to decivate based on eligible zones inactive ratio · f8d1a311
      Mel Gorman authored
      
      
      Minchan Kim reported that with per-zone lru state it was possible to
      identify that a normal zone with 8^M anonymous pages could trigger OOM
      with non-atomic order-0 allocations as all pages in the zone were in the
      active list.
      
         gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
         Call Trace:
           __alloc_pages_nodemask+0xe52/0xe60
           ? new_slab+0x39c/0x3b0
           new_slab+0x39c/0x3b0
           ___slab_alloc.constprop.87+0x6da/0x840
           ? __alloc_skb+0x3c/0x260
           ? enqueue_task_fair+0x73/0xbf0
           ? poll_select_copy_remaining+0x140/0x140
           __slab_alloc.isra.81.constprop.86+0x40/0x6d
           ? __alloc_skb+0x3c/0x260
           kmem_cache_alloc+0x22c/0x260
           ? __alloc_skb+0x3c/0x260
           __alloc_skb+0x3c/0x260
           alloc_skb_with_frags+0x4e/0x1a0
           sock_alloc_send_pskb+0x16a/0x1b0
           ? wait_for_unix_gc+0x31/0x90
           unix_stream_sendmsg+0x28d/0x340
           sock_sendmsg+0x2d/0x40
           sock_write_iter+0x6c/0xc0
           __vfs_write+0xc0/0x120
           vfs_write+0x9b/0x1a0
           ? __might_fault+0x49/0xa0
           SyS_write+0x44/0x90
           do_fast_syscall_32+0xa6/0x1e0
      
         Mem-Info:
         active_anon:101103 inactive_anon:102219 isolated_anon:0
          active_file:503 inactive_file:544 isolated_file:0
          unevictable:0 dirty:0 writeback:34 unstable:0
          slab_reclaimable:6298 slab_unreclaimable:74669
          mapped:863 shmem:0 pagetables:100998 bounce:0
          free:23573 free_pcp:1861 free_cma:0
         Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
         DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
         lowmem_reserve[]: 0 809 1965 1965
         Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
         lowmem_reserve[]: 0 0 9247 9247
         HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
         lowmem_reserve[]: 0 0 0 0
         DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
         Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
         HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
         Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
         54409 total pagecache pages
         53215 pages in swap cache
         Swap cache stats: add 300982, delete 247765, find 157978/226539
         Free swap  = 3803244kB
         Total swap = 4192252kB
         524186 pages RAM
         295934 pages HighMem/MovableOnly
         9642 pages reserved
         0 pages cma reserved
      
      The problem is due to the active deactivation logic in
      inactive_list_is_low:
      
      	Node 0 active_anon:404412kB inactive_anon:409040kB
      
      IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due
      to highmem anonymous stat so VM never deactivates normal zone's
      anonymous pages.
      
      This patch is a modified version of Minchan's original solution but
      based upon it.  The problem with Minchan's patch is that any low zone
      with an imbalanced list could force a rotation.
      
      In this patch, a zone-constrained global reclaim will rotate the list if
      the inactive/active ratio of all eligible zones needs to be corrected.
      It is possible that higher zone pages will be initially rotated
      prematurely but this is the safer choice to maintain overall LRU age.
      
      Link: http://lkml.kernel.org/r/20160722090929.GJ10438@techsingularity.net
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8d1a311
    • Mel Gorman's avatar
      mm: remove reclaim and compaction retry approximations · 5a1c84b4
      Mel Gorman authored
      
      
      If per-zone LRU accounting is available then there is no point
      approximating whether reclaim and compaction should retry based on pgdat
      statistics.  This is effectively a revert of "mm, vmstat: remove zone
      and node double accounting by approximating retries" with the difference
      that inactive/active stats are still available.  This preserves the
      history of why the approximation was retried and why it had to be
      reverted to handle OOM kills on 32-bit systems.
      
      Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a1c84b4
    • Mel Gorman's avatar
      mm, vmscan: remove highmem_file_pages · bb4cc2be
      Mel Gorman authored
      
      
      With the reintroduction of per-zone LRU stats, highmem_file_pages is
      redundant so remove it.
      
      [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory]
        Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb4cc2be
    • Minchan Kim's avatar
      mm: add per-zone lru list stat · 71c799f4
      Minchan Kim authored
      
      
      When I did stress test with hackbench, I got OOM message frequently
      which didn't ever happen in zone-lru.
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        ..
        ..
         __alloc_pages_nodemask+0xe52/0xe60
         ? new_slab+0x39c/0x3b0
         new_slab+0x39c/0x3b0
         ___slab_alloc.constprop.87+0x6da/0x840
         ? __alloc_skb+0x3c/0x260
         ? _raw_spin_unlock_irq+0x27/0x60
         ? trace_hardirqs_on_caller+0xec/0x1b0
         ? finish_task_switch+0xa6/0x220
         ? poll_select_copy_remaining+0x140/0x140
         __slab_alloc.isra.81.constprop.86+0x40/0x6d
         ? __alloc_skb+0x3c/0x260
         kmem_cache_alloc+0x22c/0x260
         ? __alloc_skb+0x3c/0x260
         __alloc_skb+0x3c/0x260
         alloc_skb_with_frags+0x4e/0x1a0
         sock_alloc_send_pskb+0x16a/0x1b0
         ? wait_for_unix_gc+0x31/0x90
         ? alloc_set_pte+0x2ad/0x310
         unix_stream_sendmsg+0x28d/0x340
         sock_sendmsg+0x2d/0x40
         sock_write_iter+0x6c/0xc0
         __vfs_write+0xc0/0x120
         vfs_write+0x9b/0x1a0
         ? __might_fault+0x49/0xa0
         SyS_write+0x44/0x90
         do_fast_syscall_32+0xa6/0x1e0
         sysenter_past_esp+0x45/0x74
      
        Mem-Info:
        active_anon:104698 inactive_anon:105791 isolated_anon:192
         active_file:433 inactive_file:283 isolated_file:22
         unevictable:0 dirty:0 writeback:296 unstable:0
         slab_reclaimable:6389 slab_unreclaimable:78927
         mapped:474 shmem:0 pagetables:101426 bounce:0
         free:10518 free_pcp:334 free_cma:0
        Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
        DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
        Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
        HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        25121 total pagecache pages
        24160 pages in swap cache
        Swap cache stats: add 86371, delete 62211, find 42865/60187
        Free swap  = 4015560kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9658 pages reserved
        0 pages cma reserved
      
      The order-0 allocation for normal zone failed while there are a lot of
      reclaimable memory(i.e., anonymous memory with free swap).  I wanted to
      analyze the problem but it was hard because we removed per-zone lru stat
      so I couldn't know how many of anonymous memory there are in normal/dma
      zone.
      
      When we investigate OOM problem, reclaimable memory count is crucial
      stat to find a problem.  Without it, it's hard to parse the OOM message
      so I believe we should keep it.
      
      With per-zone lru stat,
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        Mem-Info:
        active_anon:101103 inactive_anon:102219 isolated_anon:0
         active_file:503 inactive_file:544 isolated_file:0
         unevictable:0 dirty:0 writeback:34 unstable:0
         slab_reclaimable:6298 slab_unreclaimable:74669
         mapped:863 shmem:0 pagetables:100998 bounce:0
         free:23573 free_pcp:1861 free_cma:0
        Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
        DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
        Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
        HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        54409 total pagecache pages
        53215 pages in swap cache
        Swap cache stats: add 300982, delete 247765, find 157978/226539
        Free swap  = 3803244kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9642 pages reserved
        0 pages cma reserved
      
      With that, we can see normal zone has a 86M reclaimable memory so we can
      know something goes wrong(I will fix the problem in next patch) in
      reclaim.
      
      [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
       Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
      Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71c799f4
    • Mel Gorman's avatar
      mm, vmscan: release/reacquire lru_lock on pgdat change · 785b99fe
      Mel Gorman authored
      
      
      With node-lru, the locking is based on the pgdat.  As Minchan pointed
      out, there is an opportunity to reduce LRU lock release/acquire in
      check_move_unevictable_pages by only changing lock on a pgdat change.
      
      [mgorman@techsingularity.net: remove double initialisation]
        Link: http://lkml.kernel.org/r/20160719074835.GC10438@techsingularity.net
      Link: http://lkml.kernel.org/r/1468853426-12858-3-git-send-email-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      785b99fe