Skip to content
  1. Apr 06, 2018
    • Jeff Moyer's avatar
      block_invalidatepage(): only release page if the full page was invalidated · 3172485f
      Jeff Moyer authored
      Prior to commit d47992f8 ("mm: change invalidatepage prototype to
      accept length"), an offset of 0 meant that the full page was being
      invalidated.  After that commit, we need to instead check the length.
      
      Jan said:
      :
      : The only possible issue is that try_to_release_page() was called more
      : often than necessary.  Otherwise the issue is harmless but still it's good
      : to have this fixed.
      
      Link: http://lkml.kernel.org/r/x49fu5rtnzs.fsf@segfault.boston.devel.redhat.com
      Fixes: d47992f8
      
       ("mm: change invalidatepage prototype to accept length")
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3172485f
    • Mike Rapoport's avatar
      mm: kernel-doc: add missing parameter descriptions · e8b098fc
      Mike Rapoport authored
      
      
      Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8b098fc
    • Mike Rapoport's avatar
      mm/swap.c: remove @cold parameter description for release_pages() · 002843de
      Mike Rapoport authored
      The 'cold' parameter was removed from release_pages function by commit
      c6f92f9f
      
       ("mm: remove cold parameter for release_pages").
      
      Update the description to match the code.
      
      Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      002843de
    • Mike Rapoport's avatar
      mm/nommu: remove description of alloc_vm_area · e48e3c59
      Mike Rapoport authored
      
      
      The alloc_mm_area in nommu is a stub, but its description states it
      allocates kernel address space.  Remove the description to make the code
      and the documentation agree.
      
      Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e48e3c59
    • Sergey Senozhatsky's avatar
      zram: drop max_zpage_size and use zs_huge_class_size() · 60f5921a
      Sergey Senozhatsky authored
      
      
      Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class
      watermark instead, which makes more sense.
      
      TEST
      - I used a 1G zram device, LZO compression back-end, original
        data set size was 444MB. Looking at zsmalloc classes stats the
        test ended up to be pretty fair.
      
      BASE ZRAM/ZSMALLOC
      =====================
      zram mm_stat
      
      498978816 191482495 199831552        0 199831552    15634        0
      
      zsmalloc classes
      
       class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
      ...
         151  2448           0            0          1240       1240        744                3        0
         168  2720           0            0          4200       4200       2800                2        0
         190  3072           0            0         10100      10100       7575                3        0
         202  3264           0            0           380        380        304                4        0
         254  4096           0            0         10620      10620      10620                1        0
      
       Total                 7           46        106982     106187      48787                         0
      
      PATCHED ZRAM/ZSMALLOC
      =====================
      
      zram mm_stat
      
      498978816 182579184 194248704        0 194248704    15628        0
      
      zsmalloc classes
      
       class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
      ...
         151  2448           0            0          1240       1240        744                3        0
         168  2720           0            0          4200       4200       2800                2        0
         190  3072           0            0         10100      10100       7575                3        0
         202  3264           0            0          7180       7180       5744                4        0
         254  4096           0            0          3820       3820       3820                1        0
      
       Total                 8           45        106959     106193      47424                         0
      
      As we can see, we reduced the number of objects stored in class-4096,
      because a huge number of objects which we previously forcibly stored in
      class-4096 now stored in non-huge class-3264.  This results in lower
      memory consumption:
      
      - zsmalloc now uses 47424 physical pages, which is less than 48787 pages
        zsmalloc used before.
      
      - objects that we store in class-3264 share zspages.  That's why overall
        the number of pages that both class-4096 and class-3264 consumed went
        down from 10924 to 9564.
      
      [sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
        Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20180306070639.7389-3-sergey.senozhatsky@gmail.com
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60f5921a
    • Sergey Senozhatsky's avatar
      zsmalloc: introduce zs_huge_class_size() · 010b495e
      Sergey Senozhatsky authored
      
      
      Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.
      
      ZRAM's max_zpage_size is a bad thing.  It forces zsmalloc to store
      normal objects as huge ones, which results in bigger zsmalloc memory
      usage.  Drop it and use actual zsmalloc huge-class value when decide if
      the object is huge or not.
      
      This patch (of 2):
      
      Not every object can be share its zspage with other objects, e.g.  when
      the object is as big as zspage or nearly as big a zspage.  For such
      objects zsmalloc has a so called huge class - every object which belongs
      to huge class consumes the entire zspage (which consists of a physical
      page).  On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
      3264, so starting down from size 3264, objects can share page(-s) and
      thus minimize memory wastage.
      
      ZRAM, however, has its own statically defined watermark for huge
      objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
      object larger than this watermark (3072) as a PAGE_SIZE object, in other
      words, to a huge class, while zsmalloc can keep some of those objects in
      non-huge classes.  This results in increased memory consumption.
      
      zsmalloc knows better if the object is huge or not.  Introduce
      zs_huge_class_size() function which tells if the given object can be
      stored in one of non-huge classes or not.  This will let us to drop
      ZRAM's huge object watermark and fully rely on zsmalloc when we decide
      if the object is huge.
      
      [sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
        Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.com
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      010b495e
    • Huang Ying's avatar
      mm: fix races between swapoff and flush dcache · cb9f753a
      Huang Ying authored
      Thanks to commit 4b3ef9da
      
       ("mm/swap: split swap cache into 64MB
      trunks"), after swapoff the address_space associated with the swap
      device will be freed.  So page_mapping() users which may touch the
      address_space need some kind of mechanism to prevent the address_space
      from being freed during accessing.
      
      The dcache flushing functions (flush_dcache_page(), etc) in architecture
      specific code may access the address_space of swap device for anonymous
      pages in swap cache via page_mapping() function.  But in some cases
      there are no mechanisms to prevent the swap device from being swapoff,
      for example,
      
        CPU1					CPU2
        __get_user_pages()			swapoff()
          flush_dcache_page()
            mapping = page_mapping()
              ...				  exit_swap_address_space()
              ...				    kvfree(spaces)
              mapping_mapped(mapping)
      
      The address space may be accessed after being freed.
      
      But from cachetlb.txt and Russell King, flush_dcache_page() only care
      about file cache pages, for anonymous pages, flush_anon_page() should be
      used.  The implementation of flush_dcache_page() in all architectures
      follows this too.  They will check whether page_mapping() is NULL and
      whether mapping_mapped() is true to determine whether to flush the
      dcache immediately.  And they will use interval tree (mapping->i_mmap)
      to find all user space mappings.  While mapping_mapped() and
      mapping->i_mmap isn't used by anonymous pages in swap cache at all.
      
      So, to fix the race between swapoff and flush dcache, __page_mapping()
      is add to return the address_space for file cache pages and NULL
      otherwise.  All page_mapping() invoking in flush dcache functions are
      replaced with page_mapping_file().
      
      [akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
      Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb9f753a
    • Nikolay Borisov's avatar
      fs/direct-io.c: minor cleanups in do_blockdev_direct_IO · 1c0ff0f1
      Nikolay Borisov authored
      
      
      We already get the block counts and calculate the end block at the
      beginning of the function.  Let's use the local variables for
      consistency and readability.  No functional changes
      
      [akpm@linux-foundation.org: constify the locals to prevent future slipups]
      Link: http://lkml.kernel.org/r/1519638870-17756-1-git-send-email-nborisov@suse.com
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c0ff0f1
    • Guenter Roeck's avatar
      include/linux/mm.h: provide consistent declaration for num_poisoned_pages · 5844a486
      Guenter Roeck authored
      
      
      clang reports the following compile warning.
      
        In file included from mm/vmscan.c:56:
        ./include/linux/swapops.h:327:22: warning:
      	section attribute is specified on redeclared variable [-Wsection]
        extern atomic_long_t num_poisoned_pages __read_mostly;
                             ^
        ./include/linux/mm.h:2585:22: note: previous declaration is here
        extern atomic_long_t num_poisoned_pages;
                           ^
      
      Let's use __read_mostly everywhere.
      
      Link: http://lkml.kernel.org/r/1519686565-8224-1-git-send-email-linux@roeck-us.net
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5844a486
    • Dan Williams's avatar
      device-dax: implement ->pagesize() for smaps to report MMUPageSize · c1d53b92
      Dan Williams authored
      
      
      Given that device-dax is making similar page mapping size guarantees as
      hugetlbfs, emit the size in smaps and any other kernel path that
      requests the mapping size of a vma.
      
      Link: http://lkml.kernel.org/r/151996255287.27922.18397777516059080245.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1d53b92
    • Dan Williams's avatar
      mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct · 05ea8860
      Dan Williams authored
      When device-dax is operating in huge-page mode we want it to behave like
      hugetlbfs and report the MMU page mapping size that is being enforced by
      the vma.
      
      Similar to commit 31383c68
      
       "mm, hugetlbfs: introduce ->split() to
      vm_operations_struct" it would be messy to teach vma_mmu_pagesize()
      about device-dax page mapping sizes in the same (hstate) way that
      hugetlbfs communicates this attribute.  Instead, these patches introduce
      a new ->pagesize() vm operation.
      
      Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05ea8860
    • Dan Williams's avatar
      mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize() · 09135cc5
      Dan Williams authored
      Patch series "mm, smaps: MMUPageSize for device-dax", v3.
      
      Similar to commit 31383c68
      
       ("mm, hugetlbfs: introduce ->split() to
      vm_operations_struct") here is another occasion where we want
      special-case hugetlbfs/hstate enabling to also apply to device-dax.
      
      This prompts the question what other hstate conversions we might do
      beyond ->split() and ->pagesize(), but this appears to be the last of
      the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.
      
      This patch (of 3):
      
      The current powerpc definition of vma_mmu_pagesize() open codes looking
      up the page size via hstate.  It is identical to the generic
      vma_kernel_pagesize() implementation.
      
      Now, vma_kernel_pagesize() is growing support for determining the page
      size of Device-DAX vmas in addition to the existing Hugetlbfs page size
      determination.
      
      Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
      would automatically benefit from any new vma-type support that is added
      to vma_kernel_pagesize().  However, the powerpc vma_mmu_pagesize() is
      prevented from calling vma_kernel_pagesize() due to a circular header
      dependency that requires vma_mmu_pagesize() to be defined before
      including <linux/hugetlb.h>.
      
      Break this circular dependency by defining the default vma_mmu_pagesize()
      as a __weak symbol to be overridden by the powerpc version.
      
      Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Jane Chu <jane.chu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09135cc5
    • Mario Leinweber's avatar
      mm/gup.c: fix coding style issues. · 2923117b
      Mario Leinweber authored
      
      
      - Fixed style error: 8 spaces -> 1 tab.
      - Fixed style warning: Corrected misleading indentation.
      
      Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.de
      Signed-off-by: default avatarMario Leinweber <marioleinweber@web.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2923117b
    • Aaron Lu's avatar
      mm/free_pcppages_bulk: prefetch buddy while not holding lock · 97334162
      Aaron Lu authored
      
      
      When a page is freed back to the global pool, its buddy will be checked
      to see if it's possible to do a merge.  This requires accessing buddy's
      page structure and that access could take a long time if it's cache
      cold.
      
      This patch adds a prefetch to the to-be-freed page's buddy outside of
      zone->lock in hope of accessing buddy's page structure later under
      zone->lock will be faster.  Since we *always* do buddy merging and check
      an order-0 page's buddy to try to merge it when it goes into the main
      allocator, the cacheline will always come in, i.e.  the prefetched data
      will never be unused.
      
      Normally, the number of prefetch will be pcp->batch(default=31 and has
      an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
      pcp's pages get all drained, it will be pcp->count which has an upper
      limit of pcp->high.  pcp->high, although has a default value of 186
      (pcp->batch=31 * 6), can be changed by user through
      /proc/sys/vm/percpu_pagelist_fraction and there is no software upper
      limit so could be large, like several thousand.  For this reason, only
      the first pcp->batch number of page's buddy structure is prefetched to
      avoid excessive prefetching.
      
      In the meantime, there are two concerns:
      
       1. the prefetch could potentially evict existing cachelines, especially
          for L1D cache since it is not huge
      
       2. there is some additional instruction overhead, namely calculating
          buddy pfn twice
      
      For 1, it's hard to say, this microbenchmark though shows good result
      but the actual benefit of this patch will be workload/CPU dependant;
      
      For 2, since the calculation is a XOR on two local variables, it's
      expected in many cases that cycles spent will be offset by reduced
      memory latency later.  This is especially true for NUMA machines where
      multiple CPUs are contending on zone->lock and the most time consuming
      part under zone->lock is the wait of 'struct page' cacheline of the
      to-be-freed pages and their buddies.
      
      Test with will-it-scale/page_fault1 full load:
      
        kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
        v4.16-rc2+  9034215        7971818       13667135       15677465
        patch2/3    9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
        this patch 10180856 +6.8%  8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
      
      Note: this patch's performance improvement percent is against patch2/3.
      
      (Changelog stolen from Dave Hansen and Mel Gorman's comments at
      http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)
      
      [aaron.lu@intel.com: use helper function, avoid disordering pages]
        Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
        Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
      [aaron.lu@intel.com: v4]
        Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
        Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
      Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97334162
    • Aaron Lu's avatar
      mm/free_pcppages_bulk: do not hold lock when picking pages to free · 0a5f4e5b
      Aaron Lu authored
      
      
      When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the
      zone->lock is held and then pages are chosen from PCP's migratetype
      list.  While there is actually no need to do this 'choose part' under
      lock since it's PCP pages, the only CPU that can touch them is us and
      irq is also disabled.
      
      Moving this part outside could reduce lock held time and improve
      performance.  Test with will-it-scale/page_fault1 full load:
      
        kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
        v4.16-rc2+  9034215        7971818       13667135       15677465
        this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
      
      What the test does is: starts $nr_cpu processes and each will repeatedly
      do the following for 5 minutes:
      
       - mmap 128M anonymouse space
      
       - write access to that space
      
       - munmap.
      
      The score is the aggregated iteration.
      
      https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
      
      Link: http://lkml.kernel.org/r/20180301062845.26038-3-aaron.lu@intel.com
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a5f4e5b
    • Aaron Lu's avatar
      mm/free_pcppages_bulk: update pcp->count inside · 77ba9062
      Aaron Lu authored
      
      
      Matthew Wilcox found that all callers of free_pcppages_bulk() currently
      update pcp->count immediately after so it's natural to do it inside
      free_pcppages_bulk().
      
      No functionality or performance change is expected from this patch.
      
      Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.com
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77ba9062
    • David Rientjes's avatar
      mm, compaction: drain pcps for zone when kcompactd fails · bc3106b2
      David Rientjes authored
      
      
      It's possible for free pages to become stranded on per-cpu pagesets
      (pcps) that, if drained, could be merged with buddy pages on the zone's
      free area to form large order pages, including up to MAX_ORDER.
      
      Consider a verbose example using the tools/vm/page-types tool at the
      beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates
      a slab page).  Pages on pcps do not have any page flags set.
      
        109954  1       _______S________________________________________________________
        109955  2       __________B_____________________________________________________
        109957  1       ________________________________________________________________
        109958  1       __________B_____________________________________________________
        109959  7       ________________________________________________________________
        109960  1       __________B_____________________________________________________
        109961  9       ________________________________________________________________
        10996a  1       __________B_____________________________________________________
        10996b  3       ________________________________________________________________
        10996e  1       __________B_____________________________________________________
        10996f  1       ________________________________________________________________
        ...
        109f8c  1       __________B_____________________________________________________
        109f8d  2       ________________________________________________________________
        109f8f  2       __________B_____________________________________________________
        109f91  f       ________________________________________________________________
        109fa0  1       __________B_____________________________________________________
        109fa1  7       ________________________________________________________________
        109fa8  1       __________B_____________________________________________________
        109fa9  1       ________________________________________________________________
        109faa  1       __________B_____________________________________________________
        109fab  1       _______S________________________________________________________
      
      The compaction migration scanner is attempting to defragment this memory
      since it is at the beginning of the zone.  It has done so quite well,
      all movable pages have been migrated.  From pfn [0x109955, 0x109fab),
      there are only buddy pages and pages without flags set.
      
      These pages may be stranded on pcps that could otherwise allow this
      memory to be coalesced if freed back to the zone free area.  It is
      possible that some of these pages may not be on pcps and that something
      has called alloc_pages() and used the memory directly, but we rely on
      the absence of __GFP_MOVABLE in these cases to allocate from
      MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE
      pageblocks as free as possible.
      
      These buddy and pcp pages, spanning 1,621 pages, could be coalesced and
      allow for three transparent hugepages to be dynamically allocated.
      Running the numbers for all such spans on the system, it was found that
      there were over 400 such spans of only buddy pages and pages without
      flags set at the time this /proc/kpageflags sample was collected.
      Without this support, there were _no_ order-9 or order-10 pages free.
      
      When kcompactd fails to defragment memory such that a cc.order page can
      be allocated, drain all pcps for the zone back to the buddy allocator so
      this stranding cannot occur.  Compaction for that order will
      subsequently be deferred, which acts as a ratelimit on this drain.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@chino.kir.corp.google.com
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc3106b2
    • Howard McLauchlan's avatar
      mm: make should_failslab always available for fault injection · 4f6923fb
      Howard McLauchlan authored
      should_failslab() is a convenient function to hook into for directed
      error injection into kmalloc().  However, it is only available if a
      config flag is set.
      
      The following BCC script, for example, fails kmalloc() calls after a
      btrfs umount:
      
          from bcc import BPF
      
          prog = r"""
          BPF_HASH(flag);
      
          #include <linux/mm.h>
      
          int kprobe__btrfs_close_devices(void *ctx) {
                  u64 key = 1;
                  flag.update(&key, &key);
                  return 0;
          }
      
          int kprobe__should_failslab(struct pt_regs *ctx) {
                  u64 key = 1;
                  u64 *res;
                  res = flag.lookup(&key);
                  if (res != 0) {
                      bpf_override_return(ctx, -ENOMEM);
                  }
                  return 0;
          }
          """
          b = BPF(text=prog)
      
          while 1:
              b.kprobe_poll()
      
      This patch refactors the should_failslab implementation so that the
      function is always available for error injection, independent of flags.
      
      This change woul...
      4f6923fb
    • Dou Liyang's avatar
      mm/page_poison.c: make early_page_poison_param() __init · 14298d36
      Dou Liyang authored
      
      
      The early_param() is only called during kernel initialization, So Linux
      marks the function of it with __init macro to save memory.
      
      But it forgot to mark the early_page_poison_param().  So, Make it __init
      as well.
      
      Link: http://lkml.kernel.org/r/20180117034757.27024-1-douly.fnst@cn.fujitsu.com
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14298d36
    • Dou Liyang's avatar
      mm/page_owner.c: make early_page_owner_param() __init · 1173194e
      Dou Liyang authored
      
      
      The early_param() is only called during kernel initialization, So Linux
      marks the functions of it with __init macro to save memory.
      
      But it forgot to mark the early_page_owner_param().  So, Make it __init
      as well.
      
      Link: http://lkml.kernel.org/r/20180117034736.26963-1-douly.fnst@cn.fujitsu.com
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1173194e
    • Dou Liyang's avatar
      mm/kmemleak.c: make kmemleak_boot_config() __init · 8bd30c10
      Dou Liyang authored
      
      
      The early_param() is only called during kernel initialization, So Linux
      marks the functions of it with __init macro to save memory.
      
      But it forgot to mark the kmemleak_boot_config().  So, Make it __init as
      well.
      
      Link: http://lkml.kernel.org/r/20180117034720.26897-1-douly.fnst@cn.fujitsu.com
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bd30c10
    • Minchan Kim's avatar
      mm: swap: unify cluster-based and vma-based swap readahead · e9e9b7ec
      Minchan Kim authored
      
      
      This patch makes do_swap_page() not need to be aware of two different
      swap readahead algorithms.  Just unify cluster-based and vma-based
      readahead function call.
      
      Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9e9b7ec
    • Minchan Kim's avatar
      mm: swap: clean up swap readahead · eaf649eb
      Minchan Kim authored
      
      
      When I see recent change of swap readahead, I am very unhappy about
      current code structure which diverges two swap readahead algorithm in
      do_swap_page.  This patch is to clean it up.
      
      Main motivation is that fault handler doesn't need to be aware of
      readahead algorithms but just should call swapin_readahead.
      
      As first step, this patch cleans up a little bit but not perfect (I just
      separate for review easier) so next patch will make the goal complete.
      
      [minchan@kernel.org: do not check readahead flag with THP anon]
        Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
        Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eaf649eb
    • Tetsuo Handa's avatar
      mm,vmscan: don't pretend forward progress upon shrinker_rwsem contention · e830c63a
      Tetsuo Handa authored
      
      
      Since we no longer use return value of shrink_slab() for normal reclaim,
      the comment is no longer true.  If some do_shrink_slab() call takes
      unexpectedly long (root cause of stall is currently unknown) when
      register_shrinker()/unregister_shrinker() is pending, trying to drop
      caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
      loop if many mem_cgroup are defined.  For safety, let's not pretend
      forward progress.
      
      Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e830c63a
    • Vitaly Wool's avatar
      z3fold: limit use of stale list for allocation · 5c9bab59
      Vitaly Wool authored
      
      
      Currently if z3fold couldn't find an unbuddied page it would first try
      to pull a page off the stale list.  The problem with this approach is
      that we can't 100% guarantee that the page is not processed by the
      workqueue thread at the same time unless we run cancel_work_sync() on
      it, which we can't do if we're in an atomic context.  So let's just
      limit stale list usage to non-atomic contexts only.
      
      Link: http://lkml.kernel.org/r/47ab51e7-e9c1-d30e-ab17-f734dbc3abce@gmail.com
      Signed-off-by: default avatarVitaly Vul <vitaly.vul@sony.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: <Oleksiy.Avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c9bab59
    • Konstantin Khlebnikov's avatar
      mm/huge_memory.c: reorder operations in __split_huge_page_tail() · 605ca5ed
      Konstantin Khlebnikov authored
      THP split makes non-atomic change of tail page flags.  This is almost ok
      because tail pages are locked and isolated but this breaks recent
      changes in page locking: non-atomic operation could clear bit
      PG_waiters.
      
      As a result concurrent sequence get_page_unless_zero() -> lock_page()
      might block forever.  Especially if this page was truncated later.
      
      Fix is trivial: clone flags before unfreezing page reference counter.
      
      This race exists since commit 62906027 ("mm: add PageWaiters
      indicating tasks are waiting for a page bit") while unsave unfreeze
      itself was added in commit 8df651c7
      
       ("thp: cleanup
      split_huge_page()").
      
      clear_compound_head() also must be called before unfreezing page
      reference because after successful get_page_unless_zero() might follow
      put_page() which needs correct compound_head().
      
      And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
      is made especially for that and has semantic of smp_store_release().
      
      Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      605ca5ed
    • Konstantin Khlebnikov's avatar
      mm/page_ref: use atomic_set_release in page_ref_unfreeze · 03f5d58f
      Konstantin Khlebnikov authored
      
      
      page_ref_unfreeze() has exactly that semantic.  No functional changes:
      just minus one barrier and proper handling of PPro errata.
      
      Link: http://lkml.kernel.org/r/151844393004.210639.4672319312617954272.stgit@buzz
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03f5d58f
    • Huang Ying's avatar
      mm: fix races between address_space dereference and free in page_evicatable · e92bb4dd
      Huang Ying authored
      
      
      When page_mapping() is called and the mapping is dereferenced in
      page_evicatable() through shrink_active_list(), it is possible for the
      inode to be truncated and the embedded address space to be freed at the
      same time.  This may lead to the following race.
      
      CPU1                                                CPU2
      
      truncate(inode)                                     shrink_active_list()
        ...                                                 page_evictable(page)
        truncate_inode_page(mapping, page);
          delete_from_page_cache(page)
            spin_lock_irqsave(&mapping->tree_lock, flags);
              __delete_from_page_cache(page, NULL)
                page_cache_tree_delete(..)
                  ...                                         mapping = page_mapping(page);
                  page->mapping = NULL;
                  ...
            spin_unlock_irqrestore(&mapping->tree_lock, flags);
            page_cache_free_page(mapping, page)
              put_page(page)
                if (put_page_testzero(page)) -> false
      - inode now has no pages and can be freed including embedded address_space
      
                                                              mapping_unevictable(mapping)
      							  test_bit(AS_UNEVICTABLE, &mapping->flags);
      - we've dereferenced mapping which is potentially already free.
      
      Similar race exists between swap cache freeing and page_evicatable()
      too.
      
      The address_space in inode and swap cache will be freed after a RCU
      grace period.  So the races are fixed via enclosing the page_mapping()
      and address_space usage in rcu_read_lock/unlock().  Some comments are
      added in code to make it clear what is protected by the RCU read lock.
      
      Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e92bb4dd
    • Andy Shevchenko's avatar
      mm: reuse DEFINE_SHOW_ATTRIBUTE() macro · 5ad35093
      Andy Shevchenko authored
      
      
      ...instead of open coding file operations followed by custom ->open()
      callbacks per each attribute.
      
      [andriy.shevchenko@linux.intel.com: add tags, fix compilation issue]
        Link: http://lkml.kernel.org/r/20180217144253.58604-1-andriy.shevchenko@linux.intel.com
      Link: http://lkml.kernel.org/r/20180214154644.54505-1-andriy.shevchenko@linux.intel.com
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennisszhou@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ad35093
    • David Rientjes's avatar
      mm, page_alloc: move mirrored_kernelcore to __meminitdata · 7f16f91f
      David Rientjes authored
      
      
      mirrored_kernelcore can be in __meminitdata, so move it there.
      
      At the same time, fixup section specifiers to be after the name of the
      variable per checkpatch.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121623280.179479@chino.kir.corp.google.com
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f16f91f
    • David Rientjes's avatar
      mm, page_alloc: extend kernelcore and movablecore for percent · a5c6d650
      David Rientjes authored
      
      
      Both kernelcore= and movablecore= can be used to define the amount of
      ZONE_NORMAL and ZONE_MOVABLE on a system, respectively.  This requires
      the system memory capacity to be known when specifying the command line,
      however.
      
      This introduces the ability to define both kernelcore= and movablecore=
      as a percentage of total system memory.  This is convenient for systems
      software that wants to define the amount of ZONE_MOVABLE, for example,
      as a proportion of a system's memory rather than a hardcoded byte value.
      
      To define the percentage, the final character of the parameter should be
      a '%'.
      
      mhocko: "why is anyone using these options nowadays?"
      
      rientjes:
      :
      : Fragmentation of non-__GFP_MOVABLE pages due to low on memory
      : situations can pollute most pageblocks on the system, as much as 1GB of
      : slab being fragmented over 128GB of memory, for example.  When the
      : amount of kernel memory is well bounded for certain systems, it is
      : better to aggressively reclaim from existing MIGRATE_UNMOVABLE
      : pageblocks rather than eagerly fallback to others.
      :
      : We have additional patches that help with this fragmentation if you're
      : interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
      : pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
      : draining of pcp lists back to the zone free area to prevent stranding.
      
      [rientjes@google.com: updates]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5c6d650
    • Naoya Horiguchi's avatar
      mm: hwpoison: disable memory error handling on 1GB hugepage · 31286a84
      Naoya Horiguchi authored
      
      
      Recently the following BUG was reported:
      
          Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
          Memory failure: 0x3c0000: recovery action for huge page: Recovered
          BUG: unable to handle kernel paging request at ffff8dfcc0003000
          IP: gup_pgd_range+0x1f0/0xc20
          PGD 17ae72067 P4D 17ae72067 PUD 0
          Oops: 0000 [#1] SMP PTI
          ...
          CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
      
      You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
      a 1GB hugepage.  This happens because get_user_pages_fast() is not aware
      of a migration entry on pud that was created in the 1st madvise() event.
      
      I think that conversion to pud-aligned migration entry is working, but
      other MM code walking over page table isn't prepared for it.  We need
      some time and effort to make all this work properly, so this patch
      avoids the reported bug by just disabling error handling for 1GB
      hugepage.
      
      [n-horiguchi@ah.jp.nec.com: v2]
        Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
      Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31286a84
    • Pavel Tatashin's avatar
      mm/memory_hotplug: optimize memory hotplug · d0dc12e8
      Pavel Tatashin authored
      
      
      During memory hotplugging we traverse struct pages three times:
      
      1. memset(0) in sparse_add_one_section()
      2. loop in __add_section() to set do: set_page_node(page, nid); and
         SetPageReserved(page);
      3. loop in memmap_init_zone() to call __init_single_pfn()
      
      This patch removes the first two loops, and leaves only loop 3.  All
      struct pages are initialized in one place, the same as it is done during
      boot.
      
      The benefits:
      
       - We improve memory hotplug performance because we are not evicting the
         cache several times and also reduce loop branching overhead.
      
       - Remove condition from hotpath in __init_single_pfn(), that was added
         in order to fix the problem that was reported by Bharata in the above
         email thread, thus also improve performance during normal boot.
      
       - Make memory hotplug more similar to the boot memory initialization
         path because we zero and initialize struct pages only in one
         function.
      
       - Simplifies memory hotplug struct page initialization code, and thus
         enables future improvements, such as multi-threading the
         initialization of struct pages in order to improve hotplug
         performance even further on larger machines.
      
      [pasha.tatashin@oracle.com: v5]
        Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0dc12e8
    • Pavel Tatashin's avatar
      mm/memory_hotplug: don't read nid from struct page during hotplug · fc44f7f9
      Pavel Tatashin authored
      
      
      During memory hotplugging the probe routine will leave struct pages
      uninitialized, the same as it is currently done during boot.  Therefore,
      we do not want to access the inside of struct pages before
      __init_single_page() is called during onlining.
      
      Because during hotplug we know that pages in one memory block belong to
      the same numa node, we can skip the checking.  We should keep checking
      for the boot case.
      
      [pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()]
        Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc44f7f9
    • Pavel Tatashin's avatar
      mm/memory_hotplug: optimize probe routine · b77eab70
      Pavel Tatashin authored
      
      
      When memory is hotplugged pages_correctly_reserved() is called to verify
      that the added memory is present, this routine traverses through every
      struct page and verifies that PageReserved() is set.  This is a slow
      operation especially if a large amount of memory is added.
      
      Instead of checking every page, it is enough to simply check that the
      section is present, has mapping (struct page array is allocated), and
      the mapping is online.
      
      In addition, we should not excpect that probe routine sets flags in
      struct page, as the struct pages have not yet been initialized.  The
      initialization should be done in __init_single_page(), the same as
      during boot.
      
      Link: http://lkml.kernel.org/r/20180215165920.8570-5-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b77eab70
    • Pavel Tatashin's avatar
      mm: uninitialized struct page poisoning sanity checking · f165b378
      Pavel Tatashin authored
      
      
      During boot we poison struct page memory in order to ensure that no one
      is accessing this memory until the struct pages are initialized in
      __init_single_page().
      
      This patch adds more scrutiny to this checking by making sure that flags
      do not equal the poison pattern when they are accessed.  The pattern is
      all ones.
      
      Since node id is also stored in struct page, and may be accessed quite
      early, we add this enforcement into page_to_nid() function as well.
      Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n
      
      [pasha.tatashin@oracle.com: v4]
        Link: http://lkml.kernel.org/r/20180215165920.8570-4-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180213193159.14606-4-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f165b378
    • Pavel Tatashin's avatar
      x86/mm/memory_hotplug: determine block size based on the end of boot memory · 078eb6aa
      Pavel Tatashin authored
      
      
      Memory sections are combined into "memory block" chunks.  These chunks
      are the units upon which memory can be added and removed.
      
      On x86, the new memory may be added after the end of the boot memory,
      therefore, if block size does not align with end of boot memory, memory
      hot-plugging/hot-removing can be broken.
      
      Memory sections are combined into "memory block" chunks.  These chunks
      are the units upon which memory can be added and removed.
      
      On x86 the new memory may be added after the end of the boot memory,
      therefore, if block size does not align with end of boot memory, memory
      hotplugging/hotremoving can be broken.
      
      Currently, whenever machine is booted with more than 64G the block size
      is unconditionally increased to 2G from the base 128M.  This is done in
      order to reduce number of memory device files in sysfs:
      
      	/sys/devices/system/memory/memoryXXX
      
      We must use the largest allowed block size that aligns to the next
      address to be able to hotplug the next block of memory.
      
      So, when memory is larger or equal to 64G, we check the end address and
      find the largest block size that is still power of two but smaller or
      equal to 2G.
      
      Before, the fix:
      Run qemu with:
      -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
      
      (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
      Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
      							size 0x80000000
      acpi PNP0C80:00: add_memory failed
      acpi PNP0C80:00: acpi_memory_enable_device() error
      acpi PNP0C80:00: Enumeration failure
      
      With the fix memory is added successfully as the block size is set to
      1G, and therefore aligns with start address 0x1040000000.
      
      [pasha.tatashin@oracle.com: v4]
        Link: http://lkml.kernel.org/r/20180215165920.8570-3-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180213193159.14606-3-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      078eb6aa
    • Pavel Tatashin's avatar
      mm/memory_hotplug: enforce block size aligned range check · ba325585
      Pavel Tatashin authored
      
      
      Patch series "optimize memory hotplug", v3.
      
      This patchset:
      
       - Improves hotplug performance by eliminating a number of struct page
         traverses during memory hotplug.
      
       - Fixes some issues with hotplugging, where boundaries were not
         properly checked. And on x86 block size was not properly aligned with
         end of memory
      
       - Also, potentially improves boot performance by eliminating condition
         from __init_single_page().
      
       - Adds robustness by verifying that that struct pages are correctly
         poisoned when flags are accessed.
      
      The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
      2.60GHz with 1T RAM:
      
      booting in qemu with 960G of memory, time to initialize struct pages:
      
      no-kvm:
      	TRY1		TRY2
      BEFORE:	39.433668	39.39705
      AFTER:	36.903781	36.989329
      
      with-kvm:
      BEFORE:	10.977447	11.103164
      AFTER:	10.929072	10.751885
      
      Hotplug 896G memory:
      no-kvm:
      	TRY1		TRY2
      BEFORE: 848.740000	846.910000
      AFTER:  783.070000	786.560000
      
      with-kvm:
      	TRY1		TRY2
      BEFORE: 34.410000	33.57
      AFTER:	29.810000	29.580000
      
      This patch (of 6):
      
      Start qemu with the following arguments:
      
        -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
      
      Which: boots machine with 64G, and adds a device mem1 with 2G which can
      be hotplugged later.
      
      Also make sure that config has the following turned on:
        CONFIG_MEMORY_HOTPLUG
        CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
        CONFIG_ACPI_HOTPLUG_MEMORY
      
      Using the qemu monitor hotplug the memory (make sure config has (qemu)
      device_add pc-dimm,id=dimm1,memdev=mem1
      
      The operation will fail with the following trace:
      
          WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
          pages_correctly_reserved+0xe6/0x110
          Modules linked in:
          CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
          BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
          RIP: 0010:pages_correctly_reserved+0xe6/0x110
          Call Trace:
           memory_subsys_online+0x44/0xa0
           device_online+0x51/0x80
           store_mem_state+0x5e/0xe0
           kernfs_fop_write+0xfa/0x170
           __vfs_write+0x2e/0x150
           vfs_write+0xa8/0x1a0
           SyS_write+0x4d/0xb0
           do_syscall_64+0x5d/0x110
           entry_SYSCALL_64_after_hwframe+0x21/0x86
          ---[ end trace 6203bc4f1a5d30e8 ]---
      
      The problem is detected in: drivers/base/memory.c
      
         static bool pages_correctly_reserved(unsigned long start_pfn)
         205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))
      
      This function loops through every section in the newly added memory
      block and verifies that the first pfn is valid, meaning section exists,
      has mapping (struct page array), and is online.
      
      The block size on x86 is usually 128M, but when machine is booted with
      more than 64G of memory, the block size is changed to 2G: $ cat
      /sys/devices/system/memory/block_size_bytes 80000000
      
      or
      
         $ dmesg | grep "block size"
         [    0.086469] x86/mm: Memory block size: 2048MB
      
      During memory hotplug, and hotremove we verify that the range is section
      size aligned, but we actually must verify that it is block size aligned,
      because that is the proper unit for hotplug operations.  See:
      Documentation/memory-hotplug.txt
      
      So, when the start_pfn of newly added memory is not block size aligned,
      we can get a memory block that has only part of it with properly
      populated sections.
      
      In our case the start_pfn starts from the last_pfn (end of physical
      memory).
      
         $ dmesg | grep last_pfn
         [    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
      
      0x1040000 == 65G, and so is not 2G aligned!
      
      The fix is to enforce that memory that is hotplugged and hotremoved is
      block size aligned.
      
      With this fix, running the above sequence yield to the following result:
      
         (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
         Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
         							size 0x80000000
         acpi PNP0C80:00: add_memory failed
         acpi PNP0C80:00: acpi_memory_enable_device() error
         acpi PNP0C80:00: Enumeration failure
      
      Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba325585
    • Yang Shi's avatar
      mm: thp: fix potential clearing to referenced flag in page_idle_clear_pte_refs_one() · f0849ac0
      Yang Shi authored
      
      
      For PTE-mapped THP, the compound THP has not been split to normal 4K
      pages yet, the whole THP is considered referenced if any one of sub page
      is referenced.
      
      When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
      to retrieve referenced bit.  But, the current code just returns the
      result of the last PTE.  If the last PTE has not referenced, the
      referenced flag will be cleared.
      
      Just set referenced when ptep{pmdp}_clear_young_notify() returns true.
      
      Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0849ac0
    • Pavel Tatashin's avatar
      mm: initialize pages on demand during boot · c9e97a19
      Pavel Tatashin authored
      
      
      Deferred page initialization allows the boot cpu to initialize a small
      subset of the system's pages early in boot, with other cpus doing the
      rest later on.
      
      It is, however, problematic to know how many pages the kernel needs
      during boot.  Different modules and kernel parameters may change the
      requirement, so the boot cpu either initializes too many pages or runs
      out of memory.
      
      To fix that, initialize early pages on demand.  This ensures the kernel
      does the minimum amount of work to initialize pages during boot and
      leaves the rest to be divided in the multithreaded initialization path
      (deferred_init_memmap).
      
      The on-demand code is permanently disabled using static branching once
      deferred pages are initialized.  After the static branch is changed to
      false, the overhead is up-to two branch-always instructions if the zone
      watermark check fails or if rmqueue fails.
      
      Sergey Senozhatsky noticed that while deferred pages currently make
      sense only on NUMA machines (we start one thread per latency node),
      CONFIG_NUMA is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT,
      so that is also must be addressed in the patch.
      
      [akpm@linux-foundation.org: fix typo in comment, make deferred_pages static]
      [pasha.tatashin@oracle.com: fix min() type mismatch warning]
        Link: http://lkml.kernel.org/r/20180212164543.26592-1-pasha.tatashin@oracle.com
      [pasha.tatashin@oracle.com: use zone_to_nid() in deferred_grow_zone()]
        Link: http://lkml.kernel.org/r/20180214163343.21234-2-pasha.tatashin@oracle.com
      [pasha.tatashin@oracle.com: might_sleep warning]
        Link: http://lkml.kernel.org/r/20180306192022.28289-1-pasha.tatashin@oracle.com
      [akpm@linux-foundation.org: s/spin_lock/spin_lock_irq/ in page_alloc_init_late()]
      [pasha.tatashin@oracle.com: v5]
        Link: http://lkml.kernel.org/r/20180309220807.24961-3-pasha.tatashin@oracle.com
      [akpm@linux-foundation.org: tweak comments]
      [pasha.tatashin@oracle.com: v6]
        Link: http://lkml.kernel.org/r/20180313182355.17669-3-pasha.tatashin@oracle.com
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20180209192216.20509-2-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9e97a19