Skip to content
  1. Jul 11, 2017
    • zhenwei.pi's avatar
      mm/balloon_compaction.c: enqueue zero page to balloon device · bb01b64c
      zhenwei.pi authored
      
      
      presently pages in the balloon device have random value, and these pages
      will be scanned by ksmd on the host.  They usually cannot be merged.
      Enqueue zero pages will resolve this problem.
      
      Link: http://lkml.kernel.org/r/1498698637-26389-1-git-send-email-zhenwei.pi@youruncloud.com
      Signed-off-by: default avatarzhenwei.pi <zhenwei.pi@youruncloud.com>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb01b64c
    • Doug Berger's avatar
      cma: fix calculation of aligned offset · e048cb32
      Doug Berger authored
      The align_offset parameter is used by bitmap_find_next_zero_area_off()
      to represent the offset of map's base from the previous alignment
      boundary; the function ensures that the returned index, plus the
      align_offset, honors the specified align_mask.
      
      The logic introduced by commit b5be83e3 ("mm: cma: align to physical
      address, not CMA region position") has the cma driver calculate the
      offset to the *next* alignment boundary.  In most cases, the base
      alignment is greater than that specified when making allocations,
      resulting in a zero offset whether we align up or down.  In the example
      given with the commit, the base alignment (8MB) was half the requested
      alignment (16MB) so the math also happened to work since the offset is
      8MB in both directions.  However, when requesting allocations with an
      alignment greater than twice that of the base, the returned index would
      not be correctly aligned.
      
      Also, the align_order arguments of cma_bitmap_aligned_mask() and
      cma_bitmap_aligned_offset() should not be negative so the argument type
      was made unsigned.
      
      Fixes: b5be83e3
      
       ("mm: cma: align to physical address, not CMA region position")
      Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com
      Signed-off-by: default avatarAngus Clark <angus@angusclark.org>
      Signed-off-by: default avatarDoug Berger <opendmb@gmail.com>
      Acked-by: default avatarGregory Fong <gregory.0xf0@gmail.com>
      Cc: Doug Berger <opendmb@gmail.com>
      Cc: Angus Clark <angus@angusclark.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Shiraz Hashim <shashim@codeaurora.org>
      Cc: Jaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e048cb32
    • John Hubbard's avatar
      mm/memory_hotplug.c: remove unused local zone_type from __remove_zone() · a52149f1
      John Hubbard authored
      
      
      __remove_zone() sets up up zone_type, but never uses it for anything.
      This does not cause a warning, due to the (necessary) use of
      -Wno-unused-but-set-variable.  However, it's noise, so just delete it.
      
      Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a52149f1
    • Michal Hocko's avatar
      mm: document highmem_is_dirtyable sysctl · d09b6468
      Michal Hocko authored
      
      
      It seems that there are still people using 32b kernels which a lot of
      memory and the IO tend to suck a lot for them by default.  Mostly
      because writers are throttled too when the lowmem is used.  We have
      highmem_is_dirtyable to work around that issue but it seems we never
      bothered to document it.  Let's do it now, finally.
      
      Link: http://lkml.kernel.org/r/20170626093200.18958-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alkis Georgopoulos <alkisg@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d09b6468
    • Nikolay Borisov's avatar
      include/linux/backing-dev.h: simplify wb_stat_sum · e3d3910a
      Nikolay Borisov authored
      
      
      wb_stat_sum() disables interrupts and calls __wb_stat_sum() which
      eventually calls __percpu_counter_sum().  However, the percpu routine is
      already irq-safe.  Simplify the code a bit by making wb_stat_sum()
      directly call percpu_counter_sum_positive() and not disable interrupts.
      
      Also remove the now-uneeded __wb_stat_sum() which was just a wrapper
      over percpu_counter_sum_positive().
      
      Link: http://lkml.kernel.org/r/1498230681-29103-1-git-send-email-nborisov@suse.com
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3d3910a
    • Nikolay Borisov's avatar
      include/linux/mmzone.h: remove ancient/ambiguous comment · 618b8c20
      Nikolay Borisov authored
      
      
      Currently pg_data_t is just a struct which describes a NUMA node memory
      layout.  Let's keep the comment simple and remove ambiguity.
      
      Link: http://lkml.kernel.org/r/1498220534-22717-1-git-send-email-nborisov@suse.com
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      618b8c20
    • Sebastian Andrzej Siewior's avatar
      mm/swap_slots.c: don't disable preemption while taking the per-CPU cache · f07e0f84
      Sebastian Andrzej Siewior authored
      
      
      get_cpu_var() disables preemption and returns the per-CPU version of the
      variable.  Disabling preemption is useful to ensure atomic access to the
      variable within the critical section.
      
      In this case however, after the per-CPU version of the variable is
      obtained the ->free_lock is acquired.  For that reason it seems the raw
      accessor could be used.  It only seems that ->slots_ret should be
      retested (because with disabled preemption this variable can not be set
      to NULL otherwise).
      
      This popped up during PREEMPT-RT testing because it tries to take
      spinlocks in a preempt disabled section.  In RT, spinlocks can sleep.
      
      Link: http://lkml.kernel.org/r/20170623114755.2ebxdysacvgxzott@linutronix.de
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd...
      f07e0f84
    • Rasmus Villemoes's avatar
      mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallback · b002529d
      Rasmus Villemoes authored
      
      
      Since current_order starts as MAX_ORDER-1 and is then only decremented,
      the second half of the loop condition seems superfluous.  However, if
      order is 0, we may decrement current_order past 0, making it UINT_MAX.
      This is obviously too subtle ([1], [2]).
      
      Since we need to add some comment anyway, change the two variables to
      signed, making the counting-down for loop look more familiar, and
      apparently also making gcc generate slightly smaller code.
      
      [1] https://lkml.org/lkml/2016/6/20/493
      [2] https://lkml.org/lkml/2017/6/19/345
      
      [akpm@linux-foundation.org: fix up reject fixupping]
      Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dk
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reported-by: default avatarHao Lee <haolee.swjtu@gmail.com>
      Acked-by: default avatarWei Yang <weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b002529d
    • Vasily Averin's avatar
      fs/proc/task_mmu.c: remove obsolete comment in show_map_vma() · 8c03cc85
      Vasily Averin authored
      After commit 1be7107f
      
       ("mm: larger stack guard gap, between vmas")
      we do not hide stack guard page in /proc/<pid>/maps
      
      Link: http://lkml.kernel.org/r/211f3c2a-f7ef-7c13-82bf-46fd426f6e1b@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c03cc85
    • Dou Liyang's avatar
      mm: drop useless local parameters of __register_one_node() · a7be6e5a
      Dou Liyang authored
      
      
      __register_one_node() initializes local parameters "p_node" & "parent"
      for register_node().
      
      But, register_node() does not use them.
      
      Remove the related code of "parent" node, cleanup __register_one_node()
      and register_node().
      
      Link: http://lkml.kernel.org/r/1498013846-20149-1-git-send-email-douly.fnst@cn.fujitsu.com
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7be6e5a
    • Vinayak Menon's avatar
      mm: avoid taking zone lock in pagetypeinfo_showmixed() · 727c080f
      Vinayak Menon authored
      
      
      pagetypeinfo_showmixedcount_print is found to take a lot of time to
      complete and it does this holding the zone lock and disabling
      interrupts.  In some cases it is found to take more than a second (On a
      2.4GHz,8Gb RAM,arm64 cpu).
      
      Avoid taking the zone lock similar to what is done by read_page_owner,
      which means possibility of inaccurate results.
      
      Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
      Signed-off-by: default avatarVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      727c080f
    • Michal Hocko's avatar
      mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration · ef77ba5c
      Michal Hocko authored
      
      
      new_page is yet another duplication of the migration callback which has
      to handle hugetlb migration specially.  We can safely use the generic
      new_page_nodemask for the same purpose.
      
      Please note that gigantic hugetlb pages do not need any special handling
      because alloc_huge_page_nodemask will make sure to check pages in all
      per node pools.  The reason this was done previously was that
      alloc_huge_page_node treated NO_NUMA_NODE and a specific node
      differently and so alloc_huge_page_node(nid) would check on this
      specific node.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: Andrew M...
      ef77ba5c
    • Michal Hocko's avatar
      hugetlb: add support for preferred node to alloc_huge_page_nodemask · 3e59fcb0
      Michal Hocko authored
      
      
      alloc_huge_page_nodemask tries to allocate from any numa node in the
      allowed node mask starting from lower numa nodes.  This might lead to
      filling up those low NUMA nodes while others are not used.  We can
      reduce this risk by introducing a concept of the preferred node similar
      to what we have in the regular page allocator.  We will start allocating
      from the preferred nid and then iterate over all allowed nodes in the
      zonelist order until we try them all.
      
      This is mimicing the page allocator logic except it operates on per-node
      mempools.  dequeue_huge_page_vma already does this so distill the
      zonelist logic into a more generic dequeue_huge_page_nodemask and use it
      in alloc_huge_page_nodemask.
      
      This will allow us to use proper per numa distance fallback also for
      alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
      can get rid of alloc_huge_page_node helper which doesn't have any user
      anymore.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e59fcb0
    • Michal Hocko's avatar
      mm, hugetlb: unclutter hugetlb allocation layers · aaf14e40
      Michal Hocko authored
      
      
      Patch series "mm, hugetlb: allow proper node fallback dequeue".
      
      While working on a hugetlb migration issue addressed in a separate
      patchset[1] I have noticed that the hugetlb allocations from the
      preallocated pool are quite subotimal.
      
       [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
      
      There is no fallback mechanism implemented and no notion of preferred
      node.  I have tried to work around it but Vlastimil was right to push
      back for a more robust solution.  It seems that such a solution is to
      reuse zonelist approach we use for the page alloctor.
      
      This series has 3 patches.  The first one tries to make hugetlb
      allocation layers more clear.  The second one implements the zonelist
      hugetlb pool allocation and introduces a preferred node semantic which
      is used by the migration callbacks.  The last patch is a clean up.
      
      This patch (of 3):
      
      Hugetlb allocation path for fresh huge pages is unnecessarily complex
      and it mixes different interfaces between layers.
      
      __alloc_buddy_huge_page is the central place to perform a new
      allocation.  It checks for the hugetlb overcommit and then relies on
      __hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
      all good except that __alloc_buddy_huge_page pushes vma and address down
      the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
      two different allocation modes - one for memory policy and other node
      specific (or to make it more obscure node non-specific) requests.
      
      This just screams for a reorganization.
      
      This patch pulls out all the vma specific handling up to
      __alloc_buddy_huge_page_with_mpol where it belongs.
      __alloc_buddy_huge_page will get nodemask argument and
      __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
      page allocator.
      
      In short:
      __alloc_buddy_huge_page_with_mpol - memory policy handling
        __alloc_buddy_huge_page - overcommit handling and accounting
          __hugetlb_alloc_buddy_huge_page - page allocator layer
      
      Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
      is not really needed because the page allocator already handles the
      cpusets update.
      
      Finally __hugetlb_alloc_buddy_huge_page had a special case for node
      specific allocations (when no policy is applied and there is a node
      given).  This has relied on __GFP_THISNODE to not fallback to a different
      node.  alloc_huge_page_node is the only caller which relies on this
      behavior so move the __GFP_THISNODE there.
      
      Not only does this remove quite some code it also should make those
      layers easier to follow and clear wrt responsibilities.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aaf14e40
    • Roman Gushchin's avatar
      mm/oom_kill.c: add tracepoints for oom reaper-related events · 422580c3
      Roman Gushchin authored
      
      
      During the debugging of the problem described in
      https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
      https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
      output is not really useful to understand issues related to the oom
      reaper.
      
      So, I assume, that adding some tracepoints might help with debugging of
      similar issues.
      
      Trace the following events:
       1) a process is marked as an oom victim,
       2) a process is added to the oom reaper list,
       3) the oom reaper starts reaping process's mm,
       4) the oom reaper finished reaping,
       5) the oom reaper skips reaping.
      
      How it works in practice? Below is an example which show how the problem
      mentioned above can be found: one process is added twice to the
      oom_reaper list:
      
        $ cd /sys/kernel/debug/tracing
        $ echo "oom:mark_victim" > set_event
        $ echo "oom:wake_reaper" >> set_event
        $ echo "oom:skip_task_reaping" >> set_event
        $ echo "oom:start_task_reaping" >> set_event
        $ echo "oom:finish_task_reaping" >> set_event
        $ cat trace_pipe
                allocate-502   [001] ....    91.836405: mark_victim: pid=502
                allocate-502   [001] .N..    91.837356: wake_reaper: pid=502
                allocate-502   [000] .N..    91.871149: wake_reaper: pid=502
              oom_reaper-23    [000] ....    91.871177: start_task_reaping: pid=502
              oom_reaper-23    [000] .N..    91.879511: finish_task_reaping: pid=502
              oom_reaper-23    [000] ....    91.879580: skip_task_reaping: pid=502
      
      Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      422580c3
    • Mike Rapoport's avatar
      userfaultfd: non-cooperative: add madvise() event for MADV_FREE request · 230ca982
      Mike Rapoport authored
      
      
      MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd
      monitor.  The monitor has to stop handling #PF events in the range being
      freed.  We are reusing userfaultfd_remove callback along with the logic
      required to re-get and re-validate the VMA which may change or disappear
      because userfaultfd_remove releases mmap_sem.
      
      Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      230ca982
    • Jan Kara's avatar
      mm/truncate.c: fix THP handling in invalidate_mapping_pages() · 76b6f9b7
      Jan Kara authored
      
      
      The condition checking for THP straddling end of invalidated range is
      wrong - it checks 'index' against 'end' but 'index' has been already
      advanced to point to the end of THP and thus the condition can never be
      true.  As a result THP straddling 'end' has been fully invalidated.
      Given the nature of invalidate_mapping_pages(), this could be only
      performance issue.  In fact, we are lucky the condition is wrong because
      if it was ever true, we'd leave locked page behind.
      
      Fix the condition checking for THP straddling 'end' and also properly
      unlock the page.  Also update the comment before the condition to
      explain why we decide not to invalidate the page as it was not clear to
      me and I had to ask Kirill.
      
      Link: http://lkml.kernel.org/r/20170619124723.21656-1-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76b6f9b7
    • Matthew Wilcox's avatar
      mm/hugetlb.c: replace memfmt with string_get_size · c6247f72
      Matthew Wilcox authored
      
      
      The hugetlb code has its own function to report human-readable sizes.
      Convert it to use the shared string_get_size() function.  This will lead
      to a minor difference in user visible output (MiB/GiB instead of MB/GB),
      but some would argue that's desirable anyway.
      
      Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6247f72
    • Michal Hocko's avatar
      mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit() · 6a1a8b80
      Michal Hocko authored
      
      
      Alice has reported the following UBSAN splat:
      
        UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
        signed integer overflow:
        -2147483644 - 2147483525 cannot be represented in type 'long int'
        CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P           O 4.9.25-gentoo #4
        Hardware name: XXXXXX, BIOS YYYYYY
        Call Trace:
          dump_stack+0x59/0x87
          ubsan_epilogue+0xe/0x40
          handle_overflow+0xbb/0xf0
          __ubsan_handle_sub_overflow+0x12/0x20
          memcg_check_events.isra.36+0x223/0x360
          mem_cgroup_commit_charge+0x55/0x140
          wp_page_copy+0x34e/0xb80
          do_wp_page+0x1e6/0x1300
          handle_mm_fault+0x88b/0x1990
          __do_page_fault+0x2de/0x8a0
          do_page_fault+0x1a/0x20
          error_code+0x67/0x6c
      
      The reason is that we subtract two signed types.  Let's fix this by
      truly mimicing time_after and cast the result of the subtraction.
      
      Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAlice Ferrazzi <alicef@gentoo.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a1a8b80
    • David Rientjes's avatar
      mm, hugetlb: schedule when potentially allocating many hugepages · 69ed779a
      David Rientjes authored
      
      
      A few hugetlb allocators loop while calling the page allocator and can
      potentially prevent rescheduling if the page allocator slowpath is not
      utilized.
      
      Conditionally schedule when large numbers of hugepages can be allocated.
      
      Anshuman:
       "Fixes a task which was getting hung while writing like 10000 hugepages
        (16MB on POWER8) into /proc/sys/vm/nr_hugepages."
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69ed779a
    • Michal Hocko's avatar
      mm: unify new_node_page and alloc_migrate_target · 8b913238
      Michal Hocko authored
      Commit 394e31d2
      
       ("mem-hotplug: alloc new page from a nearest
      neighbor node when mem-offline") has duplicated a large part of
      alloc_migrate_target with some hotplug specific special casing.
      
      To be more precise it tried to enfore the allocation from a different
      node than the original page.  As a result the two function diverged in
      their shared logic, e.g.  the hugetlb allocation strategy.
      
      Let's unify the two and express different NUMA requirements by the given
      nodemask.  new_node_page will simply exclude the node it doesn't care
      about and alloc_migrate_target will use all the available nodes.
      alloc_migrate_target will then learn to migrate hugetlb pages more
      sanely and use preallocated pool when possible.
      
      Please note that alloc_migrate_target used to call alloc_page resp.
      alloc_pages_current so the memory policy of the current context which is
      quite strange when we consider that it is used in the context of
      alloc_contig_range which just tries to migrate pages which stand in the
      way.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b913238
    • Michal Hocko's avatar
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko authored
      
      
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • Michal Hocko's avatar
      mm, memory_hotplug: simplify empty node mask handling in new_node_page · 7f252f27
      Michal Hocko authored
      
      
      new_node_page tries to allocate the target page on a different NUMA node
      than the source page.  This makes sense in most cases during the hotplug
      because we are likely to offline the whole numa node.  But there are
      cases where there are no other nodes to fallback (e.g.  when offlining
      parts of the only existing node) and we have to fallback to allocating
      from the source node.  The current code does that but it can be
      simplified by checking the nmask and updating it before we even try to
      allocate rather than special casing it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f252f27
    • Michal Hocko's avatar
      mm, memory_hotplug: support movable_node for hotpluggable nodes · 9f123ab5
      Michal Hocko authored
      
      
      movable_node kernel parameter allows making hotpluggable NUMA nodes to
      put all the hotplugable memory into movable zone which allows more or
      less reliable memory hotremove.  At least this is the case for the NUMA
      nodes present during the boot (see find_zone_movable_pfns_for_nodes).
      
      This is not the case for the memory hotplug, though.
      
      	echo online > /sys/devices/system/memory/memoryXYZ/state
      
      will default to a kernel zone (usually ZONE_NORMAL) unless the
      particular memblock is already in the movable zone range which is not
      the case normally when onlining the memory from the udev rule context
      for a freshly hotadded NUMA node.  The only option currently is to have
      a special udev rule to echo online_movable to all memblocks belonging to
      such a node which is rather clumsy.  Not to mention this is inconsistent
      as well because what ended up in the movable zone during the boot will
      end up in a kernel zone after hotremove & hotadd without special care.
      
      It would be nice to reuse memblock_is_hotpluggable but the runtime
      hotplug doesn't have that information available because the boot and
      hotplug paths are not shared and it would be really non trivial to make
      them use the same code path because the runtime hotplug doesn't play
      with the memblock allocator at all.
      
      Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
      movable_node is enabled and the range doesn't overlap with the existing
      normal zone.  This should provide a reasonable default onlining
      strategy.
      
      Strictly speaking the semantic is not identical with the boot time
      initialization because find_zone_movable_pfns_for_nodes covers only the
      hotplugable range as described by the BIOS/FW.  From my experience this
      is usually a full node though (except for Node0 which is special and
      never goes away completely).  If this turns out to be a problem in the
      real life we can tweak the code to store hotplug flag into memblocks but
      let's keep this simple now.
      
      Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f123ab5
    • Andy Shevchenko's avatar
      zram: use __sysfs_match_string() helper · ed8a5553
      Andy Shevchenko authored
      
      
      Use __sysfs_match_string() helper instead of open coded variant.
      
      Link: http://lkml.kernel.org/r/20170609120835.22156-1-andriy.shevchenko@linux.intel.com
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed8a5553
    • Will Deacon's avatar
      mm/migrate.c: stabilise page count when migrating transparent hugepages · f4e177d1
      Will Deacon authored
      
      
      When migrating a transparent hugepage, migrate_misplaced_transhuge_page
      guards itself against a concurrent fastgup of the page by checking that
      the page count is equal to 2 before and after installing the new pmd.
      
      If the page count changes, then the pmd is reverted back to the original
      entry, however there is a small window where the new (possibly writable)
      pmd is installed and the underlying page could be written by userspace.
      Restoring the old pmd could therefore result in loss of data.
      
      This patch fixes the problem by freezing the page count whilst updating
      the page tables, which protects against a concurrent fastgup without the
      need to restore the old pmd in the failure case (since the page count
      can no longer change under our feet).
      
      Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc...
      f4e177d1
    • Will Deacon's avatar
      include/linux/page_ref.h: ensure page_ref_unfreeze is ordered against prior accesses · 108a7ac4
      Will Deacon authored
      
      
      page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
      wrapping a critical section where struct pages can be modified without
      having to worry about consistency for a concurrent fast-GUP.
      
      Whilst page_ref_freeze has full barrier semantics due to its use of
      atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
      doesn't provide any barrier semantics and allows the operation to be
      reordered with respect to page modifications in the critical section.
      
      This patch ensures that page_ref_unfreeze is ordered after any critical
      section updates, by invoking smp_mb() prior to the atomic_set.
      
      Link: http://lkml.kernel.org/r/1497349722-6731-3-git-send-email-will.deacon@arm.com
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      108a7ac4
    • Dan Williams's avatar
      mm: always enable thp for dax mappings · baabda26
      Dan Williams authored
      
      
      The madvise policy for transparent huge pages is meant to avoid unwanted
      allocations of transparent huge pages.  It allows a policy of disabling
      the extra memory pressure and effort to arrange for a huge page when it
      is not needed.
      
      DAX by definition never incurs this overhead since it is statically
      allocated.  The policy choice makes even less sense for device-dax which
      tries to guarantee a given tlb-fault size.  Specifically, the following
      setting:
      
      	echo never > /sys/kernel/mm/transparent_hugepage/enabled
      
      ...violates that guarantee and silently disables all device-dax
      instances with a 2M or 1G alignment.  So, let's avoid that non-obvious
      side effect by force enabling thp for dax mappings in all cases.
      
      It is worth noting that the reason this uses vma_is_dax(), and the
      resulting header include changes, is that previous attempts to add a
      VM_DAX flag were NAKd.
      
      Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      baabda26
    • Dan Williams's avatar
      mm: improve readability of transparent_hugepage_enabled() · 16981d76
      Dan Williams authored
      
      
      Turn the macro into a static inline and rewrite the condition checks for
      better readability in preparation for adding another condition.
      
      [ross.zwisler@linux.intel.com: fix logic to make conversion equivalent]
      [akpm@linux-foundation.org: resolve vs mm-make-pr_set_thp_disable-immediately-active.patch]
      [akpm@linux-foundation.org: include coredump.h for MMF_DISABLE_THP]
      Link: http://lkml.kernel.org/r/149739530612.20686.14760671150202647861.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatar"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16981d76
    • Steven Rostedt (VMware)'s avatar
      oom, trace: remove ENUM evaluation of COMPACTION_FEEDBACK · 7ab0e50a
      Steven Rostedt (VMware) authored
      
      
      After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to
      CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have
      been evaluated:
      
       # cat /sys/kernel/debug/tracing/enum_map
      
      (which will soon be renamed to eval_map)
      
      And it showed some interesting results:
      
        [..]
        ZONE_MOVABLE 3 (oom)
        ZONE_NORMAL 2 (oom)
        ZONE_DMA32 1 (oom)
        ZONE_DMA 0 (oom)
        3 3 (oom)
        2 2 (oom)
        1 1 (oom)
        COMPACT_PRIO_ASYNC 2 (oom)
        COMPACT_PRIO_SYNC_LIGHT 1 (oom)
        COMPACT_PRIO_SYNC_FULL 0 (oom)
        [..]
        ZONE_DMA 0 (vmscan)
        3 3 (vmscan)
        2 2 (vmscan)
        1 1 (vmscan)
        COMPACT_PRIO_ASYNC 2 (vmscan)
        [..]
        ZONE_DMA 0 (kmem)
        3 3 (kmem)
        2 2 (kmem)
        1 1 (kmem)
        COMPACT_PRIO_ASYNC 2 (kmem)
        [..]
        ZONE_DMA 0 (compaction)
        3 3 (compaction)
        2 2 (compaction)
        1 1 (compaction)
        COMPACT_PRIO_ASYNC 2 (compaction)
        [..]
      
      The name within the parenthesis are the trace systems that the enum/eval
      maps are associated with. When there's a number evaluated to another
      number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define
      and not an enum. As #defines get converted normally, they are not needed
      to be evaluated.
      
      Each of the above trace systems with the number to number evaluation
      included the file include/trace/events/mmflags.h which has:
      
       /* High-level compaction status feedback */
       #define COMPACTION_FAILED       1
       #define COMPACTION_WITHDRAWN    2
       #define COMPACTION_PROGRESS     3
      
      [..]
      
       #define COMPACTION_FEEDBACK             \
              EM(COMPACTION_FAILED,           "failed")       \
              EM(COMPACTION_WITHDRAWN,        "withdrawn")    \
              EMe(COMPACTION_PROGRESS,        "progress")
      
      Which is still needed for the __print_symbolic() usage in the
      trace_event.  But it is not needed to be evaluated.
      
      Removing the evaluation part removes the unnecessary evaluations of
      numbers to numbers.
      
      Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.home
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ab0e50a
    • Liam R. Howlett's avatar
      mm/hugetlb.c: warn the user when issues arise on boot due to hugepages · d715cf80
      Liam R. Howlett authored
      
      
      When the user specifies too many hugepages or an invalid
      default_hugepagesz the communication to the user is implicit in the
      allocation message.  This patch adds a warning when the desired page
      count is not allocated and prints an error when the default_hugepagesz
      is invalid on boot.
      
      During boot hugepages will allocate until there is a fraction of the
      hugepage size left.  That is, we allocate until either the request is
      satisfied or memory for the pages is exhausted.  When memory for the
      pages is exhausted, it will most likely lead to the system failing with
      the OOM manager not finding enough (or anything) to kill (unless you're
      using really big hugepages in the order of 100s of MB or in the GBs).
      The user will most likely see the OOM messages much later in the boot
      sequence than the implicitly stated message.  Worse yet, you may even
      get an OOM for each processor which causes many pages of OOMs on modern
      systems.  Although these messages will be printed earlier than the OOM
      messages, at least giving the user errors and warnings will highlight
      the configuration as an issue.  I'm trying to point the user in the
      right direction by providing a more robust statement of what is failing.
      
      During the sysctl or echo command, the user can check the results much
      easier than if the system hangs during boot and the scenario of having
      nothing to OOM for kernel memory is highly unlikely.
      
      Mike said:
       "Before sending out this patch, I asked Liam off list why he was doing
        it. Was it something he just thought would be useful? Or, was there
        some type of user situation/need. He said that he had been called in
        to assist on several occasions when a system OOMed during boot. In
        almost all of these situations, the user had grossly misconfigured
        huge pages.
      
        DB users want to pre-allocate just the right amount of huge pages, but
        sometimes they can be really off. In such situations, the huge page
        init code just allocates as many huge pages as it can and reports the
        number allocated. There is no indication that it quit allocating
        because it ran out of memory. Of course, a user could compare the
        number in the message to what they requested on the command line to
        determine if they got all the huge pages they requested. The thought
        was that it would be useful to at least flag this situation. That way,
        the user might be able to better relate the huge page allocation
        failure to the OOM.
      
        I'm not sure if the e-mail discussion made it obvious that this is
        something he has seen on several occasions.
      
        I see Michal's point that this will only flag the situation where
        someone configures huge pages very badly. And, a more extensive look
        at the situation of misconfiguring huge pages might be in order. But,
        this has happened on several occasions which led to the creation of
        this patch"
      
      [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
      Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d715cf80
    • Anshuman Khandual's avatar
      mm/cma.c: warn if the CMA area could not be activated · e35ef639
      Anshuman Khandual authored
      
      
      While activating a CMA area we check to make sure that all the PFNs in
      the range are inside the same zone.  This is a requirement for
      alloc_contig_range() to work.  Any CMA area failing the check is
      disabled for good.  This happens silently right now making all future
      cma_alloc() allocations failure inevitable.
      
      Here we add an error message stating that the CMA area could not be
      activated which makes it easier to explain any future cma_alloc()
      failures on it.  While in there, change the bail out goto label from
      'err' to 'not_in_zone' which makes more sense.
      
      Link: http://lkml.kernel.org/r/20170605023729.26303-1-khandual@linux.vnet.ibm.com
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e35ef639
    • Yisheng Xie's avatar
      vmalloc: show lazy-purged vma info in vmallocinfo · 78c72746
      Yisheng Xie authored
      
      
      When ioremap a 67112960 bytes vm_area with the vmallocinfo:
       [..]
       0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
       0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap
      
      we get the result:
       0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap
      
      For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER':
      
      	if (flags & VM_IOREMAP)
      		align = 1ul << clamp_t(int, get_count_order_long(size),
      			PAGE_SHIFT, IOREMAP_MAX_ORDER);
      
      So it makes idiot like me a litte puzzled why this was a jump the
      vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving
      0xed000000-0xf1000000 as a big hole.
      
      This patch is to show all of vm_area, including vmas which are freeing
      but still in the vmap_area_list, to make it more clear about why we will
      get 0xf1000000-0xf5001000 in the above case.  And we will get a
      vmallocinfo like:
      
       [..]
       0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
       0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap
       [..]
       0xece7c000-0xece7e000    8192 unpurged vm_area
       0xece7e000-0xece83000   20480 vm_map_ram
       0xf0099000-0xf00aa000   69632 vm_map_ram
      
      after this patch.
      
      Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: zijun_hu <zijun_hu@htc.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78c72746
    • Sean Christopherson's avatar
      mm/memcontrol: exclude @root from checks in mem_cgroup_low · 34c81057
      Sean Christopherson authored
      Make @root exclusive in mem_cgroup_low; it is never considered low when
      looked at directly and is not checked when traversing the tree.  In
      effect, @root is handled identically to how root_mem_cgroup was
      previously handled by mem_cgroup_low.
      
      If @root is not excluded from the checks, a cgroup underneath @root will
      never be considered low during targeted reclaim of @root, e.g.  due to
      memory.current > memory.high, unless @root is misconfigured to have
      memory.low > memory.high.
      
      Excluding @root enables using memory.low to prioritize memory usage
      between cgroups within a subtree of the hierarchy that is limited by
      memory.high or memory.max, e.g.  when ROOT owns @root's controls but
      delegates the @root directory to a USER so that USER can create and
      administer children of @root.
      
      For example, given cgroup A with children B and C:
      
          A
         / \
        B   C
      
      and
      
        1. A/memory.current > A/memory.high
        2. A/B/memory.current < A/B/memory.low
        3. A/C/memory.current >= A/C/memory.low
      
      As 'A' is high, i.e.  triggers reclaim from 'A', and 'B' is low, we
      should reclaim from 'C' until 'A' is no longer high or until we can no
      longer reclaim from 'C'.  If 'A', i.e.  @root
      
      , isn't excluded by
      mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
      and we will reclaim indiscriminately from both 'B' and 'C'.
      
      Here is the test I used to confirm the bug and the patch.
      
      20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
      #!/bin/bash
      
      x62mb=$((62<<20))
      x66mb=$((66<<20))
      x94mb=$((94<<20))
      x98mb=$((98<<20))
      
      setup() {
          set -e
      
          if [[ -n $DEBUG ]]; then
              set -x
          fi
      
          trap teardown EXIT HUP INT TERM
      
          if [[ ! -e /mnt/1gb.swap ]]; then
              sudo fallocate -l 1G /mnt/1gb.swap > /dev/null
              sudo mkswap /mnt/1gb.swap > /dev/null
          fi
          if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
              sudo swapon /mnt/1gb.swap
          fi
      
          if [[ ! -e /cgroup/cgroup.controllers ]]; then
              sudo mount -t cgroup2 none /cgroup
          fi
      
          grep -q memory /cgroup/cgroup.controllers
      
          sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"
      
          sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
          sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
          sudo sh -c "echo '96m' > /cgroup/A/memory.high"
      
          mkdir /cgroup/A/0
          mkdir /cgroup/A/1
      
          echo 64m > /cgroup/A/0/memory.low
      }
      
      teardown() {
          set +e
      
          trap - EXIT HUP INT TERM
      
          if [[ -z $1 ]]; then
              printf "\n"
              printf "%0.s*" {1..35}
              printf "\nFAILED!\n\n"
              tail /cgroup/A/**/memory.current
              printf "%0.s*" {1..35}
              printf "\n\n"
          fi
      
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      
          sleep 2
      
          if [[ -e /cgroup/A/0 ]]; then
              rmdir /cgroup/A/0
          fi
          if [[ -e /cgroup/A/1 ]]; then
              rmdir /cgroup/A/1
          fi
          if [[ -e /cgroup/A ]]; then
              sudo rmdir /cgroup/A
          fi
      }
      
      stress_test() {
          sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/cgroup.procs"
      
          sleep 1
      
          # A/0 should be consuming more memory than A/1
          [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]
      
          # A/0 should be consuming ~64mb
          [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]
      
          # A should cumulatively be consuming ~96mb
          [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]
      
          # Stop the stressors
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      }
      
      teardown 1
      setup
      
      for ((i=1;i<=$1;i++)); do
          printf "ITERATION $i of $1 - stress_test 0 1"
          stress_test 0 1
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - stress_test 1 0"
          stress_test 1 0
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - PASSED\n"
      done
      
      teardown 1
      
      echo PASSED!
      
      20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10
      
      Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.com
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34c81057
    • Michal Hocko's avatar
      mm: make PR_SET_THP_DISABLE immediately active · 18600332
      Michal Hocko authored
      PR_SET_THP_DISABLE has a rather subtle semantic.  It doesn't affect any
      existing mapping because it only updated mm->def_flags which is a
      template for new mappings.
      
      The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
      flag set.  This can be quite surprising for all those applications which
      do not do prctl(); fork() & exec() and want to control their own THP
      behavior.
      
      Another usecase when the immediate semantic of the prctl might be useful
      is a combination of pre- and post-copy migration of containers with
      CRIU.  In this case CRIU populates a part of a memory region with data
      that was saved during the pre-copy stage.  Afterwards, the region is
      registered with userfaultfd and CRIU expects to get page faults for the
      parts of the region that were not yet populated.  However, khugepaged
      collapses the pages and the expected page faults do not occur.
      
      In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
      temporary mechanism for enabling/disabling THP process wide.
      
      Implementation wise, a new MMF_DISABLE_THP flag is added.  This flag is
      tested when decision whether to use huge pages is taken either during
      page fault of at the time of THP collapse.
      
      It should be noted, that the new implementation makes PR_SET_THP_DISABLE
      master override to any per-VMA setting, which was not the case
      previously.
      
      Fixes: a0715cc2
      
       ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
      Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18600332
    • David Rientjes's avatar
      mm, vmpressure: pass-through notification support · b6bb9811
      David Rientjes authored
      
      
      By default, vmpressure events are not pass-through, i.e.  they propagate
      up through the memcg hierarchy until an event notifier is found for any
      threshold level.
      
      This presents a difficulty when a thread waiting on a read(2) for a
      vmpressure event cannot distinguish between local memory pressure and
      memory pressure in a descendant memcg, especially when that thread may
      not control the memcg hierarchy.
      
      Consider a user-controlled child memcg with a smaller limit than a
      top-level memcg controlled by the "Activity Manager" specified in
      Documentation/cgroup-v1/memory.txt.  It may register for memory pressure
      notification for descendant memcgs to make a policy decision: oom kill a
      low priority job, increase the limit, decrease other limits, etc.  If it
      registers for memory pressure notification on the top-level memcg, it
      currently cannot distinguish between memory pressure in its own memcg or
      a descendant memcg, which is user-controlled.
      
      Conversely, if a user registers for memory pressure notification on
      their own descendant memcg, the Activity Manager does not receive any
      pressure notification for that child memcg hierarchy.  Vmpressure events
      are not received for ancestor memcgs if the memcg experiencing pressure
      have notifiers registered, perhaps outside the knowledge of the thread
      waiting on read(2) at the top level.
      
      Both of these are consequences of vmpressure notification not being
      pass-through.
      
      This implements a pass-through behavior for vmpressure events.  When
      writing to control.event_control, vmpressure event handlers may
      optionally specify a mode.  There are two new modes:
      
       - "hierarchy": always propagate memory pressure events up the hierarchy
         regardless if descendant memcgs have their own notifiers registered,
         and
      
       - "local": only receive notifications when the memcg for which the
         event is registered experiences memory pressure.
      
      Of course, processes may register for one notification of "low,local",
      for example, and another for "low".
      
      If no mode is specified, the current behavior is maintained for
      backwards compatibility.
      
      See the change to Documentation/cgroup-v1/memory.txt for full
      specification.
      
      [dan.carpenter@oracle.com: free the same pointer we allocated]
        Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.com
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6bb9811
    • Naoya Horiguchi's avatar
      mm: hwpoison: introduce idenfity_page_state · 0348d2eb
      Naoya Horiguchi authored
      
      
      Factoring duplicate code into a function.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0348d2eb
    • Naoya Horiguchi's avatar
      mm: hugetlb: delete dequeue_hwpoisoned_huge_page() · ddd40d8a
      Naoya Horiguchi authored
      
      
      dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddd40d8a
    • Naoya Horiguchi's avatar
      mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error · 78bb9203
      Naoya Horiguchi authored
      Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
      keep the error hugepage away from the system, which is OK but not good
      enough because the hugepage still has a refcount and unpoison doesn't
      work on the error hugepage (PageHWPoison flags are cleared but pages are
      still leaked.) And there's "wasting health subpages" issue too.  This
      patch reworks on me_huge_page() to solve these issues.
      
      For hugetlb file, recently we have truncating code so let's use it in
      hugetlbfs specific ->error_remove_page().
      
      For anonymous hugepage, it's helpful to dissolve the error page after
      freeing it into free hugepage list.  Migration entry and PageHWPoison in
      the head page prevent the access to it.
      
      TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
      It's not critical (and at least no worse that now) because in such case
      the error hugepage just stays in free hugepage list without being
      dissolved.  By virtue of PageHWPoison in head page, it's never allocated
      to processes.
      
      [akpm@linux-foundation.org: fix unused var warnings]
      Fixes: 23a003bf
      
       ("mm/madvise: pass return code of memory_failure() to userspace")
      Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
      Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78bb9203
    • Naoya Horiguchi's avatar
      mm: hwpoison: introduce memory_failure_hugetlb() · 761ad8d7
      Naoya Horiguchi authored
      
      
      memory_failure() is a big function and hard to maintain.  Handling
      hugetlb- and non-hugetlb- case in a single function is not good, so this
      patch separates PageHuge() branch into a new function, which saves many
      PageHuge() check.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      761ad8d7