Skip to content
  1. Dec 13, 2016
    • Johannes Weiner's avatar
      mm: workingset: move shadow entry tracking to radix tree exceptional tracking · 14b46879
      Johannes Weiner authored
      Currently, we track the shadow entries in the page cache in the upper
      bits of the radix_tree_node->count, behind the back of the radix tree
      implementation.  Because the radix tree code has no awareness of them,
      we rely on random subtleties throughout the implementation (such as the
      node->count != 1 check in the shrinking code, which is meant to exclude
      multi-entry nodes but also happens to skip nodes with only one shadow
      entry, as that's accounted in the upper bits).  This is error prone and
      has, in fact, caused the bug fixed in d3798ae8
      
       ("mm: filemap: don't
      plant shadow entries without radix tree node").
      
      To remove these subtleties, this patch moves shadow entry tracking from
      the upper bits of node->count to the existing counter for exceptional
      entries.  node->count goes back to being a simple counter of valid
      entries in the tree node and can be shrunk to a single byte.
      
      This vastly simplifies the page cache code.  All accounting happens
      natively inside the radix tree implementation, and maintaining the LRU
      linkage of shadow nodes is consolidated into a single function in the
      workingset code that is called for leaf nodes affected by a change in
      the page cache tree.
      
      This also removes the last user of the __radix_delete_node() return
      value.  Eliminate it.
      
      Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14b46879
    • Johannes Weiner's avatar
      lib: radix-tree: update callback for changing leaf nodes · 4d693d08
      Johannes Weiner authored
      
      
      Support handing __radix_tree_replace() a callback that gets invoked for
      all leaf nodes that change or get freed as a result of the slot
      replacement, to assist users tracking nodes with node->private_list.
      
      This prepares for putting page cache shadow entries into the radix tree
      root again and drastically simplifying the shadow tracking.
      
      Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d693d08
    • Johannes Weiner's avatar
      lib: radix-tree: add entry deletion support to __radix_tree_replace() · f4b109c6
      Johannes Weiner authored
      
      
      Page cache shadow entry handling will be a lot simpler when it can use a
      single generic replacement function for pages, shadow entries, and
      emptying slots.
      
      Make __radix_tree_replace() properly account insertions and deletions in
      node->count and garbage collect nodes as they become empty.  Then
      re-implement radix_tree_delete() on top of it.
      
      Link: http://lkml.kernel.org/r/20161117193058.GC23430@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4b109c6
    • Johannes Weiner's avatar
      lib: radix-tree: check accounting of existing slot replacement users · 6d75f366
      Johannes Weiner authored
      
      
      The bug in khugepaged fixed earlier in this series shows that radix tree
      slot replacement is fragile; and it will become more so when not only
      NULL<->!NULL transitions need to be caught but transitions from and to
      exceptional entries as well.  We need checks.
      
      Re-implement radix_tree_replace_slot() on top of the sanity-checked
      __radix_tree_replace().  This requires existing callers to also pass the
      radix tree root, but it'll warn us when somebody replaces slots with
      contents that need proper accounting (transitions between NULL entries,
      real entries, exceptional entries) and where a replacement through the
      slot pointer would corrupt the radix tree node counts.
      
      Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d75f366
    • Johannes Weiner's avatar
      lib: radix-tree: native accounting of exceptional entries · f7942430
      Johannes Weiner authored
      
      
      The way the page cache is sneaking shadow entries of evicted pages into
      the radix tree past the node entry accounting and tracking them manually
      in the upper bits of node->count is fraught with problems.
      
      These shadow entries are marked in the tree as exceptional entries,
      which are a native concept to the radix tree.  Maintain an explicit
      counter of exceptional entries in the radix tree node.  Subsequent
      patches will switch shadow entry tracking over to that counter.
      
      DAX and shmem are the other users of exceptional entries.  Since slot
      replacements that change the entry type from regular to exceptional must
      now be accounted, introduce a __radix_tree_replace() function that does
      replacement and accounting, and switch DAX and shmem over.
      
      The increase in radix tree node size is temporary.  A followup patch
      switches the shadow tracking to this new scheme and we'll no longer need
      the upper bits in node->count and shrink that back to one byte.
      
      Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7942430
    • Johannes Weiner's avatar
      mm: workingset: turn shadow node shrinker bugs into warnings · b936887e
      Johannes Weiner authored
      
      
      When the shadow page shrinker tries to reclaim a radix tree node but
      finds it in an unexpected state - it should contain no pages, and
      non-zero shadow entries - there is no need to kill the executing task or
      even the entire system.  Warn about the invalid state, then leave that
      tree node be.  Simply don't put it back on the shadow LRU for future
      reclaim and move on.
      
      Link: http://lkml.kernel.org/r/20161117191138.22769-4-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b936887e
    • Johannes Weiner's avatar
      mm: khugepaged: fix radix tree node leak in shmem collapse error path · 59749e6c
      Johannes Weiner authored
      The radix tree counts valid entries in each tree node.  Entries stored
      in the tree cannot be removed by simpling storing NULL in the slot or
      the internal counters will be off and the node never gets freed again.
      
      When collapsing a shmem page fails, restore the holes that were filled
      with radix_tree_insert() with a proper radix tree deletion.
      
      Fixes: f3f0e1d2
      
       ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Link: http://lkml.kernel.org/r/20161117191138.22769-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59749e6c
    • Johannes Weiner's avatar
      mm: khugepaged: close use-after-free race during shmem collapsing · 91a45f71
      Johannes Weiner authored
      Patch series "mm: workingset: radix tree subtleties & single-page file
      refaults", v3.
      
      This is another revision of the radix tree / workingset patches based on
      feedback from Jan and Kirill.
      
      This is a follow-up to d3798ae8 ("mm: filemap: don't plant shadow
      entries without radix tree node").  That patch fixed an issue that was
      caused mainly by the page cache sneaking special shadow page entries
      into the radix tree and relying on subtleties in the radix tree code to
      make that work.  The fix also had to stop tracking refaults for
      single-page files because shadow pages stored as direct pointers in
      radix_tree_root->rnode weren't properly handled during tree extension.
      
      These patches make the radix tree code explicitely support and track
      such special entries, to eliminate the subtleties and to restore the
      thrash detection for single-page files.
      
      This patch (of 9):
      
      When a radix tree iteration drops the tree lock, another thread might
      swoop in and free the node holding the current slot.  The iteration
      needs to do another tree lookup from the current index to continue.
      
      [kirill.shutemov@linux.intel.com: re-lookup for replacement]
      Fixes: f3f0e1d2
      
       ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Link: http://lkml.kernel.org/r/20161117191138.22769-2-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91a45f71
    • Andrew Morton's avatar
      include/linux/backing-dev-defs.h: shrink struct backing_dev_info · 8db378a5
      Andrew Morton authored
      
      
      Move the 4-byte `capabilities' field next to other 4-byte things.
      Shrinks sizeof(backing_dev_info) by 8 bytes on x86_64.
      
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8db378a5
    • Jens Axboe's avatar
      mm: don't cap request size based on read-ahead setting · 9491ae4a
      Jens Axboe authored
      
      
      We ran into a funky issue, where someone doing 256K buffered reads saw
      128K requests at the device level.  Turns out it is read-ahead capping
      the request size, since we use 128K as the default setting.  This
      doesn't make a lot of sense - if someone is issuing 256K reads, they
      should see 256K reads, regardless of the read-ahead setting, if the
      underlying device can support a 256K read in a single command.
      
      This patch introduces a bdi hint, io_pages.  This is the soft max IO
      size for the lower level, I've hooked it up to the bdev settings here.
      Read-ahead is modified to issue the maximum of the user request size,
      and the read-ahead max size, but capped to the max request size on the
      device side.  The latter is done to avoid reading ahead too much, if the
      application asks for a huge read.  With this patch, the kernel behaves
      like the application expects.
      
      Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9491ae4a
    • Jérémy Lefaure's avatar
      shmem: fix compilation warnings on unused functions · f1f5929c
      Jérémy Lefaure authored
      
      
      Compiling shmem.c with SHMEM and TRANSAPRENT_HUGE_PAGECACHE enabled
      raises warnings on two unused functions when CONFIG_TMPFS and
      CONFIG_SYSFS are both disabled:
      
        mm/shmem.c:390:20: warning: `shmem_format_huge' defined but not used [-Wunused-function]
         static const char *shmem_format_huge(int huge)
                            ^~~~~~~~~~~~~~~~~
        mm/shmem.c:373:12: warning: `shmem_parse_huge' defined but not used [-Wunused-function]
         static int shmem_parse_huge(const char *str)
                     ^~~~~~~~~~~~~~~~
      
      A conditional compilation on tmpfs or sysfs removes the warnings.
      
      Link: http://lkml.kernel.org/r/20161118055749.11313-1-jeremy.lefaure@lse.epita.fr
      Signed-off-by: default avatarJérémy Lefaure <jeremy.lefaure@lse.epita.fr>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1f5929c
    • Tahsin Erdogan's avatar
      fs/fs-writeback.c: remove redundant if check · bace9248
      Tahsin Erdogan authored
      
      
      b_more_io non-empty check is already preceded by an opposite check.
      
      Link: http://lkml.kernel.org/r/1478591249-30641-1-git-send-email-tahsin@google.com
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bace9248
    • Kirill A. Shutemov's avatar
      mm/filemap.c: add comment for confusing logic in page_cache_tree_insert() · c70b647d
      Kirill A. Shutemov authored
      
      
      Unlike THP, hugetlb pages are represented by one entry in the
      radix-tree.
      
      [akpm@linux-foundation.org: tweak comment]
      Link: http://lkml.kernel.org/r/20161110163640.126124-1-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c70b647d
    • Thierry Reding's avatar
      mm: cma: make linux/cma.h standalone includible · d5e6eff2
      Thierry Reding authored
      
      
      The header uses types and definitions from the linux/init.h as well as
      linux/types.h headers without explicitly including them.  This causes a
      failure to compile if they are not implicitly pulled in by includers.
      
      Link: http://lkml.kernel.org/r/20161115133235.13387-1-thierry.reding@gmail.com
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5e6eff2
    • Dan Williams's avatar
      mm: disable numa migration faults for dax vmas · c1ef8e2c
      Dan Williams authored
      
      
      Mark dax vmas as not migratable to exclude them from task_numa_work().
      This is especially relevant for device-dax which wants to ensure
      predictable access latency and not incur periodic faults.
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/147892450132.22062.16875659431109209179.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1ef8e2c
    • Heiko Carstens's avatar
      mm/pkeys: generate pkey system call code only if ARCH_HAS_PKEYS is selected · c7142aea
      Heiko Carstens authored
      
      
      Having code for the pkey_mprotect, pkey_alloc and pkey_free system calls
      makes only sense if ARCH_HAS_PKEYS is selected.  If not selected these
      system calls will always return -ENOSPC or -EINVAL.
      
      To simplify things and have less code generate the pkey system call code
      only if ARCH_HAS_PKEYS is selected.
      
      For architectures which have already wired up the system calls, but do
      not select ARCH_HAS_PKEYS this will result in less generated code and a
      different return code: the three system calls will now always return
      -ENOSYS, using the cond_syscall mechanism.
      
      For architectures which have not wired up the system calls less
      unreachable code will be generated.
      
      Link: http://lkml.kernel.org/r/20161114111251.70084-1-heiko.carstens@de.ibm.com
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7142aea
    • Reza Arbab's avatar
      dt: add documentation of "hotpluggable" memory property · c3352cbb
      Reza Arbab authored
      
      
      Summarize the "hotpluggable" property of dt memory nodes.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-6-git-send-email-arbab@linux.vnet.ibm.com
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3352cbb
    • Reza Arbab's avatar
      of/fdt: mark hotpluggable memory · 41a9ada3
      Reza Arbab authored
      
      
      When movable nodes are enabled, any node containing only hotpluggable
      memory is made movable at boot time.
      
      On x86, hotpluggable memory is discovered by parsing the ACPI SRAT,
      making corresponding calls to memblock_mark_hotplug().
      
      If we introduce a dt property to describe memory as hotpluggable,
      configs supporting early fdt may then also do this marking and use
      movable nodes.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-5-git-send-email-arbab@linux.vnet.ibm.com
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41a9ada3
    • Reza Arbab's avatar
      mm: enable CONFIG_MOVABLE_NODE on non-x86 arches · 114cf3cc
      Reza Arbab authored
      
      
      To support movable memory nodes (CONFIG_MOVABLE_NODE), at least one of
      the following must be true:
      
      1. This config has the capability to identify movable nodes at boot.
         Right now, only x86 can do this.
      
      2. Our config supports memory hotplug, which means that a movable node
         can be created by hotplugging all of its memory into ZONE_MOVABLE.
      
      Fix the Kconfig definition of CONFIG_MOVABLE_NODE, which currently
      recognizes (1), but not (2).
      
      Link: http://lkml.kernel.org/r/1479160961-25840-4-git-send-email-arbab@linux.vnet.ibm.com
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      114cf3cc
    • Reza Arbab's avatar
      mm: remove x86-only restriction of movable_node · 39fa104d
      Reza Arbab authored
      In commit c5320926
      
       ("mem-hotplug: introduce movable_node boot
      option"), the memblock allocation direction is changed to bottom-up and
      then back to top-down like this:
      
      1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
      2. memblock_set_bottom_up(false), called by x86's numa_init().
      
      Even though (1) occurs in generic mm code, it is wrapped by #ifdef
      CONFIG_MOVABLE_NODE, which depends on X86_64.
      
      This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
      things will be unbalanced.  (1) will happen for them, but (2) will not.
      
      This toggle was added in the first place because x86 has a delay between
      adding memblocks and marking them as hotpluggable.  Since other arches
      do this marking either immediately or not at all, they do not require
      the bottom-up toggle.
      
      So, resolve things by moving (1) from cmdline_parse_movable_node() to
      x86's setup_arch(), immediately after the movable_node parameter has
      been parsed.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39fa104d
    • Reza Arbab's avatar
      powerpc/mm: allow memory hotplug into a memoryless node · 4a3bac4e
      Reza Arbab authored
      Patch series "enable movable nodes on non-x86 configs", v7.
      
      This patchset allows more configs to make use of movable nodes.  When
      CONFIG_MOVABLE_NODE is selected, there are two ways to introduce such
      nodes into the system:
      
      1. Discover movable nodes at boot. Currently this is only possible on
         x86, but we will enable configs supporting fdt to do the same.
      
      2. Hotplug and online all of a node's memory using online_movable. This
         is already possible on any config supporting memory hotplug, not
         just x86, but the Kconfig doesn't say so. We will fix that.
      
      We'll also remove some cruft on power which would prevent (2).
      
      This patch (of 5):
      
      Remove the check which prevents us from hotplugging into an empty node.
      
      The original commit b226e462
      
       ("[PATCH] powerpc: don't add memory to
      empty node/zone"), states that this was intended to be a temporary measure.
      It is a workaround for an oops which no longer occurs.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-2-git-send-email-arbab@linux.vnet.ibm.com
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a3bac4e
    • Piotr Kwapulinski's avatar
      mm/mempolicy.c: forbid static or relative flags for local NUMA mode · 8d303e44
      Piotr Kwapulinski authored
      
      
      The MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES flags are irrelevant
      when setting them for MPOL_LOCAL NUMA memory policy via set_mempolicy or
      mbind.
      
      Return the "invalid argument" from set_mempolicy and mbind whenever any
      of these flags is passed along with MPOL_LOCAL.
      
      It is consistent with MPOL_PREFERRED passed with empty nodemask.
      
      It slightly shortens the execution time in paths where these flags are
      used e.g.  when trying to rebind the NUMA nodes for changes in cgroups
      cpuset mems (mpol_rebind_preferred()) or when just printing the mempolicy
      structure (/proc/PID/numa_maps).  Isolated tests done.
      
      Link: http://lkml.kernel.org/r/20161027163037.4089-1-kwapulinski.piotr@gmail.com
      Signed-off-by: default avatarPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Liang Chen <liangchen.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d303e44
    • Lorenzo Stoakes's avatar
      mm: fix up get_user_pages* comments · 80a79516
      Lorenzo Stoakes authored
      In the previous round of get_user_pages* changes comments attached to
      __get_user_pages_unlocked() and get_user_pages_unlocked() were rendered
      incorrect, this patch corrects them.
      
      In addition the get_user_pages_unlocked() comment seems to have already
      been outdated as it referred to tsk, mm parameters which were removed in
      c12d2da5
      
       ("mm/gup: Remove the macro overload API migration helpers from
      the get_user*() APIs"), this patch fixes this also.
      
      Link: http://lkml.kernel.org/r/20161025233435.5338-1-lstoakes@gmail.com
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80a79516
    • Aneesh Kumar K.V's avatar
      mm: remove the page size change check in tlb_remove_page · 692a68c1
      Aneesh Kumar K.V authored
      Now that we check for page size change early in the loop, we can
      partially revert e9d55e15
      
       ("mm: change the interface for
      __tlb_remove_page").
      
      This simplies the code much, by removing the need to track the last
      address with which we adjusted the range.  We also go back to the older
      way of filling the mmu_gather array, ie, we add an entry and then check
      whether the gather batch is full.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-6-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      692a68c1
    • Aneesh Kumar K.V's avatar
      mm: add tlb_remove_check_page_size_change to track page size change · 07e32661
      Aneesh Kumar K.V authored
      With commit e77b0852
      
       ("mm/mmu_gather: track page size with mmu
      gather and force flush if page size change") we added the ability to
      force a tlb flush when the page size change in a mmu_gather loop.  We
      did that by checking for a page size change every time we added a page
      to mmu_gather for lazy flush/remove.  We can improve that by moving the
      page size change check early and not doing it every time we add a page.
      
      This also helps us to do tlb flush when invalidating a range covering
      dax mapping.  Wrt dax mapping we don't have a backing struct page and
      hence we don't call tlb_remove_page, which earlier forced the tlb flush
      on page size change.  Moving the page size change check earlier means we
      will do the same even for dax mapping.
      
      We also avoid doing this check on architecture other than powerpc.
      
      In a later patch we will remove page size check from tlb_remove_page().
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-5-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07e32661
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: add tlb_remove_hugetlb_entry for handling hugetlb pages · b528e4b6
      Aneesh Kumar K.V authored
      
      
      This add tlb_remove_hugetlb_entry similar to tlb_remove_pmd_tlb_entry.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-4-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b528e4b6
    • Aneesh Kumar K.V's avatar
      mm: update mmu_gather range correctly · b5bc66b7
      Aneesh Kumar K.V authored
      
      
      We use __tlb_adjust_range to update range convered by mmu_gather struct.
      We later use the 'start' and 'end' to do a mmu_notifier_invalidate_range
      in tlb_flush_mmu_tlbonly().  Update the 'end' correctly in
      __tlb_adjust_range so that we call mmu_notifier_invalidate_range with
      the correct range values.
      
      Wrt tlbflush, this should not have any impact, because a flush with
      correct start address will flush tlb mapping for the range.
      
      Also add comment w.r.t updating the range when we free pagetable pages.
      For now we don't support a range based page table cache flush.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-3-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5bc66b7
    • Aneesh Kumar K.V's avatar
      mm: use the correct page size when removing the page · c0f2e176
      Aneesh Kumar K.V authored
      
      
      We are removing a pmd hugepage here.  Use the correct page size.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-2-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0f2e176
    • Arnd Bergmann's avatar
      shmem: avoid maybe-uninitialized warning · 23f919d4
      Arnd Bergmann authored
      
      
      After enabling -Wmaybe-uninitialized warnings, we get a false-postive
      warning for shmem:
      
        mm/shmem.c: In function `shmem_getpage_gfp':
        include/linux/spinlock.h:332:21: error: `info' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This can be easily avoided, since the correct 'info' pointer is known at
      the time we first enter the function, so we can simply move the
      initialization up.  Moving it before the first label avoids the warning
      and lets us remove two later initializations.
      
      Note that the function is so hard to read that it not only confuses the
      compiler, but also most readers and without this patch it could\ easily
      break if one of the 'goto's changed.
      
      Link: https://www.spinics.net/lists/kernel/msg2368133.html
      Link: http://lkml.kernel.org/r/20161024205725.786455-1-arnd@arndb.de
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23f919d4
    • Ming Ling's avatar
      mm, compaction: fix NR_ISOLATED_* stats for pfn based migration · 6afcf8ef
      Ming Ling authored
      Since commit bda807d4 ("mm: migrate: support non-lru movable page
      migration") isolate_migratepages_block) can isolate !PageLRU pages which
      would acct_isolated account as NR_ISOLATED_*.  Accounting these non-lru
      pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
      heuristics based on those counters such as pgdat_reclaimable_pages resp.
      too_many_isolated which would lead to unexpected stalls during the
      direct reclaim without any good reason.  Note that
      __alloc_contig_migrate_range can isolate a lot of pages at once.
      
      On mobile devices such as 512M ram android Phone, it may use a big zram
      swap.  In some cases zram(zsmalloc) uses too many non-lru but
      migratedable pages, such as:
      
            MemTotal: 468148 kB
            Normal free:5620kB
            Free swap:4736kB
            Total swap:409596kB
            ZRAM: 164616kB(zsmalloc non-lru pages)
            active_anon:60700kB
            inactive_anon:60744kB
            active_file:34420kB
            inactive_file:37532kB
      
      Fix this by only accounting lru pages to NR_ISOLATED_* in
      isolate_migratepages_block right after they were isolated and we still
      know they were on LRU.  Drop acct_isolated because it is called after
      the fact and we've lost that information.  Batching per-cpu counter
      doesn't make much improvement anyway.  Also make sure that we uncharge
      only LRU pages when putting them back on the LRU in
      putback_movable_pages resp.  when unmap_and_move migrates the page.
      
      [mhocko@suse.com: replace acct_isolated() with direct counting]
      Fixes: bda807d4
      
       ("mm: migrate: support non-lru movable page migration")
      Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
      Signed-off-by: default avatarMing Ling <ming.ling@spreadtrum.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6afcf8ef
    • Michal Hocko's avatar
      mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist · 6d840958
      Michal Hocko authored
      
      
      __GFP_THISNODE is documented to enforce the allocation to be satisified
      from the requested node with no fallbacks or placement policy
      enforcements.  policy_zonelist seemingly breaks this semantic if the
      current policy is MPOL_MBIND and instead of taking the node it will
      fallback to the first node in the mask if the requested one is not in
      the mask.  This is confusing to say the least because it fact we
      shouldn't ever go that path.  First tasks shouldn't be scheduled on CPUs
      with nodes outside of their mempolicy binding.  And secondly
      policy_zonelist is called only from 3 places:
      
       - huge_zonelist - never should do __GFP_THISNODE when going this path
      
       - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
      
       - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
         used
      
      So we shouldn't even need to care about this possibility and can drop
      the confusing code.  Let's keep a WARN_ON_ONCE in place to catch
      potential users and fix them up properly (aka use a different allocation
      function which ignores mempolicy).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d840958
    • David Rientjes's avatar
      mm, thp: avoid unlikely branches for split_huge_pmd · fd60775a
      David Rientjes authored
      
      
      While doing MADV_DONTNEED on a large area of thp memory, I noticed we
      encountered many unlikely() branches in profiles for each backing
      hugepage.  This is because zap_pmd_range() would call split_huge_pmd(),
      which rechecked the conditions that were already validated, but as part
      of an unlikely() branch.
      
      Avoid the unlikely() branch when in a context where pmd is known to be
      good for __split_huge_pmd() directly.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd60775a
    • zijun_hu's avatar
      mm/vmalloc.c: simplify /proc/vmallocinfo implementation · 3f500069
      zijun_hu authored
      
      
      Many seq_file helpers exist for simplifying implementation of virtual
      files especially, for /proc nodes.  however, the helpers for iteration
      over list_head are available but aren't adopted to implement
      /proc/vmallocinfo currently.
      
      Simplify /proc/vmallocinfo implementation by using existing seq_file
      helpers.
      
      Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.com
      Signed-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f500069
    • Minchan Kim's avatar
      mm: make unreserve highatomic functions reliable · 29fac03b
      Minchan Kim authored
      
      
      Currently, unreserve_highatomic_pageblock bails out if it found
      highatomic pageblock regardless of really moving free pages from the one
      so that it could mitigate unreserve logic's goal which saves OOM of a
      process.
      
      This patch makes unreserve functions bail out only if it moves some
      pages out of !highatomic free list to avoid such false positive.
      
      Another potential problem is that by race between page freeing and
      reserve highatomic function, pages could be in highatomic free list even
      though the pageblock is !high atomic migratetype.  In that case,
      unreserve_highatomic_pageblock can be void if count of highatomic
      reserve is less than pageblock_nr_pages.  We could solve it simply via
      draining all of reserved pages before the OOM.  It would have a
      safeguard role to exhuast reserved pages before converging to OOM.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29fac03b
    • Minchan Kim's avatar
      mm: try to exhaust highatomic reserve before the OOM · 04c8716f
      Minchan Kim authored
      
      
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      It's weird to show that zone has enough free memory above min watermark
      but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
      pages.  As last resort, try to unreserve highatomic pages again and if
      it has moved pages to non-highatmoc free list, retry reclaim once more.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04c8716f
    • Minchan Kim's avatar
      mm: prevent double decrease of nr_reserved_highatomic · 4855e4a7
      Minchan Kim authored
      
      
      There is race between page freeing and unreserved highatomic.
      
       CPU 0				    CPU 1
      
          free_hot_cold_page
            mt = get_pfnblock_migratetype
            set_pcppage_migratetype(page, mt)
          				    unreserve_highatomic_pageblock
          				    spin_lock_irqsave(&zone->lock)
          				    move_freepages_block
          				    set_pageblock_migratetype(page)
          				    spin_unlock_irqrestore(&zone->lock)
            free_pcppages_bulk
              __free_one_page(mt) <- mt is stale
      
      By above race, a page on CPU 0 could go non-highorderatomic free list
      since the pageblock's type is changed.  By that, unreserve logic of
      highorderatomic can decrease reserved count on a same pageblock severak
      times and then it will make mismatch between nr_reserved_highatomic and
      the number of reserved pageblock.
      
      So, this patch verifies whether the pageblock is highatomic or not and
      decrease the count only if the pageblock is highatomic.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4855e4a7
    • Minchan Kim's avatar
      mm: don't steal highatomic pageblock · 88ed365e
      Minchan Kim authored
      
      
      Patch series "use up highorder free pages before OOM", v3.
      
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      During the investigation, I found some problems with highatomic so this
      patch aims to solve the problems and the final goal is to unreserve
      every highatomic free pages before the OOM kill.
      
      This patch (of 4):
      
      In page freeing path, migratetype is racy so that a highorderatomic page
      could free into non-highorderatomic free list.  If that page is
      allocated, VM can change the pageblock from higorderatomic to something.
      In that case, highatomic pageblock accounting is broken so it doesn't
      work(e.g., VM cannot reserve highorderatomic pageblocks any more
      although it doesn't reach 1% limit).
      
      So, this patch prohibits the changing from highatomic to other type.
      It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
      array so stealing will only happen due to unexpected races which is
      really rare.  Also, such prohibiting keeps highatomic pageblock more
      longer so it would be better for highorderatomic page allocation.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88ed365e
    • Andreas Platschek's avatar
      kmemleak: fix reference to Documentation · 22901c6c
      Andreas Platschek authored
      
      
      Documentation/kmemleak.txt was moved to Documentation/dev-tools/kmemleak.rst,
      this fixes the reference to the new location.
      
      Link: http://lkml.kernel.org/r/1476544946-18804-1-git-send-email-andreas.platschek@opentech.at
      Signed-off-by: default avatarAndreas Platschek <andreas.platschek@opentech.at>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22901c6c
    • Aneesh Kumar K.V's avatar
      mm/hugetlb.c: use huge_pte_lock instead of opencoding the lock · 8bea8052
      Aneesh Kumar K.V authored
      
      
      No functional change by this patch.
      
      Link: http://lkml.kernel.org/r/20161018090234.22574-1-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bea8052
    • Aneesh Kumar K.V's avatar
      mm/hugetlb.c: use the right pte val for compare in hugetlb_cow · 3999f52e
      Aneesh Kumar K.V authored
      We cannot use the pte value used in set_pte_at for pte_same comparison,
      because archs like ppc64, filter/add new pte flag in set_pte_at.
      Instead fetch the pte value inside hugetlb_cow.  We are comparing pte
      value to make sure the pte didn't change since we dropped the page table
      lock.  hugetlb_cow get called with page table lock held, and we can take
      a copy of the pte value before we drop the page table lock.
      
      With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
      previous mapping (huge_pte_none entries), by forcing a cow in the fault
      path.  This avoid take an addition fault to covert a read-only mapping
      to read/write.  Here we were comparing a recently instantiated pte (via
      set_pte_at) to the pte values from linux page table.  As explained above
      on ppc64 such pte_same check returned wrong result, resulting in us
      taking an additional fault on ppc64.
      
      Fixes: 6a119eae
      
       ("powerpc/mm: Add a _PAGE_PTE bit")
      Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Scott Wood <scottwood@freescale.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3999f52e