Skip to content
  1. Apr 12, 2018
    • Michal Hocko's avatar
      mm, numa: rework do_pages_move · a49bd4d7
      Michal Hocko authored
      
      
      Patch series "unclutter thp migration"
      
      Motivation:
      
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by splitting the THP into small pages while moving the
      head page to the newly allocated order-0 page.  Remaining pages are
      moved to the LRU list by split_huge_page.  The same happens if the THP
      allocation fails.  This is really ugly and error prone [2].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      The first patch reworks do_pages_move which relies on a very ugly
      calling semantic when the return status is pushed to the migration path
      via private pointer.  It uses pre allocated fixed size batching to
      achieve that.  We simply cannot do the same if a THP is to be split
      during the migration path which is done in the patch 3.  Patch 2 is
      follow up cleanup which removes the mentioned return status calling
      convention ugliness.
      
      On a side note:
      
      There are some semantic issues I have encountered on the way when
      working on patch 1 but I am not addressing them here.  E.g. trying to
      move THP tail pages will result in either success or EBUSY (the later
      one more likely once we isolate head from the LRU list).  Hugetlb
      reports EACCESS on tail pages.  Some errors are reported via status
      parameter but migration failures are not even though the original
      `reason' argument suggests there was an intention to do so.  From a
      quick look into git history this never worked.  I have tried to keep the
      semantic unchanged.
      
      Then there is a relatively minor thing that the page isolation might
      fail because of pages not being on the LRU - e.g. because they are
      sitting on the per-cpu LRU caches.  Easily fixable.
      
      This patch (of 3):
      
      do_pages_move is supposed to move user defined memory (an array of
      addresses) to the user defined numa nodes (an array of nodes one for
      each address).  The user provided status array then contains resulting
      numa node for each address or an error.  The semantic of this function
      is little bit confusing because only some errors are reported back.
      Notably migrate_pages error is only reported via the return value.  This
      patch doesn't try to address these semantic nuances but rather change
      the underlying implementation.
      
      Currently we are processing user input (which can be really large) in
      batches which are stored to a temporarily allocated page.  Each address
      is resolved to its struct page and stored to page_to_node structure
      along with the requested target numa node.  The array of these
      structures is then conveyed down the page migration path via private
      argument.  new_page_node then finds the corresponding structure and
      allocates the proper target page.
      
      What is the problem with the current implementation and why to change
      it? Apart from being quite ugly it also doesn't cope with unexpected
      pages showing up on the migration list inside migrate_pages path.  That
      doesn't happen currently but the follow up patch would like to make the
      thp migration code more clear and that would need to split a THP into
      the list for some cases.
      
      How does the new implementation work? Well, instead of batching into a
      fixed size array we simply batch all pages that should be migrated to
      the same node and isolate all of them into a linked list which doesn't
      require any additional storage.  This should work reasonably well
      because page migration usually migrates larger ranges of memory to a
      specific node.  So the common case should work equally well as the
      current implementation.  Even if somebody constructs an input where the
      target numa nodes would be interleaved we shouldn't see a large
      performance impact because page migration alone doesn't really benefit
      from batching.  mmap_sem batching for the lookup is quite questionable
      and isolate_lru_page which would benefit from batching is not using it
      even in the current implementation.
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a49bd4d7
    • Colin Ian King's avatar
      mm/swapfile.c: make pointer swap_avail_heads static · bfc6b1ca
      Colin Ian King authored
      
      
      The pointer swap_avail_heads is local to the source and does not need to
      be in global scope, so make it static.
      
      Cleans up sparse warning:
      
        mm/swapfile.c:88:19: warning: symbol 'swap_avail_heads' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20180206215836.12366-1-colin.king@canonical.com
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfc6b1ca
    • Michal Hocko's avatar
      memcg: fix per_node_info cleanup · 4eaf431f
      Michal Hocko authored
      
      
      syzbot has triggered a NULL ptr dereference when allocation fault
      injection enforces a failure and alloc_mem_cgroup_per_node_info
      initializes memcg->nodeinfo only half way through.
      
      But __mem_cgroup_free still tries to free all per-node data and
      dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
      per-node data hasn't been initialized.
      
      The bug is quite unlikely to hit because small allocations do not fail
      and we would need quite some numa nodes to make struct
      mem_cgroup_per_node large enough to cross the costly order.
      
      Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
      Reported-by: default avatar <syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com>
      Fixes: 00f3ca2c
      
       ("mm: memcontrol: per-lruvec stats infrastructure")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eaf431f
    • Tom Abraham's avatar
      swap: divide-by-zero when zero length swap file on ssd · a06ad633
      Tom Abraham authored
      
      
      Calling swapon() on a zero length swap file on SSD can lead to a
      divide-by-zero.
      
      Although creating such files isn't possible with mkswap and they woud be
      considered invalid, it would be better for the swapon code to be more
      robust and handle this condition gracefully (return -EINVAL).
      Especially since the fix is small and straightforward.
      
      To help with wear leveling on SSD, the swapon syscall calculates a
      random position in the swap file using modulo p->highest_bit, which is
      set to maxpages - 1 in read_swap_header.
      
      If the swap file is zero length, read_swap_header sets maxpages=1 and
      last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
      modulo p->highest_bit in swapon syscall.
      
      This can be prevented by having read_swap_header return zero if
      last_page is zero.
      
      Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
      Signed-off-by: default avatarThomas Abraham <tabraham@suse.com>
      Reported-by: default avatar <Mark.Landis@Teradata.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a06ad633
    • Johannes Weiner's avatar
      mm: memcg: make sure memory.events is uptodate when waking pollers · e27be240
      Johannes Weiner authored
      Commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") added per-cpu drift to all memory cgroup stats
      and events shown in memory.stat and memory.events.
      
      For memory.stat this is acceptable.  But memory.events issues file
      notifications, and somebody polling the file for changes will be
      confused when the counters in it are unchanged after a wakeup.
      
      Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
      MEMCG_OOM - are sufficiently rare and high-level that we don't need
      per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
      frequent, but they're counting invocations of reclaim, which is a
      complex operation that touches many shared cachelines.
      
      This splits memory.events from the generic VM events and tracks them in
      their own, unbuffered atomic counters.  That's also cleaner, as it
      eliminates the ugly enum nesting of VM and cgroup events.
      
      [hannes@cmpxchg.org: "array subscript is above array bounds"]
        Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
      Fixes: a983b5eb
      
       ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e27be240
    • Claudio Imbrenda's avatar
      mm/ksm.c: fix inconsistent accounting of zero pages · a38c015f
      Claudio Imbrenda authored
      When using KSM with use_zero_pages, we replace anonymous pages
      containing only zeroes with actual zero pages, which are not anonymous.
      We need to do proper accounting of the mm counters, otherwise we will
      get wrong values in /proc and a BUG message in dmesg when tearing down
      the mm.
      
      Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
      Fixes: e86c59b1
      
       ("mm/ksm: improve deduplication of zero pages with colouring")
      Signed-off-by: default avatarClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a38c015f
    • Matthew Wilcox's avatar
      mm/z3fold.c: use gfpflags_allow_blocking · 8a97ea54
      Matthew Wilcox authored
      
      
      We have a perfectly good macro to determine whether the gfp flags allow
      you to sleep or not; use it instead of trying to infer it.
      
      Link: http://lkml.kernel.org/r/20180408062206.GC16007@bombadil.infradead.org
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a97ea54
    • Xidong Wang's avatar
      z3fold: fix memory leak · 1ec6995d
      Xidong Wang authored
      
      
      In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
      released on the error path that pool->compact_wq , which holds the
      return value of create_singlethread_workqueue(), is NULL.  This will
      result in a memory leak bug.
      
      [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
      Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
      Signed-off-by: default avatarXidong Wang <wangxidong_97@163.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ec6995d
    • Michal Hocko's avatar
      memcg, thp: do not invoke oom killer on thp charges · 2a70f6a7
      Michal Hocko authored
      A THP memcg charge can trigger the oom killer since 25160354 ("mm,
      thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
      We have used an explicit __GFP_NORETRY previously which ruled the OOM
      killer automagically.
      
      Memcg charge path should be semantically compliant with the allocation
      path and that means that if we do not trigger the OOM killer for costly
      orders which should do the same in the memcg charge path as well.
      Otherwise we are forcing callers to distinguish the two and use
      different gfp masks which is both non-intuitive and bug prone.  As soon
      as we get a costly high order kmalloc user we even do not have any means
      to tell the memcg specific gfp mask to prevent from OOM because the
      charging is deep within guts of the slab allocator.
      
      The unexpected memcg OOM on THP has already been fixed upstream by
      9d3c3354 ("mm, thp: do not cause memcg oom for thp") but this is a
      one-off fix rather than a generic solution.  Teach mem_cgroup_oom to
      bail out on costly order requests to fix the THP issue as well as any
      other costly OOM eligible allocations to be added in future.
      
      Also revert 9d3c3354 because special gfp for THP is no longer
      needed.
      
      Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
      Fixes: 25160354
      
       ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a70f6a7
    • Ralph Campbell's avatar
      mm/migrate: properly preserve write attribute in special migrate entry · 07707125
      Ralph Campbell authored
      
      
      Use of pte_write(pte) is only valid for present pte, the common code
      which set the migration entry can be reach for both valid present pte
      and special swap entry (for device memory).  Fix the code to use the
      mpfn value which properly handle both cases.
      
      On x86 this did not have any bad side effect because pte write bit is
      below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
      in turn means we were always creating read only special migration entry.
      
      So once migration did finish we always write protected the CPU page
      table entry (moreover this is only an issue when migrating from device
      memory to system memory).  End effect is that CPU write access would
      fault again and restore write permission.
      
      This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
      take a second fault on write access. ie, double faulting the same
      address.  There is no corruption or incorrect states (it behaves as a
      COWed page from a fork with a mapcount of 1).
      
      Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.com
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07707125
    • Wei Yang's avatar
      mm: check __highest_present_section_nr directly in memory_dev_init() · bc8755ba
      Wei Yang authored
      
      
      __highest_present_section_nr is a more strict boundary than
      NR_MEM_SECTIONS.  So checking __highest_present_section_nr directly is
      enough.
      
      Link: http://lkml.kernel.org/r/20180330032044.21647-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc8755ba
    • Mel Gorman's avatar
      sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages · 09a913a7
      Mel Gorman authored
      
      
      change_pte_range is called from task work context to mark PTEs for
      receiving NUMA faulting hints.  If the marked pages are dirty then
      migration may fail.  Some filesystems cannot migrate dirty pages without
      blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
      Even when they can, it can be a waste of cycles when the pages are
      shared forcing higher scan rates.  This patch avoids marking shared
      dirty pages for hinting faults but also will skip a migration if the
      page was dirtied after the scanner updated a clean page.
      
      This is most noticeable running the NASA Parallel Benchmark when backed
      by btrfs, the default root filesystem for some distributions, but also
      noticeable when using XFS.
      
      The following are results from a 4-socket machine running a 4.16-rc4
      kernel with some scheduler patches that are pending for the next merge
      window.
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1
        Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
        Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
        Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
        Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
        Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)
      
      is.D regresses slightly in terms of absolute time but note that that
      particular load varies quite a bit from run to run.  The more relevant
      observation is the total system CPU usage.
      
                  4.16.0-rc4  4.16.0-rc4
                schedtip-20180309 nodirty-v1
        User        71471.91    70627.04
        System      11078.96     8256.13
        Elapsed       661.66      632.74
      
      That is a substantial drop in system CPU usage and overall the workload
      completes faster.  The NUMA balancing statistics are also interesting
      
        NUMA base PTE updates        111407972   139848884
        NUMA huge PMD updates           206506      264869
        NUMA page range updates      217139044   275461812
        NUMA hint faults               4300924     3719784
        NUMA hint local faults         3012539     3416618
        NUMA hint local percent             70          91
        NUMA pages migrated            1517487     1358420
      
      While more PTEs are scanned due to changes in what faults are gathered,
      it's clear that a far higher percentage of faults are local as the bulk
      of the remote hits were dirty pages that, in this case with btrfs, had
      no chance of migrating.
      
      The following is a comparison when using XFS as that is a more realistic
      filesystem choice for a data partition
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1r47
        Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
        Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
        Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
        Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
        Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)
      
      That is a reasonable gain on two relatively long-lived workloads.  While
      not presented, there is also a substantial drop in system CPu usage and
      the NUMA balancing stats show similar improvements in locality as btrfs
      did.
      
      Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a913a7
    • Jérôme Glisse's avatar
      Documentation/vm/hmm.txt: typos and syntaxes fixes · e8eddfd2
      Jérôme Glisse authored
      
      
      This fix typos and syntaxes, thanks to Randy Dunlap for pointing them out
      (they were all my faults).
      
      Link: http://lkml.kernel.org/r/20180409151859.4713-1-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8eddfd2
    • Arnd Bergmann's avatar
      mm/hmm: fix header file if/else/endif maze, again · 9d8a463a
      Arnd Bergmann authored
      
      
      The last fix was still wrong, as we need the inline dummy functions also
      for the case that CONFIG_HMM is enabled but CONFIG_HMM_MIRROR is not:
      
        kernel/fork.o: In function `__mmdrop':
        fork.c:(.text+0x14f6): undefined reference to `hmm_mm_destroy'
      
      This adds back the second copy of the dummy functions, hopefully
      this time in the right place.
      
      Link: http://lkml.kernel.org/r/20180404110236.804484-1-arnd@arndb.de
      Fixes: 8900d06a277a ("mm/hmm: fix header file if/else/endif maze")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d8a463a
    • Tejun Heo's avatar
      mm/hmm.c: remove superfluous RCU protection around radix tree lookup · 18be460e
      Tejun Heo authored
      
      
      hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
      actually uses the RCU protection.  The only caller is
      hmm_devmem_pages_create() which already grabs the mutex and does
      superfluous rcu_read_lock/unlock() around the function.
      
      This doesn't add anything and just adds to confusion.  Remove the RCU
      protection and open-code the radix tree lookup.  If this needs to become
      more sophisticated in the future, let's add them back when necessary.
      
      Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18be460e
    • Jérôme Glisse's avatar
      mm/hmm: use device driver encoding for HMM pfn · f88a1e90
      Jérôme Glisse authored
      
      
      Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
      pfn shift value allowing them to define their own encoding for HMM pfn
      that are fill inside the pfns array of the hmm_range struct.  With this
      device driver can get pfn that match their own private encoding out of HMM
      without having to do any conversion.
      
      [rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
        Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
      [rcampbell@nvidia.com: clarify fault logic for device private memory]
        Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f88a1e90
    • Jérôme Glisse's avatar
      mm/hmm: change hmm_vma_fault() to allow write fault on page basis · 2aee09d8
      Jérôme Glisse authored
      
      
      This changes hmm_vma_fault() to not take a global write fault flag for a
      range but instead rely on caller to populate HMM pfns array with proper
      fault flag ie HMM_PFN_VALID if driver want read fault for that address or
      HMM_PFN_VALID and HMM_PFN_WRITE for write.
      
      Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
      device private memory to be migrated back to system memory through page
      fault.
      
      This is more flexible API and it better reflects how device handles and
      reports fault.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2aee09d8
    • Jérôme Glisse's avatar
      mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() · 53f5c3f4
      Jérôme Glisse authored
      
      
      No functional change, just create one function to handle pmd and one to
      handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53f5c3f4
    • Jérôme Glisse's avatar
      mm/hmm: move hmm_pfns_clear() closer to where it is used · 33cd47dc
      Jérôme Glisse authored
      
      
      Move hmm_pfns_clear() closer to where it is used to make it clear it is
      not use by page table walkers.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33cd47dc
    • Jérôme Glisse's avatar
      mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE · b2744118
      Jérôme Glisse authored
      
      
      Make naming consistent across code, DEVICE_PRIVATE is the name use outside
      HMM code so use that one.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2744118
    • Jérôme Glisse's avatar
      mm/hmm: do not differentiate between empty entry or missing directory · 5504ed29
      Jérôme Glisse authored
      
      
      There is no point in differentiating between a range for which there is
      not even a directory (and thus entries) and empty entry (pte_none() or
      pmd_none() returns true).
      
      Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
      duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5504ed29
    • Jérôme Glisse's avatar
      mm/hmm: cleanup special vma handling (VM_SPECIAL) · 855ce7d2
      Jérôme Glisse authored
      
      
      Special vma (one with any of the VM_SPECIAL flags) can not be access by
      device because there is no consistent model across device drivers on those
      vma and their backing memory.
      
      This patch directly use hmm_range struct for hmm_pfns_special() argument
      as it is always affecting the whole vma and thus the whole range.
      
      It also make behavior consistent after this patch both hmm_vma_fault() and
      hmm_vma_get_pfns() returns -EINVAL when facing such vma.  Previously
      hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
      were filling the HMM pfn array with special entry.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      855ce7d2
    • Jérôme Glisse's avatar
      mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong · ff05c0c6
      Jérôme Glisse authored
      
      
      All device driver we care about are using 64bits page table entry.  In
      order to match this and to avoid useless define convert all HMM pfn to
      directly use uint64_t.  It is a first step on the road to allow driver to
      directly use pfn value return by HMM (saving memory and CPU cycles use for
      conversion between the two).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff05c0c6
    • Jérôme Glisse's avatar
      mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture · 86586a41
      Jérôme Glisse authored
      
      
      Only peculiar architecture allow write without read thus assume that any
      valid pfn do allow for read.  Note we do not care for write only because
      it does make sense with thing like atomic compare and exchange or any
      other operations that allow you to get the memory value through them.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86586a41
    • Jérôme Glisse's avatar
      mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters · 08232a45
      Jérôme Glisse authored
      
      
      Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
      as parameter and were initializing that struct with others of their
      parameters.  Have caller of those function do this as they are likely to
      already do and only pass this struct to both function this shorten
      function signature and make it easier in the future to add new parameters
      by simply adding them to the structure.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08232a45
    • Jérôme Glisse's avatar
      mm/hmm: hmm_pfns_bad() was accessing wrong struct · c719547f
      Jérôme Glisse authored
      
      
      The private field of mm_walk struct point to an hmm_vma_walk struct and
      not to the hmm_range struct desired.  Fix to get proper struct pointer.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c719547f
    • Jérôme Glisse's avatar
      mm/hmm: unregister mmu_notifier when last HMM client quit · c01cbba2
      Jérôme Glisse authored
      
      
      This code was lost in translation at one point.  This properly call
      mmu_notifier_unregister_no_release() once last user is gone.  This fix the
      zombie mm_struct as without this patch we do not drop the refcount we have
      on it.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c01cbba2
    • Ralph Campbell's avatar
      mm/hmm: HMM should have a callback before MM is destroyed · e1401513
      Ralph Campbell authored
      
      
      hmm_mirror_register() registers a callback for when the CPU pagetable is
      modified.  Normally, the device driver will call hmm_mirror_unregister()
      when the process using the device is finished.  However, if the process
      exits uncleanly, the struct_mm can be destroyed with no warning to the
      device driver.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-4-jglisse@redhat.com
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1401513
    • Jérôme Glisse's avatar
      mm/hmm: fix header file if/else/endif maze · b28b08de
      Jérôme Glisse authored
      
      
      The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.  Because of
      this after multiple include there was multiple definition of both
      hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM was
      enabled (CONFIG_HMM set).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-3-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b28b08de
    • Ralph Campbell's avatar
      mm/hmm: documentation editorial update to HMM documentation · 76ea470c
      Ralph Campbell authored
      
      
      Update the documentation for HMM to fix minor typos and phrasing to be a
      bit more readable.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-2-jglisse@redhat.com
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Stephen  Bates <sbates@raithlin.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76ea470c
    • Steven Rostedt's avatar
      mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event · d51d1e64
      Steven Rostedt authored
      
      
      The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
      parameters! Seven of them are from the reclaim_stat structure.  This
      structure is currently local to mm/vmscan.c.  By moving it to the global
      vmstat.h header, we can also reference it from the vmscan tracepoints.
      In moving it, it brings down the overhead of passing so many arguments
      to the trace event.  In the future, we may limit the number of arguments
      that a trace event may pass (ideally just 6, but more realistically it
      may be 8).
      
      Before this patch, the code to call the trace event is this:
      
       0f 83 aa fe ff ff       jae    ffffffff811e6261 <shrink_inactive_list+0x1e1>
       48 8b 45 a0             mov    -0x60(%rbp),%rax
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       44 8b 6d d4             mov    -0x2c(%rbp),%r13d
       8b 4d d0                mov    -0x30(%rbp),%ecx
       44 8b 75 cc             mov    -0x34(%rbp),%r14d
       44 8b 7d c8             mov    -0x38(%rbp),%r15d
       48 89 45 90             mov    %rax,-0x70(%rbp)
       8b 83 b8 fe ff ff       mov    -0x148(%rbx),%eax
       8b 55 c0                mov    -0x40(%rbp),%edx
       8b 7d c4                mov    -0x3c(%rbp),%edi
       8b 75 b8                mov    -0x48(%rbp),%esi
       89 45 80                mov    %eax,-0x80(%rbp)
       65 ff 05 e4 f7 e2 7e    incl   %gs:0x7ee2f7e4(%rip)        # 15bd0 <__preempt_count>
       48 8b 05 75 5b 13 01    mov    0x1135b75(%rip),%rax        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       48 85 c0                test   %rax,%rax
       74 72                   je     ffffffff811e646a <shrink_inactive_list+0x3ea>
       48 89 c3                mov    %rax,%rbx
       4c 8b 10                mov    (%rax),%r10
       89 f8                   mov    %edi,%eax
       48 89 85 68 ff ff ff    mov    %rax,-0x98(%rbp)
       89 f0                   mov    %esi,%eax
       48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
       89 c8                   mov    %ecx,%eax
       48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
       89 d0                   mov    %edx,%eax
       48 89 85 70 ff ff ff    mov    %rax,-0x90(%rbp)
       8b 45 8c                mov    -0x74(%rbp),%eax
       48 8b 7b 08             mov    0x8(%rbx),%rdi
       48 83 c3 18             add    $0x18,%rbx
       50                      push   %rax
       41 54                   push   %r12
       41 55                   push   %r13
       ff b5 78 ff ff ff       pushq  -0x88(%rbp)
       41 56                   push   %r14
       41 57                   push   %r15
       ff b5 70 ff ff ff       pushq  -0x90(%rbp)
       4c 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%r9
       4c 8b 85 60 ff ff ff    mov    -0xa0(%rbp),%r8
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       48 8b 55 90             mov    -0x70(%rbp),%rdx
       8b 75 80                mov    -0x80(%rbp),%esi
       41 ff d2                callq  *%r10
      
      After the patch:
      
       0f 83 a8 fe ff ff       jae    ffffffff811e626d <shrink_inactive_list+0x1cd>
       8b 9b b8 fe ff ff       mov    -0x148(%rbx),%ebx
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       4c 8b 6d a0             mov    -0x60(%rbp),%r13
       65 ff 05 f5 f7 e2 7e    incl   %gs:0x7ee2f7f5(%rip)        # 15bd0 <__preempt_count>
       4c 8b 35 86 5b 13 01    mov    0x1135b86(%rip),%r14        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       4d 85 f6                test   %r14,%r14
       74 2a                   je     ffffffff811e6411 <shrink_inactive_list+0x371>
       49 8b 06                mov    (%r14),%rax
       8b 4d 8c                mov    -0x74(%rbp),%ecx
       49 8b 7e 08             mov    0x8(%r14),%rdi
       49 83 c6 18             add    $0x18,%r14
       4c 89 ea                mov    %r13,%rdx
       45 89 e1                mov    %r12d,%r9d
       4c 8d 45 b8             lea    -0x48(%rbp),%r8
       89 de                   mov    %ebx,%esi
       51                      push   %rcx
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       ff d0                   callq  *%rax
      
      Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
      Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reported-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d51d1e64
    • Andrey Ryabinin's avatar
      mm/vmscan: don't mess with pgdat->flags in memcg reclaim · e3c1ac58
      Andrey Ryabinin authored
      
      
      memcg reclaim may alter pgdat->flags based on the state of LRU lists in
      cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
      congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
      pages.  But the worst here is PGDAT_CONGESTED, since it may force all
      direct reclaims to stall in wait_iff_congested().  Note that only kswapd
      have powers to clear any of these bits.  This might just never happen if
      cgroup limits configured that way.  So all direct reclaims will stall as
      long as we have some congested bdi in the system.
      
      Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
      pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
      it's reasonable to leave all decisions about node state to kswapd.
      
      Why only kswapd? Why not allow to global direct reclaim change these
      flags? It is because currently only kswapd can clear these flags.  I'm
      less worried about the case when PGDAT_CONGESTED falsely not set, and
      more worried about the case when it falsely set.  If direct reclaimer
      sets PGDAT_CONGESTED, do we have guarantee that after the congestion
      problem is sorted out, kswapd will be woken up and clear the flag? It
      seems like there is no such guarantee.  E.g.  direct reclaimers may
      eventually balance pgdat and kswapd simply won't wake up (see
      wakeup_kswapd()).
      
      Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
      now loses its congestion throttling mechanism.  Add per-cgroup
      congestion state and throttle cgroup2 reclaimers if memcg is in
      congestion state.
      
      Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
      bits since they alter only kswapd behavior.
      
      The problem could be easily demonstrated by creating heavy congestion in
      one cgroup:
      
          echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
          mkdir -p /sys/fs/cgroup/congester
          echo 512M > /sys/fs/cgroup/congester/memory.max
          echo $$ > /sys/fs/cgroup/congester/cgroup.procs
          /* generate a lot of diry data on slow HDD */
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
          ....
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
      
      and some job in another cgroup:
      
          mkdir /sys/fs/cgroup/victim
          echo 128M > /sys/fs/cgroup/victim/memory.max
      
          # time cat /dev/sda > /dev/null
          real    10m15.054s
          user    0m0.487s
          sys     1m8.505s
      
      According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
      of the time sleeping there.
      
      With the patch, cat don't waste time anymore:
      
          # time cat /dev/sda > /dev/null
          real    5m32.911s
          user    0m0.411s
          sys     0m56.664s
      
      [aryabinin@virtuozzo.com: congestion state should be per-node]
        Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
      [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
        Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3c1ac58
    • Andrey Ryabinin's avatar
      mm/vmscan: don't change pgdat state on base of a single LRU list state · d108c772
      Andrey Ryabinin authored
      
      
      We have separate LRU list for each memory cgroup.  Memory reclaim
      iterates over cgroups and calls shrink_inactive_list() every inactive
      LRU list.  Based on the state of a single LRU shrink_inactive_list() may
      flag the whole node as dirty,congested or under writeback.  This is
      obviously wrong and hurtful.  It's especially hurtful when we have
      possibly small congested cgroup in system.  Than *all* direct reclaims
      waste time by sleeping in wait_iff_congested().  And the more memcgs in
      the system we have the longer memory allocation stall is, because
      wait_iff_congested() called on each lru-list scan.
      
      Sum reclaim stats across all visited LRUs on node and flag node as
      dirty, congested or under writeback based on that sum.  Also call
      congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
      once per lru-list scan.
      
      This only fixes the problem for global reclaim case.  Per-cgroup reclaim
      may alter global pgdat flags too, which is wrong.  But that is separate
      issue and will be addressed in the next patch.
      
      This change will not have any effect on a systems with all workload
      concentrated in a single cgroup.
      
      [aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
        Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d108c772
    • Andrey Ryabinin's avatar
      mm/vmscan: remove redundant current_may_throttle() check · c4fd4fa5
      Andrey Ryabinin authored
      
      
      Only kswapd can have non-zero nr_immediate, and current_may_throttle()
      is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
      enough to check stat.nr_immediate only.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4fd4fa5
    • Andrey Ryabinin's avatar
      mm/vmscan: update stale comments · 894befec
      Andrey Ryabinin authored
      
      
      Update some comments that became stale since transiton from per-zone to
      per-node reclaim.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      894befec
    • Roman Gushchin's avatar
      mm: treat indirectly reclaimable memory as free in overcommit logic · d79f7aa4
      Roman Gushchin authored
      
      
      Indirectly reclaimable memory can consume a significant part of total
      memory and it's actually reclaimable (it will be released under actual
      memory pressure).
      
      So, the overcommit logic should treat it as free.
      
      Otherwise, it's possible to cause random system-wide memory allocation
      failures by consuming a significant amount of memory by indirectly
      reclaimable memory, e.g.  dentry external names.
      
      If overcommit policy GUESS is used, it might be used for denial of
      service attack under some conditions.
      
      The following program illustrates the approach.  It causes the kernel to
      allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
      that at some point the overcommit logic may start blocking large
      allocation system-wide.
      
        int main()
        {
        	char buf[256];
        	unsigned long i;
        	struct stat statbuf;
      
        	buf[0] = '/';
        	for (i = 1; i < sizeof(buf); i++)
        		buf[i] = '_';
      
        	for (i = 0; 1; i++) {
        		sprintf(&buf[248], "%8lu", i);
        		stat(buf, &statbuf);
        	}
      
        	return 0;
        }
      
      This patch in combination with related indirectly reclaimable memory
      patches closes this issue.
      
      Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d79f7aa4
    • Roman Gushchin's avatar
      dcache: account external names as indirectly reclaimable memory · f1782c9b
      Roman Gushchin authored
      
      
      I received a report about suspicious growth of unreclaimable slabs on
      some machines.  I've found that it happens on machines with low memory
      pressure, and these unreclaimable slabs are external names attached to
      dentries.
      
      External names are allocated using generic kmalloc() function, so they
      are accounted as unreclaimable.  But they are held by dentries, which
      are reclaimable, and they will be reclaimed under the memory pressure.
      
      In particular, this breaks MemAvailable calculation, as it doesn't take
      unreclaimable slabs into account.  This leads to a silly situation, when
      a machine is almost idle, has no memory pressure and therefore has a big
      dentry cache.  And the resulting MemAvailable is too low to start a new
      workload.
      
      To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
      used to track the amount of memory, consumed by external names.  The
      counter is increased in the dentry allocation path, if an external name
      structure is allocated; and it's decreased in the dentry freeing path.
      
      To reproduce the problem I've used the following Python script:
      
        import os
      
        for iter in range (0, 10000000):
            try:
                name = ("/some_long_name_%d" % iter) + "_" * 220
                os.stat(name)
            except Exception:
                pass
      
      Without this patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7811688 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    2753052 kB
      
      With the patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7809516 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7749144 kB
      
      [guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
        Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
      [guro@fb.com: fix indirectly reclaimable memory accounting]
        Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1782c9b
    • Roman Gushchin's avatar
      mm: treat indirectly reclaimable memory as available in MemAvailable · 034ebf65
      Roman Gushchin authored
      
      
      Adjust /proc/meminfo MemAvailable calculation by adding the amount of
      indirectly reclaimable memory (rounded to the PAGE_SIZE).
      
      Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      034ebf65
    • Roman Gushchin's avatar
      mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES · eb592546
      Roman Gushchin authored
      
      
      Patch series "indirectly reclaimable memory", v2.
      
      This patchset introduces the concept of indirectly reclaimable memory
      and applies it to fix the issue of when a big number of dentries with
      external names can significantly affect the MemAvailable value.
      
      This patch (of 3):
      
      Introduce a concept of indirectly reclaimable memory and adds the
      corresponding memory counter and /proc/vmstat item.
      
      Indirectly reclaimable memory is any sort of memory, used by the kernel
      (except of reclaimable slabs), which is actually reclaimable, i.e.  will
      be released under memory pressure.
      
      The counter is in bytes, as it's not always possible to count such
      objects in pages.  The name contains BYTES by analogy to
      NR_KERNEL_STACK_KB.
      
      Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb592546
  2. Apr 11, 2018