Skip to content
  1. Apr 12, 2018
    • Johannes Weiner's avatar
      mm: memcg: make sure memory.events is uptodate when waking pollers · e27be240
      Johannes Weiner authored
      Commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") added per-cpu drift to all memory cgroup stats
      and events shown in memory.stat and memory.events.
      
      For memory.stat this is acceptable.  But memory.events issues file
      notifications, and somebody polling the file for changes will be
      confused when the counters in it are unchanged after a wakeup.
      
      Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
      MEMCG_OOM - are sufficiently rare and high-level that we don't need
      per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
      frequent, but they're counting invocations of reclaim, which is a
      complex operation that touches many shared cachelines.
      
      This splits memory.events from the generic VM events and tracks them in
      their own, unbuffered atomic counters.  That's also cleaner, as it
      eliminates the ugly enum nesting of VM and cgroup events.
      
      [hannes@cmpxchg.org: "array subscript is above array bounds"]
        Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
      Fixes: a983b5eb
      
       ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e27be240
    • Claudio Imbrenda's avatar
      mm/ksm.c: fix inconsistent accounting of zero pages · a38c015f
      Claudio Imbrenda authored
      When using KSM with use_zero_pages, we replace anonymous pages
      containing only zeroes with actual zero pages, which are not anonymous.
      We need to do proper accounting of the mm counters, otherwise we will
      get wrong values in /proc and a BUG message in dmesg when tearing down
      the mm.
      
      Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
      Fixes: e86c59b1
      
       ("mm/ksm: improve deduplication of zero pages with colouring")
      Signed-off-by: default avatarClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a38c015f
    • Matthew Wilcox's avatar
      mm/z3fold.c: use gfpflags_allow_blocking · 8a97ea54
      Matthew Wilcox authored
      
      
      We have a perfectly good macro to determine whether the gfp flags allow
      you to sleep or not; use it instead of trying to infer it.
      
      Link: http://lkml.kernel.org/r/20180408062206.GC16007@bombadil.infradead.org
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a97ea54
    • Xidong Wang's avatar
      z3fold: fix memory leak · 1ec6995d
      Xidong Wang authored
      
      
      In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
      released on the error path that pool->compact_wq , which holds the
      return value of create_singlethread_workqueue(), is NULL.  This will
      result in a memory leak bug.
      
      [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
      Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
      Signed-off-by: default avatarXidong Wang <wangxidong_97@163.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ec6995d
    • Michal Hocko's avatar
      memcg, thp: do not invoke oom killer on thp charges · 2a70f6a7
      Michal Hocko authored
      A THP memcg charge can trigger the oom killer since 25160354 ("mm,
      thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
      We have used an explicit __GFP_NORETRY previously which ruled the OOM
      killer automagically.
      
      Memcg charge path should be semantically compliant with the allocation
      path and that means that if we do not trigger the OOM killer for costly
      orders which should do the same in the memcg charge path as well.
      Otherwise we are forcing callers to distinguish the two and use
      different gfp masks which is both non-intuitive and bug prone.  As soon
      as we get a costly high order kmalloc user we even do not have any means
      to tell the memcg specific gfp mask to prevent from OOM because the
      charging is deep within guts of the slab allocator.
      
      The unexpected memcg OOM on THP has already been fixed upstream by
      9d3c3354 ("mm, thp: do not cause memcg oom for thp") but this is a
      one-off fix rather than a generic solution.  Teach mem_cgroup_oom to
      bail out on costly order requests to fix the THP issue as well as any
      other costly OOM eligible allocations to be added in future.
      
      Also revert 9d3c3354 because special gfp for THP is no longer
      needed.
      
      Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
      Fixes: 25160354
      
       ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a70f6a7
    • Ralph Campbell's avatar
      mm/migrate: properly preserve write attribute in special migrate entry · 07707125
      Ralph Campbell authored
      
      
      Use of pte_write(pte) is only valid for present pte, the common code
      which set the migration entry can be reach for both valid present pte
      and special swap entry (for device memory).  Fix the code to use the
      mpfn value which properly handle both cases.
      
      On x86 this did not have any bad side effect because pte write bit is
      below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
      in turn means we were always creating read only special migration entry.
      
      So once migration did finish we always write protected the CPU page
      table entry (moreover this is only an issue when migrating from device
      memory to system memory).  End effect is that CPU write access would
      fault again and restore write permission.
      
      This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
      take a second fault on write access. ie, double faulting the same
      address.  There is no corruption or incorrect states (it behaves as a
      COWed page from a fork with a mapcount of 1).
      
      Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.com
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07707125
    • Wei Yang's avatar
      mm: check __highest_present_section_nr directly in memory_dev_init() · bc8755ba
      Wei Yang authored
      
      
      __highest_present_section_nr is a more strict boundary than
      NR_MEM_SECTIONS.  So checking __highest_present_section_nr directly is
      enough.
      
      Link: http://lkml.kernel.org/r/20180330032044.21647-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc8755ba
    • Mel Gorman's avatar
      sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages · 09a913a7
      Mel Gorman authored
      
      
      change_pte_range is called from task work context to mark PTEs for
      receiving NUMA faulting hints.  If the marked pages are dirty then
      migration may fail.  Some filesystems cannot migrate dirty pages without
      blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
      Even when they can, it can be a waste of cycles when the pages are
      shared forcing higher scan rates.  This patch avoids marking shared
      dirty pages for hinting faults but also will skip a migration if the
      page was dirtied after the scanner updated a clean page.
      
      This is most noticeable running the NASA Parallel Benchmark when backed
      by btrfs, the default root filesystem for some distributions, but also
      noticeable when using XFS.
      
      The following are results from a 4-socket machine running a 4.16-rc4
      kernel with some scheduler patches that are pending for the next merge
      window.
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1
        Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
        Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
        Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
        Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
        Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)
      
      is.D regresses slightly in terms of absolute time but note that that
      particular load varies quite a bit from run to run.  The more relevant
      observation is the total system CPU usage.
      
                  4.16.0-rc4  4.16.0-rc4
                schedtip-20180309 nodirty-v1
        User        71471.91    70627.04
        System      11078.96     8256.13
        Elapsed       661.66      632.74
      
      That is a substantial drop in system CPU usage and overall the workload
      completes faster.  The NUMA balancing statistics are also interesting
      
        NUMA base PTE updates        111407972   139848884
        NUMA huge PMD updates           206506      264869
        NUMA page range updates      217139044   275461812
        NUMA hint faults               4300924     3719784
        NUMA hint local faults         3012539     3416618
        NUMA hint local percent             70          91
        NUMA pages migrated            1517487     1358420
      
      While more PTEs are scanned due to changes in what faults are gathered,
      it's clear that a far higher percentage of faults are local as the bulk
      of the remote hits were dirty pages that, in this case with btrfs, had
      no chance of migrating.
      
      The following is a comparison when using XFS as that is a more realistic
      filesystem choice for a data partition
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1r47
        Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
        Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
        Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
        Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
        Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)
      
      That is a reasonable gain on two relatively long-lived workloads.  While
      not presented, there is also a substantial drop in system CPu usage and
      the NUMA balancing stats show similar improvements in locality as btrfs
      did.
      
      Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a913a7
    • Jérôme Glisse's avatar
      Documentation/vm/hmm.txt: typos and syntaxes fixes · e8eddfd2
      Jérôme Glisse authored
      
      
      This fix typos and syntaxes, thanks to Randy Dunlap for pointing them out
      (they were all my faults).
      
      Link: http://lkml.kernel.org/r/20180409151859.4713-1-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8eddfd2
    • Arnd Bergmann's avatar
      mm/hmm: fix header file if/else/endif maze, again · 9d8a463a
      Arnd Bergmann authored
      
      
      The last fix was still wrong, as we need the inline dummy functions also
      for the case that CONFIG_HMM is enabled but CONFIG_HMM_MIRROR is not:
      
        kernel/fork.o: In function `__mmdrop':
        fork.c:(.text+0x14f6): undefined reference to `hmm_mm_destroy'
      
      This adds back the second copy of the dummy functions, hopefully
      this time in the right place.
      
      Link: http://lkml.kernel.org/r/20180404110236.804484-1-arnd@arndb.de
      Fixes: 8900d06a277a ("mm/hmm: fix header file if/else/endif maze")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d8a463a
    • Tejun Heo's avatar
      mm/hmm.c: remove superfluous RCU protection around radix tree lookup · 18be460e
      Tejun Heo authored
      
      
      hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
      actually uses the RCU protection.  The only caller is
      hmm_devmem_pages_create() which already grabs the mutex and does
      superfluous rcu_read_lock/unlock() around the function.
      
      This doesn't add anything and just adds to confusion.  Remove the RCU
      protection and open-code the radix tree lookup.  If this needs to become
      more sophisticated in the future, let's add them back when necessary.
      
      Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18be460e
    • Jérôme Glisse's avatar
      mm/hmm: use device driver encoding for HMM pfn · f88a1e90
      Jérôme Glisse authored
      
      
      Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
      pfn shift value allowing them to define their own encoding for HMM pfn
      that are fill inside the pfns array of the hmm_range struct.  With this
      device driver can get pfn that match their own private encoding out of HMM
      without having to do any conversion.
      
      [rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
        Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
      [rcampbell@nvidia.com: clarify fault logic for device private memory]
        Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f88a1e90
    • Jérôme Glisse's avatar
      mm/hmm: change hmm_vma_fault() to allow write fault on page basis · 2aee09d8
      Jérôme Glisse authored
      
      
      This changes hmm_vma_fault() to not take a global write fault flag for a
      range but instead rely on caller to populate HMM pfns array with proper
      fault flag ie HMM_PFN_VALID if driver want read fault for that address or
      HMM_PFN_VALID and HMM_PFN_WRITE for write.
      
      Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
      device private memory to be migrated back to system memory through page
      fault.
      
      This is more flexible API and it better reflects how device handles and
      reports fault.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2aee09d8
    • Jérôme Glisse's avatar
      mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() · 53f5c3f4
      Jérôme Glisse authored
      
      
      No functional change, just create one function to handle pmd and one to
      handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53f5c3f4
    • Jérôme Glisse's avatar
      mm/hmm: move hmm_pfns_clear() closer to where it is used · 33cd47dc
      Jérôme Glisse authored
      
      
      Move hmm_pfns_clear() closer to where it is used to make it clear it is
      not use by page table walkers.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33cd47dc
    • Jérôme Glisse's avatar
      mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE · b2744118
      Jérôme Glisse authored
      
      
      Make naming consistent across code, DEVICE_PRIVATE is the name use outside
      HMM code so use that one.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2744118
    • Jérôme Glisse's avatar
      mm/hmm: do not differentiate between empty entry or missing directory · 5504ed29
      Jérôme Glisse authored
      
      
      There is no point in differentiating between a range for which there is
      not even a directory (and thus entries) and empty entry (pte_none() or
      pmd_none() returns true).
      
      Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
      duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5504ed29
    • Jérôme Glisse's avatar
      mm/hmm: cleanup special vma handling (VM_SPECIAL) · 855ce7d2
      Jérôme Glisse authored
      
      
      Special vma (one with any of the VM_SPECIAL flags) can not be access by
      device because there is no consistent model across device drivers on those
      vma and their backing memory.
      
      This patch directly use hmm_range struct for hmm_pfns_special() argument
      as it is always affecting the whole vma and thus the whole range.
      
      It also make behavior consistent after this patch both hmm_vma_fault() and
      hmm_vma_get_pfns() returns -EINVAL when facing such vma.  Previously
      hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
      were filling the HMM pfn array with special entry.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      855ce7d2
    • Jérôme Glisse's avatar
      mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong · ff05c0c6
      Jérôme Glisse authored
      
      
      All device driver we care about are using 64bits page table entry.  In
      order to match this and to avoid useless define convert all HMM pfn to
      directly use uint64_t.  It is a first step on the road to allow driver to
      directly use pfn value return by HMM (saving memory and CPU cycles use for
      conversion between the two).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff05c0c6
    • Jérôme Glisse's avatar
      mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture · 86586a41
      Jérôme Glisse authored
      
      
      Only peculiar architecture allow write without read thus assume that any
      valid pfn do allow for read.  Note we do not care for write only because
      it does make sense with thing like atomic compare and exchange or any
      other operations that allow you to get the memory value through them.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86586a41
    • Jérôme Glisse's avatar
      mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters · 08232a45
      Jérôme Glisse authored
      
      
      Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
      as parameter and were initializing that struct with others of their
      parameters.  Have caller of those function do this as they are likely to
      already do and only pass this struct to both function this shorten
      function signature and make it easier in the future to add new parameters
      by simply adding them to the structure.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08232a45
    • Jérôme Glisse's avatar
      mm/hmm: hmm_pfns_bad() was accessing wrong struct · c719547f
      Jérôme Glisse authored
      
      
      The private field of mm_walk struct point to an hmm_vma_walk struct and
      not to the hmm_range struct desired.  Fix to get proper struct pointer.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c719547f
    • Jérôme Glisse's avatar
      mm/hmm: unregister mmu_notifier when last HMM client quit · c01cbba2
      Jérôme Glisse authored
      
      
      This code was lost in translation at one point.  This properly call
      mmu_notifier_unregister_no_release() once last user is gone.  This fix the
      zombie mm_struct as without this patch we do not drop the refcount we have
      on it.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c01cbba2
    • Ralph Campbell's avatar
      mm/hmm: HMM should have a callback before MM is destroyed · e1401513
      Ralph Campbell authored
      
      
      hmm_mirror_register() registers a callback for when the CPU pagetable is
      modified.  Normally, the device driver will call hmm_mirror_unregister()
      when the process using the device is finished.  However, if the process
      exits uncleanly, the struct_mm can be destroyed with no warning to the
      device driver.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-4-jglisse@redhat.com
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1401513
    • Jérôme Glisse's avatar
      mm/hmm: fix header file if/else/endif maze · b28b08de
      Jérôme Glisse authored
      
      
      The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.  Because of
      this after multiple include there was multiple definition of both
      hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM was
      enabled (CONFIG_HMM set).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-3-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b28b08de
    • Ralph Campbell's avatar
      mm/hmm: documentation editorial update to HMM documentation · 76ea470c
      Ralph Campbell authored
      
      
      Update the documentation for HMM to fix minor typos and phrasing to be a
      bit more readable.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-2-jglisse@redhat.com
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Stephen  Bates <sbates@raithlin.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76ea470c
    • Steven Rostedt's avatar
      mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event · d51d1e64
      Steven Rostedt authored
      
      
      The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
      parameters! Seven of them are from the reclaim_stat structure.  This
      structure is currently local to mm/vmscan.c.  By moving it to the global
      vmstat.h header, we can also reference it from the vmscan tracepoints.
      In moving it, it brings down the overhead of passing so many arguments
      to the trace event.  In the future, we may limit the number of arguments
      that a trace event may pass (ideally just 6, but more realistically it
      may be 8).
      
      Before this patch, the code to call the trace event is this:
      
       0f 83 aa fe ff ff       jae    ffffffff811e6261 <shrink_inactive_list+0x1e1>
       48 8b 45 a0             mov    -0x60(%rbp),%rax
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       44 8b 6d d4             mov    -0x2c(%rbp),%r13d
       8b 4d d0                mov    -0x30(%rbp),%ecx
       44 8b 75 cc             mov    -0x34(%rbp),%r14d
       44 8b 7d c8             mov    -0x38(%rbp),%r15d
       48 89 45 90             mov    %rax,-0x70(%rbp)
       8b 83 b8 fe ff ff       mov    -0x148(%rbx),%eax
       8b 55 c0                mov    -0x40(%rbp),%edx
       8b 7d c4                mov    -0x3c(%rbp),%edi
       8b 75 b8                mov    -0x48(%rbp),%esi
       89 45 80                mov    %eax,-0x80(%rbp)
       65 ff 05 e4 f7 e2 7e    incl   %gs:0x7ee2f7e4(%rip)        # 15bd0 <__preempt_count>
       48 8b 05 75 5b 13 01    mov    0x1135b75(%rip),%rax        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       48 85 c0                test   %rax,%rax
       74 72                   je     ffffffff811e646a <shrink_inactive_list+0x3ea>
       48 89 c3                mov    %rax,%rbx
       4c 8b 10                mov    (%rax),%r10
       89 f8                   mov    %edi,%eax
       48 89 85 68 ff ff ff    mov    %rax,-0x98(%rbp)
       89 f0                   mov    %esi,%eax
       48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
       89 c8                   mov    %ecx,%eax
       48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
       89 d0                   mov    %edx,%eax
       48 89 85 70 ff ff ff    mov    %rax,-0x90(%rbp)
       8b 45 8c                mov    -0x74(%rbp),%eax
       48 8b 7b 08             mov    0x8(%rbx),%rdi
       48 83 c3 18             add    $0x18,%rbx
       50                      push   %rax
       41 54                   push   %r12
       41 55                   push   %r13
       ff b5 78 ff ff ff       pushq  -0x88(%rbp)
       41 56                   push   %r14
       41 57                   push   %r15
       ff b5 70 ff ff ff       pushq  -0x90(%rbp)
       4c 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%r9
       4c 8b 85 60 ff ff ff    mov    -0xa0(%rbp),%r8
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       48 8b 55 90             mov    -0x70(%rbp),%rdx
       8b 75 80                mov    -0x80(%rbp),%esi
       41 ff d2                callq  *%r10
      
      After the patch:
      
       0f 83 a8 fe ff ff       jae    ffffffff811e626d <shrink_inactive_list+0x1cd>
       8b 9b b8 fe ff ff       mov    -0x148(%rbx),%ebx
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       4c 8b 6d a0             mov    -0x60(%rbp),%r13
       65 ff 05 f5 f7 e2 7e    incl   %gs:0x7ee2f7f5(%rip)        # 15bd0 <__preempt_count>
       4c 8b 35 86 5b 13 01    mov    0x1135b86(%rip),%r14        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       4d 85 f6                test   %r14,%r14
       74 2a                   je     ffffffff811e6411 <shrink_inactive_list+0x371>
       49 8b 06                mov    (%r14),%rax
       8b 4d 8c                mov    -0x74(%rbp),%ecx
       49 8b 7e 08             mov    0x8(%r14),%rdi
       49 83 c6 18             add    $0x18,%r14
       4c 89 ea                mov    %r13,%rdx
       45 89 e1                mov    %r12d,%r9d
       4c 8d 45 b8             lea    -0x48(%rbp),%r8
       89 de                   mov    %ebx,%esi
       51                      push   %rcx
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       ff d0                   callq  *%rax
      
      Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
      Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reported-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d51d1e64
    • Andrey Ryabinin's avatar
      mm/vmscan: don't mess with pgdat->flags in memcg reclaim · e3c1ac58
      Andrey Ryabinin authored
      
      
      memcg reclaim may alter pgdat->flags based on the state of LRU lists in
      cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
      congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
      pages.  But the worst here is PGDAT_CONGESTED, since it may force all
      direct reclaims to stall in wait_iff_congested().  Note that only kswapd
      have powers to clear any of these bits.  This might just never happen if
      cgroup limits configured that way.  So all direct reclaims will stall as
      long as we have some congested bdi in the system.
      
      Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
      pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
      it's reasonable to leave all decisions about node state to kswapd.
      
      Why only kswapd? Why not allow to global direct reclaim change these
      flags? It is because currently only kswapd can clear these flags.  I'm
      less worried about the case when PGDAT_CONGESTED falsely not set, and
      more worried about the case when it falsely set.  If direct reclaimer
      sets PGDAT_CONGESTED, do we have guarantee that after the congestion
      problem is sorted out, kswapd will be woken up and clear the flag? It
      seems like there is no such guarantee.  E.g.  direct reclaimers may
      eventually balance pgdat and kswapd simply won't wake up (see
      wakeup_kswapd()).
      
      Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
      now loses its congestion throttling mechanism.  Add per-cgroup
      congestion state and throttle cgroup2 reclaimers if memcg is in
      congestion state.
      
      Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
      bits since they alter only kswapd behavior.
      
      The problem could be easily demonstrated by creating heavy congestion in
      one cgroup:
      
          echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
          mkdir -p /sys/fs/cgroup/congester
          echo 512M > /sys/fs/cgroup/congester/memory.max
          echo $$ > /sys/fs/cgroup/congester/cgroup.procs
          /* generate a lot of diry data on slow HDD */
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
          ....
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
      
      and some job in another cgroup:
      
          mkdir /sys/fs/cgroup/victim
          echo 128M > /sys/fs/cgroup/victim/memory.max
      
          # time cat /dev/sda > /dev/null
          real    10m15.054s
          user    0m0.487s
          sys     1m8.505s
      
      According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
      of the time sleeping there.
      
      With the patch, cat don't waste time anymore:
      
          # time cat /dev/sda > /dev/null
          real    5m32.911s
          user    0m0.411s
          sys     0m56.664s
      
      [aryabinin@virtuozzo.com: congestion state should be per-node]
        Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
      [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
        Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3c1ac58
    • Andrey Ryabinin's avatar
      mm/vmscan: don't change pgdat state on base of a single LRU list state · d108c772
      Andrey Ryabinin authored
      
      
      We have separate LRU list for each memory cgroup.  Memory reclaim
      iterates over cgroups and calls shrink_inactive_list() every inactive
      LRU list.  Based on the state of a single LRU shrink_inactive_list() may
      flag the whole node as dirty,congested or under writeback.  This is
      obviously wrong and hurtful.  It's especially hurtful when we have
      possibly small congested cgroup in system.  Than *all* direct reclaims
      waste time by sleeping in wait_iff_congested().  And the more memcgs in
      the system we have the longer memory allocation stall is, because
      wait_iff_congested() called on each lru-list scan.
      
      Sum reclaim stats across all visited LRUs on node and flag node as
      dirty, congested or under writeback based on that sum.  Also call
      congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
      once per lru-list scan.
      
      This only fixes the problem for global reclaim case.  Per-cgroup reclaim
      may alter global pgdat flags too, which is wrong.  But that is separate
      issue and will be addressed in the next patch.
      
      This change will not have any effect on a systems with all workload
      concentrated in a single cgroup.
      
      [aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
        Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d108c772
    • Andrey Ryabinin's avatar
      mm/vmscan: remove redundant current_may_throttle() check · c4fd4fa5
      Andrey Ryabinin authored
      
      
      Only kswapd can have non-zero nr_immediate, and current_may_throttle()
      is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
      enough to check stat.nr_immediate only.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4fd4fa5
    • Andrey Ryabinin's avatar
      mm/vmscan: update stale comments · 894befec
      Andrey Ryabinin authored
      
      
      Update some comments that became stale since transiton from per-zone to
      per-node reclaim.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      894befec
    • Roman Gushchin's avatar
      mm: treat indirectly reclaimable memory as free in overcommit logic · d79f7aa4
      Roman Gushchin authored
      
      
      Indirectly reclaimable memory can consume a significant part of total
      memory and it's actually reclaimable (it will be released under actual
      memory pressure).
      
      So, the overcommit logic should treat it as free.
      
      Otherwise, it's possible to cause random system-wide memory allocation
      failures by consuming a significant amount of memory by indirectly
      reclaimable memory, e.g.  dentry external names.
      
      If overcommit policy GUESS is used, it might be used for denial of
      service attack under some conditions.
      
      The following program illustrates the approach.  It causes the kernel to
      allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
      that at some point the overcommit logic may start blocking large
      allocation system-wide.
      
        int main()
        {
        	char buf[256];
        	unsigned long i;
        	struct stat statbuf;
      
        	buf[0] = '/';
        	for (i = 1; i < sizeof(buf); i++)
        		buf[i] = '_';
      
        	for (i = 0; 1; i++) {
        		sprintf(&buf[248], "%8lu", i);
        		stat(buf, &statbuf);
        	}
      
        	return 0;
        }
      
      This patch in combination with related indirectly reclaimable memory
      patches closes this issue.
      
      Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d79f7aa4
    • Roman Gushchin's avatar
      dcache: account external names as indirectly reclaimable memory · f1782c9b
      Roman Gushchin authored
      
      
      I received a report about suspicious growth of unreclaimable slabs on
      some machines.  I've found that it happens on machines with low memory
      pressure, and these unreclaimable slabs are external names attached to
      dentries.
      
      External names are allocated using generic kmalloc() function, so they
      are accounted as unreclaimable.  But they are held by dentries, which
      are reclaimable, and they will be reclaimed under the memory pressure.
      
      In particular, this breaks MemAvailable calculation, as it doesn't take
      unreclaimable slabs into account.  This leads to a silly situation, when
      a machine is almost idle, has no memory pressure and therefore has a big
      dentry cache.  And the resulting MemAvailable is too low to start a new
      workload.
      
      To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
      used to track the amount of memory, consumed by external names.  The
      counter is increased in the dentry allocation path, if an external name
      structure is allocated; and it's decreased in the dentry freeing path.
      
      To reproduce the problem I've used the following Python script:
      
        import os
      
        for iter in range (0, 10000000):
            try:
                name = ("/some_long_name_%d" % iter) + "_" * 220
                os.stat(name)
            except Exception:
                pass
      
      Without this patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7811688 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    2753052 kB
      
      With the patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7809516 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7749144 kB
      
      [guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
        Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
      [guro@fb.com: fix indirectly reclaimable memory accounting]
        Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1782c9b
    • Roman Gushchin's avatar
      mm: treat indirectly reclaimable memory as available in MemAvailable · 034ebf65
      Roman Gushchin authored
      
      
      Adjust /proc/meminfo MemAvailable calculation by adding the amount of
      indirectly reclaimable memory (rounded to the PAGE_SIZE).
      
      Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      034ebf65
    • Roman Gushchin's avatar
      mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES · eb592546
      Roman Gushchin authored
      
      
      Patch series "indirectly reclaimable memory", v2.
      
      This patchset introduces the concept of indirectly reclaimable memory
      and applies it to fix the issue of when a big number of dentries with
      external names can significantly affect the MemAvailable value.
      
      This patch (of 3):
      
      Introduce a concept of indirectly reclaimable memory and adds the
      corresponding memory counter and /proc/vmstat item.
      
      Indirectly reclaimable memory is any sort of memory, used by the kernel
      (except of reclaimable slabs), which is actually reclaimable, i.e.  will
      be released under memory pressure.
      
      The counter is in bytes, as it's not always possible to count such
      objects in pages.  The name contains BYTES by analogy to
      NR_KERNEL_STACK_KB.
      
      Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb592546
  2. Apr 11, 2018
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming · f77cfbe6
      Linus Torvalds authored
      Pull c6x updates from Mark Salter.
      
      * tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming:
        c6x: pass endianness info to sparse
        c6x: fix platforms/plldata.c get_coreid build error
        c6x: remove unused KTHREAD_SIZE definition
      f77cfbe6
    • Linus Torvalds's avatar
      Merge tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips · 948869fa
      Linus Torvalds authored
      Pull MIPS updates from James Hogan:
       "These are the main MIPS changes for 4.17. Rough overview:
      
         (1) generic platform: Add support for Microsemi Ocelot SoCs
      
         (2) crypto: Add CRC32 and CRC32C HW acceleration module
      
         (3) Various cleanups and misc improvements
      
        More detailed summary:
      
        Miscellaneous:
         - hang more efficiently on halt/powerdown/restart
         - pm-cps: Block system suspend when a JTAG probe is present
         - expand make help text for generic defconfigs
         - refactor handling of legacy defconfigs
         - determine the entry point from the ELF file header to fix microMIPS
           for certain toolchains
         - introduce isa-rev.h for MIPS_ISA_REV and use to simplify other code
      
        Minor cleanups:
         - DTS: boston/ci20: Unit name cleanups and correction
         - kdump: Make the default for PHYSICAL_START always 64-bit
         - constify gpio_led in Alchemy, AR7, and TXX9
         - silence a couple of W=1 warnings
         - remove duplicate includes
      
        Platform support:
        Generic platform:
         - add support for Microsemi Ocelot
         - dt-bindings: Add vendor prefix for Microsemi Corporation
         - dt-bindings: Add bindings for Microsemi SoCs
         - add ocelot SoC & PCB123 board DTS files
         - MAINTAINERS: Add entry for Microsemi MIPS SoCs
         - enable crc32-mips on r6 configs
      
        ath79:
         - fix AR724X_PLL_REG_PCIE_CONFIG offset
      
        BCM47xx:
         - firmware: Use mac_pton() for MAC address parsing
         - add Luxul XAP1500/XWR1750 WiFi LEDs
         - use standard reset button for Luxul XWR-1750
      
        BMIPS:
         - enable CONFIG_BRCMSTB_PM in bmips_stb_defconfig for build coverage
         - add STB PM, wake-up timer, watchdog DT nodes
      
        Octeon:
         - drop '.' after newlines in printk calls
      
        ralink:
         - pci-mt7621: Enable PCIe on MT7688"
      
      * tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips: (37 commits)
        MIPS: BCM47XX: Use standard reset button for Luxul XWR-1750
        MIPS: BCM47XX: Add Luxul XAP1500/XWR1750 WiFi LEDs
        MIPS: Make the default for PHYSICAL_START always 64-bit
        MIPS: Use the entry point from the ELF file header
        MAINTAINERS: Add entry for Microsemi MIPS SoCs
        MIPS: generic: Add support for Microsemi Ocelot
        MIPS: mscc: Add ocelot PCB123 device tree
        MIPS: mscc: Add ocelot dtsi
        dt-bindings: mips: Add bindings for Microsemi SoCs
        dt-bindings: Add vendor prefix for Microsemi Corporation
        MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
        MIPS: pci-mt7620: Enable PCIe on MT7688
        MIPS: pm-cps: Block system suspend when a JTAG probe is present
        MIPS: VDSO: Replace __mips_isa_rev with MIPS_ISA_REV
        MIPS: BPF: Replace __mips_isa_rev with MIPS_ISA_REV
        MIPS: cpu-features.h: Replace __mips_isa_rev with MIPS_ISA_REV
        MIPS: Introduce isa-rev.h to define MIPS_ISA_REV
        MIPS: Hang more efficiently on halt/powerdown/restart
        FIRMWARE: bcm47xx_nvram: Replace mac address parsing
        MIPS: BMIPS: Add Broadcom STB watchdog nodes
        ...
      948869fa
    • Linus Torvalds's avatar
      Merge tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 2a56bb59
      Linus Torvalds authored
      Pull tracing updates from Steven Rostedt:
       "New features:
      
         - Tom Zanussi's extended histogram work.
      
           This adds the synthetic events to have histograms from multiple
           event data Adds triggers "onmatch" and "onmax" to call the
           synthetic events Several updates to the histogram code from this
      
         - Allow way to nest ring buffer calls in the same context
      
         - Allow absolute time stamps in ring buffer
      
         - Rewrite of filter code parsing based on Al Viro's suggestions
      
         - Setting of trace_clock to global if TSC is unstable (on boot)
      
         - Better OOM handling when allocating large ring buffers
      
         - Added initcall tracepoints (consolidated initcall_debug code with
           them)
      
        And other various fixes and clean ups"
      
      * tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
        init: Have initcall_debug still work without CONFIG_TRACEPOINTS
        init, tracing: Have printk come through the trace events for initcall_debug
        init, tracing: instrument security and console initcall trace events
        init, tracing: Add initcall trace events
        tracing: Add rcu dereference annotation for test func that touches filter->prog
        tracing: Add rcu dereference annotation for filter->prog
        tracing: Fixup logic inversion on setting trace_global_clock defaults
        tracing: Hide global trace clock from lockdep
        ring-buffer: Add set/clear_current_oom_origin() during allocations
        ring-buffer: Check if memory is available before allocation
        lockdep: Add print_irqtrace_events() to __warn
        vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
        tracing: Uninitialized variable in create_tracing_map_fields()
        tracing: Make sure variable string fields are NULL-terminated
        tracing: Add action comparisons when testing matching hist triggers
        tracing: Don't add flag strings when displaying variable references
        tracing: Fix display of hist trigger expressions containing timestamps
        ftrace: Drop a VLA in module_exists()
        tracing: Mention trace_clock=global when warning about unstable clocks
        tracing: Default to using trace_global_clock if sched_clock is unstable
        ...
      2a56bb59
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 9f3a0941
      Linus Torvalds authored
      Pull libnvdimm updates from Dan Williams:
       "This cycle was was not something I ever want to repeat as there were
        several late changes that have only now just settled.
      
        Half of the branch up to commit d2c997c0 ("fs, dax: use
        page->mapping to warn...") have been in -next for several releases.
        The of_pmem driver and the address range scrub rework were late
        arrivals, and the dax work was scaled back at the last moment.
      
        The of_pmem driver missed a previous merge window due to an oversight.
        A sense of obligation to rectify that miss is why it is included for
        4.17. It has acks from PowerPC folks. Stephen reported a build failure
        that only occurs when merging it with your latest tree, for now I have
        fixed that up by disabling modular builds of of_pmem. A test merge
        with your tree has received a build success report from the 0day robot
        over 156 configs.
      
        An initial version of the ARS rework was submitted before the merge
        window. It is self contained to libnvdimm, a net code reduction, and
        passing all unit tests.
      
        The filesystem-dax changes are based on the wait_var_event()
        functionality from tip/sched/core. However, late review feedback
        showed that those changes regressed truncate performance to a large
        degree. The branch was rewound to drop the truncate behavior change
        and now only includes preparation patches and cleanups (with full acks
        and reviews). The finalization of this dax-dma-vs-trnucate work will
        need to wait for 4.18.
      
        Summary:
      
         - A rework of the filesytem-dax implementation provides for detection
           of unmap operations (truncate / hole punch) colliding with
           in-progress device-DMA. A fix for these collisions remains a
           work-in-progress pending resolution of truncate latency and
           starvation regressions.
      
         - The of_pmem driver expands the users of libnvdimm outside of x86
           and ACPI to describe an implementation of persistent memory on
           PowerPC with Open Firmware / Device tree.
      
         - Address Range Scrub (ARS) handling is completely rewritten to
           account for the fact that ARS may run for 100s of seconds and there
           is no platform defined way to cancel it. ARS will now no longer
           block namespace initialization.
      
         - The NVDIMM Namespace Label implementation is updated to handle
           label areas as small as 1K, down from 128K.
      
         - Miscellaneous cleanups and updates to unit test infrastructure"
      
      * tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
        libnvdimm, of_pmem: workaround OF_NUMA=n build error
        nfit, address-range-scrub: add module option to skip initial ars
        nfit, address-range-scrub: rework and simplify ARS state machine
        nfit, address-range-scrub: determine one platform max_ars value
        powerpc/powernv: Create platform devs for nvdimm buses
        doc/devicetree: Persistent memory region bindings
        libnvdimm: Add device-tree based driver
        libnvdimm: Add of_node to region and bus descriptors
        libnvdimm, region: quiet region probe
        libnvdimm, namespace: use a safe lookup for dimm device name
        libnvdimm, dimm: fix dpa reservation vs uninitialized label area
        libnvdimm, testing: update the default smart ctrl_temperature
        libnvdimm, testing: Add emulation for smart injection commands
        nfit, address-range-scrub: introduce nfit_spa->ars_state
        libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
        nfit, address-range-scrub: fix scrub in-progress reporting
        dax, dm: allow device-mapper to operate without dax support
        dax: introduce CONFIG_DAX_DRIVER
        fs, dax: use page->mapping to warn if truncate collides with a busy page
        ext2, dax: introduce ext2_dax_aops
        ...
      9f3a0941
    • Linus Torvalds's avatar
      Merge tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · fbe173e3
      Linus Torvalds authored
      Pull RTC updates from Alexandre Belloni:
       "This contains a few series that have been in preparation for a while
        and that will help systems with RTCs that will fail in 2038, 2069 or
        2100.
      
        Subsystem:
         - Add tracepoints
         - Rework of the RTC/nvmem API to allow drivers to discard struct
           nvmem_config after registration
         - New range API, drivers can now expose the useful range of the RTC
         - New offset API the core is now able to add an offset to the RTC
           time, modifying the supported range.
         - Multiple rtc_time64_to_tm fixes
         - Handle time_t overflow on 32 bit platforms in the core instead of
           letting drivers do crazy things.
         - remove rtc_control API
      
        New driver:
         - Intersil ISL12026
      
        Drivers:
         - Drivers exposing the RTC non volatile memory have been converted to
           use nvmem
         - Removed useless time and date validation
         - Removed an indirection pattern that was a cargo cult from ancient
           drivers
         - Removed VLA usage
         - Fixed a possible race condition in probe functions
         - AB8540 support is dropped from ab8500
         - pcf85363 now has alarm support"
      
      * tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (128 commits)
        rtc: snvs: Fix usage of snvs_rtc_enable
        rtc: mt7622: fix module autoloading for OF platform drivers
        rtc: isl12022: use true and false for boolean values
        rtc: ab8500: Drop AB8540 support
        rtc: remove a warning during scripts/kernel-doc step
        rtc: 88pm860x: remove artificial limitation
        rtc: 88pm80x: remove artificial limitation
        rtc: st-lpc: remove artificial limitation
        rtc: mrst: remove artificial limitation
        rtc: mv: remove artificial limitation
        rtc: hctosys: Ensure system time doesn't overflow time_t
        parisc: time: stop validating rtc_time in .read_time
        rtc: pcf85063: fix clearing bits in pcf85063_start_clock
        rtc: at91sam9: Set name of regmap_config
        rtc: s5m: Remove VLA usage
        rtc: s5m: Move enum from rtc.h to rtc-s5m.c
        rtc: remove VLA usage
        rtc: Add useful timestamp definitions
        rtc: Add one offset seconds to expand RTC range
        rtc: Factor out the RTC range validation into rtc_valid_range()
        ...
      fbe173e3