Skip to content
  1. Aug 08, 2020
    • Miaohe Lin's avatar
      mm: mmap: merge vma after call_mmap() if possible · d70cec89
      Miaohe Lin authored
      
      
      The vm_flags may be changed after call_mmap() because drivers may set some
      flags for their own purpose.  As a result, we failed to merge the adjacent
      vma due to the different vm_flags as userspace can't pass in the same one.
      Try to merge vma after call_mmap() to fix this issue.
      
      Signed-off-by: default avatarHongxiang Lou <louhongxiang@huawei.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1594954065-23733-1-git-send-email-linmiaohe@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d70cec89
    • Anshuman Khandual's avatar
      arm64/mm: enable vmem_altmap support for vmemmap mappings · eee07935
      Anshuman Khandual authored
      
      
      Device memory ranges when getting hot added into ZONE_DEVICE, might
      require their vmemmap mapping's backing memory to be allocated from their
      own range instead of consuming system memory.  This prevents large system
      memory usage for potentially large device memory ranges.  Device driver
      communicates this request via vmem_altmap structure.  Architecture needs
      to take this request into account while creating and tearing down vemmmap
      mappings.
      
      This enables vmem_altmap support in vmemmap_populate() and vmemmap_free()
      which includes vmemmap_populate_basepages() used for ARM64_16K_PAGES and
      ARM64_64K_PAGES configs.
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJia He <justin.he@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: http://lkml.kernel.org/r/1594004178-8861-4-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eee07935
    • Anshuman Khandual's avatar
      mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf() · 56993b4e
      Anshuman Khandual authored
      
      
      There are many instances where vmemap allocation is often switched between
      regular memory and device memory just based on whether altmap is available
      or not.  vmemmap_alloc_block_buf() is used in various platforms to
      allocate vmemmap mappings.  Lets also enable it to handle altmap based
      device memory allocation along with existing regular memory allocations.
      This will help in avoiding the altmap based allocation switch in many
      places.  To summarize there are two different methods to call
      vmemmap_alloc_block_buf().
      
      vmemmap_alloc_block_buf(size, node, NULL)   /* Allocate from system RAM */
      vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */
      
      This converts altmap_alloc_block_buf() into a static function, drops it's
      entry from the header and updates Documentation/vm/memory-model.rst.
      
      Suggested-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJia He <justin.he@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56993b4e
    • Anshuman Khandual's avatar
      mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages() · 1d9cfee7
      Anshuman Khandual authored
      
      
      Patch series "arm64: Enable vmemmap mapping from device memory", v4.
      
      This series enables vmemmap backing memory allocation from device memory
      ranges on arm64.  But before that, it enables vmemmap_populate_basepages()
      and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based
      alocation requests.
      
      This patch (of 3):
      
      vmemmap_populate_basepages() is used across platforms to allocate backing
      memory for vmemmap mapping.  This is used as a standard default choice or
      as a fallback when intended huge pages allocation fails.  This just
      creates entire vmemmap mapping with base pages (PAGE_SIZE).
      
      On arm64 platforms, vmemmap_populate_basepages() is called instead of the
      platform specific vmemmap_populate() when ARM64_SWAPPER_USES_SECTION_MAPS
      is not enabled as in case for ARM64_16K_PAGES and ARM64_64K_PAGES configs.
      
      At present vmemmap_populate_basepages() does not support allocating from
      driver defined struct vmem_altmap while trying to create vmemmap mapping
      for a device memory range.  It prevents ARM64_16K_PAGES and
      ARM64_64K_PAGES configs on arm64 from supporting device memory with
      vmemap_altmap request.
      
      This enables vmem_altmap support in vmemmap_populate_basepages() unlocking
      device memory allocation for vmemap mapping on arm64 platforms with 16K or
      64K base page configs.
      
      Each architecture should evaluate and decide on subscribing device memory
      based base page allocation through vmemmap_populate_basepages().  Hence
      lets keep it disabled on all archs in order to preserve the existing
      semantics.  A subsequent patch enables it on arm64.
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJia He <justin.he@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Link: http://lkml.kernel.org/r/1594004178-8861-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r/1594004178-8861-2-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d9cfee7
    • Feng Tang's avatar
      mm: adjust vm_committed_as_batch according to vm overcommit policy · 56f3547b
      Feng Tang authored
      
      
      When checking a performance change for will-it-scale scalability mmap test
      [1], we found very high lock contention for spinlock of percpu counter
      'vm_committed_as':
      
          94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
          48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
          45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
      
      Actually this heavy lock contention is not always necessary.  The
      'vm_committed_as' needs to be very precise when the strict
      OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
      for the percpu counter.
      
      So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
      lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies.  Also
      add a sysctl handler to adjust it when the policy is reconfigured.
      
      Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
      desktop, and 2097%(20X) on a 4S/72C/144T server.  We tested with test
      platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
      improvements with that test.  And whether it shows improvements depends on
      if the test mmap size is bigger than the batch number computed.
      
      And if the lift is 16X, 1/3 of the platforms will show improvements,
      though it should help the mmap/unmap usage generally, as Michal Hocko
      mentioned:
      
      : I believe that there are non-synthetic worklaods which would benefit from
      : a larger batch.  E.g.  large in memory databases which do large mmaps
      : during startups from multiple threads.
      
      [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
      
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
      Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
      Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56f3547b
    • Feng Tang's avatar
      percpu_counter: add percpu_counter_sync() · 0a4954a8
      Feng Tang authored
      
      
      percpu_counter's accuracy is related to its batch size.  For a
      percpu_counter with a big batch, its deviation could be big, so when the
      counter's batch is runtime changed to a smaller value for better accuracy,
      there could also be requirment to reduce the big deviation.
      
      So add a percpu-counter sync function to be run on each CPU.
      
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Link: http://lkml.kernel.org/r/1594389708-60781-4-git-send-email-feng.tang@intel.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a4954a8
    • Feng Tang's avatar
      mm/util.c: make vm_memory_committed() more accurate · 4e2ee51e
      Feng Tang authored
      
      
      percpu_counter_sum_positive() will provide more accurate info.
      
      As with percpu_counter_read_positive(), in worst case the deviation could
      be 'batch * nr_cpus', which is totalram_pages/256 for now, and will be
      more when the batch gets enlarged.
      
      Its time cost is about 800 nanoseconds on a 2C/4T platform and 2~3
      microseconds on a 2S/36C/72T Skylake server in normal case, and in worst
      case where vm_committed_as's spinlock is under severe contention, it costs
      30~40 microseconds for the 2S/36C/72T Skylake sever, which should be fine
      for its only two users: /proc/meminfo and HyperV balloon driver's status
      trace per second.
      
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Michal Hocko <mhocko@suse.com> # for /proc/meminfo
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/1592725000-73486-3-git-send-email-feng.tang@intel.com
      Link: http://lkml.kernel.org/r/1594389708-60781-3-git-send-email-feng.tang@intel.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e2ee51e
    • Feng Tang's avatar
      proc/meminfo: avoid open coded reading of vm_committed_as · 1455083c
      Feng Tang authored
      
      
      Patch series "make vm_committed_as_batch aware of vm overcommit policy", v6.
      
      When checking a performance change for will-it-scale scalability mmap test
      [1], we found very high lock contention for spinlock of percpu counter
      'vm_committed_as':
      
          94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
          48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
          45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
      
      Actually this heavy lock contention is not always necessary.  The
      'vm_committed_as' needs to be very precise when the strict
      OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
      for the percpu counter.
      
      So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
      enlarge it for not-so-strict OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
      policies.
      
      Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
      desktop, and 2097%(20X) on a 4S/72C/144T server.  And for that case,
      whether it shows improvements depends on if the test mmap size is bigger
      than the batch number computed.
      
      We tested 10+ platforms in 0day (server, desktop and laptop).  If we lift
      it to 64X, 80%+ platforms show improvements, and for 16X lift, 1/3 of the
      platforms will show improvements.
      
      And generally it should help the mmap/unmap usage,as Michal Hocko
      mentioned:
      
      : I believe that there are non-synthetic worklaods which would benefit
      : from a larger batch. E.g. large in memory databases which do large
      : mmaps during startups from multiple threads.
      
      Note: There are some style complain from checkpatch for patch 4, as sysctl
      handler declaration follows the similar format of sibling functions
      
      [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
      
      This patch (of 4):
      
      Use the existing vm_memory_committed() instead, which is also convenient
      for future change.
      
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/1594389708-60781-1-git-send-email-feng.tang@intel.com
      Link: http://lkml.kernel.org/r/1594389708-60781-2-git-send-email-feng.tang@intel.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1455083c
    • Zhen Lei's avatar
      mm/mmap: optimize a branch judgment in ksys_mmap_pgoff() · 7bba8f0e
      Zhen Lei authored
      
      
      Look at the pseudo code below.  It's very clear that, the judgement
      "!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can
      use "else if" to avoid it.  And the assignment "retval = -EINVAL" at 2) is
      only needed by the branch 3), because "retval" will be overwritten at 4).
      
      No functional change, but it can reduce the code size. Maybe more clearer?
      Before:
      text    data     bss     dec     hex filename
      28733    1590       1   30324    7674 mm/mmap.o
      
      After:
      text    data     bss     dec     hex filename
      28701    1590       1   30292    7654 mm/mmap.o
      
      ====pseudo code====:
      	if (!(flags & MAP_ANONYMOUS)) {
      		...
      1)		if (is_file_hugepages(file))
      			len = ALIGN(len, huge_page_size(hstate_file(file)));
      2)		retval = -EINVAL;
      3)		if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
      			goto out_fput;
      	} else if (flags & MAP_HUGETLB) {
      		...
      	}
      	...
      
      4)	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
      out_fput:
      	...
      	return retval;
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7bba8f0e
    • Joerg Roedel's avatar
      mm: move p?d_alloc_track to separate header file · 2a681cfa
      Joerg Roedel authored
      
      
      The functions are only used in two source files, so there is no need for
      them to be in the global <linux/mm.h> header.  Move them to the new
      <linux/pgalloc-track.h> header and include it only where needed.
      
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200609120533...
      2a681cfa
    • Mike Rapoport's avatar
      mm: move lib/ioremap.c to mm/ · ab05eabf
      Mike Rapoport authored
      
      
      The functionality in lib/ioremap.c deals with pagetables, vmalloc and
      caches, so it naturally belongs to mm/ Moving it there will also allow
      declaring p?d_alloc_track functions in an header file inside mm/ rather
      than having those declarations in include/linux/mm.h
      
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-8-rppt@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab05eabf
    • Mike Rapoport's avatar
      asm-generic: pgalloc: provide generic pgd_free() · f9cb654c
      Mike Rapoport authored
      
      
      Most architectures define pgd_free() as a wrapper for free_page().
      
      Provide a generic version in asm-generic/pgalloc.h and enable its use for
      most architectures.
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-7-rppt@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9cb654c
    • Mike Rapoport's avatar
      asm-generic: pgalloc: provide generic pud_alloc_one() and pud_free_one() · d9e8b929
      Mike Rapoport authored
      
      
      Several architectures define pud_alloc_one() as a wrapper for
      __get_free_page() and pud_free() as a wrapper for free_page().
      
      Provide a generic implementation in asm-generic/pgalloc.h and use it where
      appropriate.
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-6-rppt@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9e8b929
    • Mike Rapoport's avatar
      asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one() · 1355c31e
      Mike Rapoport authored
      
      
      For most architectures that support >2 levels of page tables,
      pmd_alloc_one() is a wrapper for __get_free_pages(), sometimes with
      __GFP_ZERO and sometimes followed by memset(0) instead.
      
      More elaborate versions on arm64 and x86 account memory for the user page
      tables and call to pgtable_pmd_page_ctor() as the part of PMD page
      initialization.
      
      Move the arm64 version to include/asm-generic/pgalloc.h and use the
      generic version on several architectures.
      
      The pgtable_pmd_page_ctor() is a NOP when ARCH_ENABLE_SPLIT_PMD_PTLOCK is
      not enabled, so there is no functional change for most architectures
      except of the addition of __GFP_ACCOUNT for allocation of user page
      tables.
      
      The pmd_free() is a wrapper for free_page() in all the cases, so no
      functional change here.
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-5-rppt@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1355c31e
    • Mike Rapoport's avatar
      xtensa: switch to generic version of pte allocation · 7278914c
      Mike Rapoport authored
      
      
      xtensa clears PTEs during allocation of the page tables and pte_clear()
      sets the PTE to a non-zero value.  Splitting ptes_clear() helper out of
      pte_alloc_one() and pte_alloc_one_kernel() allows reuse of base generic
      allocation methods (__pte_alloc_one() and __pte_alloc_one_kernel()) and
      the common GFP mask for page table allocations.
      
      The pte_free() and pte_free_kernel() implementations on xtensa are
      identical to the generic ones and can be dropped.
      
      [jcmvbkbc@gmail.com: xtensa: fix closing endif comment]
        Link: http://lkml.kernel.org/r/20200721024751.1257-1-jcmvbkbc@gmail.com
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-4-rppt@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7278914c
    • Mike Rapoport's avatar
      opeinrisc: switch to generic version of pte allocation · fc2a6b83
      Mike Rapoport authored
      
      
      Replace pte_alloc_one(), pte_free() and pte_free_kernel() with the generic
      implementation.  The only actual functional change is the addition of
      __GFP_ACCOUT for the allocation of the user page tables.
      
      The pte_alloc_one_kernel() is kept back because its implementation on
      openrisc is different than the generic one.
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarStafford Horne <shorne@gmail.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-3-rppt@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc2a6b83
    • Mike Rapoport's avatar
      mm: remove unneeded includes of <asm/pgalloc.h> · ca15ca40
      Mike Rapoport authored
      
      
      Patch series "mm: cleanup usage of <asm/pgalloc.h>"
      
      Most architectures have very similar versions of pXd_alloc_one() and
      pXd_free_one() for intermediate levels of page table.  These patches add
      generic versions of these functions in <asm-generic/pgalloc.h> and enable
      use of the generic functions where appropriate.
      
      In addition, functions declared and defined in <asm/pgalloc.h> headers are
      used mostly by core mm and early mm initialization in arch and there is no
      actual reason to have the <asm/pgalloc.h> included all over the place.
      The first patch in this series removes unneeded includes of
      <asm/pgalloc.h>
      
      In the end it didn't work out as neatly as I hoped and moving
      pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
      unnecessary changes to arches that have custom page table allocations, so
      I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
      to mm/.
      
      This patch (of 8):
      
      In most cases <asm/pgalloc.h> header is required only for allocations of
      page table memory.  Most of the .c files that include that header do not
      use symbols declared in <asm/pgalloc.h> and do not require that header.
      
      As for the other header files that used to include <asm/pgalloc.h>, it is
      possible to move that include into the .c file that actually uses symbols
      from <asm/pgalloc.h> and drop the include from the header file.
      
      The process was somewhat automated using
      
      	sed -i -E '/[<"]asm\/pgalloc\.h/d' \
                      $(grep -L -w -f /tmp/xx \
                              $(git grep -E -l '[<"]asm/pgalloc\.h'))
      
      where /tmp/xx contains all the symbols defined in
      arch/*/include/asm/pgalloc.h.
      
      [rppt@linux.ibm.com: fix powerpc warning]
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca15ca40
    • Alex Zhang's avatar
      mm/memory.c: make remap_pfn_range() reject unaligned addr · 0c4123e3
      Alex Zhang authored
      
      
      This function implicitly assumes that the addr passed in is page aligned.
      A non page aligned addr could ultimately cause a kernel bug in
      remap_pte_range as the exit condition in the logic loop may never be
      satisfied.  This patch documents the need for the requirement, as well as
      explicitly adds a check for it.
      
      Signed-off-by: default avatarAlex Zhang <zhangalex@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200617233512.177519-1-zhangalex@google.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c4123e3
    • Ralph Campbell's avatar
      mm: remove redundant check non_swap_entry() · 463b7a17
      Ralph Campbell authored
      
      
      In zap_pte_range(), the check for non_swap_entry() and
      is_device_private_entry() is unnecessary since the latter is sufficient to
      determine if the page is a device private page.  Remove the test for
      non_swap_entry() to simplify the code and for clarity.
      
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/20200615175405.4613-1-rcampbell@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      463b7a17
    • Michal Koutný's avatar
      mm/page_counter.c: fix protection usage propagation · a6f23d14
      Michal Koutný authored
      When workload runs in cgroups that aren't directly below root cgroup and
      their parent specifies reclaim protection, it may end up ineffective.
      
      The reason is that propagate_protected_usage() is not called in all
      hierarchy up.  All the protected usage is incorrectly accumulated in the
      workload's parent.  This means that siblings_low_usage is overestimated
      and effective protection underestimated.  Even though it is transitional
      phenomenon (uncharge path does correct propagation and fixes the wrong
      children_low_usage), it can undermine the intended protection
      unexpectedly.
      
      We have noticed this problem while seeing a swap out in a descendant of a
      protected memcg (intermediate node) while the parent was conveniently
      under its protection limit and the memory pressure was external to that
      hierarchy.  Michal has pinpointed this down to the wrong
      siblings_low_usage which led to the unwanted reclaim.
      
      The fix is simply updating children_low_usage in respective ancestors also
      in the charging path.
      
      Fixes: 23067153
      
       ("mm: memory.low hierarchical behavior")
      Signed-off-by: default avatarMichal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6f23d14
    • Johannes Weiner's avatar
      mm: memcontrol: don't count limit-setting reclaim as memory pressure · e22c6ed9
      Johannes Weiner authored
      
      
      When an outside process lowers one of the memory limits of a cgroup (or
      uses the force_empty knob in cgroup1), direct reclaim is performed in the
      context of the write(), in order to directly enforce the new limit and
      have it being met by the time the write() returns.
      
      Currently, this reclaim activity is accounted as memory pressure in the
      cgroup that the writer(!) belongs to.  This is unexpected.  It
      specifically causes problems for senpai
      (https://github.com/facebookincubator/senpai), which is an agent that
      routinely adjusts the memory limits and performs associated reclaim work
      in tens or even hundreds of cgroups running on the host.  The cgroup that
      senpai is running in itself will report elevated levels of memory
      pressure, even though it itself is under no memory shortage or any sort of
      distress.
      
      Move the psi annotation from the central cgroup reclaim function to
      callsites in the allocation context, and thereby no longer count any
      limit-setting reclaim as memory pressure.  If the newly set limit causes
      the workload inside the cgroup into direct reclaim, that of course will
      continue to count as memory pressure.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e22c6ed9
    • Johannes Weiner's avatar
      mm: memcontrol: restore proper dirty throttling when memory.high changes · 19ce33ac
      Johannes Weiner authored
      Commit 8c8c383c ("mm: memcontrol: try harder to set a new
      memory.high") inadvertently removed a callback to recalculate the
      writeback cache size in light of a newly configured memory.high limit.
      
      Without letting the writeback cache know about a potentially heavily
      reduced limit, it may permit too many dirty pages, which can cause
      unnecessary reclaim latencies or even avoidable OOM situations.
      
      This was spotted while reading the code, it hasn't knowingly caused any
      problems in practice so far.
      
      Fixes: 8c8c383c
      
       ("mm: memcontrol: try harder to set a new memory.high")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200728135210.379885-1-hannes@cmpxchg.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19ce33ac
    • Yafang Shao's avatar
      memcg, oom: check memcg margin for parallel oom · 1378b37d
      Yafang Shao authored
      
      
      Memcg oom killer invocation is synchronized by the global oom_lock and
      tasks are sleeping on the lock while somebody is selecting the victim or
      potentially race with the oom_reaper is releasing the victim's memory.
      This can result in a pointless oom killer invocation because a waiter
      might be racing with the oom_reaper
      
              P1              oom_reaper              P2
                              oom_reap_task           mutex_lock(oom_lock)
                                                      out_of_memory # no victim because we have one already
                              __oom_reap_task_mm      mute_unlock(oom_lock)
       mutex_lock(oom_lock)
                              set MMF_OOM_SKIP
       select_bad_process
       # finds a new victim
      
      The page allocator prevents from this race by trying to allocate after the
      lock can be acquired (in __alloc_pages_may_oom) which acts as a last
      minute check.  Moreover page allocator simply doesn't block on the
      oom_lock and simply retries the whole reclaim process.
      
      Memcg oom killer should do the last minute check as well.  Call
      mem_cgroup_margin to do that.  Trylock on the oom_lock could be done as
      well but this doesn't seem to be necessary at this stage.
      
      [mhocko@kernel.org: commit log]
      
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1594735034-19190-1-git-send-email-laoar.shao@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1378b37d
    • Chris Down's avatar
      mm, memcg: decouple e{low,min} state mutations from protection checks · 45c7f7e1
      Chris Down authored
      
      
      mem_cgroup_protected currently is both used to set effective low and min
      and return a mem_cgroup_protection based on the result.  As a user, this
      can be a little unexpected: it appears to be a simple predicate function,
      if not for the big warning in the comment above about the order in which
      it must be executed.
      
      This change makes it so that we separate the state mutations from the
      actual protection checks, which makes it more obvious where we need to be
      careful mutating internal state, and where we are simply checking and
      don't need to worry about that.
      
      [mhocko@suse.com - don't check protection on root memcgs]
      
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45c7f7e1
    • Yafang Shao's avatar
      mm, memcg: avoid stale protection values when cgroup is above protection · 22f7496f
      Yafang Shao authored
      Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.
      
      This series contains a fix for a edge case in my earlier protection
      calculation patches, and a patch to make the area overall a little more
      robust to hopefully help avoid this in future.
      
      This patch (of 2):
      
      A cgroup can have both memory protection and a memory limit to isolate it
      from its siblings in both directions - for example, to prevent it from
      being shrunk below 2G under high pressure from outside, but also from
      growing beyond 4G under low pressure.
      
      Commit 9783aa99 ("mm, memcg: proportional memory.{low,min} reclaim")
      implemented proportional scan pressure so that multiple siblings in excess
      of their protection settings don't get reclaimed equally but instead in
      accordance to their unprotected portion.
      
      During limit reclaim, this proportionality shouldn't apply of course:
      there is no competition, all pressure is from within the cgroup and should
      be applied as such.  Reclaim should operate at full efficiency.
      
      However, mem_cgroup_protected() never expected anybody to look at the
      effective protection values when it indicated that the cgroup is above its
      protection.  As a result, a query during limit reclaim may return stale
      protection values that were calculated by a previous reclaim cycle in
      which the cgroup did have siblings.
      
      When this happens, reclaim is unnecessarily hesitant and potentially slow
      to meet the desired limit.  In theory this could lead to premature OOM
      kills, although it's not obvious this has occurred in practice.
      
      Workaround the problem by special casing reclaim roots in
      mem_cgroup_protection.  These memcgs are never participating in the
      reclaim protection because the reclaim is internal.
      
      We have to ignore effective protection values for reclaim roots because
      mem_cgroup_protected might be called from racing reclaim contexts with
      different roots.  Calculation is relying on root -> leaf tree traversal
      therefore top-down reclaim protection invariants should hold.  The only
      exception is the reclaim root which should have effective protection set
      to 0 but that would be problematic for the following setup:
      
       Let's have global and A's reclaim in parallel:
        |
        A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
        |\
        | C (low = 1G, usage = 2.5G)
        B (low = 1G, usage = 0.5G)
      
       for A reclaim we have
       B.elow = B.low
       C.elow = C.low
      
       For the global reclaim
       A.elow = A.low
       B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
       C.elow = min(C.usage, C.low)
      
       With the effective values resetting we have A reclaim
       A.elow = 0
       B.elow = B.low
       C.elow = C.low
      
       and global reclaim could see the above and then
       B.elow = C.elow = 0 because children_low_usage > A.elow
      
      Which means that protected memcgs would get reclaimed.
      
      In future we would like to make mem_cgroup_protected more robust against
      racing reclaim contexts but that is likely more complex solution than this
      simple workaround.
      
      [hannes@cmpxchg.org - large part of the changelog]
      [mhocko@suse.com - workaround explanation]
      [chris@chrisdown.name - retitle]
      
      Fixes: 9783aa99
      
       ("mm, memcg: proportional memory.{low,min} reclaim")
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22f7496f
    • Chris Down's avatar
      mm, memcg: unify reclaim retry limits with page allocator · d977aa93
      Chris Down authored
      Reclaim retries have been set to 5 since the beginning of time in
      commit 66e1707b ("Memory controller: add per cgroup LRU and
      reclaim").  However, we now have a generally agreed-upon standard for
      page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later
      in commit 0a0337e0
      
       ("mm, oom: rework oom detection").
      
      In the absence of a compelling reason to declare an OOM earlier in memcg
      context than page allocator context, it seems reasonable to supplant
      MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page
      allocator and memcg internals more similar in semantics when reclaim
      fails to produce results, avoiding premature OOMs or throttling.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d977aa93
    • Chris Down's avatar
      mm, memcg: reclaim more aggressively before high allocator throttling · b3ff9291
      Chris Down authored
      
      
      Patch series "mm, memcg: reclaim harder before high throttling", v2.
      
      This patch (of 2):
      
      In Facebook production, we've seen cases where cgroups have been put into
      allocator throttling even when they appear to have a lot of slack file
      caches which should be trivially reclaimable.
      
      Looking more closely, the problem is that we only try a single cgroup
      reclaim walk for each return to usermode before calculating whether or not
      we should throttle.  This single attempt doesn't produce enough pressure
      to shrink for cgroups with a rapidly growing amount of file caches prior
      to entering allocator throttling.
      
      As an example, we see that threads in an affected cgroup are stuck in
      allocator throttling:
      
          # for i in $(cat cgroup.threads); do
          >     grep over_high "/proc/$i/stack"
          > done
          [<0>] mem_cgroup_handle_over_high+0x10b/0x150
          [<0>] mem_cgroup_handle_over_high+0x10b/0x150
          [<0>] mem_cgroup_handle_over_high+0x10b/0x150
      
      ...however, there is no I/O pressure reported by PSI, despite a lot of
      slack file pages:
      
          # cat memory.pressure
          some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
          full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
          # cat io.pressure
          some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
          full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
          # grep _file memory.stat
          inactive_file 1370939392
          active_file 661635072
      
      This patch changes the behaviour to retry reclaim either until the current
      task goes below the 10ms grace period, or we are making no reclaim
      progress at all.  In the latter case, we enter reclaim throttling as
      before.
      
      To a user, there's no intuitive reason for the reclaim behaviour to differ
      from hitting memory.high as part of a new allocation, as opposed to
      hitting memory.high because someone lowered its value.  As such this also
      brings an added benefit: it unifies the reclaim behaviour between the two.
      
      There's precedent for this behaviour: we already do reclaim retries when
      writing to memory.{high,max}, in max reclaim, and in the page allocator
      itself.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ff9291
    • Roman Gushchin's avatar
      mm: memcontrol: avoid workload stalls when lowering memory.high · 536d3bf2
      Roman Gushchin authored
      
      
      Memory.high limit is implemented in a way such that the kernel penalizes
      all threads which are allocating a memory over the limit.  Forcing all
      threads into the synchronous reclaim and adding some artificial delays
      allows to slow down the memory consumption and potentially give some time
      for userspace oom handlers/resource control agents to react.
      
      It works nicely if the memory usage is hitting the limit from below,
      however it works sub-optimal if a user adjusts memory.high to a value way
      below the current memory usage.  It basically forces all workload threads
      (doing any memory allocations) into the synchronous reclaim and sleep.
      This makes the workload completely unresponsive for a long period of time
      and can also lead to a system-wide contention on lru locks.  It can happen
      even if the workload is not actually tight on memory and has, for example,
      a ton of cold pagecache.
      
      In the current implementation writing to memory.high causes an atomic
      update of page counter's high value followed by an attempt to reclaim
      enough memory to fit into the new limit.  To fix the problem described
      above, all we need is to change the order of execution: try to push the
      memory usage under the limit first, and only then set the new high limit.
      
      Reported-by: default avatarDomas Mituzas <domas@fb.com>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Chris Down <chris@chrisdown.name>
      Link: http://lkml.kernel.org/r/20200709194718.189231-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      536d3bf2
    • Roman Gushchin's avatar
      mm: kmem: switch to static_branch_likely() in memcg_kmem_enabled() · eda330e5
      Roman Gushchin authored
      
      
      Currently memcg_kmem_enabled() is optimized for the kernel memory
      accounting being off.  It was so for a long time, and arguably the reason
      behind was that the kernel memory accounting was initially an opt-in
      feature.  However, now it's on by default on both cgroup v1 and cgroup v2,
      and it's on for all cgroups.  So let's switch over to
      static_branch_likely() to reflect this fact.
      
      Unlikely there is a significant performance difference, as the cost of a
      memory allocation and its accounting significantly exceeds the cost of a
      jump.  However, the conversion makes the code look more logically.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200707173612.124425-3-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eda330e5
    • Roman Gushchin's avatar
      mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() · 74d555be
      Roman Gushchin authored
      
      
      charge_slab_page() and uncharge_slab_page() are not related anymore to
      memcg charging and uncharging.  In order to make their names less
      confusing, let's rename them to account_slab_page() and
      unaccount_slab_page() respectively.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74d555be
    • Roman Gushchin's avatar
      mm: memcg/slab: remove unused argument by charge_slab_page() · 84950480
      Roman Gushchin authored
      
      
      charge_slab_page() is not using the gfp argument anymore,
      remove it.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84950480
    • Shakeel Butt's avatar
      mm: memcontrol: account kernel stack per node · 991e7673
      Shakeel Butt authored
      
      
      Currently the kernel stack is being accounted per-zone.  There is no need
      to do that.  In addition due to being per-zone, memcg has to keep a
      separate MEMCG_KERNEL_STACK_KB.  Make the stat per-node and deprecate
      MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
      node_stat_item.  In addition localize the kernel stack stats updates to
      account_kernel_stack().
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      991e7673
    • Roman Gushchin's avatar
      tools/cgroup: add memcg_slabinfo.py tool · fbc1ac9d
      Roman Gushchin authored
      
      
      Add a drgn-based tool to display slab information for a given memcg.  Can
      replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2, but in a
      more flexiable way.
      
      Currently supports only SLUB configuration, but SLAB can be trivially
      added later.
      
      Output example:
      $ sudo ./tools/cgroup/memcg_slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
      shmem_inode_cache     92     92    704   46    8 : tunables    0    0    0 : slabdata      2      2      0
      eventpoll_pwq         56     56     72   56    1 : tunables    0    0    0 : slabdata      1      1      0
      eventpoll_epi         32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
      kmalloc-8              0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-96             0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-2048           0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-64           128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
      mm_struct            160    160   1024   32    8 : tunables    0    0    0 : slabdata      5      5      0
      signal_cache          96     96   1024   32    8 : tunables    0    0    0 : slabdata      3      3      0
      sighand_cache         45     45   2112   15    8 : tunables    0    0    0 : slabdata      3      3      0
      files_cache          138    138    704   46    8 : tunables    0    0    0 : slabdata      3      3      0
      task_delay_info      153    153     80   51    1 : tunables    0    0    0 : slabdata      3      3      0
      task_struct           27     27   3520    9    8 : tunables    0    0    0 : slabdata      3      3      0
      radix_tree_node       56     56    584   28    4 : tunables    0    0    0 : slabdata      2      2      0
      btrfs_inode          140    140   1136   28    8 : tunables    0    0    0 : slabdata      5      5      0
      kmalloc-1024          64     64   1024   32    8 : tunables    0    0    0 : slabdata      2      2      0
      kmalloc-192           84     84    192   42    2 : tunables    0    0    0 : slabdata      2      2      0
      inode_cache           54     54    600   27    4 : tunables    0    0    0 : slabdata      2      2      0
      kmalloc-128            0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
      kmalloc-512           32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
      skbuff_head_cache     32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
      sock_inode_cache      46     46    704   46    8 : tunables    0    0    0 : slabdata      1      1      0
      cred_jar             378    378    192   42    2 : tunables    0    0    0 : slabdata      9      9      0
      proc_inode_cache      96     96    672   24    4 : tunables    0    0    0 : slabdata      4      4      0
      dentry               336    336    192   42    2 : tunables    0    0    0 : slabdata      8      8      0
      filp                 697    864    256   32    2 : tunables    0    0    0 : slabdata     27     27      0
      anon_vma             644    644     88   46    1 : tunables    0    0    0 : slabdata     14     14      0
      pid                 1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
      vm_area_struct      1200   1200    200   40    2 : tunables    0    0    0 : slabdata     30     30      0
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-20-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbc1ac9d
    • Roman Gushchin's avatar
      kselftests: cgroup: add kernel memory accounting tests · 933dc80e
      Roman Gushchin authored
      
      
      Add some tests to cover the kernel memory accounting functionality.  These
      are covering some issues (and changes) we had recently.
      
      1) A test which allocates a lot of negative dentries, checks memcg slab
         statistics, creates memory pressure by setting memory.max to some low
         value and checks that some number of slabs was reclaimed.
      
      2) A test which covers side effects of memcg destruction: it creates
         and destroys a large number of sub-cgroups, each containing a
         multi-threaded workload which allocates and releases some kernel
         memory.  Then it checks that the charge ans memory.stats do add up on
         the parent level.
      
      3) A test which reads /proc/kpagecgroup and implicitly checks that it
         doesn't crash the system.
      
      4) A test which spawns a large number of threads and checks that the
         kernel stacks accounting works as expected.
      
      5) A test which checks that living charged slab objects are not
         preventing the memory cgroup from being released after being deleted by
         a user.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-19-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      933dc80e
    • Roman Gushchin's avatar
      mm: memcg/slab: use a single set of kmem_caches for all allocations · 10befea9
      Roman Gushchin authored
      
      
      Instead of having two sets of kmem_caches: one for system-wide and
      non-accounted allocations and the second one shared by all accounted
      allocations, we can use just one.
      
      The idea is simple: space for obj_cgroup metadata can be allocated on
      demand and filled only for accounted allocations.
      
      It allows to remove a bunch of code which is required to handle kmem_cache
      clones for accounted allocations.  There is no more need to create them,
      accumulate statistics, propagate attributes, etc.  It's a quite
      significant simplification.
      
      Also, because the total number of slab_caches is reduced almost twice (not
      all kmem_caches have a memcg clone), some additional memory savings are
      expected.  On my devvm it additionally saves about 3.5% of slab memory.
      
      [guro@fb.com: fix build on MIPS]
        Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com
      
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10befea9
    • Roman Gushchin's avatar
      mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() · 15999eef
      Roman Gushchin authored
      
      
      memcg_accumulate_slabinfo() is never called with a non-root kmem_cache as
      a first argument, so the is_root_cache(s) check is redundant and can be
      removed without any functional change.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-17-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15999eef
    • Roman Gushchin's avatar
      mm: memcg/slab: deprecate slab_root_caches · c7094406
      Roman Gushchin authored
      
      
      Currently there are two lists of kmem_caches:
      1) slab_caches, which contains all kmem_caches,
      2) slab_root_caches, which contains only root kmem_caches.
      
      And there is some preprocessor magic to have a single list if
      CONFIG_MEMCG_KMEM isn't enabled.
      
      It was required earlier because the number of non-root kmem_caches was
      proportional to the number of memory cgroups and could reach really big
      values.  Now, when it cannot exceed the number of root kmem_caches, there
      is really no reason to maintain two lists.
      
      We never iterate over the slab_root_caches list on any hot paths, so it's
      perfectly fine to iterate over slab_caches and filter out non-root
      kmem_caches.
      
      It allows to remove a lot of config-dependent code and two pointers from
      the kmem_cache structure.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7094406
    • Roman Gushchin's avatar
      mm: memcg/slab: remove memcg_kmem_get_cache() · 272911a4
      Roman Gushchin authored
      
      
      The memcg_kmem_get_cache() function became really trivial, so let's just
      inline it into the single call point: memcg_slab_pre_alloc_hook().
      
      It will make the code less bulky and can also help the compiler to
      generate a better code.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      272911a4
    • Roman Gushchin's avatar
      mm: memcg/slab: simplify memcg cache creation · d797b7d0
      Roman Gushchin authored
      
      
      Because the number of non-root kmem_caches doesn't depend on the number of
      memory cgroups anymore and is generally not very big, there is no more
      need for a dedicated workqueue.
      
      Also, as there is no more need to pass any arguments to the
      memcg_create_kmem_cache() except the root kmem_cache, it's possible to
      just embed the work structure into the kmem_cache and avoid the dynamic
      allocation of the work structure.
      
      This will also simplify the synchronization: for each root kmem_cache
      there is only one work.  So there will be no more concurrent attempts to
      create a non-root kmem_cache for a root kmem_cache: the second and all
      following attempts to queue the work will fail.
      
      On the kmem_cache destruction path there is no more need to call the
      expensive flush_workqueue() and wait for all pending works to be finished.
      Instead, cancel_work_sync() can be used to cancel/wait for only one work.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d797b7d0
    • Roman Gushchin's avatar
      mm: memcg/slab: use a single set of kmem_caches for all accounted allocations · 9855609b
      Roman Gushchin authored
      
      
      This is fairly big but mostly red patch, which makes all accounted slab
      allocations use a single set of kmem_caches instead of creating a separate
      set for each memory cgroup.
      
      Because the number of non-root kmem_caches is now capped by the number of
      root kmem_caches, there is no need to shrink or destroy them prematurely.
      They can be perfectly destroyed together with their root counterparts.
      This allows to dramatically simplify the management of non-root
      kmem_caches and delete a ton of code.
      
      This patch performs the following changes:
      1) introduces memcg_params.memcg_cache pointer to represent the
         kmem_cache which will be used for all non-root allocations
      2) reuses the existing memcg kmem_cache creation mechanism
         to create memcg kmem_cache on the first allocation attempt
      3) memcg kmem_caches are named <kmemcache_name>-memcg,
         e.g. dentry-memcg
      4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
         or schedule it's creation and return the root cache
      5) removes almost all non-root kmem_cache management code
         (separate refcounter, reparenting, shrinking, etc)
      6) makes slab debugfs to display root_mem_cgroup css id and never
         show :dead and :deact flags in the memcg_slabinfo attribute.
      
      Following patches in the series will simplify the kmem_cache creation.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9855609b