Skip to content
  1. May 01, 2021
    • Wang Wensheng's avatar
      mm/sparse: add the missing sparse_buffer_fini() in error branch · 2284f47f
      Wang Wensheng authored
      sparse_buffer_init() and sparse_buffer_fini() should appear in pair, or a
      WARN issue would be through the next time sparse_buffer_init() runs.
      
      Add the missing sparse_buffer_fini() in error branch.
      
      Link: https://lkml.kernel.org/r/20210325113155.118574-1-wangwensheng4@huawei.com
      Fixes: 85c77f79
      
       ("mm/sparse: add new sparse_init_nid() and sparse_init()")
      Signed-off-by: default avatarWang Wensheng <wangwensheng4@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2284f47f
    • Zhiyuan Dai's avatar
      mm/dmapool: switch from strlcpy to strscpy · 943f229e
      Zhiyuan Dai authored
      
      
      strlcpy is marked as deprecated in Documentation/process/deprecated.rst,
      and there is no functional difference when the caller expects truncation
      (when not checking the return value). strscpy is relatively better as it
      also avoids scanning the whole source string.
      
      Link: https://lkml.kernel.org/r/1613962050-14188-1-git-send-email-daizhiyuan@phytium.com.cn
      Signed-off-by: default avatarZhiyuan Dai <daizhiyuan@phytium.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      943f229e
    • Brian Geffon's avatar
      selftests: add a MREMAP_DONTUNMAP selftest for shmem · 85931004
      Brian Geffon authored
      
      
      This test extends the current mremap tests to validate that the
      MREMAP_DONTUNMAP operation can be performed on shmem mappings.
      
      Link: https://lkml.kernel.org/r/20210323182520.2712101-3-bgeffon@google.com
      Signed-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Michael S . Tsirkin" <mst@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Alejandro Colomar <alx.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85931004
    • Brian Geffon's avatar
      Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" · 14d07113
      Brian Geffon authored
      This reverts commit cd544fd1
      
      .
      
      As discussed in [1] this commit was a no-op because the mapping type was
      checked in vma_to_resize before move_vma is ever called.  This meant that
      vm_ops->mremap() would never be called on such mappings.  Furthermore,
      we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
      mappings, and these special mappings are still protected by the existing
      check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL.
      
      1. https://lkml.org/lkml/2020/12/28/2340
      
      Link: https://lkml.kernel.org/r/20210323182520.2712101-2-bgeffon@google.com
      Signed-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Cc: Alejandro Colomar <alx.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Michael S . Tsirkin" <mst@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14d07113
    • Brian Geffon's avatar
      mm: extend MREMAP_DONTUNMAP to non-anonymous mappings · a4609387
      Brian Geffon authored
      
      
      Patch series "mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings", v5.
      
      This patch (of 3):
      
      Currently MREMAP_DONTUNMAP only accepts private anonymous mappings.  This
      restriction was placed initially for simplicity and not because there
      exists a technical reason to do so.
      
      This change will widen the support to include any mappings which are not
      VM_DONTEXPAND or VM_PFNMAP.  The primary use case is to support
      MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
      This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL if
      VM_DONTEXPAND or VM_PFNMAP mappings are specified.
      
      Lokesh Gidra who works on the Android JVM, provided an explanation of how
      such a feature will improve Android JVM garbage collection: "Android is
      developing a new garbage collector (GC), based on userfaultfd.  The
      garbage collector will use userfaultfd (uffd) on the java heap during
      compaction.  On accessing any uncompacted page, the application threads
      will find it missing, at which point the thread will create the compacted
      page and then use UFFDIO_COPY ioctl to get it mapped and then resume
      execution.  Before starting this compaction, in a stop-the-world pause the
      heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to
      receive UFFD_EVENT_PAGEFAULT events after resuming execution.
      
      To speedup mremap operations, pagetable movement was optimized by moving
      PUD entries instead of PTE entries [1].  It was necessary as mremap of
      even modest sized memory ranges also took several milliseconds, and
      stopping the application for that long isn't acceptable in response-time
      sensitive cases.
      
      With UFFDIO_CONTINUE feature [2], it will be even more efficient to
      implement this GC, particularly the 'non-moveable' portions of the heap.
      It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
      However, for this to work, the java heap has to be on a 'shared' vma.
      Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
      patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
      compaction."
      
      [1] https://lore.kernel.org/linux-mm/20201215030730.NC3CU98e4%25akpm@linux-foundation.org/
      [2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmussen@google.com/
      
      Link: https://lkml.kernel.org/r/20210323182520.2712101-1-bgeffon@google.com
      Signed-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Cc: Alejandro Colomar <alx.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Michael S . Tsirkin" <mst@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4609387
    • Huang Ying's avatar
      NUMA balancing: reduce TLB flush via delaying mapping on hint page fault · b99a342d
      Huang Ying authored
      
      
      With NUMA balancing, in hint page fault handler, the faulting page will be
      migrated to the accessing node if necessary.  During the migration, TLB
      will be shot down on all CPUs that the process has run on recently.
      Because in the hint page fault handler, the PTE will be made accessible
      before the migration is tried.  The overhead of TLB shooting down can be
      high, so it's better to be avoided if possible.  In fact, if we delay
      mapping the page until migration, that can be avoided.  This is what this
      patch doing.
      
      For the multiple threads applications, it's possible that a page is
      accessed by multiple threads almost at the same time.  In the original
      implementation, because the first thread will install the accessible PTE
      before migrating the page, the other threads may access the page directly
      before the page is made inaccessible again during migration.  While with
      the patch, the second thread will go through the page fault handler too.
      And because of the PageLRU() checking in the following code path,
      
        migrate_misplaced_page()
          numamigrate_isolate_page()
            isolate_lru_page()
      
      the migrate_misplaced_page() will return 0, and the PTE will be made
      accessible in the second thread.
      
      This will introduce a little more overhead.  But we think the possibility
      for a page to be accessed by the multiple threads at the same time is low,
      and the overhead difference isn't too large.  If this becomes a problem in
      some workloads, we need to consider how to reduce the overhead.
      
      To test the patch, we run a test case as follows on a 2-socket Intel
      server (1 NUMA node per socket) with 128GB DRAM (64GB per socket).
      
      1. Run a memory eater on NUMA node 1 to use 40GB memory before running
         pmbench.
      
      2. Run pmbench (normal accessing pattern) with 8 processes, and 8
         threads per process, so there are 64 threads in total.  The
         working-set size of each process is 8960MB, so the total working-set
         size is 8 * 8960MB = 70GB.  The CPU of all pmbench processes is bound
         to node 1.  The pmbench processes will access some DRAM on node 0.
      
      3. After the pmbench processes run for 10 seconds, kill the memory
         eater.  Now, some pages will be migrated from node 0 to node 1 via
         NUMA balancing.
      
      Test results show that, with the patch, the pmbench throughput (page
      accesses/s) increases 5.5%.  The number of the TLB shootdowns interrupts
      reduces 98% (from ~4.7e7 to ~9.7e5) with about 9.2e6 pages (35.8GB)
      migrated.  From the perf profile, it can be found that the CPU cycles
      spent by try_to_unmap() and its callees reduces from 6.02% to 0.47%.  That
      is, the CPU cycles spent by TLB shooting down decreases greatly.
      
      Link: https://lkml.kernel.org/r/20210408132236.1175607-1-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Matthew Wilcox" <willy@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b99a342d
    • Christoph Hellwig's avatar
      i915: fix remap_io_sg to verify the pgprot · b12d691e
      Christoph Hellwig authored
      
      
      remap_io_sg claims that the pgprot is pre-verified using an io_mapping,
      but actually does not get passed an io_mapping and just uses the pgprot in
      the VMA.  Remove the apply_to_page_range abuse and just loop over
      remap_pfn_range for each segment.
      
      Note: this could use io_mapping_map_user by passing an iomap to
      remap_io_sg if the maintainers can verify that the pgprot in the iomap in
      the only caller is indeed the desired one here.
      
      Link: https://lkml.kernel.org/r/20210326055505.1424432-5-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b12d691e
    • Christoph Hellwig's avatar
      i915: use io_mapping_map_user · b739f125
      Christoph Hellwig authored
      
      
      Replace the home-grown remap_io_mapping that abuses apply_to_page_range
      with the proper io_mapping_map_user interface.
      
      Link: https://lkml.kernel.org/r/20210326055505.1424432-4-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b739f125
    • Christoph Hellwig's avatar
      mm: add a io_mapping_map_user helper · 1fbaf8fc
      Christoph Hellwig authored
      
      
      Add a helper that calls remap_pfn_range for an struct io_mapping, relying
      on the pgprot pre-validation done when creating the mapping instead of
      doing it at runtime.
      
      Link: https://lkml.kernel.org/r/20210326055505.1424432-3-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fbaf8fc
    • Christoph Hellwig's avatar
      mm: add remap_pfn_range_notrack · 74ffa5a3
      Christoph Hellwig authored
      
      
      Patch series "add remap_pfn_range_notrack instead of reinventing it in i915", v2.
      
      i915 has some reason to want to avoid the track_pfn_remap overhead in
      remap_pfn_range.  Add a function to the core VM to do just that rather
      than reinventing the functionality poorly in the driver.
      
      Note that the remap_io_sg path does get exercises when using Xorg on my
      Thinkpad X1, so this should be considered lightly tested, I've not managed
      to hit the remap_io_mapping path at all.
      
      This patch (of 4):
      
      Add a version of remap_pfn_range that does not call track_pfn_range.  This
      will be used to fix horrible abuses of VM internals in the i915 driver.
      
      Link: https://lkml.kernel.org/r/20210326055505.1424432-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20210326055505.1424432-2-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74ffa5a3
    • Ovidiu Panait's avatar
      mm, tracing: improve rss_stat tracepoint message · f9001107
      Ovidiu Panait authored
      
      
      Adjust the rss_stat tracepoint to print the name of the resident page type
      that got updated (e.g. MM_ANONPAGES/MM_FILEPAGES), rather than the numeric
      index corresponding to it (the __entry->member value):
      
      Before this patch:
      ------------------
        rss_stat: mm_id=1216113068 curr=0 member=1 size=28672B
        rss_stat: mm_id=1216113068 curr=0 member=1 size=0B
        rss_stat: mm_id=534402304 curr=1 member=0 size=188416B
        rss_stat: mm_id=534402304 curr=1 member=1 size=40960B
      
      After this patch:
      -----------------
        rss_stat: mm_id=1726253524 curr=1 type=MM_ANONPAGES size=40960B
        rss_stat: mm_id=1726253524 curr=1 type=MM_FILEPAGES size=663552B
        rss_stat: mm_id=1726253524 curr=1 type=MM_ANONPAGES size=65536B
        rss_stat: mm_id=1726253524 curr=1 type=MM_FILEPAGES size=647168B
      
      Use TRACE_DEFINE_ENUM()/__print_symbolic() logic to map the enum values to
      the strings they represent, so that userspace tools can also parse the raw
      data correctly.
      
      Link: https://lkml.kernel.org/r/20210310162305.4862-1-ovidiu.panait@windriver.com
      Signed-off-by: default avatarOvidiu Panait <ovidiu.panait@windriver.com>
      Suggested-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9001107
    • Oscar Salvador's avatar
      x86/vmemmap: optimize for consecutive sections in partial populated PMDs · faf1c000
      Oscar Salvador authored
      
      
      We can optimize in the case we are adding consecutive sections, so no
      memset(PAGE_UNUSED) is needed.
      
      In that case, let us keep track where the unused range of the previous
      memory range begins, so we can compare it with start of the range to be
      added.  If they are equal, we know sections are added consecutively.
      
      For that purpose, let us introduce 'unused_pmd_start', which always holds
      the beginning of the unused memory range.
      
      In the case a section does not contiguously follow the previous one, we
      know we can memset [unused_pmd_start, PMD_BOUNDARY) with PAGE_UNUSE.
      
      This patch is based on a similar patch by David Hildenbrand:
      
      https://lore.kernel.org/linux-mm/20200722094558.9828-10-david@redhat.com/
      
      Link: https://lkml.kernel.org/r/20210309214050.4674-5-osalvador@suse.de
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      faf1c000
    • Oscar Salvador's avatar
      x86/vmemmap: handle unpopulated sub-pmd ranges · 8d400913
      Oscar Salvador authored
      
      
      When sizeof(struct page) is not a power of 2, sections do not span a PMD
      anymore and so when populating them some parts of the PMD will remain
      unused.
      
      Because of this, PMDs will be left behind when depopulating sections since
      remove_pmd_table() thinks that those unused parts are still in use.
      
      Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv()
      will do the right thing and will let us free the PMD when the last user of
      it is gone.
      
      This patch is based on a similar patch by David Hildenbrand:
      
      https://lore.kernel.org/linux-mm/20200722094558.9828-9-david@redhat.com/
      
      [osalvador@suse.de: go back to the ifdef version]
        Link: https://lkml.kernel.org/r/YGy++mSft7K4u+88@localhost.localdomain
      
      Link: https://lkml.kernel.org/r/20210309214050.4674-4-osalvador@suse.de
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d400913
    • Oscar Salvador's avatar
      x86/vmemmap: drop handling of 1GB vmemmap ranges · 69ccfe74
      Oscar Salvador authored
      
      
      There is no code to allocate 1GB pages when mapping the vmemmap range as
      this might waste some memory and requires more complexity which is not
      really worth.
      
      Drop the dead code both for the aligned and unaligned cases and leave only
      the direct map handling.
      
      Link: https://lkml.kernel.org/r/20210309214050.4674-3-osalvador@suse.de
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69ccfe74
    • Oscar Salvador's avatar
      x86/vmemmap: drop handling of 4K unaligned vmemmap range · 8e2df191
      Oscar Salvador authored
      
      
      Patch series "Cleanup and fixups for vmemmap handling", v6.
      
      This series contains cleanups to remove dead code that handles unaligned
      cases for 4K and 1GB pages (patch#1 and patch#2) when removing the vemmmap
      range, and a fix (patch#3) to handle the case when two vmemmap ranges
      intersect the same PMD.
      
      This patch (of 4):
      
      remove_pte_table() is prepared to handle the case where either the start
      or the end of the range is not PAGE aligned.  This cannot actually happen:
      
      __populate_section_memmap enforces the range to be PMD aligned, so as long
      as the size of the struct page remains multiple of 8, the vmemmap range
      will be aligned to PAGE_SIZE.
      
      Drop the dead code and place a VM_BUG_ON in vmemmap_{populate,free} to
      catch nasty cases.  Note that the VM_BUG_ON is placed in there because
      vmemmap_{populate,free= } is the gate of all removing and freeing page
      tables logic.
      
      Link: https://lkml.kernel.org/r/20210309214050.4674-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20210309214050.4674-2-osalvador@suse.de
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e2df191
    • Zhiyuan Dai's avatar
      mm/interval_tree: add comments to improve code readability · 0c1dcb05
      Zhiyuan Dai authored
      
      
      Add a comment explaining the value of the ISSTATIC parameter, Inform the
      reader that this is not a coding style issue.
      
      Link: https://lkml.kernel.org/r/1613964695-17614-1-git-send-email-daizhiyuan@phytium.com.cn
      Signed-off-by: default avatarZhiyuan Dai <daizhiyuan@phytium.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c1dcb05
    • Wang Qing's avatar
      mm/memory.c: do_numa_page(): delete bool "migrated" · bf90ac19
      Wang Qing authored
      
      
      Smatch gives the warning:
      
        do_numa_page() warn: assigning (-11) to unsigned variable 'migrated'
      
      Link: https://lkml.kernel.org/r/1614603421-2681-1-git-send-email-wangqing@vivo.com
      Signed-off-by: default avatarWang Qing <wangqing@vivo.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf90ac19
    • Johannes Weiner's avatar
      mm: page_counter: mitigate consequences of a page_counter underflow · 9317d0ff
      Johannes Weiner authored
      
      
      When the unsigned page_counter underflows, even just by a few pages, a
      cgroup will not be able to run anything afterwards and trigger the OOM
      killer in a loop.
      
      Underflows shouldn't happen, but when they do in practice, we may just be
      off by a small amount that doesn't interfere with the normal operation -
      consequences don't need to be that dire.
      
      Reset the page_counter to 0 upon underflow.  We'll issue a warning that
      the accounting will be off and then try to keep limping along.
      
      [ We used to do this with the original res_counter, where it was a
        more straight-forward correction inside the spinlock section. I
        didn't carry it forward into the lockless page counters for
        simplicity, but it turns out this is quite useful in practice. ]
      
      Link: https://lkml.kernel.org/r/20210408143155.2679744-1-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChris Down <chris@chrisd...>
      9317d0ff
    • Wan Jiabing's avatar
      linux/memcontrol.h: remove duplicate struct declaration · a10e9957
      Wan Jiabing authored
      
      
      struct mem_cgroup is declared twice.  One has been declared at forward
      struct declaration.  Remove the duplicate.
      
      Link: https://lkml.kernel.org/r/20210330020246.2265371-1-wanjiabing@vivo.com
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a10e9957
    • Muchun Song's avatar
      mm: memcontrol: move PageMemcgKmem to the scope of CONFIG_MEMCG_KMEM · bd290e1e
      Muchun Song authored
      
      
      The page only can be marked as kmem when CONFIG_MEMCG_KMEM is enabled.
      So move PageMemcgKmem() to the scope of the CONFIG_MEMCG_KMEM.
      
      As a bonus, on !CONFIG_MEMCG_KMEM build some code can be compiled out.
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-8-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd290e1e
    • Muchun Song's avatar
      mm: memcontrol: inline __memcg_kmem_{un}charge() into obj_cgroup_{un}charge_pages() · f1286fae
      Muchun Song authored
      
      
      There is only one user of __memcg_kmem_charge(), so manually inline
      __memcg_kmem_charge() to obj_cgroup_charge_pages().  Similarly manually
      inline __memcg_kmem_uncharge() into obj_cgroup_uncharge_pages() and call
      obj_cgroup_uncharge_pages() in obj_cgroup_release().
      
      This is just code cleanup without any functionality changes.
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-7-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1286fae
    • Muchun Song's avatar
      mm: memcontrol: use obj_cgroup APIs to charge kmem pages · b4e0b68f
      Muchun Song authored
      
      
      Since Roman's series "The new cgroup slab memory controller" applied.
      All slab objects are charged via the new APIs of obj_cgroup.  The new
      APIs introduce a struct obj_cgroup to charge slab objects.  It prevents
      long-living objects from pinning the original memory cgroup in the
      memory.  But there are still some corner objects (e.g.  allocations
      larger than order-1 page on SLUB) which are not charged via the new
      APIs.  Those objects (include the pages which are allocated from buddy
      allocator directly) are charged as kmem pages which still hold a
      reference to the memory cgroup.
      
      We want to reuse the obj_cgroup APIs to charge the kmem pages.  If we do
      that, we should store an object cgroup pointer to page->memcg_data for
      the kmem pages.
      
      Finally, page->memcg_data will have 3 different meanings.
      
        1) For the slab pages, page->memcg_data points to an object cgroups
           vector.
      
        2) For the kmem pages (exclude the slab pages), page->memcg_data
           points to an object cgroup.
      
        3) For the user pages (e.g. the LRU pages), page->memcg_data points
           to a memory cgroup.
      
      We do not change the behavior of page_memcg() and page_memcg_rcu().  They
      are also suitable for LRU pages and kmem pages.  Why?
      
      Because memory allocations pinning memcgs for a long time - it exists at a
      larger scale and is causing recurring problems in the real world: page
      cache doesn't get reclaimed for a long time, or is used by the second,
      third, fourth, ...  instance of the same job that was restarted into a new
      cgroup every time.  Unreclaimable dying cgroups pile up, waste memory, and
      make page reclaim very inefficient.
      
      We can convert LRU pages and most other raw memcg pins to the objcg
      direction to fix this problem, and then the page->memcg will always point
      to an object cgroup pointer.  At that time, LRU pages and kmem pages will
      be treated the same.  The implementation of page_memcg() will remove the
      kmem page check.
      
      This patch aims to charge the kmem pages by using the new APIs of
      obj_cgroup.  Finally, the page->memcg_data of the kmem page points to an
      object cgroup.  We can use the __page_objcg() to get the object cgroup
      associated with a kmem page.  Or we can use page_memcg() to get the memory
      cgroup associated with a kmem page, but caller must ensure that the
      returned memcg won't be released (e.g.  acquire the rcu_read_lock or
      css_set_lock).
      
        Link: https://lkml.kernel.org/r/20210401030141.37061-1-songmuchun@bytedance.com
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-6-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      [songmuchun@bytedance.com: fix forget to obtain the ref to objcg in split_page_memcg]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4e0b68f
    • Muchun Song's avatar
      mm: memcontrol: change ug->dummy_page only if memcg changed · 7ab345a8
      Muchun Song authored
      
      
      Just like assignment to ug->memcg, we only need to update ug->dummy_page
      if memcg changed.  So move it to there.  This is a very small
      optimization.
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-5-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ab345a8
    • Muchun Song's avatar
      mm: memcontrol: directly access page->memcg_data in mm/page_alloc.c · 48060834
      Muchun Song authored
      
      
      page_memcg() is not suitable for use by page_expected_state() and
      page_bad_reason().  Because it can BUG_ON() for the slab pages when
      CONFIG_DEBUG_VM is enabled.  As neither lru, nor kmem, nor slab page
      should have anything left in there by the time the page is freed, what
      we care about is whether the value of page->memcg_data is 0.  So just
      directly access page->memcg_data here.
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-4-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48060834
    • Muchun Song's avatar
      mm: memcontrol: introduce obj_cgroup_{un}charge_pages · e74d2259
      Muchun Song authored
      
      
      We know that the unit of slab object charging is bytes, the unit of kmem
      page charging is PAGE_SIZE.  If we want to reuse obj_cgroup APIs to
      charge the kmem pages, we should pass PAGE_SIZE (as third parameter) to
      obj_cgroup_charge().  Because the size is already PAGE_SIZE, we can skip
      touch the objcg stock.  And obj_cgroup_{un}charge_pages() are introduced
      to charge in units of page level.
      
      In the latter patch, we also can reuse those two helpers to charge or
      uncharge a number of kernel pages to a object cgroup.  This is just a
      code movement without any functional changes.
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-3-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e74d2259
    • Muchun Song's avatar
      mm: memcontrol: slab: fix obtain a reference to a freeing memcg · 9f38f03a
      Muchun Song authored
      Patch series "Use obj_cgroup APIs to charge kmem pages", v5.
      
      Since Roman's series "The new cgroup slab memory controller" applied.
      All slab objects are charged with the new APIs of obj_cgroup.  The new
      APIs introduce a struct obj_cgroup to charge slab objects.  It prevents
      long-living objects from pinning the original memory cgroup in the
      memory.  But there are still some corner objects (e.g.  allocations
      larger than order-1 page on SLUB) which are not charged with the new
      APIs.  Those objects (include the pages which are allocated from buddy
      allocator directly) are charged as kmem pages which still hold a
      reference to the memory cgroup.
      
      E.g.  We know that the kernel stack is charged as kmem pages because the
      size of the kernel stack can be greater than 2 pages (e.g.  16KB on
      x86_64 or arm64).  If we create a thread (suppose the thread stack is
      charged to memory cgroup A) and then move it from memory cgroup A to
      memory cgroup B.  Because the kernel stack of the thread hold a
      reference to the memory cgroup A.  The thread can pin the memory cgroup
      A in the memory even if we remove the cgroup A.  If we want to see this
      scenario by using the following script.  We can see that the system has
      added 500 dying cgroups (This is not a real world issue, just a script
      to show that the large kmallocs are charged as kmem pages which can pin
      the memory cgroup in the memory).
      
      	#!/bin/bash
      
      	cat /proc/cgroups | grep memory
      
      	cd /sys/fs/cgroup/memory
      	echo 1 > memory.move_charge_at_immigrate
      
      	for i in range{1..500}
      	do
      		mkdir kmem_test
      		echo $$ > kmem_test/cgroup.procs
      		sleep 3600 &
      		echo $$ > cgroup.procs
      		echo `cat kmem_test/cgroup.procs` > cgroup.procs
      		rmdir kmem_test
      	done
      
      	cat /proc/cgroups | grep memory
      
      This patchset aims to make those kmem pages to drop the reference to
      memory cgroup by using the APIs of obj_cgroup.  Finally, we can see that
      the number of the dying cgroups will not increase if we run the above test
      script.
      
      This patch (of 7):
      
      The rcu_read_lock/unlock only can guarantee that the memcg will not be
      freed, but it cannot guarantee the success of css_get (which is in the
      refill_stock when cached memcg changed) to memcg.
      
        rcu_read_lock()
        memcg = obj_cgroup_memcg(old)
        __memcg_kmem_uncharge(memcg)
            refill_stock(memcg)
                if (stock->cached != memcg)
                    // css_get can change the ref counter from 0 back to 1.
                    css_get(&memcg->css)
        rcu_read_unlock()
      
      This fix is very like the commit:
      
        eefbfa7f ("mm: memcg/slab: fix use after free in obj_cgroup_charge")
      
      Fix this by holding a reference to the memcg which is passed to the
      __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge().
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210319163821.20704-2-songmuchun@bytedance.com
      Fixes: 3de7d4f2
      
       ("mm: memcg/slab: optimize objcg stock draining")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f38f03a
    • Shakeel Butt's avatar
      memcg: charge before adding to swapcache on swapin · 0add0c77
      Shakeel Butt authored
      
      
      Currently the kernel adds the page, allocated for swapin, to the
      swapcache before charging the page.  This is fine but now we want a
      per-memcg swapcache stat which is essential for folks who wants to
      transparently migrate from cgroup v1's memsw to cgroup v2's memory and
      swap counters.  In addition charging a page before exposing it to other
      parts of the kernel is a step in the right direction.
      
      To correctly maintain the per-memcg swapcache stat, this patch has
      adopted to charge the page before adding it to swapcache.  One challenge
      in this option is the failure case of add_to_swap_cache() on which we
      need to undo the mem_cgroup_charge().  Specifically undoing
      mem_cgroup_uncharge_swap() is not simple.
      
      To resolve the issue, this patch decouples the charging for swapin pages
      from mem_cgroup_charge().  Two new functions are introduced,
      mem_cgroup_swapin_charge_page() for just charging the swapin page and
      mem_cgroup_swapin_uncharge_swap() for uncharging the swap slot once the
      page has been successfully added to the swapcache.
      
      [shakeelb@google.com: set page->private before calling swap_readpage]
        Link: https://lkml.kernel.org/r/20210318015959.2986837-1-shakeelb@google.com
      
      Link: https://lkml.kernel.org/r/20210305212639.775498-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0add0c77
    • Johannes Weiner's avatar
      kselftests: cgroup: update kmem test for new vmstat implementation · 4bbcc5a4
      Johannes Weiner authored
      
      
      With memcg having switched to rstat, memory.stat output is precise.
      Update the cgroup selftest to reflect the expectations and error
      tolerances of the new implementation.
      
      Also add newly tracked types of memory to the memory.stat side of the
      equation, since they're included in memory.current and could throw false
      positives.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-9-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bbcc5a4
    • Johannes Weiner's avatar
      mm: memcontrol: consolidate lruvec stat flushing · 2cd21c89
      Johannes Weiner authored
      
      
      There are two functions to flush the per-cpu data of an lruvec into the
      rest of the cgroup tree: when the cgroup is being freed, and when a CPU
      disappears during hotplug.  The difference is whether all CPUs or just
      one is being collected, but the rest of the flushing code is the same.
      Merge them into one function and share the common code.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-8-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cd21c89
    • Johannes Weiner's avatar
      mm: memcontrol: switch to rstat · 2d146aa3
      Johannes Weiner authored
      
      
      Replace the memory controller's custom hierarchical stats code with the
      generic rstat infrastructure provided by the cgroup core.
      
      The current implementation does batched upward propagation from the
      write side (i.e.  as stats change).  The per-cpu batches introduce an
      error, which is multiplied by the number of subgroups in a tree.  In
      systems with many CPUs and sizable cgroup trees, the error can be large
      enough to confuse users (e.g.  32 batch pages * 32 CPUs * 32 subgroups
      results in an error of up to 128M per stat item).  This can entirely
      swallow allocation bursts inside a workload that the user is expecting
      to see reflected in the statistics.
      
      In the past, we've done read-side aggregation, where a memory.stat read
      would have to walk the entire subtree and add up per-cpu counts.  This
      became problematic with lazily-freed cgroups: we could have large
      subtrees where most cgroups were entirely idle.  Hence the switch to
      change-driven upward propagation.  Unfortunately, it needed to trade
      accuracy for speed due to the write side being so hot.
      
      Rstat combines the best of both worlds: from the write side, it cheaply
      maintains a queue of cgroups that have pending changes, so that the read
      side can do selective tree aggregation.  This way the reported stats
      will always be precise and recent as can be, while the aggregation can
      skip over potentially large numbers of idle cgroups.
      
      The way rstat works is that it implements a tree for tracking cgroups
      with pending local changes, as well as a flush function that walks the
      tree upwards.  The controller then drives this by 1) telling rstat when
      a local cgroup stat changes (e.g.  mod_memcg_state) and 2) when a flush
      is required to get uptodate hierarchy stats for a given subtree (e.g.
      when memory.stat is read).  The controller also provides a flush
      callback that is called during the rstat flush walk for each cgroup and
      aggregates its local per-cpu counters and propagates them upwards.
      
      This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
      NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
      aggregation.  It removes 3 words from the per-cpu data.  It eliminates
      memcg_exact_page_state(), since memcg_page_state() is now exact.
      
      [akpm@linux-foundation.org: merge fix]
      [hannes@cmpxchg.org: fix a sleep in atomic section problem]
        Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d146aa3
    • Johannes Weiner's avatar
      cgroup: rstat: punt root-level optimization to individual controllers · dc26532a
      Johannes Weiner authored
      
      
      Current users of the rstat code can source root-level statistics from
      the native counters of their respective subsystem, allowing them to
      forego aggregation at the root level.  This optimization is currently
      implemented inside the generic rstat code, which doesn't track the root
      cgroup and doesn't invoke the subsystem flush callbacks on it.
      
      However, the memory controller cannot do this optimization, because
      cgroup1 breaks out memory specifically for the local level, including at
      the root level.  In preparation for the memory controller switching to
      rstat, move the optimization from rstat core to the controllers.
      
      Afterwards, rstat will always track the root cgroup for changes and
      invoke the subsystem callbacks on it; and it's up to the subsystem to
      special-case and skip aggregation of the root cgroup if it can source
      this information through other, cheaper means.
      
      This is the case for the io controller and the cgroup base stats.  In
      their respective flush callbacks, check whether the parent is the root
      cgroup, and if so, skip the unnecessary upward propagation.
      
      The extra cost of tracking the root cgroup is negligible: on stat
      changes, we actually remove a branch that checks for the root.  The
      queueing for a flush touches only per-cpu data, and only the first stat
      change since a flush requires a (per-cpu) lock.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc26532a
    • Johannes Weiner's avatar
      cgroup: rstat: support cgroup1 · a7df69b8
      Johannes Weiner authored
      
      
      Rstat currently only supports the default hierarchy in cgroup2.  In
      order to replace memcg's private stats infrastructure - used in both
      cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.
      
      The initialization and destruction callbacks for regular cgroups are
      already in place.  Remove the cgroup_on_dfl() guards to handle cgroup1.
      
      The initialization of the root cgroup is currently hardcoded to only
      handle cgrp_dfl_root.cgrp.  Move those callbacks to cgroup_setup_root()
      and cgroup_destroy_root() to handle the default root as well as the
      various cgroup1 roots we may set up during mounting.
      
      The linking of css to cgroups happens in code shared between cgroup1 and
      cgroup2 as well.  Simply remove the cgroup_on_dfl() guard.
      
      Linkage of the root css to the root cgroup is a bit trickier: per
      default, the root css of a subsystem controller belongs to the default
      hierarchy (i.e.  the cgroup2 root).  When a controller is mounted in its
      cgroup1 version, the root css is stolen and moved to the cgroup1 root;
      on unmount, the css moves back to the default hierarchy.  Annotate
      rebind_subsystems() to move the root css linkage along between roots.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-5-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7df69b8
    • Johannes Weiner's avatar
      mm: memcontrol: privatize memcg_page_state query functions · a18e6e6e
      Johannes Weiner authored
      
      
      There are no users outside of the memory controller itself. The rest
      of the kernel cares either about node or lruvec stats.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-4-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a18e6e6e
    • Johannes Weiner's avatar
      mm: memcontrol: kill mem_cgroup_nodeinfo() · a3747b53
      Johannes Weiner authored
      
      
      No need to encapsulate a simple struct member access.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3747b53
    • Johannes Weiner's avatar
      mm: memcontrol: fix cpuhotplug statistics flushing · a3d4c05a
      Johannes Weiner authored
      Patch series "mm: memcontrol: switch to rstat", v3.
      
      This series converts memcg stats tracking to the streamlined rstat
      infrastructure provided by the cgroup core code.  rstat is already used by
      the CPU controller and the IO controller.  This change is motivated by
      recent accuracy problems in memcg's custom stats code, as well as the
      benefits of sharing common infra with other controllers.
      
      The current memcg implementation does batched tree aggregation on the
      write side: local stat changes are cached in per-cpu counters, which are
      then propagated upward in batches when a threshold (32 pages) is exceeded.
      This is cheap, but the error introduced by the lazy upward propagation
      adds up: 32 pages times CPUs times cgroups in the subtree.  We've had
      complaints from service owners that the stats do not reliably track and
      react to allocation behavior as expected, sometimes swallowing the results
      of entire test applications.
      
      The original memcg stat implementation used to do tree aggregation
      exclusively on the read side: local stats would only ever be tracked in
      per-cpu counters, and a memory.stat read would iterate the entire subtree
      and sum those counters up.  This didn't keep up with the times:
      
       - Cgroup trees are much bigger now. We switched to lazily-freed
         cgroups, where deleted groups would hang around until their remaining
         page cache has been reclaimed. This can result in large subtrees that
         are expensive to walk, while most of the groups are idle and their
         statistics don't change much anymore.
      
       - Automated monitoring increased. With the proliferation of userspace
         oom killing, proactive reclaim, and higher-resolution logging of
         workload trends in general, top-level stat files are polled at least
         once a second in many deployments.
      
       - The lifetime of cgroups got shorter. Where most cgroup setups in the
         past would have a few large policy-oriented cgroups for everything
         running on the system, newer cgroup deployments tend to create one
         group per application - which gets deleted again as the processes
         exit. An aggregation scheme that doesn't retain child data inside the
         parents loses event history of the subtree.
      
      Rstat addresses all three of those concerns through intelligent,
      persistent read-side aggregation.  As statistics change at the local
      level, rstat tracks - on a per-cpu basis - only those parts of a subtree
      that have changes pending and require aggregation.  The actual
      aggregation occurs on the colder read side - which can now skip over
      (potentially large) numbers of recently idle cgroups.
      
      ===
      
      The test_kmem cgroup selftest is currently failing due to excessive
      cumulative vmstat drift from 100 subgroups:
      
          ok 1 test_kmem_basic
          memory.current = 8810496
          slab + anon + file + kernel_stack = 17074568
          slab = 6101384
          anon = 946176
          file = 0
          kernel_stack = 10027008
          not ok 2 test_kmem_memcg_deletion
          ok 3 test_kmem_proc_kpagecgroup
          ok 4 test_kmem_kernel_stacks
          ok 5 test_kmem_dead_cgroups
          ok 6 test_percpu_basic
      
      As you can see, memory.stat items far exceed memory.current.  The kernel
      stack alone is bigger than all of charged memory.  That's because the
      memory of the test has been uncharged from memory.current, but the
      negative vmstat deltas are still sitting in the percpu caches.
      
      The test at this time isn't even counting percpu, pagetables etc.  yet,
      which would further contribute to the error.  The last patch in the series
      updates the test to include them - as well as reduces the vmstat
      tolerances in general to only expect page_counter batching.
      
      With all patches applied, the (now more stringent) test succeeds:
      
          ok 1 test_kmem_basic
          ok 2 test_kmem_memcg_deletion
          ok 3 test_kmem_proc_kpagecgroup
          ok 4 test_kmem_kernel_stacks
          ok 5 test_kmem_dead_cgroups
          ok 6 test_percpu_basic
      
      ===
      
      A kernel build test confirms that overhead is comparable.  Two kernels are
      built simultaneously in a nested tree with several idle siblings:
      
      root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
                                                   `- build-b (defconfig, make -j16)
                                                   `- idle-1
                                                   `- ...
                                                   `- idle-9
      
      During the builds, kernelbuild/memory.stat is read once a second.
      
      A perf diff shows that the changes in cycle distribution is
      minimal. Top 10 kernel symbols:
      
           0.09%     +0.08%  [kernel.kallsyms]                       [k] __mod_memcg_lruvec_state
           0.00%     +0.06%  [kernel.kallsyms]                       [k] cgroup_rstat_updated
           0.08%     -0.05%  [kernel.kallsyms]                       [k] __mod_memcg_state.part.0
           0.16%     -0.04%  [kernel.kallsyms]                       [k] release_pages
           0.00%     +0.03%  [kernel.kallsyms]                       [k] __count_memcg_events
           0.01%     +0.03%  [kernel.kallsyms]                       [k] mem_cgroup_charge_statistics.constprop.0
           0.10%     -0.02%  [kernel.kallsyms]                       [k] get_mem_cgroup_from_mm
           0.05%     -0.02%  [kernel.kallsyms]                       [k] mem_cgroup_update_lru_size
           0.57%     +0.01%  [kernel.kallsyms]                       [k] asm_exc_page_fault
      
      ===
      
      The on-demand aggregated stats are now fully accurate:
      
      $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
        grep -e inactive_file /sys/fs/cgroup/memory.stat
      
      vanilla:                              patched:
      nr_inactive_file 1574105088           nr_inactive_file 1027801088
         inactive_file 1577410560              inactive_file 1027801088
      
      ===
      
      This patch (of 8):
      
      The memcg hotunplug callback erroneously flushes counts on the local CPU,
      not the counts of the CPU going away; those counts will be lost.
      
      Flush the CPU that is actually going away.
      
      Also simplify the code a bit by using mod_memcg_state() and
      count_memcg_events() instead of open-coding the upward flush - this is
      comparable to how vmstat.c handles hotunplug flushing.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org
      Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org
      Fixes: a983b5eb
      
       ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3d4c05a
    • Shakeel Butt's avatar
      memcg: enable memcg oom-kill for __GFP_NOFAIL · 3d0cbb98
      Shakeel Butt authored
      In the era of async memcg oom-killer, the commit a0d8b00a ("mm: memcg:
      do not declare OOM from __GFP_NOFAIL allocations") added the code to skip
      memcg oom-killer for __GFP_NOFAIL allocations.  The reason was that the
      __GFP_NOFAIL callers will not enter aync oom synchronization path and will
      keep the task marked as in memcg oom.  At that time the tasks marked in
      memcg oom can bypass the memcg limits and the oom synchronization would
      have happened later in the later userspace triggered page fault.  Thus
      letting the task marked as under memcg oom bypass the memcg limit for
      arbitrary time.
      
      With the synchronous memcg oom-killer (commit 29ef680a ("memcg, oom:
      move out_of_memory back to the charge path")) and not letting the task
      marked under memcg oom to bypass the memcg limits (commit 1f14c1ac
      
      
      ("mm: memcg: do not allow task about to OOM kill to bypass the limit")),
      we can again allow __GFP_NOFAIL allocations to trigger memcg oom-kill.
      This will make memcg oom behavior closer to page allocator oom behavior.
      
      Link: https://lkml.kernel.org/r/20210223204337.2785120-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d0cbb98
    • Shakeel Butt's avatar
      memcg: cleanup root memcg checks · a4792030
      Shakeel Butt authored
      
      
      Replace the implicit checking of root memcg with explicit root memcg
      checking i.e.  !css->parent with mem_cgroup_is_root().
      
      Link: https://lkml.kernel.org/r/20210223205625.2792891-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4792030
    • Muchun Song's avatar
      mm: memcontrol: fix kernel stack account · 27faca83
      Muchun Song authored
      For simplification commit 991e7673
      
       ("mm: memcontrol: account kernel
      stack per node") changed the per zone vmalloc backed stack pages
      accounting to per node.
      
      By doing that we have lost a certain precision because those pages might
      live in different NUMA nodes.  In the end NR_KERNEL_STACK_KB exported to
      the userspace might be over estimated on some nodes while underestimated
      on others.  But this is not a real world problem, just a problem found
      by reading the code.  So there is no actual data to showing how much
      impact it has on users.
      
      This doesn't impose any real problem to correctnes of the kernel
      behavior as the counter is not used for any internal processing but it
      can cause some confusion to the userspace.
      
      Address the problem by accounting each vmalloc backing page to its own
      node.
      
      Link: https://lkml.kernel.org/r/20210303151843.81156-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27faca83
    • Zhiyuan Dai's avatar
      mm/memremap.c: fix improper SPDX comment style · 2840d498
      Zhiyuan Dai authored
      
      
      Replace /* */ comment with //, fix SPDX comment style.
      
      see: Documentation/process/license-rules.rst
      
      Link: https://lkml.kernel.org/r/1614223348-15516-1-git-send-email-daizhiyuan@phytium.com.cn
      Signed-off-by: default avatarZhiyuan Dai <daizhiyuan@phytium.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2840d498
    • Yang Shi's avatar
      mm: gup: remove FOLL_SPLIT · 4066c119
      Yang Shi authored
      Since commit 5a52c9df ("uprobe: use FOLL_SPLIT_PMD instead of
      FOLL_SPLIT") and commit ba925fa3
      
       ("s390/gmap: improve THP splitting")
      FOLL_SPLIT has not been used anymore.  Remove the dead code.
      
      Link: https://lkml.kernel.org/r/20210330203900.9222-1-shy828301@gmail.com
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4066c119