Skip to content
  1. Nov 07, 2021
    • Vasily Averin's avatar
      vmalloc: back off when the current task is OOM-killed · dd544141
      Vasily Averin authored
      
      
      Huge vmalloc allocation on heavy loaded node can lead to a global memory
      shortage.  Task called vmalloc can have worst badness and be selected by
      OOM-killer, however taken fatal signal does not interrupt allocation
      cycle.  Vmalloc repeat page allocaions again and again, exacerbating the
      crisis and consuming the memory freed up by another killed tasks.
      
      After a successful completion of the allocation procedure, a fatal
      signal will be processed and task will be destroyed finally.  However it
      may not release the consumed memory, since the allocated object may have
      a lifetime unrelated to the completed task.  In the worst case, this can
      lead to the host will panic due to "Out of memory and no killable
      processes..."
      
      This patch allows OOM-killer to break vmalloc cycle, makes OOM more
      effective and avoid host panic.  It does not check oom condition
      directly, however, and breaks page allocation cycle when fatal signal
      was received.
      
      This may trigger some hidden problems, when caller does not handle
      vmalloc failures, or when rollaback after failed vmalloc calls own
      vmallocs inside.  However all of these scenarios are incorrect: vmalloc
      does not guarantee successful allocation, it has never been called with
      __GFP_NOFAIL and threfore either should not be used for any rollbacks or
      should handle such errors correctly and not lead to critical failures.
      
      Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd544141
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: check various alignments when debugging · 066fed59
      Uladzislau Rezki (Sony) authored
      
      
      Before we did not guarantee a free block with lowest start address for
      allocations with alignment >= PAGE_SIZE.  Because an alignment overhead
      was included into a search length like below:
      
           length = size + align - 1;
      
      doing so we make sure that a bigger block would fit after applying an
      alignment adjustment.  Now there is no such limitation, i.e.  any
      alignment that user wants to apply will result to a lowest address of
      returned free area.
      
      Link: https://lkml.kernel.org/r/20211004142829.22222-2-urezki@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Ping Fang <pifang@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      066fed59
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: do not adjust the search size for alignment overhead · 9f531973
      Uladzislau Rezki (Sony) authored
      We used to include an alignment overhead into a search length, in that
      case we guarantee that a found area will definitely fit after applying a
      specific alignment that user specifies.  From the other hand we do not
      guarantee that an area has the lowest address if an alignment is >=
      PAGE_SIZE.
      
      It means that, when a user specifies a special alignment together with a
      range that corresponds to an exact requested size then an allocation
      will fail.  This is what happens to KASAN, it wants the free block that
      exactly matches a specified range during onlining memory banks:
      
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
          vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
          bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
          CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
          Call Trace:
           dump_stack+0x8e/0xd0
           warn_alloc.cold.90+0x8a/0x1b2
           ? zone_watermark_ok_safe+0x300/0x300
           ? slab_free_freelist_hook+0x85/0x1a0
           ? __get_vm_area_node+0x240/0x2c0
           ? kfree+0xdd/0x570
           ? kmem_cache_alloc_node_trace+0x157/0x230
           ? notifier_call_chain+0x90/0x160
           __vmalloc_node_range+0x465/0x840
           ? mark_held_locks+0xb7/0x120
      
      Fix it by making sure that find_vmap_lowest_match() returns lowest start
      address with any given alignment value, i.e.  for alignments bigger then
      PAGE_SIZE the algorithm rolls back toward parent nodes checking right
      sub-trees if the most left free block did not fit due to alignment
      overhead.
      
      Link: https://lkml.kernel.org/r/20211004142829.22222-1-urezki@gmail.com
      Fixes: 68ad4a33
      
       ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reported-by: default avatarPing Fang <pifang@redhat.com>
      Tested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f531973
    • Eric Dumazet's avatar
      mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo · 7cc7913e
      Eric Dumazet authored
      If last va found in vmap_area_list does not have a vm pointer,
      vmallocinfo.s_show() returns 0, and show_purge_info() is not called as
      it should.
      
      Link: https://lkml.kernel.org/r/20211001170815.73321-1-eric.dumazet@gmail.com
      Fixes: dd3b8353
      
       ("mm/vmalloc: do not keep unpurged areas in the busy tree")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Pengfei Li <lpf.vector@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cc7913e
    • Eric Dumazet's avatar
      mm/vmalloc: make show_numa_info() aware of hugepage mappings · 51e50b3a
      Eric Dumazet authored
      
      
      show_numa_info() can be slightly faster, by skipping over hugepages
      directly.
      
      Link: https://lkml.kernel.org/r/20211001172725.105824-1-eric.dumazet@gmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51e50b3a
    • Peter Zijlstra's avatar
      mm/vmalloc: don't allow VM_NO_GUARD on vmap() · bd1a8fb2
      Peter Zijlstra authored
      
      
      The vmalloc guard pages are added on top of each allocation, thereby
      isolating any two allocations from one another.  The top guard of the
      lower allocation is the bottom guard guard of the higher allocation etc.
      
      Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
      isolating separate allocations.
      
      There are only two in-tree users of this flag, neither of which use it
      through the exported interface.  Ensure it stays this way.
      
      Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.net
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd1a8fb2
    • Vasily Averin's avatar
      mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() · 228f778e
      Vasily Averin authored
      Commit f255935b ("mm: cleanup the gfp_mask handling in
      __vmalloc_area_node") added __GFP_NOWARN to gfp_mask unconditionally
      however it disabled all output inside warn_alloc() call.  This patch
      saves original gfp_mask and provides it to all warn_alloc() calls.
      
      Link: https://lkml.kernel.org/r/f4f3187b-9684-e426-565d-827c2a9bbb0e@virtuozzo.com
      Fixes: f255935b
      
       ("mm: cleanup the gfp_mask handling in __vmalloc_area_node")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      228f778e
    • Gang Li's avatar
      mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN · 627ae828
      Gang Li authored
      
      
      By using DECLARE_EVENT_CLASS and TRACE_EVENT_FN, we can save a lot of
      space from duplicate code.
      
      Link: https://lkml.kernel.org/r/20211009071243.70286-1-ligang.bdlg@bytedance.com
      Signed-off-by: default avatarGang Li <ligang.bdlg@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      627ae828
    • Gang Li's avatar
      mm: mmap_lock: remove redundant newline in TP_printk · f595e341
      Gang Li authored
      
      
      Ftrace core will add newline automatically on printing, so using it in
      TP_printkcreates a blank line.
      
      Link: https://lkml.kernel.org/r/20211009071105.69544-1-ligang.bdlg@bytedance.com
      Signed-off-by: default avatarGang Li <ligang.bdlg@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f595e341
    • Lucas De Marchi's avatar
      include/linux/io-mapping.h: remove fallback for writecombine · 2e86f78b
      Lucas De Marchi authored
      The fallback was introduced in commit 80c33624 ("io-mapping: Fixup
      for different names of writecombine") to fix the build on microblaze.
      
      5 years later, it seems all archs now provide a pgprot_writecombine(),
      so just remove the other possible fallbacks.  For microblaze,
      pgprot_writecombine() is available since commit 97ccedd7
      
      
      ("microblaze: Provide pgprot_device/writecombine macros for nommu").
      
      This is build-tested on microblaze with a hack to always build
      mm/io-mapping.o and without DIYing on an x86-only macro
      (_PAGE_CACHE_MASK)
      
      Link: https://lkml.kernel.org/r/20211020204838.1142908-1-lucas.demarchi@intel.com
      Signed-off-by: default avatarLucas De Marchi <lucas.demarchi@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e86f78b
    • Dmitry Safonov's avatar
      mm/mremap: don't account pages in vma_to_resize() · fdbef614
      Dmitry Safonov authored
      All this vm_unacct_memory(charged) dance seems to complicate the life
      without a good reason.  Furthermore, it seems not always done right on
      error-pathes in mremap_to().  And worse than that: this `charged'
      difference is sometimes double-accounted for growing MREMAP_DONTUNMAP
      mremap()s in move_vma():
      
      	if (security_vm_enough_memory_mm(mm, new_len >> PAGE_SHIFT))
      
      Let's not do this.  Account memory in mremap() fast-path for growing
      VMAs or in move_vma() for actually moving things.  The same simpler way
      as it's done by vm_stat_account(), but with a difference to call
      security_vm_enough_memory_mm() before copying/adjusting VMA.
      
      Originally noticed by Chen Wandun:
      https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com
      
      Link: https://lkml.kernel.org/r/20210721131320.522061-1-dima@arista.com
      Fixes: e346b381
      
       ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yongjun <weiyongjun1@huawei.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdbef614
    • Liu Song's avatar
      mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() · 6af5fa0d
      Liu Song authored
      
      
      After adjustment, the repeated assignment of "prev" is avoided, and the
      readability of the code is improved.
      
      Link: https://lkml.kernel.org/r/20211012152444.4127-1-fishland@aliyun.com
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLiu Song <liu.song11@zte.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6af5fa0d
    • Lukas Bulwahn's avatar
      memory: remove unused CONFIG_MEM_BLOCK_SIZE · e26e0cc3
      Lukas Bulwahn authored
      Commit 3947be19
      
       ("[PATCH] memory hotplug: sysfs and add/remove
      functions") defines CONFIG_MEM_BLOCK_SIZE, but this has never been
      utilized anywhere.
      
      It is a good practice to keep the CONFIG_* defines exclusively for the
      Kbuild system.  So, drop this unused definition.
      
      This issue was noticed due to running ./scripts/checkkconfigsymbols.py.
      
      Link: https://lkml.kernel.org/r/20211006120354.7468-1-lukas.bulwahn@gmail.com
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e26e0cc3
    • Tiberiu A Georgescu's avatar
      Documentation: update pagemap with shmem exceptions · cbbb69d3
      Tiberiu A Georgescu authored
      
      
      This patch follows the discussions on previous documentation patch
      threads [1][2].  It presents the exception case of shared memory
      management from the pagemap's point of view.  It briefly describes what
      is missing, why it is missing and alternatives to the pagemap for page
      info retrieval in user space.
      
      In short, the kernel does not keep track of PTEs for swapped out shared
      pages within the processes that references them.  Thus, the
      proc/pid/pagemap tool cannot print the swap destination of the shared
      memory pages, instead setting the pagemap entry to zero for both
      non-allocated and swapped out pages.  This can create confusion for
      users who need information on swapped out pages.
      
      The reasons why maintaining the PTEs of all swapped out shared pages
      among all processes while maintaining similar performance is not a
      trivial task, or a desirable change, have been discussed extensively
      [1][3][4][5].  There are also arguments for why this arguably missing
      information should eventually be exposed to the user in either a future
      pagemap patch, or by an alternative tool.
      
      [1]: https://marc.info/?m=162878395426774
      [2]: https://lore.kernel.org/lkml/20210920164931.175411-1-tiberiu.georgescu@nutanix.com/
      [3]: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
      [4]: https://lore.kernel.org/lkml/20210807032521.7591-1-peterx@redhat.com/
      [5]: https://lore.kernel.org/lkml/20210715201651.212134-1-peterx@redhat.com/
      
      Mention the current missing information in the pagemap and alternatives
      on how to retrieve it, in case someone stumbles upon unexpected
      behaviour.
      
      Link: https://lkml.kernel.org/r/20210923064618.157046-1-tiberiu.georgescu@nutanix.com
      Link: https://lkml.kernel.org/r/20210923064618.157046-2-tiberiu.georgescu@nutanix.com
      Signed-off-by: default avatarTiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
      Reviewed-by: default avatarIvan Teterevkov <ivan.teterevkov@nutanix.com>
      Reviewed-by: default avatarFlorian Schmidt <florian.schmidt@nutanix.com>
      Reviewed-by: default avatarCarl Waldspurger <carl.waldspurger@nutanix.com>
      Reviewed-by: default avatarJonathan Davies <jonathan.davies@nutanix.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbbb69d3
    • Qi Zheng's avatar
      mm: remove redundant smp_wmb() · ed33b5a6
      Qi Zheng authored
      
      
      The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes
      setup is visible before the pte is made visible to other CPUs by being
      put into page tables.  We only need this when the pte is actually
      populated, so move it to pmd_install().  __pte_alloc_kernel(),
      __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case.
      
      We can also defer smp_wmb() to the place where the pmd entry is really
      populated by preallocated pte.  There are two kinds of user of
      preallocated pte, one is filemap & finish_fault(), another is THP.  The
      former does not need another smp_wmb() because the smp_wmb() has been
      done by pmd_install().  Fortunately, the latter also does not need
      another smp_wmb() because there is already a smp_wmb() before populating
      the new pte when the THP uses a preallocated pte to split a huge pmd.
      
      Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mika Penttila <mika.penttila@nextfour.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed33b5a6
    • Qi Zheng's avatar
      mm: introduce pmd_install() helper · 03c4f204
      Qi Zheng authored
      
      
      Patch series "Do some code cleanups related to mm", v3.
      
      This patch (of 2):
      
      Currently we have three times the same few lines repeated in the code.
      Deduplicate them by newly introduced pmd_install() helper.
      
      Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Mika Penttila <mika.penttila@nextfour.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03c4f204
    • Peter Xu's avatar
      mm: add zap_skip_check_mapping() helper · 91b61ef3
      Peter Xu authored
      
      
      Use the helper for the checks.  Rename "check_mapping" into
      "zap_mapping" because "check_mapping" looks like a bool but in fact it
      stores the mapping itself.  When it's set, we check the mapping (it must
      be non-NULL).  When it's cleared we skip the check, which works like the
      old way.
      
      Move the duplicated comments to the helper too.
      
      Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91b61ef3
    • Peter Xu's avatar
      mm: drop first_index/last_index in zap_details · 232a6a1c
      Peter Xu authored
      
      
      The first_index/last_index parameters in zap_details are actually only
      used in unmap_mapping_range_tree().  At the meantime, this function is
      only called by unmap_mapping_pages() once.
      
      Instead of passing these two variables through the whole stack of page
      zapping code, remove them from zap_details and let them simply be
      parameters of unmap_mapping_range_tree(), which is inlined.
      
      Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      232a6a1c
    • Peter Xu's avatar
      mm: clear vmf->pte after pte_unmap_same() returns · 2ca99358
      Peter Xu authored
      
      
      pte_unmap_same() will always unmap the pte pointer.  After the unmap,
      vmf->pte will not be valid any more, we should clear it.
      
      It was safe only because no one is accessing vmf->pte after
      pte_unmap_same() returns, since the only caller of pte_unmap_same() (so
      far) is do_swap_page(), where vmf->pte will in most cases be overwritten
      very soon.
      
      Directly pass in vmf into pte_unmap_same() and then we can also avoid
      the long parameter list too, which should be a nice cleanup.
      
      Link: https://lkml.kernel.org/r/20210915181533.11188-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca99358
    • Peter Xu's avatar
      mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte · 9ae0f87d
      Peter Xu authored
      
      
      Patch series "mm: A few cleanup patches around zap, shmem and uffd", v4.
      
      IMHO all of them are very nice cleanups to existing code already,
      they're all small and self-contained.  They'll be needed by uffd-wp
      coming series.
      
      This patch (of 4):
      
      It was conditionally done previously, as there's one shmem special case
      that we use SetPageDirty() instead.  However that's not necessary and it
      should be easier and cleaner to do it unconditionally in
      mfill_atomic_install_pte().
      
      The most recent discussion about this is here, where Hugh explained the
      history of SetPageDirty() and why it's possible that it's not required
      at all:
      
      https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
      
      Currently mfill_atomic_install_pte() has three callers:
      
              1. shmem_mfill_atomic_pte
              2. mcopy_atomic_pte
              3. mcontinue_atomic_pte
      
      After the change: case (1) should have its SetPageDirty replaced by the
      dirty bit on pte (so we unify them together, finally), case (2) should
      have no functional change at all as it has page_in_cache==false, case
      (3) may add a dirty bit to the pte.  However since case (3) is
      UFFDIO_CONTINUE for shmem, it's merely 100% sure the page is dirty after
      all because UFFDIO_CONTINUE normally requires another process to modify
      the page cache and kick the faulted thread, so should not make a real
      difference either.
      
      This should make it much easier to follow on which case will set dirty
      for uffd, as we'll simply set it all now for all uffd related ioctls.
      Meanwhile, no special handling of SetPageDirty() if there's no need.
      
      Link: https://lkml.kernel.org/r/20210915181456.10739-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210915181456.10739-2-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ae0f87d
    • Amit Daniel Kachhap's avatar
      mm/memory.c: avoid unnecessary kernel/user pointer conversion · b063e374
      Amit Daniel Kachhap authored
      
      
      Annotating a pointer from __user to kernel and then back again might
      confuse sparse.  In copy_huge_page_from_user() it can be avoided by
      removing the intermediate variable since it is never used.
      
      Link: https://lkml.kernel.org/r/20210914150820.19326-1-amit.kachhap@arm.com
      Signed-off-by: default avatarAmit Daniel Kachhap <amit.kachhap@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b063e374
    • Rolf Eike Beer's avatar
      mm: use __pfn_to_section() instead of open coding it · f1dc0db2
      Rolf Eike Beer authored
      
      
      It is defined in the same file just a few lines above.
      
      Link: https://lkml.kernel.org/r/4598487.Rc0NezkW7i@mobilepool36.emlix.com
      Signed-off-by: default avatarRolf Eike Beer <eb@emlix.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1dc0db2
    • Peng Liu's avatar
      mm/mmap.c: fix a data race of mm->total_vm · 7866076b
      Peng Liu authored
      
      
      The variable mm->total_vm could be accessed concurrently during mmaping
      and system accounting as noticed by KCSAN,
      
        BUG: KCSAN: data-race in __acct_update_integrals / mmap_region
      
        read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3:
         mmap_region+0x6dc/0x1400
         do_mmap+0x794/0xca0
         vm_mmap_pgoff+0xdf/0x150
         ksys_mmap_pgoff+0xe1/0x380
         do_syscall_64+0x37/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2:
         __acct_update_integrals+0x187/0x1d0
         acct_account_cputime+0x3c/0x40
         update_process_times+0x5c/0x150
         tick_sched_timer+0x184/0x210
         __run_hrtimer+0x119/0x3b0
         hrtimer_interrupt+0x350/0xaa0
         __sysvec_apic_timer_interrupt+0x7b/0x220
         asm_call_irq_on_stack+0x12/0x20
         sysvec_apic_timer_interrupt+0x4d/0x80
         asm_sysvec_apic_timer_interrupt+0x12/0x20
         smp_call_function_single+0x192/0x2b0
         perf_install_in_context+0x29b/0x4a0
         __se_sys_perf_event_open+0x1a98/0x2550
         __x64_sys_perf_event_open+0x63/0x70
         do_syscall_64+0x37/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Reported by Kernel Concurrency Sanitizer on:
        CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        Ubuntu-1.8.2-1ubuntu1 04/01/2014
      
      In vm_stat_account which called by mmap_region, increase total_vm, and
      __acct_update_integrals may read total_vm at the same time.  This will
      cause a data race which lead to undefined behaviour.  To avoid potential
      bad read/write, volatile property and barrier are both used to avoid
      undefined behaviour.
      
      Link: https://lkml.kernel.org/r/20210913105550.1569419-1-liupeng256@huawei.com
      Signed-off-by: default avatarPeng Liu <liupeng256@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7866076b
    • Vasily Averin's avatar
      memcg: prohibit unconditional exceeding the limit of dying tasks · a4ebf1b6
      Vasily Averin authored
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It is assumed that the amount of the memory charged by those
      tasks is bound and most of the memory will get released while the task
      is exiting.  This is resembling a heuristic for the global OOM situation
      when tasks get access to memory reserves.  There is no global memory
      shortage at the memcg level so the memcg heuristic is more relieved.
      
      The above assumption is overly optimistic though.  E.g.  vmalloc can
      scale to really large requests and the heuristic would allow that.  We
      used to have an early break in the vmalloc allocator for killed tasks
      but this has been reverted by commit b8c8a338
      
       ("Revert "vmalloc:
      back off when the current task is killed"").  There are likely other
      similar code paths which do not check for fatal signals in an
      allocation&charge loop.  Also there are some kernel objects charged to a
      memcg which are not bound to a process life time.
      
      It has been observed that it is not really hard to trigger these
      bypasses and cause global OOM situation.
      
      One potential way to address these runaways would be to limit the amount
      of excess (similar to the global OOM with limited oom reserves).  This
      is certainly possible but it is not really clear how much of an excess
      is desirable and still protects from global OOMs as that would have to
      consider the overall memcg configuration.
      
      This patch is addressing the problem by removing the heuristic
      altogether.  Bypass is only allowed for requests which either cannot
      fail or where the failure is not desirable while excess should be still
      limited (e.g.  atomic requests).  Implementation wise a killed or dying
      task fails to charge if it has passed the OOM killer stage.  That should
      give all forms of reclaim chance to restore the limit before the failure
      (ENOMEM) and tell the caller to back off.
      
      In addition, this patch renames should_force_charge() helper to
      task_is_dying() because now its use is not associated witch forced
      charging.
      
      This patch depends on pagefault_out_of_memory() to not trigger
      out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
      and cause a global OOM killer.
      
      Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4ebf1b6
    • Michal Hocko's avatar
      mm, oom: do not trigger out_of_memory from the #PF · 60e2793d
      Michal Hocko authored
      
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory.  This can happen for 2
      different reasons.  a) Memcg is out of memory and we rely on
      mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
      normal allocation fails.
      
      The latter is quite problematic because allocation paths already trigger
      out_of_memory and the page allocator tries really hard to not fail
      allocations.  Anyway, if the OOM killer has been already invoked there
      is no reason to invoke it again from the #PF path.  Especially when the
      OOM condition might be gone by that time and we have no way to find out
      other than allocate.
      
      Moreover if the allocation failed and the OOM killer hasn't been invoked
      then we are unlikely to do the right thing from the #PF context because
      we have already lost the allocation context and restictions and
      therefore might oom kill a task from a different NUMA domain.
      
      This all suggests that there is no legitimate reason to trigger
      out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
      that no #PF path returns with VM_FAULT_OOM without allocation print a
      warning that this is happening before we restart the #PF.
      
      [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
      This is a local problem related to memcg, however, it causes unnecessary
      global OOM kills that are repeated over and over again and escalate into a
      real disaster.  This has been broken since kmem accounting has been
      introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
      the separate limit so the only way to handle kmem hard limit was to return
      with ENOMEM.  In upstream the problem will be fixed by removing the
      outdated kmem limit, however stable and LTS kernels cannot do it and are
      still affected.  This patch fixes the problem and should be backported
      into stable/LTS.]
      
      Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60e2793d
    • Vasily Averin's avatar
      mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks · 0b28179a
      Vasily Averin authored
      
      
      Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
      
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It can be misused and allowed to trigger global OOM from inside
      a memcg-limited container.  On the other hand if memcg fails allocation,
      called from inside #PF handler it triggers global OOM from inside
      pagefault_out_of_memory().
      
      To prevent these problems this patchset:
       (a) removes execution of out_of_memory() from
           pagefault_out_of_memory(), becasue nobody can explain why it is
           necessary.
       (b) allow memcg to fail allocation of dying/killed tasks.
      
      This patch (of 3):
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory which in turn executes
      out_out_memory() and can kill a random task.
      
      An allocation might fail when the current task is the oom victim and
      there are no memory reserves left.  The OOM killer is already handled at
      the page allocator level for the global OOM and at the charging level
      for the memcg one.  Both have much more information about the scope of
      allocation/charge request.  This means that either the OOM killer has
      been invoked properly and didn't lead to the allocation success or it
      has been skipped because it couldn't have been invoked.  In both cases
      triggering it from here is pointless and even harmful.
      
      It makes much more sense to let the killed task die rather than to wake
      up an eternally hungry oom-killer and send him to choose a fatter victim
      for breakfast.
      
      Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b28179a
    • Muchun Song's avatar
      mm: list_lru: only add memcg-aware lrus to the global lru list · 3eef1127
      Muchun Song authored
      
      
      The non-memcg-aware lru is always skiped when traversing the global lru
      list, which is not efficient.  We can only add the memcg-aware lru to
      the global lru list instead to make traversing more efficient.
      
      Link: https://lkml.kernel.org/r/20211025124353.55781-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eef1127
    • Muchun Song's avatar
      mm: memcontrol: remove the kmem states · e80216d9
      Muchun Song authored
      
      
      Now the kmem states is only used to indicate whether the kmem is
      offline.  However, we can set ->kmemcg_id to -1 to indicate whether the
      kmem is offline.  Finally, we can remove the kmem states to simplify the
      code.
      
      Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e80216d9
    • Muchun Song's avatar
      mm: memcontrol: remove kmemcg_id reparenting · 64268868
      Muchun Song authored
      
      
      Since slab objects and kmem pages are charged to object cgroup instead
      of memory cgroup, memcg_reparent_objcgs() will reparent this cgroup and
      all its descendants to its parent cgroup.  This already makes further
      list_lru_add()'s add elements to the parent's list.  So it is
      unnecessary to change kmemcg_id of an offline cgroup to its parent's id.
      It just wastes CPU cycles.  Just remove the redundant code.
      
      Link: https://lkml.kernel.org/r/20211025125102.56533-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64268868
    • Muchun Song's avatar
      mm: list_lru: fix the return value of list_lru_count_one() · 41d17431
      Muchun Song authored
      Since commit 2788cf0c
      
       ("memcg: reparent list_lrus and free kmemcg_id
      on css offline"), ->nr_items can be negative during memory cgroup
      reparenting.  In this case, list_lru_count_one() will return an unusual
      and huge value, which can surprise users.  At least for now it hasn't
      affected any users.  But it is better to let list_lru_count_ont()
      returns zero when ->nr_items is negative.
      
      Link: https://lkml.kernel.org/r/20211025124910.56433-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41d17431
    • Muchun Song's avatar
      mm: list_lru: remove holding lru lock · 60ec6a48
      Muchun Song authored
      Since commit e5bc3af7
      
       ("rcu: Consolidate PREEMPT and !PREEMPT
      synchronize_rcu()"), the critical section of spin lock can serve as an
      RCU read-side critical section which already allows readers that hold
      nlru->lock to avoid taking rcu lock.  So just remove holding lock.
      
      Link: https://lkml.kernel.org/r/20211025124534.56345-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60ec6a48
    • Shakeel Butt's avatar
      memcg, kmem: further deprecate kmem.limit_in_bytes · 58056f77
      Shakeel Butt authored
      The deprecation process of kmem.limit_in_bytes started with the commit
      0158115f
      
       ("memcg, kmem: deprecate kmem.limit_in_bytes") which also
      explains in detail the motivation behind the deprecation.  To summarize,
      it is the unexpected behavior on hitting the kmem limit.  This patch
      moves the deprecation process to the next stage by disallowing to set
      the kmem limit.  In future we might just remove the kmem.limit_in_bytes
      file completely.
      
      [akpm@linux-foundation.org: s/ENOTSUPP/EOPNOTSUPP/]
      [arnd@arndb.de: mark cancel_charge() inline]
        Link: https://lkml.kernel.org/r/20211022070542.679839-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20211019153408.2916808-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58056f77
    • Len Baker's avatar
      mm/list_lru.c: prefer struct_size over open coded arithmetic · 16f6bf26
      Len Baker authored
      
      
      As noted in the "Deprecated Interfaces, Language Features, Attributes,
      and Conventions" documentation [1], size calculations (especially
      multiplication) should not be performed in memory allocator (or similar)
      function arguments due to the risk of them overflowing.
      
      This could lead to values wrapping around and a smaller allocation being
      made than the caller was expecting.  Using those allocations could lead
      to linear overflows of heap memory and other misbehaviors.
      
      So, use the struct_size() helper to do the arithmetic instead of the
      argument "size + count * size" in the kvmalloc() functions.
      
      Also, take the opportunity to refactor the memcpy() call to use the
      flex_array_size() helper.
      
      This code was detected with the help of Coccinelle and audited and fixed
      manually.
      
      [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
      
      Link: https://lkml.kernel.org/r/20211017105929.9284-1-len.baker@gmx.com
      Signed-off-by: default avatarLen Baker <len.baker@gmx.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16f6bf26
    • Waiman Long's avatar
      mm/memcg: remove obsolete memcg_free_kmem() · 38d4ef44
      Waiman Long authored
      Since commit d648bcc7
      
       ("mm: kmem: make memcg_kmem_enabled()
      irreversible"), the only thing memcg_free_kmem() does is to call
      memcg_offline_kmem() when the memcg is still online which can happen
      when online_css() fails due to -ENOMEM.
      
      However, the name memcg_free_kmem() is confusing and it is more clear
      and straight forward to call memcg_offline_kmem() directly from
      mem_cgroup_css_free().
      
      Link: https://lkml.kernel.org/r/20211005202450.11775-1-longman@redhat.com
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d4ef44
    • Shakeel Butt's avatar
      memcg: unify memcg stat flushing · fd25a9e0
      Shakeel Butt authored
      The memcg stats can be flushed in multiple context and potentially in
      parallel too.  For example multiple parallel user space readers for
      memcg stats will contend on the rstat locks with each other.  There is
      no need for that.  We just need one flusher and everyone else can
      benefit.
      
      In addition after aa48e47e
      
       ("memcg: infrastructure to flush memcg
      stats") the kernel periodically flush the memcg stats from the root, so,
      the other flushers will potentially have much less work to do.
      
      Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd25a9e0
    • Shakeel Butt's avatar
      memcg: flush stats only if updated · 11192d9c
      Shakeel Butt authored
      At the moment, the kernel flushes the memcg stats on every refault and
      also on every reclaim iteration.  Although rstat maintains per-cpu
      update tree but on the flush the kernel still has to go through all the
      cpu rstat update tree to check if there is anything to flush.  This
      patch adds the tracking on the stats update side to make flush side more
      clever by skipping the flush if there is no update.
      
      The stats update codepath is very sensitive performance wise for many
      workloads and benchmarks.  So, we can not follow what the commit
      aa48e47e ("memcg: infrastructure to flush memcg stats") did which
      was triggering async flush through queue_work() and caused a lot
      performance regression reports.  That got reverted by the commit
      1f828223
      
       ("memcg: flush lruvec stats in the refault").
      
      In this patch we kept the stats update codepath very minimal and let the
      stats reader side to flush the stats only when the updates are over a
      specific threshold.  For now the threshold is (nr_cpus * CHARGE_BATCH).
      
      To evaluate the impact of this patch, an 8 GiB tmpfs file is created on
      a system with swap-on-zram and the file was pushed to swap through
      memory.force_empty interface.  On reading the whole file, the memcg stat
      flush in the refault code path is triggered.  With this patch, we
      observed 63% reduction in the read time of 8 GiB file.
      
      Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatar"Michal Koutný" <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11192d9c
    • Peter Xu's avatar
      mm/memcg: drop swp_entry_t* in mc_handle_file_pte() · 48384b0b
      Peter Xu authored
      It is unused after the rework of commit f5df8635
      
       ("mm: use
      find_get_incore_page in memcontrol").
      
      Link: https://lkml.kernel.org/r/20210916193014.80129-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48384b0b
    • Matthew Wilcox (Oracle)'s avatar
      mm: optimise put_pages_list() · 988c69f1
      Matthew Wilcox (Oracle) authored
      
      
      Instead of calling put_page() one page at a time, pop pages off the list
      if their refcount was too high and pass the remainder to
      put_unref_page_list().  This should be a speed improvement, but I have
      no measurements to support that.  Current callers do not care about
      performance, but I hope to add some which do.
      
      Link: https://lkml.kernel.org/r/20211007192138.561673-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      988c69f1
    • Rafael Aquini's avatar
      mm/swapfile: fix an integer overflow in swap_show() · 642929a2
      Rafael Aquini authored
      
      
      This one is just a minor nuisance for people going through /proc/swaps
      if any of their swapareas is bigger than, or equal to 1073741824 pages
      (4TB).
      
      seq_printf() format string casts as uint the conversion from pages to
      KB, and that will overflow in the aforementioned case.
      
      Albeit being almost unthinkable that someone would actually set up such
      big of a single swaparea, there is a ticket recently filed against RHEL:
      https://bugzilla.redhat.com/show_bug.cgi?id=2008812
      
      Given that all other codesites that use format strings for the same swap
      pages-to-KB conversion do cast it as ulong, this patch just follows
      suit.
      
      Link: https://lkml.kernel.org/r/20211006184011.2579054-1-aquini@redhat.com
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      642929a2
    • vulab's avatar
      mm/swapfile: remove needless request_queue NULL pointer check · 363dc512
      vulab authored
      
      
      The request_queue pointer returned from bdev_get_queue() shall never be
      NULL, so the null check is unnecessary, just remove it.
      
      Link: https://lkml.kernel.org/r/20210917082111.33923-1-vulab@iscas.ac.cn
      Signed-off-by: default avatarXu Wang <vulab@iscas.ac.cn>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      363dc512