Skip to content
  1. Nov 07, 2021
    • Krupa Ramakrishnan's avatar
      mm/page_alloc: use accumulated load when building node fallback list · 54d032ce
      Krupa Ramakrishnan authored
      In build_zonelists(), when the fallback list is built for the nodes, the
      node load gets reinitialized during each iteration.  This results in
      nodes with same distances occupying the same slot in different node
      fallback lists rather than appearing in the intended round- robin
      manner.  This results in one node getting picked for allocation more
      compared to other nodes with the same distance.
      
      As an example, consider a 4 node system with the following distance
      matrix.
      
        Node 0  1  2  3
        ----------------
        0    10 12 32 32
        1    12 10 32 32
        2    32 32 10 12
        3    32 32 12 10
      
      For this case, the node fallback list gets built like this:
      
        Node  Fallback list
        ---------------------
        0     0 1 2 3
        1     1 0 3 2
        2     2 3 0 1
        3     3 2 0 1 <-- Unexpected fallback order
      
      In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
      same order which results in more allocations getting satisfied from node
      0 compared to node 1.
      
      The effect of this on remote memory bandwidth as seen by stream
      benchmark is shown below:
      
        Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
      	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
        Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
      	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	57479.6		110791.8
           SCALE	55372.9		105685.9
             ADD	50460.6		96734.2
          TRIADD	50397.6		97119.1
        ----------------------------------------
      
      The bandwidth drop in Case 1 occurs because most of the allocations get
      satisfied by node 0 as it appears first in the fallback order for both
      nodes 2 and 3.
      
      This can be fixed by accumulating the node load in build_zonelists()
      rather than reinitializing it during each iteration.  With this the
      nodes with the same distance rightly get assigned in the round robin
      manner.
      
      In fact this was how it was originally until commit f0c0b2b8
      ("change zonelist order: zonelist order selection logic") dropped the
      load accumulation and resorted to initializing the load during each
      iteration.
      
      While zonelist ordering was removed by commit c9bff3ee ("mm,
      page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
      accumulation in build_zonelists() remained.  So essentially this patch
      reverts back to the accumulated node load logic.
      
      After this fix, the fallback order gets built like this:
      
        Node Fallback list
        ------------------
        0    0 1 2 3
        1    1 0 3 2
        2    2 3 0 1
        3    3 2 1 0 <-- Note the change here
      
      The bandwidth in Case 1 improves and matches Case 2 as shown below.
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	110438.9	110107.2
           SCALE	105930.5	105817.5
             ADD	97005.1		96159.8
          TRIADD	97441.5		96757.1
        ----------------------------------------
      
      The correctness of the fallback list generation has been verified for
      the above node configuration where the node 3 starts as memory-less node
      and comes up online only during memory hotplug.
      
      [bharata@amd.com: Added changelog, review, test validation]
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
      Fixes: f0c0b2b8
      
       ("change zonelist order: zonelist order selection logic")
      Signed-off-by: default avatarKrupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Co-developed-by: default avatarSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: default avatarSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: default avatarBharata B Rao <bharata@amd.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54d032ce
    • Bharata B Rao's avatar
      mm/page_alloc: print node fallback order · 6cf25392
      Bharata B Rao authored
      
      
      Patch series "Fix NUMA nodes fallback list ordering".
      
      For a NUMA system that has multiple nodes at same distance from other
      nodes, the fallback list generation prefers same node order for them
      instead of round-robin thereby penalizing one node over others.  This
      series fixes it.
      
      More description of the problem and the fix is present in the patch
      description.
      
      This patch (of 2):
      
      Print information message about the allocation fallback order for each
      NUMA node during boot.
      
      No functional changes here.  This makes it easier to illustrate the
      problem in the node fallback list generation, which the next patch
      fixes.
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
      Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.com
      Signed-off-by: default avatarBharata B Rao <bharata@amd.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cf25392
    • Miaohe Lin's avatar
      mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] · ba7f1b9e
      Miaohe Lin authored
      
      
      Don't use with __GFP_HIGHMEM because page_address() cannot represent
      highmem pages without kmap().  Newly allocated pages would leak as
      page_address() will return NULL for highmem pages here.  But It works
      now because the callers do not specify __GFP_HIGHMEM now.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-6-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba7f1b9e
    • Miaohe Lin's avatar
      mm/page_alloc.c: use helper function zone_spans_pfn() · 86fb05b9
      Miaohe Lin authored
      
      
      Use helper function zone_spans_pfn() to check whether pfn is within a
      zone to simplify the code slightly.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-5-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86fb05b9
    • Miaohe Lin's avatar
      mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() · 7cba630b
      Miaohe Lin authored
      
      
      The second two paragraphs about "all pages pinned" and pages_scanned is
      obsolete.  And There are PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP orders
      in pcp.  So the same order assumption is not held now.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-4-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cba630b
    • Miaohe Lin's avatar
      mm/page_alloc.c: simplify the code by using macro K() · ff7ed9e4
      Miaohe Lin authored
      
      
      Use helper macro K() to convert the pages to the corresponding size.
      Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff7ed9e4
    • Miaohe Lin's avatar
      mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() · ea808b4e
      Miaohe Lin authored
      
      
      Patch series "Cleanups and fixup for page_alloc", v2.
      
      This series contains cleanups to remove meaningless VM_BUG_ON(), use
      helpers to simplify the code and remove obsolete comment.  Also we avoid
      allocating highmem pages via alloc_pages_exact[_nid].  More details can be
      found in the respective changelogs.
      
      This patch (of 5):
      
      It's meaningless to VM_BUG_ON() order != pageblock_order just after
      setting order to pageblock_order.  Remove it.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210902121242.41607-2-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea808b4e
    • Eric Dumazet's avatar
      mm/large system hash: avoid possible NULL deref in alloc_large_system_hash · 084f7e23
      Eric Dumazet authored
      If __vmalloc() returned NULL, is_vm_area_hugepages(NULL) will fault if
      CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
      
      Link: https://lkml.kernel.org/r/20210915212530.2321545-1-eric.dumazet@gmail.com
      Fixes: 121e6f32
      
       ("mm/vmalloc: hugepage vmalloc mappings")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      084f7e23
    • Changcheng Deng's avatar
      lib/test_vmalloc.c: use swap() to make code cleaner · 34b46efd
      Changcheng Deng authored
      
      
      Use swap() in order to make code cleaner.  Issue found by coccinelle.
      
      Link: https://lkml.kernel.org/r/20211028111443.15744-1-deng.changcheng@zte.com.cn
      Signed-off-by: default avatarChangcheng Deng <deng.changcheng@zte.com.cn>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34b46efd
    • Chen Wandun's avatar
      mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation · c00b6b96
      Chen Wandun authored
      Commit ffb29b1c
      
       ("mm/vmalloc: fix numa spreading for large hash
      tables") can cause significant performance regressions in some
      situations as Andrew mentioned in [1].  The main situation is vmalloc,
      vmalloc will allocate pages with NUMA_NO_NODE by default, that will
      result in alloc page one by one;
      
      In order to solve this, __alloc_pages_bulk and mempolicy should be
      considered at the same time.
      
      1) If node is specified in memory allocation request, it will alloc all
         pages by __alloc_pages_bulk.
      
      2) If interleaving allocate memory, it will cauculate how many pages
         should be allocated in each node, and use __alloc_pages_bulk to alloc
         pages in each node.
      
      [1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: make two functions static]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      
      Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.com
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c00b6b96
    • Michal Hocko's avatar
      mm/vmalloc: be more explicit about supported gfp flags · b7d90e7a
      Michal Hocko authored
      
      
      The core of the vmalloc allocator __vmalloc_area_node doesn't say
      anything about gfp mask argument.  Not all gfp flags are supported
      though.  Be more explicit about constraints.
      
      Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7d90e7a
    • Kefeng Wang's avatar
      kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC · 3252b1d8
      Kefeng Wang authored
      
      
      With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:
      
        Unable to handle kernel paging request at virtual address ffff7000028f2000
        ...
        swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
        [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
        Internal error: Oops: 96000007 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
        Hardware name: linux,dummy-virt (DT)
        pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
        pc : kasan_check_range+0x90/0x1a0
        lr : memcpy+0x88/0xf4
        sp : ffff80001378fe20
        ...
        Call trace:
         kasan_check_range+0x90/0x1a0
         pcpu_page_first_chunk+0x3f0/0x568
         setup_per_cpu_areas+0xb8/0x184
         start_kernel+0x8c/0x328
      
      The vm area used in vm_area_register_early() has no kasan shadow memory,
      Let's add a new kasan_populate_early_vm_area_shadow() function to
      populate the vm area shadow memory to fix the issue.
      
      [wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow']
        Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com
      
      Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: Marco Elver <elver@google.com>		[KASAN]
      Acked-by: Andrey Konovalov <andreyknvl@gmail.com>	[KASAN]
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3252b1d8
    • Kefeng Wang's avatar
      arm64: support page mapping percpu first chunk allocator · 09cea619
      Kefeng Wang authored
      
      
      Percpu embedded first chunk allocator is the firstly option, but it
      could fails on ARM64, eg,
      
        percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000
        percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000
        percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000
      
      then we could get
      
        WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838
      
      and the system could not boot successfully.
      
      Let's implement page mapping percpu first chunk allocator as a fallback
      to the embedding allocator to increase the robustness of the system.
      
      Link: https://lkml.kernel.org/r/20210910053354.26721-3-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cea619
    • Kefeng Wang's avatar
      vmalloc: choose a better start address in vm_area_register_early() · 0eb68437
      Kefeng Wang authored
      
      
      Percpu embedded first chunk allocator is the firstly option, but it
      could fail on ARM64, eg,
      
        percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000
        percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000
        percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000
      
      then we could get to
      
        WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838
      
      and the system cannot boot successfully.
      
      Let's implement page mapping percpu first chunk allocator as a fallback
      to the embedding allocator to increase the robustness of the system.
      
      Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and
      KASAN_VMALLOC enabled.
      
      Tested on ARM64 qemu with cmdline "percpu_alloc=page".
      
      This patch (of 3):
      
      There are some fixed locations in the vmalloc area be reserved in
      ARM(see iotable_init()) and ARM64(see map_kernel()), but for
      pcpu_page_first_chunk(), it calls vm_area_register_early() and choose
      VMALLOC_START as the start address of vmap area which could be
      conflicted with above address, then could trigger a BUG_ON in
      vm_area_add_early().
      
      Let's choose a suit start address by traversing the vmlist.
      
      Link: https://lkml.kernel.org/r/20210910053354.26721-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20210910053354.26721-2-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0eb68437
    • Vasily Averin's avatar
      vmalloc: back off when the current task is OOM-killed · dd544141
      Vasily Averin authored
      
      
      Huge vmalloc allocation on heavy loaded node can lead to a global memory
      shortage.  Task called vmalloc can have worst badness and be selected by
      OOM-killer, however taken fatal signal does not interrupt allocation
      cycle.  Vmalloc repeat page allocaions again and again, exacerbating the
      crisis and consuming the memory freed up by another killed tasks.
      
      After a successful completion of the allocation procedure, a fatal
      signal will be processed and task will be destroyed finally.  However it
      may not release the consumed memory, since the allocated object may have
      a lifetime unrelated to the completed task.  In the worst case, this can
      lead to the host will panic due to "Out of memory and no killable
      processes..."
      
      This patch allows OOM-killer to break vmalloc cycle, makes OOM more
      effective and avoid host panic.  It does not check oom condition
      directly, however, and breaks page allocation cycle when fatal signal
      was received.
      
      This may trigger some hidden problems, when caller does not handle
      vmalloc failures, or when rollaback after failed vmalloc calls own
      vmallocs inside.  However all of these scenarios are incorrect: vmalloc
      does not guarantee successful allocation, it has never been called with
      __GFP_NOFAIL and threfore either should not be used for any rollbacks or
      should handle such errors correctly and not lead to critical failures.
      
      Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd544141
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: check various alignments when debugging · 066fed59
      Uladzislau Rezki (Sony) authored
      
      
      Before we did not guarantee a free block with lowest start address for
      allocations with alignment >= PAGE_SIZE.  Because an alignment overhead
      was included into a search length like below:
      
           length = size + align - 1;
      
      doing so we make sure that a bigger block would fit after applying an
      alignment adjustment.  Now there is no such limitation, i.e.  any
      alignment that user wants to apply will result to a lowest address of
      returned free area.
      
      Link: https://lkml.kernel.org/r/20211004142829.22222-2-urezki@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Ping Fang <pifang@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      066fed59
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: do not adjust the search size for alignment overhead · 9f531973
      Uladzislau Rezki (Sony) authored
      We used to include an alignment overhead into a search length, in that
      case we guarantee that a found area will definitely fit after applying a
      specific alignment that user specifies.  From the other hand we do not
      guarantee that an area has the lowest address if an alignment is >=
      PAGE_SIZE.
      
      It means that, when a user specifies a special alignment together with a
      range that corresponds to an exact requested size then an allocation
      will fail.  This is what happens to KASAN, it wants the free block that
      exactly matches a specified range during onlining memory banks:
      
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
          [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
          vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
          bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
          CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
          Call Trace:
           dump_stack+0x8e/0xd0
           warn_alloc.cold.90+0x8a/0x1b2
           ? zone_watermark_ok_safe+0x300/0x300
           ? slab_free_freelist_hook+0x85/0x1a0
           ? __get_vm_area_node+0x240/0x2c0
           ? kfree+0xdd/0x570
           ? kmem_cache_alloc_node_trace+0x157/0x230
           ? notifier_call_chain+0x90/0x160
           __vmalloc_node_range+0x465/0x840
           ? mark_held_locks+0xb7/0x120
      
      Fix it by making sure that find_vmap_lowest_match() returns lowest start
      address with any given alignment value, i.e.  for alignments bigger then
      PAGE_SIZE the algorithm rolls back toward parent nodes checking right
      sub-trees if the most left free block did not fit due to alignment
      overhead.
      
      Link: https://lkml.kernel.org/r/20211004142829.22222-1-urezki@gmail.com
      Fixes: 68ad4a33
      
       ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reported-by: default avatarPing Fang <pifang@redhat.com>
      Tested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f531973
    • Eric Dumazet's avatar
      mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo · 7cc7913e
      Eric Dumazet authored
      If last va found in vmap_area_list does not have a vm pointer,
      vmallocinfo.s_show() returns 0, and show_purge_info() is not called as
      it should.
      
      Link: https://lkml.kernel.org/r/20211001170815.73321-1-eric.dumazet@gmail.com
      Fixes: dd3b8353
      
       ("mm/vmalloc: do not keep unpurged areas in the busy tree")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Pengfei Li <lpf.vector@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cc7913e
    • Eric Dumazet's avatar
      mm/vmalloc: make show_numa_info() aware of hugepage mappings · 51e50b3a
      Eric Dumazet authored
      
      
      show_numa_info() can be slightly faster, by skipping over hugepages
      directly.
      
      Link: https://lkml.kernel.org/r/20211001172725.105824-1-eric.dumazet@gmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51e50b3a
    • Peter Zijlstra's avatar
      mm/vmalloc: don't allow VM_NO_GUARD on vmap() · bd1a8fb2
      Peter Zijlstra authored
      
      
      The vmalloc guard pages are added on top of each allocation, thereby
      isolating any two allocations from one another.  The top guard of the
      lower allocation is the bottom guard guard of the higher allocation etc.
      
      Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
      isolating separate allocations.
      
      There are only two in-tree users of this flag, neither of which use it
      through the exported interface.  Ensure it stays this way.
      
      Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.net
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd1a8fb2
    • Vasily Averin's avatar
      mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() · 228f778e
      Vasily Averin authored
      Commit f255935b ("mm: cleanup the gfp_mask handling in
      __vmalloc_area_node") added __GFP_NOWARN to gfp_mask unconditionally
      however it disabled all output inside warn_alloc() call.  This patch
      saves original gfp_mask and provides it to all warn_alloc() calls.
      
      Link: https://lkml.kernel.org/r/f4f3187b-9684-e426-565d-827c2a9bbb0e@virtuozzo.com
      Fixes: f255935b
      
       ("mm: cleanup the gfp_mask handling in __vmalloc_area_node")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      228f778e
    • Gang Li's avatar
      mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN · 627ae828
      Gang Li authored
      
      
      By using DECLARE_EVENT_CLASS and TRACE_EVENT_FN, we can save a lot of
      space from duplicate code.
      
      Link: https://lkml.kernel.org/r/20211009071243.70286-1-ligang.bdlg@bytedance.com
      Signed-off-by: default avatarGang Li <ligang.bdlg@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      627ae828
    • Gang Li's avatar
      mm: mmap_lock: remove redundant newline in TP_printk · f595e341
      Gang Li authored
      
      
      Ftrace core will add newline automatically on printing, so using it in
      TP_printkcreates a blank line.
      
      Link: https://lkml.kernel.org/r/20211009071105.69544-1-ligang.bdlg@bytedance.com
      Signed-off-by: default avatarGang Li <ligang.bdlg@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f595e341
    • Lucas De Marchi's avatar
      include/linux/io-mapping.h: remove fallback for writecombine · 2e86f78b
      Lucas De Marchi authored
      The fallback was introduced in commit 80c33624 ("io-mapping: Fixup
      for different names of writecombine") to fix the build on microblaze.
      
      5 years later, it seems all archs now provide a pgprot_writecombine(),
      so just remove the other possible fallbacks.  For microblaze,
      pgprot_writecombine() is available since commit 97ccedd7
      
      
      ("microblaze: Provide pgprot_device/writecombine macros for nommu").
      
      This is build-tested on microblaze with a hack to always build
      mm/io-mapping.o and without DIYing on an x86-only macro
      (_PAGE_CACHE_MASK)
      
      Link: https://lkml.kernel.org/r/20211020204838.1142908-1-lucas.demarchi@intel.com
      Signed-off-by: default avatarLucas De Marchi <lucas.demarchi@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e86f78b
    • Dmitry Safonov's avatar
      mm/mremap: don't account pages in vma_to_resize() · fdbef614
      Dmitry Safonov authored
      All this vm_unacct_memory(charged) dance seems to complicate the life
      without a good reason.  Furthermore, it seems not always done right on
      error-pathes in mremap_to().  And worse than that: this `charged'
      difference is sometimes double-accounted for growing MREMAP_DONTUNMAP
      mremap()s in move_vma():
      
      	if (security_vm_enough_memory_mm(mm, new_len >> PAGE_SHIFT))
      
      Let's not do this.  Account memory in mremap() fast-path for growing
      VMAs or in move_vma() for actually moving things.  The same simpler way
      as it's done by vm_stat_account(), but with a difference to call
      security_vm_enough_memory_mm() before copying/adjusting VMA.
      
      Originally noticed by Chen Wandun:
      https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com
      
      Link: https://lkml.kernel.org/r/20210721131320.522061-1-dima@arista.com
      Fixes: e346b381
      
       ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yongjun <weiyongjun1@huawei.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdbef614
    • Liu Song's avatar
      mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() · 6af5fa0d
      Liu Song authored
      
      
      After adjustment, the repeated assignment of "prev" is avoided, and the
      readability of the code is improved.
      
      Link: https://lkml.kernel.org/r/20211012152444.4127-1-fishland@aliyun.com
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLiu Song <liu.song11@zte.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6af5fa0d
    • Lukas Bulwahn's avatar
      memory: remove unused CONFIG_MEM_BLOCK_SIZE · e26e0cc3
      Lukas Bulwahn authored
      Commit 3947be19
      
       ("[PATCH] memory hotplug: sysfs and add/remove
      functions") defines CONFIG_MEM_BLOCK_SIZE, but this has never been
      utilized anywhere.
      
      It is a good practice to keep the CONFIG_* defines exclusively for the
      Kbuild system.  So, drop this unused definition.
      
      This issue was noticed due to running ./scripts/checkkconfigsymbols.py.
      
      Link: https://lkml.kernel.org/r/20211006120354.7468-1-lukas.bulwahn@gmail.com
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e26e0cc3
    • Tiberiu A Georgescu's avatar
      Documentation: update pagemap with shmem exceptions · cbbb69d3
      Tiberiu A Georgescu authored
      
      
      This patch follows the discussions on previous documentation patch
      threads [1][2].  It presents the exception case of shared memory
      management from the pagemap's point of view.  It briefly describes what
      is missing, why it is missing and alternatives to the pagemap for page
      info retrieval in user space.
      
      In short, the kernel does not keep track of PTEs for swapped out shared
      pages within the processes that references them.  Thus, the
      proc/pid/pagemap tool cannot print the swap destination of the shared
      memory pages, instead setting the pagemap entry to zero for both
      non-allocated and swapped out pages.  This can create confusion for
      users who need information on swapped out pages.
      
      The reasons why maintaining the PTEs of all swapped out shared pages
      among all processes while maintaining similar performance is not a
      trivial task, or a desirable change, have been discussed extensively
      [1][3][4][5].  There are also arguments for why this arguably missing
      information should eventually be exposed to the user in either a future
      pagemap patch, or by an alternative tool.
      
      [1]: https://marc.info/?m=162878395426774
      [2]: https://lore.kernel.org/lkml/20210920164931.175411-1-tiberiu.georgescu@nutanix.com/
      [3]: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
      [4]: https://lore.kernel.org/lkml/20210807032521.7591-1-peterx@redhat.com/
      [5]: https://lore.kernel.org/lkml/20210715201651.212134-1-peterx@redhat.com/
      
      Mention the current missing information in the pagemap and alternatives
      on how to retrieve it, in case someone stumbles upon unexpected
      behaviour.
      
      Link: https://lkml.kernel.org/r/20210923064618.157046-1-tiberiu.georgescu@nutanix.com
      Link: https://lkml.kernel.org/r/20210923064618.157046-2-tiberiu.georgescu@nutanix.com
      Signed-off-by: default avatarTiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
      Reviewed-by: default avatarIvan Teterevkov <ivan.teterevkov@nutanix.com>
      Reviewed-by: default avatarFlorian Schmidt <florian.schmidt@nutanix.com>
      Reviewed-by: default avatarCarl Waldspurger <carl.waldspurger@nutanix.com>
      Reviewed-by: default avatarJonathan Davies <jonathan.davies@nutanix.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbbb69d3
    • Qi Zheng's avatar
      mm: remove redundant smp_wmb() · ed33b5a6
      Qi Zheng authored
      
      
      The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes
      setup is visible before the pte is made visible to other CPUs by being
      put into page tables.  We only need this when the pte is actually
      populated, so move it to pmd_install().  __pte_alloc_kernel(),
      __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case.
      
      We can also defer smp_wmb() to the place where the pmd entry is really
      populated by preallocated pte.  There are two kinds of user of
      preallocated pte, one is filemap & finish_fault(), another is THP.  The
      former does not need another smp_wmb() because the smp_wmb() has been
      done by pmd_install().  Fortunately, the latter also does not need
      another smp_wmb() because there is already a smp_wmb() before populating
      the new pte when the THP uses a preallocated pte to split a huge pmd.
      
      Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mika Penttila <mika.penttila@nextfour.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed33b5a6
    • Qi Zheng's avatar
      mm: introduce pmd_install() helper · 03c4f204
      Qi Zheng authored
      
      
      Patch series "Do some code cleanups related to mm", v3.
      
      This patch (of 2):
      
      Currently we have three times the same few lines repeated in the code.
      Deduplicate them by newly introduced pmd_install() helper.
      
      Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Mika Penttila <mika.penttila@nextfour.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03c4f204
    • Peter Xu's avatar
      mm: add zap_skip_check_mapping() helper · 91b61ef3
      Peter Xu authored
      
      
      Use the helper for the checks.  Rename "check_mapping" into
      "zap_mapping" because "check_mapping" looks like a bool but in fact it
      stores the mapping itself.  When it's set, we check the mapping (it must
      be non-NULL).  When it's cleared we skip the check, which works like the
      old way.
      
      Move the duplicated comments to the helper too.
      
      Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91b61ef3
    • Peter Xu's avatar
      mm: drop first_index/last_index in zap_details · 232a6a1c
      Peter Xu authored
      
      
      The first_index/last_index parameters in zap_details are actually only
      used in unmap_mapping_range_tree().  At the meantime, this function is
      only called by unmap_mapping_pages() once.
      
      Instead of passing these two variables through the whole stack of page
      zapping code, remove them from zap_details and let them simply be
      parameters of unmap_mapping_range_tree(), which is inlined.
      
      Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      232a6a1c
    • Peter Xu's avatar
      mm: clear vmf->pte after pte_unmap_same() returns · 2ca99358
      Peter Xu authored
      
      
      pte_unmap_same() will always unmap the pte pointer.  After the unmap,
      vmf->pte will not be valid any more, we should clear it.
      
      It was safe only because no one is accessing vmf->pte after
      pte_unmap_same() returns, since the only caller of pte_unmap_same() (so
      far) is do_swap_page(), where vmf->pte will in most cases be overwritten
      very soon.
      
      Directly pass in vmf into pte_unmap_same() and then we can also avoid
      the long parameter list too, which should be a nice cleanup.
      
      Link: https://lkml.kernel.org/r/20210915181533.11188-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca99358
    • Peter Xu's avatar
      mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte · 9ae0f87d
      Peter Xu authored
      
      
      Patch series "mm: A few cleanup patches around zap, shmem and uffd", v4.
      
      IMHO all of them are very nice cleanups to existing code already,
      they're all small and self-contained.  They'll be needed by uffd-wp
      coming series.
      
      This patch (of 4):
      
      It was conditionally done previously, as there's one shmem special case
      that we use SetPageDirty() instead.  However that's not necessary and it
      should be easier and cleaner to do it unconditionally in
      mfill_atomic_install_pte().
      
      The most recent discussion about this is here, where Hugh explained the
      history of SetPageDirty() and why it's possible that it's not required
      at all:
      
      https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
      
      Currently mfill_atomic_install_pte() has three callers:
      
              1. shmem_mfill_atomic_pte
              2. mcopy_atomic_pte
              3. mcontinue_atomic_pte
      
      After the change: case (1) should have its SetPageDirty replaced by the
      dirty bit on pte (so we unify them together, finally), case (2) should
      have no functional change at all as it has page_in_cache==false, case
      (3) may add a dirty bit to the pte.  However since case (3) is
      UFFDIO_CONTINUE for shmem, it's merely 100% sure the page is dirty after
      all because UFFDIO_CONTINUE normally requires another process to modify
      the page cache and kick the faulted thread, so should not make a real
      difference either.
      
      This should make it much easier to follow on which case will set dirty
      for uffd, as we'll simply set it all now for all uffd related ioctls.
      Meanwhile, no special handling of SetPageDirty() if there's no need.
      
      Link: https://lkml.kernel.org/r/20210915181456.10739-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210915181456.10739-2-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ae0f87d
    • Amit Daniel Kachhap's avatar
      mm/memory.c: avoid unnecessary kernel/user pointer conversion · b063e374
      Amit Daniel Kachhap authored
      
      
      Annotating a pointer from __user to kernel and then back again might
      confuse sparse.  In copy_huge_page_from_user() it can be avoided by
      removing the intermediate variable since it is never used.
      
      Link: https://lkml.kernel.org/r/20210914150820.19326-1-amit.kachhap@arm.com
      Signed-off-by: default avatarAmit Daniel Kachhap <amit.kachhap@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b063e374
    • Rolf Eike Beer's avatar
      mm: use __pfn_to_section() instead of open coding it · f1dc0db2
      Rolf Eike Beer authored
      
      
      It is defined in the same file just a few lines above.
      
      Link: https://lkml.kernel.org/r/4598487.Rc0NezkW7i@mobilepool36.emlix.com
      Signed-off-by: default avatarRolf Eike Beer <eb@emlix.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1dc0db2
    • Peng Liu's avatar
      mm/mmap.c: fix a data race of mm->total_vm · 7866076b
      Peng Liu authored
      
      
      The variable mm->total_vm could be accessed concurrently during mmaping
      and system accounting as noticed by KCSAN,
      
        BUG: KCSAN: data-race in __acct_update_integrals / mmap_region
      
        read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3:
         mmap_region+0x6dc/0x1400
         do_mmap+0x794/0xca0
         vm_mmap_pgoff+0xdf/0x150
         ksys_mmap_pgoff+0xe1/0x380
         do_syscall_64+0x37/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2:
         __acct_update_integrals+0x187/0x1d0
         acct_account_cputime+0x3c/0x40
         update_process_times+0x5c/0x150
         tick_sched_timer+0x184/0x210
         __run_hrtimer+0x119/0x3b0
         hrtimer_interrupt+0x350/0xaa0
         __sysvec_apic_timer_interrupt+0x7b/0x220
         asm_call_irq_on_stack+0x12/0x20
         sysvec_apic_timer_interrupt+0x4d/0x80
         asm_sysvec_apic_timer_interrupt+0x12/0x20
         smp_call_function_single+0x192/0x2b0
         perf_install_in_context+0x29b/0x4a0
         __se_sys_perf_event_open+0x1a98/0x2550
         __x64_sys_perf_event_open+0x63/0x70
         do_syscall_64+0x37/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Reported by Kernel Concurrency Sanitizer on:
        CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        Ubuntu-1.8.2-1ubuntu1 04/01/2014
      
      In vm_stat_account which called by mmap_region, increase total_vm, and
      __acct_update_integrals may read total_vm at the same time.  This will
      cause a data race which lead to undefined behaviour.  To avoid potential
      bad read/write, volatile property and barrier are both used to avoid
      undefined behaviour.
      
      Link: https://lkml.kernel.org/r/20210913105550.1569419-1-liupeng256@huawei.com
      Signed-off-by: default avatarPeng Liu <liupeng256@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7866076b
    • Vasily Averin's avatar
      memcg: prohibit unconditional exceeding the limit of dying tasks · a4ebf1b6
      Vasily Averin authored
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It is assumed that the amount of the memory charged by those
      tasks is bound and most of the memory will get released while the task
      is exiting.  This is resembling a heuristic for the global OOM situation
      when tasks get access to memory reserves.  There is no global memory
      shortage at the memcg level so the memcg heuristic is more relieved.
      
      The above assumption is overly optimistic though.  E.g.  vmalloc can
      scale to really large requests and the heuristic would allow that.  We
      used to have an early break in the vmalloc allocator for killed tasks
      but this has been reverted by commit b8c8a338
      
       ("Revert "vmalloc:
      back off when the current task is killed"").  There are likely other
      similar code paths which do not check for fatal signals in an
      allocation&charge loop.  Also there are some kernel objects charged to a
      memcg which are not bound to a process life time.
      
      It has been observed that it is not really hard to trigger these
      bypasses and cause global OOM situation.
      
      One potential way to address these runaways would be to limit the amount
      of excess (similar to the global OOM with limited oom reserves).  This
      is certainly possible but it is not really clear how much of an excess
      is desirable and still protects from global OOMs as that would have to
      consider the overall memcg configuration.
      
      This patch is addressing the problem by removing the heuristic
      altogether.  Bypass is only allowed for requests which either cannot
      fail or where the failure is not desirable while excess should be still
      limited (e.g.  atomic requests).  Implementation wise a killed or dying
      task fails to charge if it has passed the OOM killer stage.  That should
      give all forms of reclaim chance to restore the limit before the failure
      (ENOMEM) and tell the caller to back off.
      
      In addition, this patch renames should_force_charge() helper to
      task_is_dying() because now its use is not associated witch forced
      charging.
      
      This patch depends on pagefault_out_of_memory() to not trigger
      out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
      and cause a global OOM killer.
      
      Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4ebf1b6
    • Michal Hocko's avatar
      mm, oom: do not trigger out_of_memory from the #PF · 60e2793d
      Michal Hocko authored
      
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory.  This can happen for 2
      different reasons.  a) Memcg is out of memory and we rely on
      mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
      normal allocation fails.
      
      The latter is quite problematic because allocation paths already trigger
      out_of_memory and the page allocator tries really hard to not fail
      allocations.  Anyway, if the OOM killer has been already invoked there
      is no reason to invoke it again from the #PF path.  Especially when the
      OOM condition might be gone by that time and we have no way to find out
      other than allocate.
      
      Moreover if the allocation failed and the OOM killer hasn't been invoked
      then we are unlikely to do the right thing from the #PF context because
      we have already lost the allocation context and restictions and
      therefore might oom kill a task from a different NUMA domain.
      
      This all suggests that there is no legitimate reason to trigger
      out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
      that no #PF path returns with VM_FAULT_OOM without allocation print a
      warning that this is happening before we restart the #PF.
      
      [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
      This is a local problem related to memcg, however, it causes unnecessary
      global OOM kills that are repeated over and over again and escalate into a
      real disaster.  This has been broken since kmem accounting has been
      introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
      the separate limit so the only way to handle kmem hard limit was to return
      with ENOMEM.  In upstream the problem will be fixed by removing the
      outdated kmem limit, however stable and LTS kernels cannot do it and are
      still affected.  This patch fixes the problem and should be backported
      into stable/LTS.]
      
      Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60e2793d
    • Vasily Averin's avatar
      mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks · 0b28179a
      Vasily Averin authored
      
      
      Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
      
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It can be misused and allowed to trigger global OOM from inside
      a memcg-limited container.  On the other hand if memcg fails allocation,
      called from inside #PF handler it triggers global OOM from inside
      pagefault_out_of_memory().
      
      To prevent these problems this patchset:
       (a) removes execution of out_of_memory() from
           pagefault_out_of_memory(), becasue nobody can explain why it is
           necessary.
       (b) allow memcg to fail allocation of dying/killed tasks.
      
      This patch (of 3):
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory which in turn executes
      out_out_memory() and can kill a random task.
      
      An allocation might fail when the current task is the oom victim and
      there are no memory reserves left.  The OOM killer is already handled at
      the page allocator level for the global OOM and at the charging level
      for the memcg one.  Both have much more information about the scope of
      allocation/charge request.  This means that either the OOM killer has
      been invoked properly and didn't lead to the allocation success or it
      has been skipped because it couldn't have been invoked.  In both cases
      triggering it from here is pointless and even harmful.
      
      It makes much more sense to let the killed task die rather than to wake
      up an eternally hungry oom-killer and send him to choose a fatter victim
      for breakfast.
      
      Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b28179a