Skip to content
  1. Apr 19, 2023
    • Peng Zhang's avatar
      mm: kfence: improve the performance of __kfence_alloc() and __kfence_free() · 1ba3cbf3
      Peng Zhang authored
      
      
      In __kfence_alloc() and __kfence_free(), we will set and check canary. 
      Assuming that the size of the object is close to 0, nearly 4k memory
      accesses are required because setting and checking canary is executed byte
      by byte.
      
      canary is now defined like this:
      KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)(addr) & 0x7))
      
      Observe that canary is only related to the lower three bits of the
      address, so every 8 bytes of canary are the same.  We can access 8-byte
      canary each time instead of byte-by-byte, thereby optimizing nearly 4k
      memory accesses to 4k/8 times.
      
      Use the bcc tool funclatency to measure the latency of __kfence_alloc()
      and __kfence_free(), the numbers (deleted the distribution of latency) is
      posted below.  Though different object sizes will have an impact on the
      measurement, we ignore it for now and assume the average object size is
      roughly equal.
      
      Before patching:
      __kfence_alloc:
      avg = 5055 nsecs, total: 5515252 nsecs, count: 1091
      __kfence_free:
      avg = 5319 nsecs, total: 9735130 nsecs, count: 1830
      
      After patching:
      __kfence_alloc:
      avg = 3597 nsecs, total: 6428491 nsecs, count: 1787
      __kfence_free:
      avg = 3046 nsecs, total: 3415390 nsecs, count: 1121
      
      The numbers indicate that there is ~30% - ~40% performance improvement.
      
      Link: https://lkml.kernel.org/r/20230403122738.6006-1-zhangpeng.00@bytedance.com
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1ba3cbf3
    • Liu Shixin's avatar
      mm/zswap: delay the initialization of zswap · 141fdeec
      Liu Shixin authored
      
      
      Since some users may not use zswap, the zswap_pool is wasted.  Save memory
      by delaying the initialization of zswap until enabled.
      
      [liushixin2@huawei.com: fix some pattern problem suggested by Christoph]
        Link: https://lkml.kernel.org/r/20230411093632.822290-4-liushixin2@huawei.com
      Link: https://lkml.kernel.org/r/20230403121318.1876082-4-liushixin2@huawei.com
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      141fdeec
    • Liu Shixin's avatar
      mm/zswap: replace zswap_init_{started/failed} with zswap_init_state · 9021ccec
      Liu Shixin authored
      
      
      The zswap_init_started variable name has a bit confusing.  Actually, there
      are three state: uninitialized, initial failed and initial succeed.  Add a
      new variable zswap_init_state to replace zswap_init_{started/failed}.
      
      Link: https://lkml.kernel.org/r/20230403121318.1876082-3-liushixin2@huawei.com
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9021ccec
    • Liu Shixin's avatar
      mm/zswap: remove zswap_entry_cache_{create,destroy} helper function · b7919122
      Liu Shixin authored
      
      
      Patch series "Delay the initialization of zswap", v9.
      
      In the initialization of zswap, about 18MB memory will be allocated for
      zswap_pool.  Since some users may not use zswap, the zswap_pool is wasted.
      Save memory by delaying the initialization of zswap until enabled.
      
      
      This patch (of 3):
      
      Remove zswap_entry_cache_create and zswap_entry_cache_destroy and use
      kmem_cache_* function directly.
      
      Link: https://lkml.kernel.org/r/20230411093632.822290-1-liushixin2@huawei.com
      Link: https://lkml.kernel.org/r/20230403121318.1876082-1-liushixin2@huawei.com
      Link: https://lkml.kernel.org/r/20230403121318.1876082-2-liushixin2@huawei.com
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b7919122
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: rename addr_to_vb_xarray() function · fa1c77c1
      Uladzislau Rezki (Sony) authored
      
      
      Short the name of the addr_to_vb_xarray() function to the addr_to_vb_xa().
      This aligns with other internal function abbreviations.
      
      Link: https://lkml.kernel.org/r/20230331073727.6968-1-urezki@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Suggested-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa1c77c1
    • Hao Ge's avatar
      kmemleak-test: fix kmemleak_test.c build logic · 27d9a0fd
      Hao Ge authored
      kmemleak-test.c was moved to the samples directory in 1abbef4f
      ("mm,kmemleak-test.c: move kmemleak-test.c to samples dir").
      
      If CONFIG_DEBUG_KMEMLEAK_TEST=m and CONFIG_SAMPLES is unset,
      kmemleak-test.c will be unnecessarily compiled.
      
      So move the entry for CONFIG_DEBUG_KMEMLEAK_TEST from mm/Kconfig and add a
      new CONFIG_SAMPLE_KMEMLEAK in samples/ to control whether kmemleak-test.c
      is built or not.
      
      Link: https://lkml.kernel.org/r/20230330060904.292975-1-gehao@kylinos.cn
      Fixes: 1abbef4f
      
       ("mm,kmemleak-test.c: move kmemleak-test.c to samples dir")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Finn Behrens <me@kloenk.dev>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Tony Krowiak <akrowiak@linux.ibm.com>
      Cc: Ye Xingchen <ye.xingchen@zte.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      27d9a0fd
    • Uladzislau Rezki (Sony)'s avatar
      lib/test_vmalloc.c: add vm_map_ram()/vm_unmap_ram() test case · 869cb29a
      Uladzislau Rezki (Sony) authored
      
      
      Add vm_map_ram()/vm_unmap_ram() test case to our stress test-suite.
      
      [akpm@linux-foundation.org: fix whitespace, per Lorenzo]
      Link: https://lkml.kernel.org/r/20230330190639.431589-2-urezki@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      869cb29a
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: remove a global vmap_blocks xarray · 062eacf5
      Uladzislau Rezki (Sony) authored
      
      
      A global vmap_blocks-xarray array can be contented under heavy usage of
      the vm_map_ram()/vm_unmap_ram() APIs.  The lock_stat shows that a
      "vmap_blocks.xa_lock" lock is a second in a top-list when it comes to
      contentions:
      
      <snip>
      ----------------------------------------
      class name con-bounces contentions ...
      ----------------------------------------
      vmap_area_lock:         2554079 2554276 ...
        --------------
        vmap_area_lock        1297948  [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
        vmap_area_lock        1256330  [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
        vmap_area_lock              1  [<00000000c95c05a7>] find_vm_area+0x16/0x70
        --------------
        vmap_area_lock        1738590  [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
        vmap_area_lock         815688  [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
        vmap_area_lock              1  [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170
      
      vmap_blocks.xa_lock:    862689  862698 ...
        -------------------
        vmap_blocks.xa_lock   378418    [<00000000625a5626>] vm_map_ram+0x359/0x4a0
        vmap_blocks.xa_lock   484280    [<00000000caa2ef03>] xa_erase+0xe/0x30
        -------------------
        vmap_blocks.xa_lock   576226    [<00000000caa2ef03>] xa_erase+0xe/0x30
        vmap_blocks.xa_lock   286472    [<00000000625a5626>] vm_map_ram+0x359/0x4a0
      ...
      <snip>
      
      that is a result of running vm_map_ram()/vm_unmap_ram() in
      a loop. The test creates 64(on 64 CPUs system) threads and
      each one maps/unmaps 1 page.
      
      After this change the "xa_lock" can be considered as a noise
      in the same test condition:
      
      <snip>
      ...
      &xa->xa_lock#1:         10333 10394 ...
        --------------
        &xa->xa_lock#1        5349      [<00000000bbbc9751>] xa_erase+0xe/0x30
        &xa->xa_lock#1        5045      [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
        --------------
        &xa->xa_lock#1        7326      [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
        &xa->xa_lock#1        3068      [<00000000bbbc9751>] xa_erase+0xe/0x30
      ...
      <snip>
      
      Running the test_vmalloc.sh run_test_mask=1024 nr_threads=64 nr_pages=5
      shows around ~8 percent of throughput improvement of vm_map_ram() and
      vm_unmap_ram() APIs.
      
      This patch does not fix vmap_area_lock/free_vmap_area_lock and
      purge_vmap_area_lock bottle-necks, it is rather a separate rework.
      
      Link: https://lkml.kernel.org/r/20230330190639.431589-1-urezki@gmail.com
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      062eacf5
    • Mike Rapoport (IBM)'s avatar
      mm: move free_area_empty() to mm/internal.h · 62f31bd4
      Mike Rapoport (IBM) authored
      
      
      The free_area_empty() helper is only used inside mm/ so move it there to
      reduce noise in include/linux/mmzone.h
      
      Link: https://lkml.kernel.org/r/20230326160215.2674531-1-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      62f31bd4
    • Zhen Lei's avatar
      kmsan: fix a stale comment in kmsan_save_stack_with_flags() · e961cc56
      Zhen Lei authored
      After commit 446ec838 ("mm/page_alloc: use might_alloc()") and commit
      84172f4b
      
       ("mm/page_alloc: combine __alloc_pages and
      __alloc_pages_nodemask"), the comment is no longer accurate.  Flag
      '__GFP_DIRECT_RECLAIM' is clear enough on its own, so remove the comment
      rather than update it.
      
      Link: https://lkml.kernel.org/r/20230327034149.942-1-thunder.leizhen@huawei.com
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e961cc56
    • Matthew Wilcox (Oracle)'s avatar
      hugetlb: remove PageHeadHuge() · 957ebbdf
      Matthew Wilcox (Oracle) authored
      
      
      Sidhartha Kumar removed the last caller of PageHeadHuge(), so we can now
      remove it and make folio_test_hugetlb() the real implementation.  Add
      kernel-doc for folio_test_hugetlb().
      
      Link: https://lkml.kernel.org/r/20230327151050.1787744-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      957ebbdf
    • Mike Rapoport (IBM)'s avatar
      xtensa: reword ARCH_FORCE_MAX_ORDER prompt and help text · 4519a254
      Mike Rapoport (IBM) authored
      
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-15-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4519a254
    • Mike Rapoport (IBM)'s avatar
      sparc: reword ARCH_FORCE_MAX_ORDER prompt and help text · 8def4c05
      Mike Rapoport (IBM) authored
      
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-14-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8def4c05
    • Mike Rapoport (IBM)'s avatar
      sh: drop ranges for definition of ARCH_FORCE_MAX_ORDER · 04954082
      Mike Rapoport (IBM) authored
      
      
      sh defines insane ranges for ARCH_FORCE_MAX_ORDER allowing MAX_ORDER up to
      63, which implies maximal contiguous allocation size of 2^63 pages.
      
      Drop bogus definitions of ranges for ARCH_FORCE_MAX_ORDER and leave it a
      simple integer with sensible defaults.
      
      Users that *really* need to change the value of ARCH_FORCE_MAX_ORDER will
      be able to do so but they won't be mislead by the bogus ranges.
      
      [rppt@kernel.org: untweak ARCH_FORCE_MAX_ORDER's `range']
        Link: https://lkml.kernel.org/r/20230325060828.2662773-13-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20230324052233.2654090-13-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      04954082
    • Mike Rapoport (IBM)'s avatar
      sh: reword ARCH_FORCE_MAX_ORDER prompt and help text · b2a37fb2
      Mike Rapoport (IBM) authored
      
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      [rppt@kernel.org: tweak ARCH_FORCE_MAX_ORDER's `range']
        Link: https://lkml.kernel.org/r/20230325060828.2662773-12-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20230324052233.2654090-12-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2a37fb2
    • Mike Rapoport (IBM)'s avatar
      powerpc: drop ranges for definition of ARCH_FORCE_MAX_ORDER · 1e8fed87
      Mike Rapoport (IBM) authored
      
      
      PowerPC defines ranges for ARCH_FORCE_MAX_ORDER some of which are insanely
      allowing MAX_ORDER up to 63, which implies maximal contiguous allocation
      size of 2^63 pages.
      
      Drop bogus definitions of ranges for ARCH_FORCE_MAX_ORDER and leave it a
      simple integer with sensible defaults.
      
      Users that *really* need to change the value of ARCH_FORCE_MAX_ORDER will
      be able to do so but they won't be mislead by the bogus ranges.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-11-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1e8fed87
    • Mike Rapoport (IBM)'s avatar
      powerpc: reword ARCH_FORCE_MAX_ORDER prompt and help text · 6fc54303
      Mike Rapoport (IBM) authored
      
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-10-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6fc54303
    • Mike Rapoport (IBM)'s avatar
      nios2: drop ranges for definition of ARCH_FORCE_MAX_ORDER · 482f7b76
      Mike Rapoport (IBM) authored
      
      
      nios2 defines range for ARCH_FORCE_MAX_ORDER allowing MAX_ORDER up to 19,
      which implies maximal contiguous allocation size of 2^19 pages or 2GiB.
      
      Drop bogus definition of ranges for ARCH_FORCE_MAX_ORDER and leave it a
      simple integer with sensible default.
      
      Users that *really* need to change the value of ARCH_FORCE_MAX_ORDER will
      be able to do so but they won't be mislead by the bogus ranges.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-9-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      482f7b76
    • Mike Rapoport (IBM)'s avatar
      nios2: reword ARCH_FORCE_MAX_ORDER prompt and help text · 5646e83d
      Mike Rapoport (IBM) authored
      
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-8-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5646e83d
    • Mike Rapoport (IBM)'s avatar
      m68k: reword ARCH_FORCE_MAX_ORDER prompt and help text · 7a5b272e
      Mike Rapoport (IBM) authored
      
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-7-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a5b272e
    • Mike Rapoport (IBM)'s avatar
      ia64: don't allow users to override ARCH_FORCE_MAX_ORDER · 9d0f7a57
      Mike Rapoport (IBM) authored
      
      
      It is enough to keep default values for base and huge pages without
      letting users to override ARCH_FORCE_MAX_ORDER.
      
      Drop the prompt to make the option unvisible in *config.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-6-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9d0f7a57
    • Mike Rapoport (IBM)'s avatar
      csky: drop ARCH_FORCE_MAX_ORDER · 4e7c8655
      Mike Rapoport (IBM) authored
      
      
      The default value of ARCH_FORCE_MAX_ORDER matches the generic default
      defined in the MM code, the architecture does not support huge pages, so
      there is no need to keep ARCH_FORCE_MAX_ORDER option available.
      
      Drop it.
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-5-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e7c8655
    • Mike Rapoport (IBM)'s avatar
      arm64: reword ARCH_FORCE_MAX_ORDER prompt and help text · 4632cb22
      Mike Rapoport (IBM) authored
      
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      [rppt@kernel.org: change ARCH_FORCE_MAX_ORDER dependencies]
        Link: https://lkml.kernel.org/r/20230325060828.2662773-4-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20230324052233.2654090-4-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4632cb22
    • Mike Rapoport (IBM)'s avatar
      arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER · 34affcd7
      Mike Rapoport (IBM) authored
      
      
      It is not a good idea to change fundamental parameters of core memory
      management.  Having predefined ranges suggests that the values within
      those ranges are sensible, but one has to *really* understand implications
      of changing MAX_ORDER before actually amending it and ranges don't help
      here.
      
      Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt
      visible only if EXPERT=y
      
      Link: https://lkml.kernel.org/r/20230324052233.2654090-3-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34affcd7
    • Mike Rapoport (IBM)'s avatar
      arm: reword ARCH_FORCE_MAX_ORDER prompt and help text · 8c907785
      Mike Rapoport (IBM) authored
      
      
      Patch series "arch,mm: cleanup Kconfig entries for ARCH_FORCE_MAX_ORDER",
      v3.
      
      Several architectures have ARCH_FORCE_MAX_ORDER in their Kconfig and
      they all have wrong and misleading prompt and help text for this option.
      
      Besides, some define insane limits for possible values of
      ARCH_FORCE_MAX_ORDER, some carefully define ranges only for a subset of
      possible configurations, some make this option configurable by users for no
      good reason.
      
      This set updates the prompt and help text everywhere and does its best to
      update actual definitions of ranges where applicable.
      
      kbuild generated a bunch of false positives because it assigns -1 to
      ARCH_FORCE_MAX_ORDER, hopefully this will be fixed soon.
      
      
      This patch (of 14):
      
      The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
      describe this configuration option.
      
      Update both to actually describe what this option does.
      
      Link: https://lkml.kernel.org/r/20230325060828.2662773-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20230324052233.2654090-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20230324052233.2654090-2-rppt@kernel.org
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c907785
    • Michal Hocko's avatar
      memcg: do not drain charge pcp caches on remote isolated cpus · 6a792697
      Michal Hocko authored
      
      
      Leonardo Bras has noticed that pcp charge cache draining might be
      disruptive on workloads relying on 'isolated cpus', a feature commonly
      used on workloads that are sensitive to interruption and context switching
      such as vRAN and Industrial Control Systems.
      
      There are essentially two ways how to approach the issue.  We can either
      allow the pcp cache to be drained on a different rather than a local cpu
      or avoid remote flushing on isolated cpus.
      
      The current pcp charge cache is really optimized for high performance and
      it always relies to stick with its cpu.  That means it only requires
      local_lock (preempt_disable on !RT) and draining is handed over to pcp WQ
      to drain locally again.
      
      The former solution (remote draining) would require to add an additional
      locking to prevent local charges from racing with the draining.  This adds
      an atomic operation to otherwise simple arithmetic fast path in the
      try_charge path.  Another concern is that the remote draining can cause a
      lock contention for the isolated workloads and therefore interfere with it
      indirectly via user space interfaces.
      
      Another option is to avoid draining scheduling on isolated cpus
      altogether.  That means that those remote cpus would keep their charges
      even after drain_all_stock returns.  This is certainly not optimal either
      but it shouldn't really cause any major problems.  In the worst case (many
      isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH i.e 64
      page) the memory consumption of a memcg would be artificially higher than
      can be immediately used from other cpus.
      
      Theoretically a memcg OOM killer could be triggered pre-maturely. 
      Currently it is not really clear whether this is a practical problem
      though.  Tight memcg limit would be really counter productive to cpu
      isolated workloads pretty much by definition because any memory reclaimed
      induced by memcg limit could break user space timing expectations as those
      usually expect execution in the userspace most of the time.
      
      Also charges could be left behind on memcg removal.  Any future charge on
      those isolated cpus will drain that pcp cache so this won't be a permanent
      leak.
      
      Considering cons and pros of both approaches this patch is implementing
      the second option and simply do not schedule remote draining if the target
      cpu is isolated.  This solution is much more simpler.  It doesn't add any
      new locking and it is more more predictable from the user space POV. 
      Should the pre-mature memcg OOM become a real life problem, we can revisit
      this decision.
      
      [akpm@linux-foundation.org: memcontrol.c needs sched/isolation.h]
        Link: https://lore.kernel.org/oe-kbuild-all/202303180617.7E3aIlHf-lkp@intel.com/
      Link: https://lkml.kernel.org/r/20230317134448.11082-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reported-by: default avatarLeonardo Bras <leobras@redhat.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a792697
    • Frederic Weisbecker's avatar
      sched/isolation: add cpu_is_isolated() API · a85c2257
      Frederic Weisbecker authored
      
      
      Patch series "memcg, cpuisol: do not interfere pcp cache charges draining
      with cpuisol workloads".
      
      Leonardo has reported [1] that pcp memcg charge draining can interfere
      with cpu isolated workloads.  The said draining is done from a WQ context
      with a pcp worker scheduled on each CPU which holds any cached charges for
      a specific memcg hierarchy.  Operation is not really a common operation
      [2].  It can be triggered from the userspace though so some care is
      definitely due.
      
      Leonardo has tried to address the issue by allowing remote charge draining
      [3].  This approach requires an additional locking to synchronize pcp
      caches sync from a remote cpu from local pcp consumers.  Even though the
      proposed lock was per-cpu there is still potential for contention and less
      predictable behavior.
      
      This patchset addresses the issue from a different angle.  Rather than
      dealing with a potential synchronization, cpus which are isolated are
      simply never scheduled to be drained.  This means that a small amount of
      charges could be laying around and waiting for a later use or they are
      flushed when a different memcg is charged from the same cpu.  More details
      are in patch 2.  The first patch from Frederic is implementing an
      abstraction to tell whether a specific cpu has been isolated and therefore
      require a special treatment.
      
      
      This patch (of 2):
      
      Provide this new API to check if a CPU has been isolated either through
      isolcpus= or nohz_full= kernel parameter.
      
      It aims at avoiding kernel load deemed to be safely spared on CPUs running
      sensitive workload that can't bear any disturbance, such as pcp cache
      draining.
      
      Link: https://lkml.kernel.org/r/20230317134448.11082-1-mhocko@kernel.org
      Link: https://lkml.kernel.org/r/20230317134448.11082-2-mhocko@kernel.org
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Leonardo Bras <leobras@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a85c2257
    • Ivan Orlov's avatar
      mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file() · 2ce0bdfe
      Ivan Orlov authored
      
      
      Syzkaller reported the following issue:
      
      kernel BUG at mm/khugepaged.c:1823!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 5097 Comm: syz-executor220 Not tainted 6.2.0-syzkaller-13154-g857f1268a591 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/16/2023
      RIP: 0010:collapse_file mm/khugepaged.c:1823 [inline]
      RIP: 0010:hpage_collapse_scan_file+0x67c8/0x7580 mm/khugepaged.c:2233
      Code: 00 00 89 de e8 c9 66 a3 ff 31 ff 89 de e8 c0 66 a3 ff 45 84 f6 0f 85 28 0d 00 00 e8 22 64 a3 ff e9 dc f7 ff ff e8 18 64 a3 ff <0f> 0b f3 0f 1e fa e8 0d 64 a3 ff e9 93 f6 ff ff f3 0f 1e fa 4c 89
      RSP: 0018:ffffc90003dff4e0 EFLAGS: 00010093
      RAX: ffffffff81e95988 RBX: 00000000000001c1 RCX: ffff8880205b3a80
      RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00000000000001c1
      RBP: ffffc90003dff830 R08: ffffffff81e90e67 R09: fffffbfff1a433c3
      R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000000
      R13: ffffc90003dff6c0 R14: 00000000000001c0 R15: 0000000000000000
      FS:  00007fdbae5ee700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fdbae6901e0 CR3: 000000007b2dd000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       madvise_collapse+0x721/0xf50 mm/khugepaged.c:2693
       madvise_vma_behavior mm/madvise.c:1086 [inline]
       madvise_walk_vmas mm/madvise.c:1260 [inline]
       do_madvise+0x9e5/0x4680 mm/madvise.c:1439
       __do_sys_madvise mm/madvise.c:1452 [inline]
       __se_sys_madvise mm/madvise.c:1450 [inline]
       __x64_sys_madvise+0xa5/0xb0 mm/madvise.c:1450
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The xas_store() call during page cache scanning can potentially translate
      'xas' into the error state (with the reproducer provided by the syzkaller
      the error code is -ENOMEM).  However, there are no further checks after
      the 'xas_store', and the next call of 'xas_next' at the start of the
      scanning cycle doesn't increase the xa_index, and the issue occurs.
      
      This patch will add the xarray state error checking after the xas_store()
      and the corresponding result error code.
      
      Tested via syzbot.
      
      [akpm@linux-foundation.org: update include/trace/events/huge_memory.h's SCAN_STATUS]
      Link: https://lkml.kernel.org/r/20230329145330.23191-1-ivan.orlov0322@gmail.com
      Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc
      Signed-off-by: default avatarIvan Orlov <ivan.orlov0322@gmail.com>
      Reported-by: default avatar <syzbot+9578faa5475acb35fa50@syzkaller.appspotmail.com>
      Tested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Himadri Pandya <himadrispandya@gmail.com>
      Cc: Ivan Orlov <ivan.orlov0322@gmail.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ce0bdfe
    • Arnd Bergmann's avatar
      kasan: remove hwasan-kernel-mem-intrinsic-prefix=1 for clang-14 · 90fd8336
      Arnd Bergmann authored
      Some unknown -mllvm options (i.e.  those starting with the letter "h")
      don't cause an error to be returned by clang, so the cc-option helper adds
      the unknown hwasan-kernel-mem-intrinsic-prefix=1 flag to CFLAGS with
      compilers that are new enough for hwasan but too old for this option.
      
      This causes a rather unreadable build failure:
      
      fixdep: error opening file: scripts/mod/.empty.o.d: No such file or directory
      make[4]: *** [/home/arnd/arm-soc/scripts/Makefile.build:252: scripts/mod/empty.o] Error 2
      fixdep: error opening file: scripts/mod/.devicetable-offsets.s.d: No such file or directory
      make[4]: *** [/home/arnd/arm-soc/scripts/Makefile.build:114: scripts/mod/devicetable-offsets.s] Error 2
      
      Add a version check to only allow this option with clang-15, gcc-13
      or later versions.
      
      Link: https://lkml.kernel.org/r/20230418122350.1646391-1-arnd@kernel.org
      Fixes: 51287dcb
      
       ("kasan: emit different calls for instrumentable memintrinsics")
      Link: https://lore.kernel.org/all/CANpmjNMwYosrvqh4ogDO8rgn+SeDHM2b-shD21wTypm_6MMe=g@mail.gmail.com/
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Nicolas Schier <nicolas@fjasle.eu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tom Rix <trix@redhat.com>
      Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      90fd8336
    • Sergey Senozhatsky's avatar
      zsmalloc: reset compaction source zspage pointer after putback_zspage() · f7ddb612
      Sergey Senozhatsky authored
      The current implementation of the compaction loop fails to set the source
      zspage pointer to NULL in all cases, leading to a potential issue where
      __zs_compact() could use a stale zspage pointer.  This pointer could even
      point to a previously freed zspage, causing unexpected behavior in the
      putback_zspage() and migrate_write_unlock() functions after returning from
      the compaction loop.
      
      Address the issue by ensuring that the source zspage pointer is always set
      to NULL when it should be.
      
      Link: https://lkml.kernel.org/r/20230417130850.1784777-1-senozhatsky@chromium.org
      Fixes: 5a845e9f
      
       ("zsmalloc: rework compaction algorithm")
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Reported-by: default avatarYu Zhao <yuzhao@google.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7ddb612
    • Arnd Bergmann's avatar
      mm: make arch_has_descending_max_zone_pfns() static · 5f300fd5
      Arnd Bergmann authored
      clang produces a build failure on x86 for some randconfig builds after a
      change that moves around code to mm/mm_init.c:
      
      Cannot find symbol for section 2: .text.
      mm/mm_init.o: failed
      
      I have not been able to figure out why this happens, but the __weak
      annotation on arch_has_descending_max_zone_pfns() is the trigger here.
      
      Removing the weak function in favor of an open-coded Kconfig option check
      avoids the problem and becomes clearer as well as better to optimize by
      the compiler.
      
      [arnd@arndb.de: fix logic bug]
        Link: https://lkml.kernel.org/r/20230415081904.969049-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20230414080418.110236-1-arnd@kernel.org
      Fixes: 9420f89d
      
       ("mm: move most of core MM initialization to mm/mm_init.c")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarSeongJae Park <sj@kernel.org>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: kernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5f300fd5
    • Kirill A. Shutemov's avatar
      mm: avoid passing 0 to __ffs() · 59f876fb
      Kirill A. Shutemov authored
      23baf831 ("mm, treewide: redefine MAX_ORDER sanely") results in
      various boot failures (hang) on arm targets Debug messages reveal the
      reason.
      
      ########### MAX_ORDER=10 start=0 __ffs(start)=-1 min()=10 min_t=-1
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      If start==0, __ffs(start) returns 0xfffffff or (as int) -1, which min_t()
      interprets as such, while min() apparently uses the returned unsigned long
      value. Obviously a negative order isn't received well by the rest of the
      code.
      
      [akpm@linux-foundation.org: fix comment, per Mike]
        Link: https://lkml.kernel.org/r/ZDBa7HWZK69dKKzH@kernel.org
      Link: https://lkml.kernel.org/r/20230406072529.vupqyrzqnhyozeyh@box.shutemov.name
      Fixes: 23baf831
      
       ("mm, treewide: redefine MAX_ORDER sanely")
      Signed-off-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
        Link: https://lkml.kernel.org/r/9460377a-38aa-4f39-ad57-fb73725f92db@roeck-us.net
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      59f876fb
    • Andrew Morton's avatar
    • Ryusuke Konishi's avatar
      nilfs2: initialize unused bytes in segment summary blocks · ef832747
      Ryusuke Konishi authored
      Syzbot still reports uninit-value in nilfs_add_checksums_on_logs() for
      KMSAN enabled kernels after applying commit 73970316
      
       ("nilfs2:
      initialize "struct nilfs_binfo_dat"->bi_pad field").
      
      This is because the unused bytes at the end of each block in segment
      summaries are not initialized.  So this fixes the issue by padding the
      unused bytes with null bytes.
      
      Link: https://lkml.kernel.org/r/20230417173513.12598-1-konishi.ryusuke@gmail.com
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+048585f3f4227bb2b49b@syzkaller.appspotmail.com>
        Link: https://syzkaller.appspot.com/bug?extid=048585f3f4227bb2b49b
      Cc: Alexander Potapenko <glider@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef832747
    • Mel Gorman's avatar
      mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages · 4d73ba5f
      Mel Gorman authored
      A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
      taking an excessive amount of time for large amounts of memory.  Further
      testing allocating huge pages that the cost is linear i.e.  if allocating
      1G pages in batches of 10 then the time to allocate nr_hugepages from
      10->20->30->etc increases linearly even though 10 pages are allocated at
      each step.  Profiles indicated that much of the time is spent checking the
      validity within already existing huge pages and then attempting a
      migration that fails after isolating the range, draining pages and a whole
      lot of other useless work.
      
      Commit eb14d4ee ("mm,page_alloc: drop unnecessary checks from
      pfn_range_valid_contig") removed two checks, one which ignored huge pages
      for contiguous allocations as huge pages can sometimes migrate.  While
      there may be value on migrating a 2M page to satisfy a 1G allocation, it's
      potentially expensive if the 1G allocation fails and it's pointless to try
      moving a 1G page for a new 1G allocation or scan the tail pages for valid
      PFNs.
      
      Reintroduce the PageHuge check and assume any contiguous region with
      hugetlbfs pages is unsuitable for a new 1G allocation.
      
      The hpagealloc test allocates huge pages in batches and reports the
      average latency per page over time.  This test happens just after boot
      when fragmentation is not an issue.  Units are in milliseconds.
      
      hpagealloc
                                     6.3.0-rc6              6.3.0-rc6              6.3.0-rc6
                                       vanilla   hugeallocrevert-v1r1   hugeallocsimple-v1r2
      Min       Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
      1st-qrtle Latency      356.61 (   0.00%)        5.34 (  98.50%)       19.85 (  94.43%)
      2nd-qrtle Latency      697.26 (   0.00%)        5.47 (  99.22%)       20.44 (  97.07%)
      3rd-qrtle Latency      972.94 (   0.00%)        5.50 (  99.43%)       20.81 (  97.86%)
      Max-1     Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
      Max-5     Latency       82.14 (   0.00%)        5.11 (  93.78%)       19.31 (  76.49%)
      Max-10    Latency      150.54 (   0.00%)        5.20 (  96.55%)       19.43 (  87.09%)
      Max-90    Latency     1164.45 (   0.00%)        5.53 (  99.52%)       20.97 (  98.20%)
      Max-95    Latency     1223.06 (   0.00%)        5.55 (  99.55%)       21.06 (  98.28%)
      Max-99    Latency     1278.67 (   0.00%)        5.57 (  99.56%)       22.56 (  98.24%)
      Max       Latency     1310.90 (   0.00%)        8.06 (  99.39%)       26.62 (  97.97%)
      Amean     Latency      678.36 (   0.00%)        5.44 *  99.20%*       20.44 *  96.99%*
      
                         6.3.0-rc6   6.3.0-rc6   6.3.0-rc6
                           vanilla   revert-v1   hugeallocfix-v2
      Duration User           0.28        0.27        0.30
      Duration System       808.66       17.77       35.99
      Duration Elapsed      830.87       18.08       36.33
      
      The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
      page and almost 10 minutes in total to run the test.  Reverting the
      problematic commit reduces it to 8ms at worst and the patch takes 26ms. 
      This patch fixes the main issue with skipping huge pages but leaves the
      page_count() out because a page with an elevated count potentially can
      migrate.
      
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
      Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
      Fixes: eb14d4ee
      
       ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarYuanxi Liu <y.liu@naruida.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d73ba5f
    • Liam R. Howlett's avatar
      mm/mmap: regression fix for unmapped_area{_topdown} · 58c5d0d6
      Liam R. Howlett authored
      The maple tree limits the gap returned to a window that specifically fits
      what was asked.  This may not be optimal in the case of switching search
      directions or a gap that does not satisfy the requested space for other
      reasons.  Fix the search by retrying the operation and limiting the search
      window in the rare occasion that a conflict occurs.
      
      Link: https://lkml.kernel.org/r/20230414185919.4175572-1-Liam.Howlett@oracle.com
      Fixes: 3499a131
      
       ("mm/mmap: use maple tree for unmapped_area{_topdown}")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58c5d0d6
    • Liam R. Howlett's avatar
      maple_tree: fix mas_empty_area() search · 06e8fd99
      Liam R. Howlett authored
      The internal function of mas_awalk() was incorrectly skipping the last
      entry in a node, which could potentially be NULL.  This is only a problem
      for the left-most node in the tree - otherwise that NULL would not exist.
      
      Fix mas_awalk() by using the metadata to obtain the end of the node for
      the loop and the logical pivot as apposed to the raw pivot value.
      
      Link: https://lkml.kernel.org/r/20230414145728.4067069-2-Liam.Howlett@oracle.com
      Fixes: 54a611b6
      
       ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06e8fd99
    • Liam R. Howlett's avatar
      maple_tree: make maple state reusable after mas_empty_area_rev() · fad8e429
      Liam R. Howlett authored
      Stop using maple state min/max for the range by passing through pointers
      for those values.  This will allow the maple state to be reused without
      resetting.
      
      Also add some logic to fail out early on searching with invalid
      arguments.
      
      Link: https://lkml.kernel.org/r/20230414145728.4067069-1-Liam.Howlett@oracle.com
      Fixes: 54a611b6
      
       ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fad8e429
    • Alexander Potapenko's avatar
      mm: kmsan: handle alloc failures in kmsan_ioremap_page_range() · fdea03e1
      Alexander Potapenko authored
      Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range()
      must also properly handle allocation/mapping failures.  In the case of
      such, it must clean up the already created metadata mappings and return an
      error code, so that the error can be propagated to ioremap_page_range(). 
      Without doing so, KMSAN may silently fail to bring the metadata for the
      page range into a consistent state, which will result in user-visible
      crashes when trying to access them.
      
      Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com
      Fixes: b073d7f8
      
       ("mm: kmsan: maintain KMSAN metadata for page operations")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarDipanjan Das <mail.dipanjan.das@gmail.com>
        Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fdea03e1
    • Alexander Potapenko's avatar
      mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush() · 47ebd031
      Alexander Potapenko authored
      As reported by Dipanjan Das, when KMSAN is used together with kernel fault
      injection (or, generally, even without the latter), calls to kcalloc() or
      __vmap_pages_range_noflush() may fail, leaving the metadata mappings for
      the virtual mapping in an inconsistent state.  When these metadata
      mappings are accessed later, the kernel crashes.
      
      To address the problem, we return a non-zero error code from
      kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping
      failure inside it, and make vmap_pages_range_noflush() return an error if
      KMSAN fails to allocate the metadata.
      
      This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(),
      as these allocation failures are not fatal anymore.
      
      Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com
      Fixes: b073d7f8
      
       ("mm: kmsan: maintain KMSAN metadata for page operations")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarDipanjan Das <mail.dipanjan.das@gmail.com>
        Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47ebd031