Skip to content
  1. Mar 23, 2022
    • SeongJae Park's avatar
      mm/damon/dbgfs/init_regions: use target index instead of target id · 144760f8
      SeongJae Park authored
      
      
      Patch series "Remove the type-unclear target id concept".
      
      DAMON asks each monitoring target ('struct damon_target') to have one
      'unsigned long' integer called 'id', which should be unique among the
      targets of same monitoring context.  Meaning of it is, however, totally up
      to the monitoring primitives that registered to the monitoring context.
      For example, the virtual address spaces monitoring primitives treats the
      id as a 'struct pid' pointer.
      
      This makes the code flexible but ugly, not well-documented, and
      type-unsafe[1].  Also, identification of each target can be done via its
      index.  For the reason, this patchset removes the concept and uses clear
      type definition.
      
      [1] https://lore.kernel.org/linux-mm/20211013154535.4aaeaaf9d0182922e405dd1e@linux-foundation.org/
      
      This patch (of 4):
      
      Target id is a 'unsigned long' data, which can be interpreted differently
      by each monitoring primitives.  For example, it means 'struct pid *' for
      the virtual address spaces monitoring, while it means nothing but an
      integer to be displayed to debugfs interface users for the physical
      address space monitoring.  It's flexible but makes code ugly and
      type-unsafe[1].
      
      To be prepared for eventual removal of the concept, this commit removes a
      use case of the concept in 'init_regions' debugfs file handling.  In
      detail, this commit replaces use of the id with the index of each target
      in the context's targets list.
      
      [1] https://lore.kernel.org/linux-mm/20211013154535.4aaeaaf9d0182922e405dd1e@linux-foundation.org/
      
      Link: https://lkml.kernel.org/r/20211230100723.2238-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211230100723.2238-2-sj@kernel.org
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      144760f8
    • Miaohe Lin's avatar
      mm/hmm.c: remove unneeded local variable ret · d0977efa
      Miaohe Lin authored
      
      
      The local variable ret is always 0. Remove it to make code more tight.
      
      Link: https://lkml.kernel.org/r/20220125124833.39718-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0977efa
    • Marco Elver's avatar
      kfence: allow use of a deferrable timer · 737b6a10
      Marco Elver authored
      Allow the use of a deferrable timer, which does not force CPU wake-ups
      when the system is idle.  A consequence is that the sample interval
      becomes very unpredictable, to the point that it is not guaranteed that
      the KFENCE KUnit test still passes.
      
      Nevertheless, on power-constrained systems this may be preferable, so
      let's give the user the option should they accept the above trade-off.
      
      Link: https://lkml.kernel.org/r/20220308141415.3168078f
      
      -1-elver@google.com
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      737b6a10
    • Peng Liu's avatar
      kfence: test: try to avoid test_gfpzero trigger rcu_stall · 3cb1c962
      Peng Liu authored
      
      
      When CONFIG_KFENCE_NUM_OBJECTS is set to a big number, kfence
      kunit-test-case test_gfpzero will eat up nearly all the CPU's resources
      and rcu_stall is reported as the following log which is cut from a
      physical server.
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 	68-....: (14422 ticks this GP) idle=6ce/1/0x4000000000000002
        softirq=592/592 fqs=7500 (t=15004 jiffies g=10677 q=20019)
        Task dump for CPU 68:
        task:kunit_try_catch state:R  running task
        stack:    0 pid: 9728 ppid:     2 flags:0x0000020a
        Call trace:
         dump_backtrace+0x0/0x1e4
         show_stack+0x20/0x2c
         sched_show_task+0x148/0x170
         ...
         rcu_sched_clock_irq+0x70/0x180
         update_process_times+0x68/0xb0
         tick_sched_handle+0x38/0x74
         ...
         gic_handle_irq+0x78/0x2c0
         el1_irq+0xb8/0x140
         kfree+0xd8/0x53c
         test_alloc+0x264/0x310 [kfence_test]
         test_gfpzero+0xf4/0x840 [kfence_test]
         kunit_try_run_case+0x48/0x20c
         kunit_generic_run_threadfn_adapter+0x28/0x34
         kthread+0x108/0x13c
         ret_from_fork+0x10/0x18
      
      To avoid rcu_stall and unacceptable latency, a schedule point is
      added to test_gfpzero.
      
      Link: https://lkml.kernel.org/r/20220309083753.1561921-4-liupeng256@huawei.com
      Signed-off-by: default avatarPeng Liu <liupeng256@huawei.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Tested-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Wang Kefeng <wangkefeng.wang@huawei.com>
      Cc: Daniel Latypov <dlatypov@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cb1c962
    • Peng Liu's avatar
      kunit: make kunit_test_timeout compatible with comment · bdd015f7
      Peng Liu authored
      In function kunit_test_timeout, it is declared "300 * MSEC_PER_SEC"
      represent 5min.  However, it is wrong when dealing with arm64 whose
      default HZ = 250, or some other situations.  Use msecs_to_jiffies to fix
      this, and kunit_test_timeout will work as desired.
      
      Link: https://lkml.kernel.org/r/20220309083753.1561921-3-liupeng256@huawei.com
      Fixes: 5f3e0620
      
       ("kunit: test: add support for test abort")
      Signed-off-by: default avatarPeng Liu <liupeng256@huawei.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDaniel Latypov <dlatypov@google.com>
      Reviewed-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Tested-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Wang Kefeng <wangkefeng.wang@huawei.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdd015f7
    • Peng Liu's avatar
      kunit: fix UAF when run kfence test case test_gfpzero · adf50545
      Peng Liu authored
      
      
      Patch series "kunit: fix a UAF bug and do some optimization", v2.
      
      This series is to fix UAF (use after free) when running kfence test case
      test_gfpzero, which is time costly.  This UAF bug can be easily triggered
      by setting CONFIG_KFENCE_NUM_OBJECTS = 65535.  Furthermore, some
      optimization for kunit tests has been done.
      
      This patch (of 3):
      
      Kunit will create a new thread to run an actual test case, and the main
      process will wait for the completion of the actual test thread until
      overtime.  The variable "struct kunit test" has local property in function
      kunit_try_catch_run, and will be used in the test case thread.  Task
      kunit_try_catch_run will free "struct kunit test" when kunit runs
      overtime, but the actual test case is still run and an UAF bug will be
      triggered.
      
      The above problem has been both observed in a physical machine and qemu
      platform when running kfence kunit tests.  The problem can be triggered
      when setting CONFIG_KFENCE_NUM_OBJECTS = 65535.  Under this setting, the
      test case test_gfpzero will cost hours and kunit will run to overtime.
      The follows show the panic log.
      
        BUG: unable to handle page fault for address: ffffffff82d882e9
      
        Call Trace:
         kunit_log_append+0x58/0xd0
         ...
         test_alloc.constprop.0.cold+0x6b/0x8a [kfence_test]
         test_gfpzero.cold+0x61/0x8ab [kfence_test]
         kunit_try_run_case+0x4c/0x70
         kunit_generic_run_threadfn_adapter+0x11/0x20
         kthread+0x166/0x190
         ret_from_fork+0x22/0x30
        Kernel panic - not syncing: Fatal exception
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        Ubuntu-1.8.2-1ubuntu1 04/01/2014
      
      To solve this problem, the test case thread should be stopped when the
      kunit frame runs overtime.  The stop signal will send in function
      kunit_try_catch_run, and test_gfpzero will handle it.
      
      Link: https://lkml.kernel.org/r/20220309083753.1561921-1-liupeng256@huawei.com
      Link: https://lkml.kernel.org/r/20220309083753.1561921-2-liupeng256@huawei.com
      Signed-off-by: default avatarPeng Liu <liupeng256@huawei.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Tested-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Wang Kefeng <wangkefeng.wang@huawei.com>
      Cc: Daniel Latypov <dlatypov@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adf50545
    • Tianchen Ding's avatar
      kfence: alloc kfence_pool after system startup · b33f778b
      Tianchen Ding authored
      
      
      Allow enabling KFENCE after system startup by allocating its pool via the
      page allocator. This provides the flexibility to enable KFENCE even if it
      wasn't enabled at boot time.
      
      Link: https://lkml.kernel.org/r/20220307074516.6920-3-dtcccc@linux.alibaba.com
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Tested-by: default avatarPeng Liu <liupeng256@huawei.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b33f778b
    • Tianchen Ding's avatar
      kfence: allow re-enabling KFENCE after system startup · 698361bc
      Tianchen Ding authored
      
      
      Patch series "provide the flexibility to enable KFENCE", v3.
      
      If CONFIG_CONTIG_ALLOC is not supported, we fallback to try
      alloc_pages_exact().  Allocating pages in this way has limits about
      MAX_ORDER (default 11).  So we will not support allocating kfence pool
      after system startup with a large KFENCE_NUM_OBJECTS.
      
      When handling failures in kfence_init_pool_late(), we pair
      free_pages_exact() to alloc_pages_exact() for compatibility consideration,
      though it actually does the same as free_contig_range().
      
      This patch (of 2):
      
      If once KFENCE is disabled by:
      echo 0 > /sys/module/kfence/parameters/sample_interval
      KFENCE could never be re-enabled until next rebooting.
      
      Allow re-enabling it by writing a positive num to sample_interval.
      
      Link: https://lkml.kernel.org/r/20220307074516.6920-1-dtcccc@linux.alibaba.com
      Link: https://lkml.kernel.org/r/20220307074516.6920-2-dtcccc@linux.alibaba.com
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698361bc
    • tangmeng's avatar
      mm/kfence: remove unnecessary CONFIG_KFENCE option · 56eb8e94
      tangmeng authored
      
      
      In mm/Makefile has:
      
        obj-$(CONFIG_KFENCE) += kfence/
      
      So that we don't need 'obj-$(CONFIG_KFENCE) :=' in mm/kfence/Makefile,
      delete it from mm/kfence/Makefile.
      
      Link: https://lkml.kernel.org/r/20220221065525.21344-1-tangmeng@uniontech.com
      Signed-off-by: default avatartangmeng <tangmeng@uniontech.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56eb8e94
    • Dr. David Alan Gilbert's avatar
      mm/page_table_check.c: use strtobool for param parsing · 597da28e
      Dr. David Alan Gilbert authored
      
      
      Use strtobool rather than open coding "on" and "off" parsing.
      
      Link: https://lkml.kernel.org/r/20220227181038.126926-1-linux@treblig.org
      Signed-off-by: default avatarDr. David Alan Gilbert <linux@treblig.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      597da28e
    • Miaohe Lin's avatar
      mm/highmem: remove unnecessary done label · 7a3f2263
      Miaohe Lin authored
      
      
      Remove unnecessary done label to simplify the code.
      
      Link: https://lkml.kernel.org/r/20220126092542.64659-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a3f2263
    • Ira Weiny's avatar
      highmem: document kunmap_local() · d7ca25c5
      Ira Weiny authored
      
      
      Some users of kmap() add an offset to the kmap() address to be used
      during the mapping.
      
      When converting to kmap_local_page() the base address does not need to
      be stored because any address within the page can be used in
      kunmap_local().  However, this was not clear from the documentation and
      cause some questions.[1]
      
      Document that any address in the page can be used in kunmap_local() to
      clarify this for future users.
      
      [1] https://lore.kernel.org/lkml/20211213154543.GM3538886@iweiny-DESK2.sc.intel.com/
      
      [ira.weiny@intel.com: updates per Christoph]
        Link: https://lkml.kernel.org/r/20220124182138.816693-1-ira.weiny@intel.com
      
      Link: https://lkml.kernel.org/r/20220124013045.806718-1-ira.weiny@intel.com
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7ca25c5
    • Vlastimil Babka's avatar
      mm/early_ioremap: declare early_memremap_pgprot_adjust() · be4893d9
      Vlastimil Babka authored
      
      
      The mm/ directory can almost fully be built with W=1, which would help
      in local development.  One remaining issue is missing prototype for
      early_memremap_pgprot_adjust().
      
      Thus add a declaration for this function.  Use mm/internal.h instead of
      asm/early_ioremap.h to avoid missing type definitions and unnecessary
      exposure.
      
      Link: https://lkml.kernel.org/r/20220314165724.16071-2-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be4893d9
    • Randy Dunlap's avatar
      mm/usercopy: return 1 from hardened_usercopy __setup() handler · 05fe3c10
      Randy Dunlap authored
      __setup() handlers should return 1 if the command line option is handled
      and 0 if not (or maybe never return 0; it just pollutes init's
      environment).  This prevents:
      
        Unknown kernel command line parameters \
        "BOOT_IMAGE=/boot/bzImage-517rc5 hardened_usercopy=off", will be \
        passed to user space.
      
        Run /sbin/init as init process
         with arguments:
           /sbin/init
         with environment:
           HOME=/
           TERM=linux
           BOOT_IMAGE=/boot/bzImage-517rc5
           hardened_usercopy=off
      or
           hardened_usercopy=on
      but when "hardened_usercopy=foo" is used, there is no Unknown kernel
      command line parameter.
      
      Return 1 to indicate that the boot option has been handled.
      Print a warning if strtobool() returns an error on the option string,
      but do not mark this as in unknown command line option and do not cause
      init's environment to be polluted with this string.
      
      Link: https://lkml.kernel.org/r/20220222034249.14795-1-rdunlap@infradead.org
      Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
      Fixes: b5cb15d9
      
       ("usercopy: Allow boot cmdline disabling of hardening")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarIgor Zhbanov <i.zhbanov@omprussia.ru>
      Acked-by: default avatarChris von Recklinghausen <crecklin@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05fe3c10
    • Christophe Leroy's avatar
      mm: uninline copy_overflow() · ad7489d5
      Christophe Leroy authored
      
      
      While building a small config with CONFIG_CC_OPTIMISE_FOR_SIZE, I ended
      up with more than 50 times the following function in vmlinux because GCC
      doesn't honor the 'inline' keyword:
      
      	c00243bc <copy_overflow>:
      	c00243bc:	94 21 ff f0 	stwu    r1,-16(r1)
      	c00243c0:	7c 85 23 78 	mr      r5,r4
      	c00243c4:	7c 64 1b 78 	mr      r4,r3
      	c00243c8:	3c 60 c0 62 	lis     r3,-16286
      	c00243cc:	7c 08 02 a6 	mflr    r0
      	c00243d0:	38 63 5e e5 	addi    r3,r3,24293
      	c00243d4:	90 01 00 14 	stw     r0,20(r1)
      	c00243d8:	4b ff 82 45 	bl      c001c61c <__warn_printk>
      	c00243dc:	0f e0 00 00 	twui    r0,0
      	c00243e0:	80 01 00 14 	lwz     r0,20(r1)
      	c00243e4:	38 21 00 10 	addi    r1,r1,16
      	c00243e8:	7c 08 03 a6 	mtlr    r0
      	c00243ec:	4e 80 00 20 	blr
      
      With -Winline, GCC tells:
      
      	/include/linux/thread_info.h:212:20: warning: inlining failed in call to 'copy_overflow': call is unlikely and code size would grow [-Winline]
      
      copy_overflow() is a non conditional warning called by check_copy_size()
      on an error path.
      
      check_copy_size() have to remain inlined in order to benefit from
      constant folding, but copy_overflow() is not worth inlining.
      
      Uninline the warning when CONFIG_BUG is selected.
      
      When CONFIG_BUG is not selected, WARN() does nothing so skip it.
      
      This reduces the size of vmlinux by almost 4kbytes.
      
      Link: https://lkml.kernel.org/r/e1723b9cfa924bcefcd41f69d0025b38e4c9364e.1644819985.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad7489d5
    • Christophe Leroy's avatar
      mm: remove usercopy_warn() · 6eada26f
      Christophe Leroy authored
      Users of usercopy_warn() were removed by commit 53944f17
      
       ("mm:
      remove HARDENED_USERCOPY_FALLBACK")
      
      Remove it.
      
      Link: https://lkml.kernel.org/r/5f26643fc70b05f8455b60b99c30c17d635fa640.1644231910.git.christophe.leroy@csgroup.eu
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarStephen Kitt <steve@sk2.org>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6eada26f
    • Maciej S. Szmigiero's avatar
      mm/zswap.c: allow handling just same-value filled pages · cb325ddd
      Maciej S. Szmigiero authored
      
      
      Zswap has an ability to efficiently store same-value filled pages, which
      can be turned on and off using the "same_filled_pages_enabled"
      parameter.
      
      However, there is currently no way to enable just this (lightweight)
      functionality, while not making use of the whole compressed page storage
      machinery.
      
      Add a "non_same_filled_pages_enabled" parameter which allows disabling
      handling of pages that aren't same-value filled.  This way zswap can be
      run in such lightweight same-value filled pages only mode.
      
      Link: https://lkml.kernel.org/r/7dbafa963e8bab43608189abbe2067f4b9287831.1641247624.git.maciej.szmigiero@oracle.com
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb325ddd
    • Hugh Dickins's avatar
      mm/thp: ClearPageDoubleMap in first page_add_file_rmap() · bd55b0c2
      Hugh Dickins authored
      PageDoubleMap is maintained differently for anon and for shmem+file: the
      shmem+file one was never cleared, because a safe place to do so could
      not be found; so it would blight future use of the cached hugepage until
      evicted.
      
      See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linux.alibaba.com/
      
      But page_add_file_rmap() does provide a safe place to do so (though later
      than one might wish): allowing testing to return to an initial state
      without a damaging drop_caches.
      
      Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com
      Fixes: 9a73f61b
      
       ("thp, mlock: do not mlock PTE-mapped file huge pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd55b0c2
    • Oscar Salvador's avatar
      mm: only re-generate demotion targets when a numa node changes its N_CPU state · 734c1570
      Oscar Salvador authored
      Abhishek reported that after patch [1], hotplug operations are taking
      roughly double the expected time.  [2]
      
      The reason behind is that the CPU callbacks that
      migrate_on_reclaim_init() sets always call set_migration_target_nodes()
      whenever a CPU is brought up/down.
      
      But we only care about numa nodes going from having cpus to become
      cpuless, and vice versa, as that influences the demotion_target order.
      
      We do already have two CPU callbacks (vmstat_cpu_online() and
      vmstat_cpu_dead()) that check exactly that, so get rid of the CPU
      callbacks in migrate_on_reclaim_init() and only call
      set_migration_target_nodes() from vmstat_cpu_{dead,online}() whenever a
      numa node change its N_CPU state.
      
      [1] https://lore.kernel.org/linux-mm/20210721063926.3024591-2-ying.huang@intel.com/
      [2] https://lore.kernel.org/linux-mm/eb438ddd-2919-73d4-bd9f-b7eecdd9577a@linux.vnet.ibm.com/
      
      [osalvador@suse.de: add feedback from Huang Ying]
        Link: https://lkml.kernel.org/r/20220314150945.12694-1-osalvador@suse.de
      
      Link: https://lkml.kernel.org/r/20220310120749.23077-1-osalvador@suse.de
      Fixes: 884a6e5d
      
       ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reported-by: default avatarAbhishek Goel <huntbag@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      734c1570
    • David Hildenbrand's avatar
      drivers/base/memory: clarify adding and removing of memory blocks · 2aa065f7
      David Hildenbrand authored
      
      
      Let's make it clearer at which places we actually add and remove memory
      blocks -- streamlining the terminology -- and highlight which memory block
      start out online and which start out as offline.
      
       * rename add_memory_block -> add_boot_memory_block
       * rename init_memory_block -> add_memory_block
       * rename unregister_memory -> remove_memory_block
       * rename register_memory -> __add_memory_block
       * add add_hotplug_memory_block
       * mark add_boot_memory_block with __init (suggested by Oscar)
      
      __add_memory_block() is  a pure helper for add_memory_block(), remove
      the somewhat obvious comment.
      
      Link: https://lkml.kernel.org/r/20220221154531.11382-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2aa065f7
    • David Hildenbrand's avatar
      drivers/base/memory: determine and store zone for single-zone memory blocks · 395f6081
      David Hildenbrand authored
      
      
      test_pages_in_a_zone() is just another nasty PFN walker that can easily
      stumble over ZONE_DEVICE memory ranges falling into the same memory block
      as ordinary system RAM: the memmap of parts of these ranges might possibly
      be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:
      
        UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
        index 7 is out of range for type 'zone [5]'
        CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
        Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
        Call Trace:
         dump_stack+0x9a/0xf0
         ubsan_epilogue+0x9/0x7a
         __ubsan_handle_out_of_bounds+0x13a/0x181
         test_pages_in_a_zone+0x3c4/0x500
         show_valid_zones+0x1fa/0x380
         dev_attr_show+0x43/0xb0
         sysfs_kf_seq_show+0x1c5/0x440
         seq_read+0x49d/0x1190
         vfs_read+0xff/0x300
         ksys_read+0xb8/0x170
         do_syscall_64+0xa5/0x4b0
         entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        RIP: 0033:0x7f01f4439b52
      
      We seem to stumble over a memmap that contains a garbage zone id.  While
      we could try inserting pfn_to_online_page() calls, it will just make
      memory offlining slower, because we use test_pages_in_a_zone() to make
      sure we're offlining pages that all belong to the same zone.
      
      Let's just get rid of this PFN walker and determine the single zone of a
      memory block -- if any -- for early memory blocks during boot.  For memory
      onlining, we know the single zone already.  Let's avoid any additional
      memmap scanning and just rely on the zone information available during
      boot.
      
      For memory hot(un)plug, we only really care about memory blocks that:
      * span a single zone (and, thereby, a single node)
      * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
      If one of these conditions is not met, we reject memory offlining.
      Hotplugged memory blocks (starting out offline), always meet both
      conditions.
      
      There are three scenarios to handle:
      
      (1) Memory hot(un)plug
      
      A memory block with zone == NULL cannot be offlined, corresponding to
      our previous test_pages_in_a_zone() check.
      
      After successful memory onlining/offlining, we simply set the zone
      accordingly.
      * Memory onlining: set the zone we just used for onlining
      * Memory offlining: set zone = NULL
      
      So a hotplugged memory block starts with zone = NULL. Once memory
      onlining is done, we set the proper zone.
      
      (2) Boot memory with !CONFIG_NUMA
      
      We know that there is just a single pgdat, so we simply scan all zones
      of that pgdat for an intersection with our memory block PFN range when
      adding the memory block. If more than one zone intersects (e.g., DMA and
      DMA32 on x86 for the first memory block) we set zone = NULL and
      consequently mimic what test_pages_in_a_zone() used to do.
      
      (3) Boot memory with CONFIG_NUMA
      
      At the point in time we create the memory block devices during boot, we
      don't know yet which nodes *actually* span a memory block. While we could
      scan all zones of all nodes for intersections, overlapping nodes complicate
      the situation and scanning all nodes is possibly expensive. But that
      problem has already been solved by the code that sets the node of a memory
      block and creates the link in the sysfs --
      do_register_memory_block_under_node().
      
      So, we hook into the code that sets the node id for a memory block. If
      we already have a different node id set for the memory block, we know
      that multiple nodes *actually* have PFNs falling into our memory block:
      we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
      to do. If there is no node id set, we do the same as (2) for the given
      node.
      
      Note that the call order in driver_init() is:
      -> memory_dev_init(): create memory block devices
      -> node_dev_init(): link memory block devices to the node and set the
      		    node id
      
      So in summary, we detect if there is a single zone responsible for this
      memory block and we consequently store the zone in that case in the
      memory block, updating it during memory onlining/offlining.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarRafael Parra <rparrazo@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      395f6081
    • David Hildenbrand's avatar
      drivers/base/node: rename link_mem_sections() to register_memory_block_under_node() · cc651559
      David Hildenbrand authored
      
      
      Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.
      
      I remember talking to Michal in the past about removing
      test_pages_in_a_zone(), which we use for:
      * verifying that a memory block we intend to offline is really only managed
        by a single zone. We don't support offlining of memory blocks that are
        managed by multiple zones (e.g., multiple nodes, DMA and DMA32)
      * exposing that zone to user space via
        /sys/devices/system/memory/memory*/valid_zones
      
      Now that I identified some more cases where test_pages_in_a_zone() might
      go wrong, and we received an UBSAN report (see patch #3), let's get rid of
      this PFN walker.
      
      So instead of detecting the zone at runtime with test_pages_in_a_zone() by
      scanning the memmap, let's determine and remember for each memory block if
      it's managed by a single zone.  The stored zone can then be used for the
      above two cases, avoiding a manual lookup using test_pages_in_a_zone().
      
      This avoids eventually stumbling over uninitialized memmaps in corner
      cases, especially when ZONE_DEVICE ranges partly fall into memory block
      (that are responsible for managing System RAM).
      
      Handling memory onlining is easy, because we online to exactly one zone.
      Handling boot memory is more tricky, because we want to avoid scanning all
      zones of all nodes to detect possible zones that overlap with the physical
      memory region of interest.  Fortunately, we already have code that
      determines the applicable nodes for a memory block, to create sysfs links
      -- we'll hook into that.
      
      Patch #1 is a simple cleanup I had laying around for a longer time.
      Patch #2 contains the main logic to remove test_pages_in_a_zone() and
      further details.
      
      [1] https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
      
      This patch (of 2):
      
      Let's adjust the stale terminology, making it match
      unregister_memory_block_under_nodes() and
      do_register_memory_block_under_node().  We're dealing with memory block
      devices, which span 1..X memory sections.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220210184359.235565-2-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc651559
    • Miaohe Lin's avatar
      mm/memory_hotplug: fix misplaced comment in offline_pages · 36ba30bc
      Miaohe Lin authored
      It's misplaced since commit 79605093
      
       ("mm, memory_hotplug: print
      reason for the offlining failure").  Move it to the right place.
      
      Link: https://lkml.kernel.org/r/20220207133643.23427-5-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36ba30bc
    • Miaohe Lin's avatar
      mm/memory_hotplug: clean up try_offline_node · b27340a5
      Miaohe Lin authored
      
      
      We can use helper macro node_spanned_pages to check whether node spans
      pages.  And we can change the parameter of check_cpu_on_node to nid as
      that's what it really cares.  Thus we can further get rid of the local
      variable pgdat and improve the readability a bit.
      
      Link: https://lkml.kernel.org/r/20220207133643.23427-4-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b27340a5
    • Miaohe Lin's avatar
      mm/memory_hotplug: avoid calling zone_intersects() for ZONE_NORMAL · d6aad201
      Miaohe Lin authored
      
      
      If zid reaches ZONE_NORMAL, the caller will always get the NORMAL zone no
      matter what zone_intersects() returns.  So we can save some possible cpu
      cycles by avoid calling zone_intersects() for ZONE_NORMAL.
      
      Link: https://lkml.kernel.org/r/20220207133643.23427-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6aad201
    • Miaohe Lin's avatar
      mm/memory_hotplug: remove obsolete comment of __add_pages · 2b6bf15f
      Miaohe Lin authored
      Patch series "A few cleanup patches around memory_hotplug".
      
      This series contains a few patches to fix obsolete and misplaced comments,
      clean up the try_offline_node function and so on.
      
      This patch (of 4):
      
      Since commit f1dd2cd1
      
       ("mm, memory_hotplug: do not associate hotadded
      memory to zones until online"), there is no need to pass in the zone.
      
      [akpm@linux-foundation.org: remove the comment altogether, per David]
      
      Link: https://lkml.kernel.org/r/20220207133643.23427-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220207133643.23427-2-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b6bf15f
    • David Hildenbrand's avatar
      drivers/base/node: consolidate node device subsystem initialization in node_dev_init() · 2848a28b
      David Hildenbrand authored
      
      
      ...  and call node_dev_init() after memory_dev_init() from driver_init(),
      so before any of the existing arch/subsys calls.  All online nodes should
      be known at that point: early during boot, arch code determines node and
      zone ranges and sets the relevant nodes online; usually this happens in
      setup_arch().
      
      This is in line with memory_dev_init(), which initializes the memory
      device subsystem and creates all memory block devices.
      
      Similar to memory_dev_init(), panic() if anything goes wrong, we don't
      want to continue with such basic initialization errors.
      
      The important part is that node_dev_init() gets called after
      memory_dev_init() and after cpu_dev_init(), but before any of the relevant
      archs call register_cpu() to register the new cpu device under the node
      device.  The latter should be the case for the current users of
      topology_init().
      
      Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: Anatoly Pugachev <matorola@gmail.com> (sparc64)
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2848a28b
    • David Hildenbrand's avatar
      drivers/base/memory: add memory block to memory group after registration succeeded · 7ea0d2d7
      David Hildenbrand authored
      If register_memory() fails, we freed the memory block but already added
      the memory block to the group list, not good.  Let's defer adding the
      block to the memory group to after registering the memory block device.
      
      We do handle it properly during unregister_memory(), but that's not
      called when the registration fails.
      
      Link: https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
      Fixes: 028fc57a
      
       ("drivers/base/memory: introduce "memory groups" to logically group memory blocks")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ea0d2d7
    • Wei Yang's avatar
      memcg: do not tweak node in alloc_mem_cgroup_per_node_info · 8c9bb398
      Wei Yang authored
      
      
      alloc_mem_cgroup_per_node_info is allocated for each possible node and
      this used to be a problem because !node_online nodes didn't have
      appropriate data structure allocated.  This has changed by "mm: handle
      uninitialized numa nodes gracefully" so we can drop the special casing
      here.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-7-mhocko@kernel.org
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Rafael Aquini <raquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c9bb398
    • Michal Hocko's avatar
      mm: make free_area_init_node aware of memory less nodes · 7c30daac
      Michal Hocko authored
      
      
      free_area_init_node is also called from memory less node initialization
      path (free_area_init_memoryless_node).  It doesn't really make much sense
      to display the physical memory range for those nodes: Initmem setup node
      XX [mem 0x0000000000000000-0x0000000000000000]
      
      Instead be explicit that the node is memoryless: Initmem setup node XX as
      memoryless
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-6-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c30daac
    • Michal Hocko's avatar
      mm, memory_hotplug: reorganize new pgdat initialization · 70b5b46a
      Michal Hocko authored
      
      
      When a !node_online node is brought up it needs a hotplug specific
      initialization because the node could be either uninitialized yet or it
      could have been recycled after previous hotremove.  hotadd_init_pgdat is
      responsible for that.
      
      Internal pgdat state is initialized at two places currently
      	- hotadd_init_pgdat
      	- free_area_init_core_hotplug
      
      There is no real clear cut what should go where but this patch's chosen to
      move the whole internal state initialization into
      free_area_init_core_hotplug.  hotadd_init_pgdat is still responsible to
      pull all the parts together - most notably to initialize zonelists because
      those depend on the overall topology.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70b5b46a
    • Michal Hocko's avatar
      mm, memory_hotplug: drop arch_free_nodedata · 390511e1
      Michal Hocko authored
      
      
      Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
      used to allocate pgdat when memory has been added to a node
      (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure
      path because once the pgdat is exported (to be visible by NODA_DATA(nid))
      it cannot really be freed because there is no synchronization available
      for that.
      
      pgdat is allocated for each possible nodes now so the memory hotplug
      doesn't need to do the ever use arch_free_nodedata so drop it.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-4-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      390511e1
    • Michal Hocko's avatar
      mm: handle uninitialized numa nodes gracefully · 09f49dca
      Michal Hocko authored
      
      
      We have had several reports [1][2][3] that page allocator blows up when an
      allocation from a possible node is requested.  The underlying reason is
      that NODE_DATA for the specific node is not allocated.
      
      NUMA specific initialization is arch specific and it can vary a lot.  E.g.
      x86 tries to initialize all nodes that have some cpu affinity (see
      init_cpu_to_node) but this can be insufficient because the node might be
      cpuless for example.
      
      One way to address this problem would be to check for !node_online nodes
      when trying to get a zonelist and silently fall back to another node.
      That is unfortunately adding a branch into allocator hot path and it
      doesn't handle any other potential NODE_DATA users.
      
      This patch takes a different approach (following a lead of [3]) and it pre
      allocates pgdat for all possible nodes in an arch indipendent code -
      free_area_init.  All uninitialized nodes are treated as memoryless nodes.
      node_state of the node is not changed because that would lead to other
      side effects - e.g.  sysfs representation of such a node and from past
      discussions [4] it is known that some tools might have problems digesting
      that.
      
      Newly allocated pgdat only gets a minimal initialization and the rest of
      the work is expected to be done by the memory hotplug - hotadd_new_pgdat
      (renamed to hotadd_init_pgdat).
      
      generic_alloc_nodedata is changed to use the memblock allocator because
      neither page nor slab allocators are available at the stage when all
      pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
      use the early boot allocator.  The only arch specific implementation is
      ia64 and that is changed to use the early allocator as well.
      
      [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
      [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
      [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
      [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
      
      [akpm@linux-foundation.org: replace comment, per Mike]
      
      Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
      Reported-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Tested-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Reported-by: default avatarNico Pache <npache@redhat.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Tested-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09f49dca
    • Michal Hocko's avatar
      mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG · e930d999
      Michal Hocko authored
      
      
      Patch series "mm, memory_hotplug: handle unitialized numa node gracefully".
      
      The core of the fix is patch 2 which also links existing bug reports.  The
      high level goal is to have all possible numa nodes have their pgdat
      allocated and initialized so
      
      	for_each_possible_node(nid)
      		NODE_DATA(nid)
      
      will never return garbage.  This has proven to be problem in several
      places when an offline numa node is used for an allocation just to realize
      that node_data and therefore allocation fallback zonelists are not
      initialized and such an allocation request blows up.
      
      There were attempts to address that by checking node_online in several
      places including the page allocator.  This patchset approaches the problem
      from a different perspective and instead of special casing, which just
      adds a runtime overhead, it allocates pglist_data for each possible node.
      This can add some memory overhead for platforms with high number of
      possible nodes if they do not contain any memory.  This should be a rather
      rare configuration though.
      
      How to test this? David has provided and excellent howto:
      http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com
      
      Patches 1 and 3-6 are mostly cleanups.  The patchset has been reviewed by
      Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to
      both).  David has tested as per instructions above and hasn't found any
      fallouts in the memory hotplug scenarios.
      
      This patch (of 6):
      
      This is a preparatory patch and it doesn't introduce any functional
      change.  It merely pulls out arch_alloc_nodedata (and co) outside of
      CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
      from the generic MM code.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org
      Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e930d999
    • Charan Teja Kalla's avatar
      mm: madvise: skip unmapped vma holes passed to process_madvise · 08095d63
      Charan Teja Kalla authored
      The process_madvise() system call is expected to skip holes in vma passed
      through 'struct iovec' vector list.  But do_madvise, which
      process_madvise() calls for each vma, returns ENOMEM in case of unmapped
      holes, despite the VMA is processed.
      
      Thus process_madvise() should treat ENOMEM as expected and consider the
      VMA passed to as processed and continue processing other vma's in the
      vector list.  Returning -ENOMEM to user, despite the VMA is processed,
      will be unable to figure out where to start the next madvise.
      
      Link: https://lkml.kernel.org/r/4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com
      Fixes: ecb8ac8b
      
      ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08095d63
    • Charan Teja Kalla's avatar
      mm: madvise: return correct bytes advised with process_madvise · 5bd009c7
      Charan Teja Kalla authored
      Patch series "mm: madvise: return correct bytes processed with
      process_madvise", v2.  With the process_madvise(), always choose to return
      non zero processed bytes over an error.  This can help the user to know on
      which VMA, passed in the 'struct iovec' vector list, is failed to advise
      thus can take the decission of retrying/skipping on that VMA.
      
      This patch (of 2):
      
      The process_madvise() system call returns error even after processing some
      VMA's passed in the 'struct iovec' vector list which leaves the user
      confused to know where to restart the advise next.  It is also against
      this syscall man page[1] documentation where it mentions that "return
      value may be less than the total number of requested bytes, if an error
      occurred after some iovec elements were already processed.".
      
      Consider a user passed 10 VMA's in the 'struct iovec' vector list of which
      9 are processed but one.  Then it just returns the error caused on that
      failed VMA despite the first 9 VMA's processed, leaving the user confused
      about on which VMA it is failed.  Returning the number of bytes processed
      here can help the user to know which VMA it is failed on and thus can
      retry/skip the advise on that VMA.
      
      [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.
      
      Link: https://lkml.kernel.org/r/cover.1647008754.git.quic_charante@quicinc.com
      Link: https://lkml.kernel.org/r/125b61a0edcee5c2db8658aed9d06a43a19ccafc.1647008754.git.quic_charante@quicinc.com
      Fixes: ecb8ac8b
      
      ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bd009c7
    • Miaohe Lin's avatar
      mm/madvise: use vma_lookup() instead of find_vma() · 531037a0
      Miaohe Lin authored
      
      
      Using vma_lookup() verifies the start address is contained in the found
      vma.  This results in easier to read the code.
      
      Link: https://lkml.kernel.org/r/20220311082731.63513-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      531037a0
    • Matthew Wilcox (Oracle)'s avatar
      mm/hwpoison: check the subpage, not the head page · da358d5c
      Matthew Wilcox (Oracle) authored
      
      
      Hardware poison is tracked on a per-page basis, not on the head page.
      
      Link: https://lkml.kernel.org/r/20220130013042.1906881-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da358d5c
    • Miaohe Lin's avatar
      mm/ksm: use helper macro __ATTR_RW · 1bad2e5c
      Miaohe Lin authored
      
      
      Use helper macro __ATTR_RW to define KSM_ATTR to make code more clear.
      Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20220221115809.26381-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bad2e5c
    • Yang Yang's avatar
      mm/vmstat: add event for ksm swapping in copy · 4d45c3af
      Yang Yang authored
      
      
      When faults in from swap what used to be a KSM page and that page had been
      swapped in before, system has to make a copy, and leaves remerging the
      pages to a later pass of ksmd.
      
      That is not good for performace, we'd better to reduce this kind of copy.
      There are some ways to reduce it, for example lessen swappiness or
      madvise(, , MADV_MERGEABLE) range.  So add this event to support doing
      this tuning.  Just like this patch: "mm, THP, swap: add THP swapping out
      fallback counting".
      
      Link: https://lkml.kernel.org/r/20220113023839.758845-1-yang.yang29@zte.com.cn
      Signed-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Saravanan D <saravanand@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d45c3af