Skip to content
  1. Oct 26, 2023
    • Liu Shixin's avatar
      mm: kmemleak: add __find_and_remove_object() · 858a195b
      Liu Shixin authored
      
      
      Add new __find_and_remove_object() without kmemleak_lock protect, it is in
      preparation for the next patch.
      
      Link: https://lkml.kernel.org/r/20231018102952.3339837-7-liushixin2@huawei.com
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      858a195b
    • Liu Shixin's avatar
      mm: kmemleak: use mem_pool_free() to free object · 2e1d4738
      Liu Shixin authored
      The kmemleak object is allocated by mem_pool_alloc(), which could be from
      slab or mem_pool[], so it's not suitable using __kmem_cache_free() to free
      the object, use __mem_pool_free() instead.
      
      Link: https://lkml.kernel.org/r/20231018102952.3339837-6-liushixin2@huawei.com
      Fixes: 0647398a
      
       ("mm: kmemleak: simple memory allocation pool for kmemleak objects")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e1d4738
    • Liu Shixin's avatar
      mm: kmemleak: split __create_object into two functions · 0edd7b58
      Liu Shixin authored
      
      
      __create_object() consists of two part, the first part allocate a kmemleak
      object and initialize it, the second part insert it into object tree. 
      This function need kmemleak_lock but actually only the second part need
      lock.
      
      Split it into two functions, the first function __alloc_object only
      allocate a kmemleak object, and the second function __link_object() will
      initialize the object and insert it into object tree, use the
      kmemleak_lock to protect __link_object() only.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Link: https://lkml.kernel.org/r/20231018102952.3339837-5-liushixin2@huawei.com
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0edd7b58
    • Liu Shixin's avatar
      mm/kmemleak: fix print format of pointer in pr_debug() · 62047e0f
      Liu Shixin authored
      
      
      With 0x%p, the pointer will be hashed and print (____ptrval____) instead. 
      And with 0x%pa, the pointer can be successfully printed but with duplicate
      prefixes, which looks like:
      
       kmemleak: kmemleak_free(0x(____ptrval____))
       kmemleak: kmemleak_free_percpu(0x(____ptrval____))
       kmemleak: kmemleak_free_part_phys(0x0x0000000a1af86000)
      
      Use 0x%px instead of 0x%p or 0x%pa to print the pointer. Then the print
      will be like:
      
       kmemleak: kmemleak_free(0xffff9111c145b020)
       kmemleak: kmemleak_free_percpu(0x00000000000333b0)
       kmemleak: kmemleak_free_part_phys(0x0000000a1af80000)
      
      Link: https://lkml.kernel.org/r/20231018102952.3339837-4-liushixin2@huawei.com
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      62047e0f
    • Liu Shixin's avatar
      bootmem: use kmemleak_free_part_phys in free_bootmem_page · 80203f1c
      Liu Shixin authored
      Since kmemleak_alloc_phys() rather than kmemleak_alloc() was called from
      memblock_alloc_range_nid(), kmemleak_free_part_phys() should be used to
      delete kmemleak object in free_bootmem_page().  In debug mode, there are
      following warning:
      
       kmemleak: Partially freeing unknown object at 0xffff97345aff7000 (size 4096)
      
      Link: https://lkml.kernel.org/r/20231018102952.3339837-3-liushixin2@huawei.com
      Fixes: 028725e7
      
       ("bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80203f1c
    • Liu Shixin's avatar
      bootmem: use kmemleak_free_part_phys in put_page_bootmem · 6d4e2cda
      Liu Shixin authored
      Patch series "Some bugfix about kmemleak", v3.
      
      Some bugfixes for kmemleak and the printed info from debug mode.
      
      
      This patch (of 7):
      
      Since kmemleak_alloc_phys() rather than kmemleak_alloc() was called from
      memblock_alloc_range_nid(), kmemleak_free_part_phys() should be used to
      delete kmemleak object in put_page_bootmem().  In debug mode, there are
      following warning:
      
       kmemleak: Partially freeing unknown object at 0xffff97345aff7000 (size 4096)
      
      Link: https://lkml.kernel.org/r/20231018102952.3339837-1-liushixin2@huawei.com
      Link: https://lkml.kernel.org/r/20231018102952.3339837-2-liushixin2@huawei.com
      Fixes: dd0ff4d1
      
       ("bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d4e2cda
    • Kefeng Wang's avatar
      mm: remove page_cpupid_xchg_last() · 8f0f4788
      Kefeng Wang authored
      
      
      Since all calls use folio_xchg_last_cpupid(), remove
      page_cpupid_xchg_last().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-20-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f0f4788
    • Kefeng Wang's avatar
      mm: use folio_xchg_last_cpupid() in wp_page_reuse() · c2c3b514
      Kefeng Wang authored
      
      
      Convert to use folio_xchg_last_cpupid() in wp_page_reuse(), and remove
      page variable. Since now only normal and PMD-mapped page is handled by
      numa balancing, it's enough to only update the entire folio's last cpupid.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-19-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2c3b514
    • Kefeng Wang's avatar
      mm: convert wp_page_reuse() and finish_mkwrite_fault() to take a folio · a86bc96b
      Kefeng Wang authored
      
      
      Saves one compound_head() call, also in preparation for
      page_cpupid_xchg_last() conversion.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-18-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a86bc96b
    • Kefeng Wang's avatar
      mm: make finish_mkwrite_fault() static · c08b7e38
      Kefeng Wang authored
      
      
      Make finish_mkwrite_fault static since it is not used outside of
      memory.c.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-17-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c08b7e38
    • Kefeng Wang's avatar
      mm: huge_memory: use folio_xchg_last_cpupid() in __split_huge_page_tail() · c8253011
      Kefeng Wang authored
      
      
      Convert to use folio_xchg_last_cpupid() in __split_huge_page_tail().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-16-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c8253011
    • Kefeng Wang's avatar
      mm: migrate: use folio_xchg_last_cpupid() in folio_migrate_flags() · 4e694fe4
      Kefeng Wang authored
      
      
      Convert to use folio_xchg_last_cpupid() in folio_migrate_flags(), also
      directly use folio_nid() instead of page_to_nid(&folio->page).
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-15-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e694fe4
    • Kefeng Wang's avatar
      sched/fair: use folio_xchg_last_cpupid() in should_numa_migrate_memory() · 1b143cc7
      Kefeng Wang authored
      
      
      Convert to use folio_xchg_last_cpupid() in should_numa_migrate_memory().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-14-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b143cc7
    • Kefeng Wang's avatar
      mm: add folio_xchg_last_cpupid() · 136d0b47
      Kefeng Wang authored
      
      
      Add folio_xchg_last_cpupid() wrapper, which is required to convert
      page_cpupid_xchg_last() to folio vertion later in the series.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-13-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      136d0b47
    • Kefeng Wang's avatar
      mm: remove xchg_page_access_time() · f3930843
      Kefeng Wang authored
      
      
      Since all calls use folio_xchg_access_time(), remove
      xchg_page_access_time().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-12-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3930843
    • Kefeng Wang's avatar
      mm: huge_memory: use a folio in change_huge_pmd() · d986ba2b
      Kefeng Wang authored
      
      
      Use a folio in change_huge_pmd(), which helps to remove last
      xchg_page_access_time() caller.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-11-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d986ba2b
    • Kefeng Wang's avatar
      mm: mprotect: use a folio in change_pte_range() · ec177880
      Kefeng Wang authored
      
      
      Use a folio in change_pte_range() to save three compound_head() calls.
      Since now only normal and PMD-mapped page is handled by numa balancing,
      it is enough to only update the entire folio's access time.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-10-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec177880
    • Kefeng Wang's avatar
      sched/fair: use folio_xchg_access_time() in numa_hint_fault_latency() · 0b201c36
      Kefeng Wang authored
      
      
      Convert to use folio_xchg_access_time() in numa_hint_fault_latency().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-9-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b201c36
    • Kefeng Wang's avatar
      mm: add folio_xchg_access_time() · 55c19938
      Kefeng Wang authored
      
      
      Add folio_xchg_access_time() wrapper, which is required to convert
      xchg_page_access_time() to folio vertion later in the series.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-8-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55c19938
    • Kefeng Wang's avatar
      mm: remove page_cpupid_last() · f39eac30
      Kefeng Wang authored
      
      
      Since all calls use folio_last_cpupid(), remove page_cpupid_last().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-7-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f39eac30
    • Kefeng Wang's avatar
      mm: huge_memory: use folio_last_cpupid() in __split_huge_page_tail() · 19c1ac02
      Kefeng Wang authored
      
      
      Convert to use folio_last_cpupid() in __split_huge_page_tail().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-6-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      19c1ac02
    • Kefeng Wang's avatar
      mm: huge_memory: use folio_last_cpupid() in do_huge_pmd_numa_page() · c4a8d2fa
      Kefeng Wang authored
      
      
      Convert to use folio_last_cpupid() in do_huge_pmd_numa_page().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-5-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4a8d2fa
    • Kefeng Wang's avatar
      mm: memory: use folio_last_cpupid() in do_numa_page() · 67b33e3f
      Kefeng Wang authored
      
      
      Convert to use folio_last_cpupid() in do_numa_page().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-4-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      67b33e3f
    • Kefeng Wang's avatar
      mm: add folio_last_cpupid() · 155c98cf
      Kefeng Wang authored
      
      
      Add folio_last_cpupid() wrapper, which is required to convert
      page_cpupid_last() to folio vertion later in the series.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-3-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      155c98cf
    • Kefeng Wang's avatar
      mm_types: add virtual and _last_cpupid into struct folio · 1d44f2e6
      Kefeng Wang authored
      
      
      Patch series "mm: convert page cpupid functions to folios", v3.
      
      The cpupid(or access time) used by numa balancing is stored in flags or
      _last_cpupid(if LAST_CPUPID_NOT_IN_PAGE_FLAGS) of page, this is to convert
      page cpupid to folio cpupid, a new _last_cpupid is added into folio, which
      make us to use folio->_last_cpupid directly, and the page cpupid functions
      are converted to folio ones.
      
        page_cpupid_last()		-> folio_last_cpupid()
        xchg_page_access_time()	-> folio_xchg_access_time()
        page_cpupid_xchg_last()	-> folio_xchg_last_cpupid()
      
      
      This patch (of 19):
      
      If WANT_PAGE_VIRTUAL and LAST_CPUPID_NOT_IN_PAGE_FLAGS defined, the
      'virtual' and '_last_cpupid' are in struct page, and since _last_cpupid is
      used by numa balancing feature, it is better to move it before KMSAN
      metadata from struct page, also add them into struct folio to make us to
      access them from folio directly.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20231018140806.2783514-2-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d44f2e6
    • Kairui Song's avatar
      mm/swap: avoid a xa load for swapout path · e5b306a0
      Kairui Song authored
      A variable is never used for swapout path (shadowp is NULL) and compiler
      is unable to optimize out the unneeded load since it's a function call.
      
      The was introduced by 3852f676
      
       ("mm/swapcache: support to handle the
      shadow entries").
      
      Link: https://lkml.kernel.org/r/20231017011728.37508-1-ryncsn@gmail.com
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5b306a0
    • Roman Gushchin's avatar
      mm: kmem: reimplement get_obj_cgroup_from_current() · e56808fe
      Roman Gushchin authored
      
      
      Reimplement get_obj_cgroup_from_current() using current_obj_cgroup(). 
      get_obj_cgroup_from_current() and current_obj_cgroup() share 80% of the
      code, so the new implementation is almost trivial.
      
      get_obj_cgroup_from_current() is a convenient function used by the
      bpf subsystem, so there is no reason to get rid of it completely.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-7-roman.gushchin@linux.dev
      Signed-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e56808fe
    • Roman Gushchin's avatar
      percpu: scoped objcg protection · c63b835d
      Roman Gushchin authored
      
      
      Similar to slab and kmem, switch to a scope-based protection of the objcg
      pointer to avoid.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-6-roman.gushchin@linux.dev
      Signed-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c63b835d
    • Roman Gushchin's avatar
      mm: kmem: scoped objcg protection · e86828e5
      Roman Gushchin authored
      
      
      Switch to a scope-based protection of the objcg pointer on slab/kmem
      allocation paths.  Instead of using the get_() semantics in the
      pre-allocation hook and put the reference afterwards, let's rely on the
      fact that objcg is pinned by the scope.
      
      It's possible because:
      1) if the objcg is received from the current task struct, the task is
         keeping a reference to the objcg.
      2) if the objcg is received from an active memcg (remote charging),
         the memcg is pinned by the scope and has a reference to the
         corresponding objcg.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-5-roman.gushchin@linux.dev
      Signed-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e86828e5
    • Roman Gushchin's avatar
      mm: kmem: make memcg keep a reference to the original objcg · 675d6c9b
      Roman Gushchin authored
      
      
      Keep a reference to the original objcg object for the entire life of a
      memcg structure.
      
      This allows to simplify the synchronization on the kernel memory
      allocation paths: pinning a (live) memcg will also pin the corresponding
      objcg.
      
      The memory overhead of this change is minimal because object cgroups
      usually outlive their corresponding memory cgroups even without this
      change, so it's only an additional pointer per memcg.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-4-roman.gushchin@linux.dev
      Signed-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      675d6c9b
    • Roman Gushchin's avatar
      mm: kmem: add direct objcg pointer to task_struct · 1aacbd35
      Roman Gushchin authored
      
      
      To charge a freshly allocated kernel object to a memory cgroup, the kernel
      needs to obtain an objcg pointer.  Currently it does it indirectly by
      obtaining the memcg pointer first and then calling to
      __get_obj_cgroup_from_memcg().
      
      Usually tasks spend their entire life belonging to the same object cgroup.
      So it makes sense to save the objcg pointer on task_struct directly, so
      it can be obtained faster.  It requires some work on fork, exit and cgroup
      migrate paths, but these paths are way colder.
      
      To avoid any costly synchronization the following rules are applied:
      1) A task sets it's objcg pointer itself.
      
      2) If a task is being migrated to another cgroup, the least
         significant bit of the objcg pointer is set atomically.
      
      3) On the allocation path the objcg pointer is obtained locklessly
         using the READ_ONCE() macro and the least significant bit is
         checked. If it's set, the following procedure is used to update
         it locklessly:
             - task->objcg is zeroed using cmpxcg
             - new objcg pointer is obtained
             - task->objcg is updated using try_cmpxchg
             - operation is repeated if try_cmpxcg fails
         It guarantees that no updates will be lost if task migration
         is racing against objcg pointer update. It also allows to keep
         both read and write paths fully lockless.
      
      Because the task is keeping a reference to the objcg, it can't go away
      while the task is alive.
      
      This commit doesn't change the way the remote memcg charging works.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-3-roman.gushchin@linux.dev
      Signed-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1aacbd35
    • Roman Gushchin's avatar
      mm: kmem: optimize get_obj_cgroup_from_current() · 7d0715d0
      Roman Gushchin authored
      
      
      Patch series "mm: improve performance of accounted kernel memory
      allocations", v5.
      
      This patchset improves the performance of accounted kernel memory
      allocations by ~30% as measured by a micro-benchmark [1].  The benchmark
      is very straightforward: 1M of 64 bytes-large kmalloc() allocations.
      
      Below are results with the disabled kernel memory accounting, the original state
      and with this patchset applied.
      
      |             | Kmem disabled | Original | Patched |  Delta |
      |-------------+---------------+----------+---------+--------|
      | User cgroup |         29764 |    84548 |   59078 | -30.0% |
      | Root cgroup |         29742 |    48342 |   31501 | -34.8% |
      
      As we can see, the patchset removes the majority of the overhead when
      there is no actual accounting (a task belongs to the root memory cgroup)
      and almost halves the accounting overhead otherwise.
      
      The main idea is to get rid of unnecessary memcg to objcg conversions and
      switch to a scope-based protection of objcgs, which eliminates extra
      operations with objcg reference counters under a rcu read lock.  More
      details are provided in individual commit descriptions.
      
      
      This patch (of 5):
      
      Manually inline memcg_kmem_bypass() and active_memcg() to speed up
      get_obj_cgroup_from_current() by avoiding duplicate in_task() checks and
      active_memcg() readings.
      
      Also add a likely() macro to __get_obj_cgroup_from_memcg():
      obj_cgroup_tryget() should succeed at almost all times except a very
      unlikely race with the memcg deletion path.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-1-roman.gushchin@linux.dev
      Link: https://lkml.kernel.org/r/20231019225346.1822282-2-roman.gushchin@linux.dev
      Signed-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d0715d0
    • Huang Ying's avatar
      mm, pcp: reduce detecting time of consecutive high order page freeing · 6ccdcb6d
      Huang Ying authored
      
      
      In current PCP auto-tuning design, if the number of pages allocated is
      much more than that of pages freed on a CPU, the PCP high may become the
      maximal value even if the allocating/freeing depth is small, for example,
      in the sender of network workloads.  If a CPU was used as sender
      originally, then it is used as receiver after context switching, we need
      to fill the whole PCP with maximal high before triggering PCP draining for
      consecutive high order freeing.  This will hurt the performance of some
      network workloads.
      
      To solve the issue, in this patch, we will track the consecutive page
      freeing with a counter in stead of relying on PCP draining.  So, we can
      detect consecutive page freeing much earlier.
      
      On a 2-socket Intel server with 128 logical CPU, we tested
      SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. 
      With the patch, the network bandwidth improves 5.0%.  This restores the
      performance drop caused by PCP auto-tuning.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-10-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6ccdcb6d
    • Huang Ying's avatar
      mm, pcp: decrease PCP high if free pages < high watermark · 57c0419c
      Huang Ying authored
      
      
      One target of PCP is to minimize pages in PCP if the system free pages is
      too few.  To reach that target, when page reclaiming is active for the
      zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating
      path, decrease PCP high and free some pages in freeing path.  But this may
      be too late because the background page reclaiming may introduce latency
      for some workloads.  So, in this patch, during page allocation we will
      detect whether the number of free pages of the zone is below high
      watermark.  If so, we will stop increasing PCP high in allocating path,
      decrease PCP high and free some pages in freeing path.  With this, we can
      reduce the possibility of the premature background page reclaiming caused
      by too large PCP.
      
      The high watermark checking is done in allocating path to reduce the
      overhead in hotter freeing path.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-9-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      57c0419c
    • Huang Ying's avatar
      mm: tune PCP high automatically · 51a755c5
      Huang Ying authored
      
      
      The target to tune PCP high automatically is as follows,
      
      - Minimize allocation/freeing from/to shared zone
      
      - Minimize idle pages in PCP
      
      - Minimize pages in PCP if the system free pages is too few
      
      To reach these target, a tuning algorithm as follows is designed,
      
      - When we refill PCP via allocating from the zone, increase PCP high.
        Because if we had larger PCP, we could avoid to allocate from the
        zone.
      
      - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
        decrease PCP high to try to free possible idle PCP pages.
      
      - When page reclaiming is active for the zone, stop increasing PCP
        high in allocating path, decrease PCP high and free some pages in
        freeing path.
      
      So, the PCP high can be tuned to the page allocating/freeing depth of
      workloads eventually.
      
      One issue of the algorithm is that if the number of pages allocated is
      much more than that of pages freed on a CPU, the PCP high may become the
      maximal value even if the allocating/freeing depth is small.  But this
      isn't a severe issue, because there are no idle pages in this case.
      
      One alternative choice is to increase PCP high when we drain PCP via
      trying to free pages to the zone, but don't increase PCP high during PCP
      refilling.  This can avoid the issue above.  But if the number of pages
      allocated is much less than that of pages freed on a CPU, there will be
      many idle pages in PCP and it is hard to free these idle pages.
      
      1/8 (>> 3) of PCP high will be decreased periodically.  The value 1/8 is
      kind of arbitrary.  Just to make sure that the idle PCP pages will be
      freed eventually.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      build time decreases 3.5%.  The cycles% of the spinlock contention (mostly
      for zone lock) decreases from 11.0% to 0.5%.  The number of PCP draining
      for high order pages freeing (free_high) decreases 65.6%.  The number of
      pages allocated from zone (instead of from PCP) decreases 83.9%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51a755c5
    • Huang Ying's avatar
      mm: add framework for PCP high auto-tuning · 90b41691
      Huang Ying authored
      
      
      The page allocation performance requirements of different workloads are
      usually different.  So, we need to tune PCP (per-CPU pageset) high to
      optimize the workload page allocation performance.  Now, we have a system
      wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand.
      But, it's hard to find out the best value by hand.  And one global
      configuration may not work best for the different workloads that run on
      the same system.  One solution to these issues is to tune PCP high of each
      CPU automatically.
      
      This patch adds the framework for PCP high auto-tuning.  With it,
      pcp->high of each CPU will be changed automatically by tuning algorithm at
      runtime.  The minimal high (pcp->high_min) is the original PCP high value
      calculated based on the low watermark pages.  While the maximal high
      (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction
      sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the
      maximal pcp->high that can be set via sysctl knob by hand.
      
      It's possible that PCP high auto-tuning doesn't work well for some
      workloads.  So, when PCP high is tuned by hand via the sysctl knob, the
      auto-tuning will be disabled.  The PCP high set by hand will be used
      instead.
      
      This patch only adds the framework, so pcp->high will be set to
      pcp->high_min (original default) always.  We will add actual auto-tuning
      algorithm in the following patches in the series.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-7-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      90b41691
    • Huang Ying's avatar
      mm, page_alloc: scale the number of pages that are batch allocated · c0a24239
      Huang Ying authored
      
      
      When a task is allocating a large number of order-0 pages, it may acquire
      the zone->lock multiple times allocating pages in batches.  This may
      unnecessarily contend on the zone lock when allocating very large number
      of pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent allocations.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      cycles% of the spinlock contention (mostly for zone lock) decreases from
      12.6% to 11.0% (with PCP size == 367).
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-6-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0a24239
    • Huang Ying's avatar
      mm: restrict the pcp batch scale factor to avoid too long latency · 52166607
      Huang Ying authored
      In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
      batches to increase page allocation throughput, reduce page
      allocation/freeing latency per page, and reduce zone lock contention.  But
      too large batch size will cause too long maximal allocation/freeing
      latency, which may punish arbitrary users.  So the default batch size is
      chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to
      avoid that.
      
      In commit 3b12e7e9
      
       ("mm/page_alloc: scale the number of pages that are
      batch freed"), the batch size will be scaled for large number of page
      freeing to improve page freeing performance and reduce zone lock
      contention.  Similar optimization can be used for large number of pages
      allocation too.
      
      To find out a suitable max batch scale factor (that is, max effective
      batch size), some tests and measurement on some machines were done as
      follows.
      
      A set of debug patches are implemented as follows,
      
      - Set PCP high to be 2 * batch to reduce the effect of PCP high
      
      - Disable free batch size scaling to get the raw performance.
      
      - The code with zone lock held is extracted from rmqueue_bulk() and
        free_pcppages_bulk() to 2 separate functions to make it easy to
        measure the function run time with ftrace function_graph tracer.
      
      - The batch size is hard coded to be 63 (default), 127, 255, 511,
        1023, 2047, 4095.
      
      Then will-it-scale/page_fault1 is used to generate the page
      allocation/freeing workload.  The page allocation/freeing throughput
      (page/s) is measured via will-it-scale.  The page allocation/freeing
      average latency (alloc/free latency avg, in us) and allocation/freeing
      latency at 99 percentile (alloc/free latency 99%, in us) are measured with
      ftrace function_graph tracer.
      
      The test results are as follows,
      
      Sapphire Rapids Server
      ======================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	513633.4	 2.33		 3.57		 2.67		  6.83
       127	517616.7	 4.35		 6.65		 4.22		 13.03
       255	520822.8	 8.29		13.32		 7.52		 25.24
       511	524122.0	15.79		23.42		14.02		 49.35
      1023	525980.5	30.25		44.19		25.36		 94.88
      2047	526793.6	59.39		84.50		45.22		140.81
      
      Ice Lake Server
      ===============
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	620210.3	 2.21		 3.68		 2.02		 4.35
       127	627003.0	 4.09		 6.86		 3.51		 8.28
       255	630777.5	 7.70		13.50		 6.17		15.97
       511	633651.5	14.85		22.62		11.66		31.08
      1023	637071.1	28.55		42.02		20.81		54.36
      2047	638089.7	56.54		84.06		39.28		91.68
      
      Cascade Lake Server
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	404706.7	 3.29		  5.03		 3.53		  4.75
       127	422475.2	 6.12		  9.09		 6.36		  8.76
       255	411522.2	11.68		 16.97		10.90		 16.39
       511	428124.1	22.54		 31.28		19.86		 32.25
      1023	414718.4	43.39		 62.52		40.00		 66.33
      2047	429848.7	86.64		120.34		71.14		106.08
      
      Commet Lake Desktop
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
      
        63	795183.13	 2.18		 3.55		 2.03		 3.05
       127	803067.85	 3.91		 6.56		 3.85		 5.52
       255	812771.10	 7.35		10.80		 7.14		10.20
       511	817723.48	14.17		27.54		13.43		30.31
      1023	818870.19	27.72		40.10		27.89		46.28
      
      Coffee Lake Desktop
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	510542.8	 3.13		  4.40		 2.48		 3.43
       127	514288.6	 5.97		  7.89		 4.65		 6.04
       255	516889.7	11.86		 15.58		 8.96		12.55
       511	519802.4	23.10		 28.81		16.95		26.19
      1023	520802.7	45.30		 52.51		33.19		45.95
      2047	519997.1	90.63		104.00		65.26		81.74
      
      From the above data, to restrict the allocation/freeing latency to be less
      than 100 us in most times, the max batch scale factor needs to be less
      than or equal to 5.
      
      Although it is reasonable to use 5 as max batch scale factor for the
      systems tested, there are also slower systems.  Where smaller value should
      be used to constrain the page allocation/freeing latency.
      
      So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added to
      set the max batch scale factor.  Whose default value is 5, and users can
      reduce it when necessary.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-5-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52166607
    • Huang Ying's avatar
      mm, pcp: reduce lock contention for draining high-order pages · 362d37a1
      Huang Ying authored
      In commit f26b3fa0
      
       ("mm/page_alloc: limit number of high-order pages
      on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
      PCP is mostly used for high-order pages freeing to improve the cache-hot
      pages reusing between page allocating and freeing CPUs.
      
      On system with small per-CPU data cache slice, pages shouldn't be cached
      before draining to guarantee cache-hot.  But on a system with large
      per-CPU data cache slice, some pages can be cached before draining to
      reduce zone lock contention.
      
      So, in this patch, instead of draining without any caching, "pcp->batch"
      pages will be cached in PCP before draining if the size of the per-CPU
      data cache slice is more than "3 * batch".
      
      In theory, if the size of per-CPU data cache slice is more than "2 *
      batch", we can reuse cache-hot pages between CPUs.  But considering the
      other usage of cache (code, other data accessing, etc.), "3 * batch" is
      used.
      
      Note: "3 * batch" is chosen to make sure the optimization works on recent
      x86_64 server CPUs.  If you want to increase it, please check whether it
      breaks the optimization.
      
      On a 2-socket Intel server with 128 logical CPU, with the patch, the
      network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite
      with 16-pair processes increase 70.5%.  The cycles% of the spinlock
      contention (mostly for zone lock) decreases from 46.1% to 21.3%.  The
      number of PCP draining for high order pages freeing (free_high) decreases
      89.9%.  The cache miss rate keeps 0.2%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-4-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      362d37a1
    • Huang Ying's avatar
      cacheinfo: calculate size of per-CPU data cache slice · 94a3bfe4
      Huang Ying authored
      
      
      This can be used to estimate the size of the data cache slice that can be
      used by one CPU under ideal circumstances.  Both DATA caches and UNIFIED
      caches are used in calculation.  So, the users need to consider the impact
      of the code cache usage.
      
      Because the cache inclusive/non-inclusive information isn't available now,
      we just use the size of the per-CPU slice of LLC to make the result more
      predictable across architectures.  This may be improved when more cache
      information is available in the future.
      
      A brute-force algorithm to iterate all online CPUs is used to avoid to
      allocate an extra cpumask, especially in offline callback.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-3-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94a3bfe4