Skip to content
  1. Mar 29, 2023
    • Qi Zheng's avatar
      mm: shrinkers: make count and scan in shrinker debugfs lockless · 20cd1892
      Qi Zheng authored
      
      
      Like global and memcg slab shrink, also use SRCU to make count and scan
      operations in memory shrinker debugfs lockless.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-6-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20cd1892
    • Kirill Tkhai's avatar
      mm: vmscan: add shrinker_srcu_generation · 475733dd
      Kirill Tkhai authored
      
      
      After we make slab shrink lockless with SRCU, the longest sleep
      unregister_shrinker() will be a sleep waiting for all do_shrink_slab()
      calls.
      
      To avoid long unbreakable action in the unregister_shrinker(), add
      shrinker_srcu_generation to restore a check similar to the
      rwsem_is_contendent() check that we had before.
      
      And for memcg slab shrink, we unlock SRCU and continue iterations from the
      next shrinker id.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-5-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      475733dd
    • Qi Zheng's avatar
      mm: vmscan: make memcg slab shrink lockless · caa05325
      Qi Zheng authored
      
      
      Like global slab shrink, this commit also uses SRCU to make memcg slab
      shrink lockless.
      
      We can reproduce the down_read_trylock() hotspot through the
      following script:
      
      ```
      
      DIR="/root/shrinker/memcg/mnt"
      
      do_create()
      {
          mkdir -p /sys/fs/cgroup/memory/test
          mkdir -p /sys/fs/cgroup/perf_event/test
          echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          for i in `seq 0 $1`;
          do
              mkdir -p /sys/fs/cgroup/memory/test/$i;
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
              mkdir -p $DIR/$i;
          done
      }
      
      do_mount()
      {
          for i in `seq $1 $2`;
          do
              mount -t tmpfs $i $DIR/$i;
          done
      }
      
      do_touch()
      {
          for i in `seq $1 $2`;
          do
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
                  dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
          done
      }
      
      case "$1" in
        touch)
          do_touch $2 $3
          ;;
        test)
            do_create 4000
          do_mount 0 4000
          do_touch 0 3000
          ;;
        *)
          exit 1
          ;;
      esac
      ```
      
      Save the above script, then run test and touch commands.
      Then we can use the following perf command to view hotspots:
      
      perf top -U -F 999
      
      1) Before applying this patchset:
      
        32.31%  [kernel]           [k] down_read_trylock
        19.40%  [kernel]           [k] pv_native_safe_halt
        16.24%  [kernel]           [k] up_read
        15.70%  [kernel]           [k] shrink_slab
         4.69%  [kernel]           [k] _find_next_bit
         2.62%  [kernel]           [k] shrink_node
         1.78%  [kernel]           [k] shrink_lruvec
         0.76%  [kernel]           [k] do_shrink_slab
      
      2) After applying this patchset:
      
        27.83%  [kernel]           [k] _find_next_bit
        16.97%  [kernel]           [k] shrink_slab
        15.82%  [kernel]           [k] pv_native_safe_halt
         9.58%  [kernel]           [k] shrink_node
         8.31%  [kernel]           [k] shrink_lruvec
         5.64%  [kernel]           [k] do_shrink_slab
         3.88%  [kernel]           [k] mem_cgroup_iter
      
      At the same time, we use the following perf command to capture
      IPC information:
      
      perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
      
      1) Before applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            454187219766      cycles                    test                    ( +-  1.84% )
             78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
      
              10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
      
      2) After applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
            527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
      
                10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
      
      We can see that IPC drops very seriously when calling
      down_read_trylock() at high frequency. After using SRCU,
      the IPC is at a normal level.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-4-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarVlastimil Babka <Vbabka@suse.cz>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      caa05325
    • Qi Zheng's avatar
      mm: vmscan: make global slab shrink lockless · f95bdb70
      Qi Zheng authored
      
      
      The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
      which protects most operations such as slab shrink, registration and
      unregistration of shrinkers, etc.  This can easily cause problems in the
      following cases.
      
      1) When the memory pressure is high and there are many
         filesystems mounted or unmounted at the same time,
         slab shrink will be affected (down_read_trylock()
         failed).
      
         Such as the real workload mentioned by Kirill Tkhai:
      
         ```
         One of the real workloads from my experience is start
         of an overcommitted node containing many starting
         containers after node crash (or many resuming containers
         after reboot for kernel update). In these cases memory
         pressure is huge, and the node goes round in long reclaim.
         ```
      
      2) If a shrinker is blocked (such as the case mentioned
         in [1]) and a writer comes in (such as mount a fs),
         then this writer will be blocked and cause all
         subsequent shrinker-related operations to be blocked.
      
      Even if there is no competitor when shrinking slab, there may still be a
      problem.  If we have a long shrinker list and we do not reclaim enough
      memory with each shrinker, then the down_read_trylock() may be called with
      high frequency.  Because of the poor multicore scalability of atomic
      operations, this can lead to a significant drop in IPC (instructions per
      cycle).
      
      So many times in history ([2],[3],[4],[5]), some people wanted to replace
      shrinker_rwsem trylock with SRCU in the slab shrink, but all these patches
      were abandoned because SRCU was not unconditionally enabled.
      
      But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"), the SRCU
      is unconditionally enabled.  So it's time to use SRCU to protect readers
      who previously held shrinker_rwsem.
      
      This commit uses SRCU to make global slab shrink lockless,
      the memcg slab shrink is handled in the subsequent patch.
      
      [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
      [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
      [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
      [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
      [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-3-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f95bdb70
    • Qi Zheng's avatar
      mm: vmscan: add a map_nr_max field to shrinker_info · 42c9db39
      Qi Zheng authored
      
      
      Patch series "make slab shrink lockless", v5.
      
      This patch series aims to make slab shrink lockless.
      
      1. Background
      =============
      
      On our servers, we often find the following system cpu hotspots:
      
        52.22% [kernel]        [k] down_read_trylock
        19.60% [kernel]        [k] up_read
         8.86% [kernel]        [k] shrink_slab
         2.44% [kernel]        [k] idr_find
         1.25% [kernel]        [k] count_shadow_nodes
         1.18% [kernel]        [k] shrink lruvec
         0.71% [kernel]        [k] mem_cgroup_iter
         0.71% [kernel]        [k] shrink_node
         0.55% [kernel]        [k] find_next_bit
      
      And we used bpftrace to capture its calltrace as follows:
      
      @[
          down_read_trylock+1
          shrink_slab+128
          shrink_node+371
          do_try_to_free_pages+232
          try_to_free_pages+243
          _alloc_pages_slowpath+771
          _alloc_pages_nodemask+702
          pagecache_get_page+255
          filemap_fault+1361
          ext4_filemap_fault+44
          __do_fault+76
          handle_mm_fault+3543
          do_user_addr_fault+442
          do_page_fault+48
          page_fault+62
      ]: 1161690
      @[
          down_read_trylock+1
          shrink_slab+128
          shrink_node+371
          balance_pgdat+690
          kswapd+389
          kthread+246
          ret_from_fork+31
      ]: 8424884
      @[
          down_read_trylock+1
          shrink_slab+128
          shrink_node+371
          do_try_to_free_pages+232
          try_to_free_pages+243
          __alloc_pages_slowpath+771
          __alloc_pages_nodemask+702
          __do_page_cache_readahead+244
          filemap_fault+1674
          ext4_filemap_fault+44
          __do_fault+76
          handle_mm_fault+3543
          do_user_addr_fault+442
          do_page_fault+48
          page_fault+62
      ]: 20917631
      
      We can see that down_read_trylock() of shrinker_rwsem is being called with
      high frequency at that time.  Because of the poor multicore scalability of
      atomic operations, this can lead to a significant drop in IPC
      (instructions per cycle).
      
      And more, the shrinker_rwsem is a global read-write lock in shrinkers
      subsystem, which protects most operations such as slab shrink,
      registration and unregistration of shrinkers, etc.  This can easily cause
      problems in the following cases.
      
      1) When the memory pressure is high and there are many filesystems
         mounted or unmounted at the same time, slab shrink will be affected
         (down_read_trylock() failed).
      
         Such as the real workload mentioned by Kirill Tkhai:
      
         ```
         One of the real workloads from my experience is start of an
         overcommitted node containing many starting containers after node crash
         (or many resuming containers after reboot for kernel update).  In these
         cases memory pressure is huge, and the node goes round in long reclaim.
         ```
      
      2) If a shrinker is blocked (such as the case mentioned in [1]) and a
         writer comes in (such as mount a fs), then this writer will be blocked
         and cause all subsequent shrinker-related operations to be blocked.
      
      [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
      
      All the above cases can be solved by replacing the shrinker_rwsem trylocks
      with SRCU.
      
      2. Survey
      =========
      
      Before doing the code implementation, I found that there were many similar
      submissions in the community:
      
      a. Davidlohr Bueso submitted a patch in 2015.
         Subject: [PATCH -next v2] mm: srcu-ify shrinkers
         Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
         Result: It was finally merged into the linux-next branch,
                 but failed on arm allnoconfig (without CONFIG_SRCU)
      
      b. Tetsuo Handa submitted a patchset in 2017.
         Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
         Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
         Result: Finally chose to use the current simple way (break
                 when rwsem_is_contended()).  And Christoph Hellwig suggested to
                 using SRCU, but SRCU was not unconditionally enabled at the
                 time.
      
      c. Kirill Tkhai submitted a patchset in 2018.
         Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
         Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
         Result: At that time, SRCU was not unconditionally enabled,
                 and there were some objections to enabling SRCU.  Later,
                 because Kirill's focus was moved to other things, this patchset
                 was not continued to be updated.
      
      d. Sultan Alsawaf submitted a patch in 2021.
         Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
         Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
         Result: Rejected because SRCU was not unconditionally enabled.
      
      We can find that almost all these historical commits were abandoned
      because SRCU was not unconditionally enabled.  But now SRCU has been
      unconditionally enable by Paul E.  McKenney in 2023 [2], so it's time to
      replace shrinker_rwsem trylocks with SRCU.
      
      [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
      
      3. Reproduction and testing
      ===========================
      
      We can reproduce the down_read_trylock() hotspot through the following script:
      
      ```
      #!/bin/bash
      
      DIR="/root/shrinker/memcg/mnt"
      
      do_create()
      {
          mkdir -p /sys/fs/cgroup/memory/test
          mkdir -p /sys/fs/cgroup/perf_event/test
          echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          for i in `seq 0 $1`;
          do
              mkdir -p /sys/fs/cgroup/memory/test/$i;
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
              mkdir -p $DIR/$i;
          done
      }
      
      do_mount()
      {
          for i in `seq $1 $2`;
          do
              mount -t tmpfs $i $DIR/$i;
          done
      }
      
      do_touch()
      {
          for i in `seq $1 $2`;
          do
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
                  dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
          done
      }
      
      case "$1" in
        touch)
          do_touch $2 $3
          ;;
        test)
            do_create 4000
          do_mount 0 4000
          do_touch 0 3000
          ;;
        *)
          exit 1
          ;;
      esac
      ```
      
      Save the above script, then run test and touch commands.  Then we can use
      the following perf command to view hotspots:
      
      perf top -U -F 999
      
      1) Before applying this patchset:
      
        32.31%  [kernel]           [k] down_read_trylock
        19.40%  [kernel]           [k] pv_native_safe_halt
        16.24%  [kernel]           [k] up_read
        15.70%  [kernel]           [k] shrink_slab
         4.69%  [kernel]           [k] _find_next_bit
         2.62%  [kernel]           [k] shrink_node
         1.78%  [kernel]           [k] shrink_lruvec
         0.76%  [kernel]           [k] do_shrink_slab
      
      2) After applying this patchset:
      
        27.83%  [kernel]           [k] _find_next_bit
        16.97%  [kernel]           [k] shrink_slab
        15.82%  [kernel]           [k] pv_native_safe_halt
         9.58%  [kernel]           [k] shrink_node
         8.31%  [kernel]           [k] shrink_lruvec
         5.64%  [kernel]           [k] do_shrink_slab
         3.88%  [kernel]           [k] mem_cgroup_iter
      
      At the same time, we use the following perf command to capture IPC
      information:
      
      perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
      
      1) Before applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            454187219766      cycles                    test                    ( +-  1.84% )
             78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
      
              10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
      
      2) After applying this patchset:
      
       Performance counter stats for 'system wide' (5 runs):
      
            841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
            527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
      
                10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
      
      We can see that IPC drops very seriously when calling down_read_trylock()
      at high frequency.  After using SRCU, the IPC is at a normal level.
      
      
      This patch (of 8):
      
      To prepare for the subsequent lockless memcg slab shrink, add a map_nr_max
      field to struct shrinker_info to records its own real shrinker_nr_max.
      
      Link: https://lkml.kernel.org/r/20230313112819.38938-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20230313112819.38938-2-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Suggested-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      42c9db39
    • Lorenzo Stoakes's avatar
      mm: prefer xxx_page() alloc/free functions for order-0 pages · dcc1be11
      Lorenzo Stoakes authored
      
      
      Update instances of alloc_pages(..., 0), __get_free_pages(..., 0) and
      __free_pages(..., 0) to use alloc_page(), __get_free_page() and
      __free_page() respectively in core code.
      
      Link: https://lkml.kernel.org/r/50c48ca4789f1da2a65795f2346f5ae3eff7d665.1678710232.git.lstoakes@gmail.com
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dcc1be11
    • Peter Collingbourne's avatar
      kasan: remove PG_skip_kasan_poison flag · 0a54864f
      Peter Collingbourne authored
      
      
      Code inspection reveals that PG_skip_kasan_poison is redundant with
      kasantag, because the former is intended to be set iff the latter is the
      match-all tag.  It can also be observed that it's basically pointless to
      poison pages which have kasantag=0, because any pages with this tag would
      have been pointed to by pointers with match-all tags, so poisoning the
      pages would have little to no effect in terms of bug detection. 
      Therefore, change the condition in should_skip_kasan_poison() to check
      kasantag instead, and remove PG_skip_kasan_poison and associated flags.
      
      Link: https://lkml.kernel.org/r/20230310042914.3805818-3-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I57f825f2eaeaf7e8389d6cf4597c8a5821359838
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a54864f
    • Sebastian Andrzej Siewior's avatar
      io-mapping: don't disable preempt on RT in io_mapping_map_atomic_wc(). · 7eb16f23
      Sebastian Andrzej Siewior authored
      
      
      io_mapping_map_atomic_wc() disables preemption and pagefaults for
      historical reasons.  The conversion to io_mapping_map_local_wc(), which
      only disables migration, cannot be done wholesale because quite some call
      sites need to be updated to accommodate with the changed semantics.
      
      On PREEMPT_RT enabled kernels the io_mapping_map_atomic_wc() semantics are
      problematic due to the implicit disabling of preemption which makes it
      impossible to acquire 'sleeping' spinlocks within the mapped atomic
      sections.
      
      PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
      more than a decade.  It could be argued that this is a justification to do
      this unconditionally, but PREEMPT_RT covers only a limited number of
      architectures and it disables some functionality which limits the coverage
      further.
      
      Limit the replacement to PREEMPT_RT for now.  This is also done
      kmap_atomic().
      
      Link: https://lkml.kernel.org/r/20230310162905.O57Pj7hh@linutronix.de
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reported-by: default avatarRichard Weinberger <richard.weinberger@gmail.com>
        Link: https://lore.kernel.org/CAFLxGvw0WMxaMqYqJ5WgvVSbKHq2D2xcXTOgMCpgq9nDC-MWTQ@mail.gmail.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7eb16f23
    • Luis Chamberlain's avatar
      shmem: add support to ignore swap · 2c6efe9c
      Luis Chamberlain authored
      
      
      In doing experimentations with shmem having the option to avoid swap
      becomes a useful mechanism.  One of the *raves* about brd over shmem is
      you can avoid swap, but that's not really a good reason to use brd if we
      can instead use shmem.  Using brd has its own good reasons to exist, but
      just because "tmpfs" doesn't let you do that is not a great reason to
      avoid it if we can easily add support for it.
      
      I don't add support for reconfiguring incompatible options, but if we
      really wanted to we can add support for that.
      
      To avoid swap we use mapping_set_unevictable() upon inode creation, and
      put a WARN_ON_ONCE() stop-gap on writepages() for reclaim.
      
      Link: https://lkml.kernel.org/r/20230309230545.2930737-7-mcgrof@kernel.org
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Adam Manzanares <a.manzanares@samsung.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2c6efe9c
    • Luis Chamberlain's avatar
      shmem: update documentation · d0f5a854
      Luis Chamberlain authored
      
      
      Update the docs to reflect a bit better why some folks prefer tmpfs over
      ramfs and clarify a bit more about the difference between brd ramdisks.
      
      While at it, add THP docs for tmpfs, both the mount options and the sysfs
      file.
      
      Link: https://lkml.kernel.org/r/20230309230545.2930737-6-mcgrof@kernel.org
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Adam Manzanares <a.manzanares@samsung.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0f5a854
    • Luis Chamberlain's avatar
      shmem: skip page split if we're not reclaiming · 9a976f0c
      Luis Chamberlain authored
      
      
      In theory when info->flags & VM_LOCKED we should not be getting
      shem_writepage() called so we should be verifying this with a
      WARN_ON_ONCE().  Since we should not be swapping then best to ensure we
      also don't do the folio split earlier too.  So just move the check early
      to avoid folio splits in case its a dubious call.
      
      We also have a similar early bail when !total_swap_pages so just move that
      earlier to avoid the possible folio split in the same situation.
      
      Link: https://lkml.kernel.org/r/20230309230545.2930737-5-mcgrof@kernel.org
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Adam Manzanares <a.manzanares@samsung.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9a976f0c
    • Luis Chamberlain's avatar
      shmem: move reclaim check early on writepages() · cf7992bf
      Luis Chamberlain authored
      
      
      i915_gem requires huge folios to be split when swapping.  However we have
      check for usage of writepages() to ensure it used only for swap purposes
      later.  Avoid the splits if we're not being called for reclaim, even if
      they should in theory not happen.
      
      This makes the conditions easier to follow on shem_writepage().
      
      Link: https://lkml.kernel.org/r/20230309230545.2930737-4-mcgrof@kernel.org
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Adam Manzanares <a.manzanares@samsung.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf7992bf
    • Luis Chamberlain's avatar
      shmem: set shmem_writepage() variables early · 8ccee8c1
      Luis Chamberlain authored
      
      
      shmem_writepage() sets up variables typically used *after* a possible huge
      page split.  However even if that does happen the address space mapping
      should not change, and the inode does not change either.  So it should be
      safe to set that from the very beginning.
      
      This commit makes no functional changes.
      
      Link: https://lkml.kernel.org/r/20230309230545.2930737-3-mcgrof@kernel.org
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Adam Manzanares <a.manzanares@samsung.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ccee8c1
    • Luis Chamberlain's avatar
      shmem: remove check for folio lock on writepage() · 1f514bee
      Luis Chamberlain authored
      
      
      Patch series "tmpfs: add the option to disable swap", v2.
      
      I'm doing this work as part of future experimentation with tmpfs and the
      page cache, but given a common complaint found about tmpfs is the
      innability to work without the page cache I figured this might be useful
      to others.  It turns out it is -- at least Christian Brauner indicates
      systemd uses ramfs for a few use-cases because they don't want to use swap
      and so having this option would let them move over to using tmpfs for
      those small use cases, see systemd-creds(1).
      
      To see if you hit swap:
      
      mkswap /dev/nvme2n1
      swapon /dev/nvme2n1
      free -h
      
      With swap - what we see today
      =============================
      mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
      dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
      free -h
                     total        used        free      shared  buff/cache   available
      Mem:           3.7Gi       2.6Gi       1.2Gi       2.2Gi       2.2Gi       1.2Gi
      Swap:           99Gi       2.8Gi        97Gi
      
      
      Without swap
      =============
      
      free -h
                     total        used        free      shared  buff/cache   available
      Mem:           3.7Gi       387Mi       3.4Gi       2.1Mi        57Mi       3.3Gi
      Swap:           99Gi          0B        99Gi
      mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
      dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
      free -h
                     total        used        free      shared  buff/cache   available
      Mem:           3.7Gi       2.6Gi       1.2Gi       2.3Gi       2.3Gi       1.1Gi
      Swap:           99Gi        21Mi        99Gi
      
      The mix and match remount testing
      =================================
      
      # Cannot disable swap after it was first enabled:
      mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
      mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
      mount: /data-tmpfs: mount point not mounted or bad option.
             dmesg(1) may have more information after failed mount system call.
      dmesg -c
      tmpfs: Cannot disable swap on remount
      
      # Remount with the same noswap option is OK:
      mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
      mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
      dmesg -c
      
      # Trying to enable swap with a remount after it first disabled:
      mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
      mount -t tmpfs -o remount -o size=5G           tmpfs /data-tmpfs/
      mount: /data-tmpfs: mount point not mounted or bad option.
             dmesg(1) may have more information after failed mount system call.
      dmesg -c
      tmpfs: Cannot enable swap on remount if it was disabled on first mount
      
      
      This patch (of 6):
      
      Matthew notes we should not need to check the folio lock on the
      writepage() callback so remove it.  This sanity check has been lingering
      since linux-history days.  We remove this as we tidy up the writepage()
      callback to make things a bit clearer.
      
      Link: https://lkml.kernel.org/r/20230309230545.2930737-1-mcgrof@kernel.org
      Link: https://lkml.kernel.org/r/20230309230545.2930737-2-mcgrof@kernel.org
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Adam Manzanares <a.manzanares@samsung.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f514bee
    • Jingyu Wang's avatar
      mm/gup.c: fix typo in comments · 5da1a868
      Jingyu Wang authored
      
      
      Link: https://lkml.kernel.org/r/20230309104813.170309-1-jingyuwang_vip@163.com
      Signed-off-by: default avatarJingyu Wang <jingyuwang_vip@163.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5da1a868
    • Danilo Krummrich's avatar
      maple_tree: export symbol mas_preallocate() · 5c63a7c3
      Danilo Krummrich authored
      
      
      Fix missing EXPORT_SYMBOL_GPL() statement for mas_preallocate().
      
      It isn't actually used by anything yet, but mas_preallocate() is part of
      the maple tree's 'Advanced API'.  All other functions of this API are
      exported already.
      
      Link: https://lkml.kernel.org/r/20230302011035.4928-1-dakr@redhat.com
      Signed-off-by: default avatarDanilo Krummrich <dakr@redhat.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c63a7c3
    • Christoph Hellwig's avatar
      mm,jfs: move write_one_page/folio_write_one to jfs · 452a8f40
      Christoph Hellwig authored
      
      
      The last remaining user of folio_write_one through the write_one_page
      wrapper is jfs, so move the functionality there and hard code the call to
      metapage_writepage.
      
      Note that the use of the pagecache by the JFS 'metapage' buffer cache is a
      bit odd, and we could probably do without VM-level dirty tracking at all,
      but that's a change for another time.
      
      Link: https://lkml.kernel.org/r/20230307143125.27778-4-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Gang He <ghe@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      452a8f40
    • Christoph Hellwig's avatar
      ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page · a0d50b11
      Christoph Hellwig authored
      
      
      Use filemap_write_and_wait_range to write back the range of the dirty page
      instead of write_one_page in preparation of removing write_one_page and
      eventually ->writepage.
      
      Link: https://lkml.kernel.org/r/20230307143125.27778-3-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Gang He <ghe@suse.com>
      Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a0d50b11
    • Christoph Hellwig's avatar
      ufs: don't flush page immediately for DIRSYNC directories · 8b8d9a2d
      Christoph Hellwig authored
      
      
      Patch series "remove most callers of write_one_page", v4.
      
      This series removes most users of the write_one_page API.  These helpers
      internally call ->writepage which we are gradually removing from the
      kernel.
      
      
      This patch (of 3):
      
      We do not need to writeout modified directory blocks immediately when
      modifying them while the page is locked.  It is enough to do the flush
      somewhat later which has the added benefit that inode times can be flushed
      as well.  It also allows us to stop depending on write_one_page()
      function.
      
      Ported from an ext2 patch by Jan Kara.
      
      Link: https://lkml.kernel.org/r/20230307143125.27778-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20230307143125.27778-2-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b8d9a2d
    • Alexander Potapenko's avatar
      kmsan: add test_stackdepot_roundtrip · 6204c9ab
      Alexander Potapenko authored
      
      
      Ensure that KMSAN does not report false positives in instrumented callers
      of stack_depot_save(), stack_depot_print(), and stack_depot_fetch().
      
      Link: https://lkml.kernel.org/r/20230306111322.205724-2-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: syzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6204c9ab
    • Alexander Potapenko's avatar
      lib/stackdepot: kmsan: mark API outputs as initialized · 8e00b2df
      Alexander Potapenko authored
      
      
      KMSAN does not instrument stackdepot and may treat memory allocated by it
      as uninitialized.  This is not a problem for KMSAN itself, because its
      functions calling stackdepot API are also not instrumented.  But other
      kernel features (e.g.  netdev tracker) may access stack depot from
      instrumented code, which will lead to false positives, unless we
      explicitly mark stackdepot outputs as initialized.
      
      Link: https://lkml.kernel.org/r/20230306111322.205724-1-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Suggested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e00b2df
    • Yue Zhao's avatar
      mm, memcg: Prevent memory.soft_limit_in_bytes load/store tearing · 2178e20c
      Yue Zhao authored
      
      
      The knob for cgroup v1 memory controller: memory.soft_limit_in_bytes is
      not protected by any locking so it can be modified while it is used.  This
      is not an actual problem because races are unlikely.  But it is better to
      use [READ|WRITE]_ONCE to prevent compiler from doing anything funky.
      
      The access of memcg->soft_limit is lockless, so it can be concurrently set
      at the same time as we are trying to read it.  All occurrences of
      memcg->soft_limit are updated with [READ|WRITE]_ONCE.
      
      [findns94@gmail.com: v3]
        Link: https://lkml.kernel.org/r/20230308162555.14195-5-findns94@gmail.com
      Link: https://lkml.kernel.org/r/20230306154138.3775-5-findns94@gmail.com
      Signed-off-by: default avatarYue Zhao <findns94@gmail.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Tang Yizhou <tangyeechou@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2178e20c
    • Yue Zhao's avatar
      mm, memcg: Prevent memory.oom_control load/store tearing · 17c56de6
      Yue Zhao authored
      
      
      The knob for cgroup v1 memory controller: memory.oom_control is not
      protected by any locking so it can be modified while it is used.  This is
      not an actual problem because races are unlikely.  But it is better to use
      [READ|WRITE]_ONCE to prevent compiler from doing anything funky.
      
      The access of memcg->oom_kill_disable is lockless, so it can be
      concurrently set at the same time as we are trying to read it.  All
      occurrences of memcg->oom_kill_disable are updated with [READ|WRITE]_ONCE.
      
      [findns94@gmail.com: v3]
        Link: https://lkml.kernel.org/r/20230308162555.14195-4-findns94@gmail.com
      Link: https://lkml.kernel.org/r/20230306154138.377-4-findns94@gmail.com
      Signed-off-by: default avatarYue Zhao <findns94@gmail.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Tang Yizhou <tangyeechou@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17c56de6
    • Yue Zhao's avatar
      mm, memcg: Prevent memory.swappiness load/store tearing · 82b3aa26
      Yue Zhao authored
      
      
      The knob for cgroup v1 memory controller: memory.swappiness is not
      protected by any locking so it can be modified while it is used.  This is
      not an actual problem because races are unlikely.  But it is better to use
      [READ|WRITE]_ONCE to prevent compiler from doing anything funky.
      
      The access of memcg->swappiness and vm_swappiness is lockless, so both of
      them can be concurrently set at the same time as we are trying to read
      them.  All occurrences of memcg->swappiness and vm_swappiness are updated
      with [READ|WRITE]_ONCE.
      
      [findns94@gmail.com: v3]
        Link: https://lkml.kernel.org/r/20230308162555.14195-3-findns94@gmail.com
      Link: https://lkml.kernel.org/r/20230306154138.3775-3-findns94@gmail.com
      Signed-off-by: default avatarYue Zhao <findns94@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Tang Yizhou <tangyeechou@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      82b3aa26
    • Yue Zhao's avatar
      mm, memcg: Prevent memory.oom.group load/store tearing · eaf7b66b
      Yue Zhao authored
      
      
      Patch series "mm, memcg: cgroup v1 and v2 tunable load/store tearing
      fixes", v2.
      
      This patch series helps to prevent load/store tearing in
      several cgroup knobs.
      
      As kindly pointed out by Michal Hocko and Roman Gushchin
      , the changelog has been rephrased.
      
      Besides, more knobs were checked, according to kind suggestions
      from Shakeel Butt and Muchun Song.
      
      
      This patch (of 4):
      
      The knob for cgroup v2 memory controller: memory.oom.group
      is not protected by any locking so it can be modified while it is used.
      This is not an actual problem because races are unlikely (the knob is
      usually configured long before any workloads hits actual memcg oom)
      but it is better to use READ_ONCE/WRITE_ONCE to prevent compiler from
      doing anything funky.
      
      The access of memcg->oom_group is lockless, so it can be
      concurrently set at the same time as we are trying to read it.
      
      Link: https://lkml.kernel.org/r/20230306154138.3775-1-findns94@gmail.com
      Link: https://lkml.kernel.org/r/20230306154138.3775-2-findns94@gmail.com
      Signed-off-by: default avatarYue Zhao <findns94@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Tang Yizhou <tangyeechou@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eaf7b66b
    • Zi Yan's avatar
      selftests/mm: fix split huge page tests · dd63bd7d
      Zi Yan authored
      Fix two inputs to check_anon_huge() and one if condition, so the tests
      work as expected.
      
      Link: https://lkml.kernel.org/r/20230306160907.16804-1-zi.yan@sent.com
      Fixes: c07c343c
      
       ("selftests/vm: dedup THP helpers")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Tested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dd63bd7d
    • Gerald Schaefer's avatar
      mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault() · 99c29133
      Gerald Schaefer authored
      
      
      s390 can do more fine-grained handling of spurious TLB protection faults,
      when there also is the PTE pointer available.
      
      Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an
      additional parameter.
      
      This will add no functional change to other architectures, but those with
      private flush_tlb_fix_spurious_fault() implementations need to be made
      aware of the new parameter.
      
      Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99c29133
    • Sergey Senozhatsky's avatar
      zsmalloc: show per fullness group class stats · e1807d5d
      Sergey Senozhatsky authored
      
      
      We keep the old fullness (3/4 threshold) reporting in
      zs_stats_size_show().  Switch from allmost full/empty stats to
      fine-grained per inuse ratio (fullness group) reporting, which gives
      signicantly more data on classes fragmentation.
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-5-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e1807d5d
    • Sergey Senozhatsky's avatar
      zsmalloc: rework compaction algorithm · 5a845e9f
      Sergey Senozhatsky authored
      
      
      The zsmalloc compaction algorithm has the potential to waste some CPU
      cycles, particularly when compacting pages within the same fullness group.
      This is due to the way it selects the head page of the fullness list for
      source and destination pages, and how it reinserts those pages during each
      iteration.  The algorithm may first use a page as a migration destination
      and then as a migration source, leading to an unnecessary back-and-forth
      movement of objects.
      
      Consider the following fullness list:
      
      PageA PageB PageC PageD PageE
      
      During the first iteration, the compaction algorithm will select PageA as
      the source and PageB as the destination.  All of PageA's objects will be
      moved to PageB, and then PageA will be released while PageB is reinserted
      into the fullness list.
      
      PageB PageC PageD PageE
      
      During the next iteration, the compaction algorithm will again select the
      head of the list as the source and destination, meaning that PageB will
      now serve as the source and PageC as the destination.  This will result in
      the objects being moved away from PageB, the same objects that were just
      moved to PageB in the previous iteration.
      
      To prevent this avalanche effect, the compaction algorithm should not
      reinsert the destination page between iterations.  By doing so, the most
      optimal page will continue to be used and its usage ratio will increase,
      reducing internal fragmentation.  The destination page should only be
      reinserted into the fullness list if:
      - It becomes full
      - No source page is available.
      
      TEST
      ====
      
      It's very challenging to reliably test this series.  I ended up developing
      my own synthetic test that has 100% reproducibility.  The test generates
      significan fragmentation (for each size class) and then performs
      compaction for each class individually and tracks the number of memcpy()
      in zs_object_copy(), so that we can compare the amount work compaction
      does on per-class basis.
      
      Total amount of work (zram mm_stat objs_moved)
      ----------------------------------------------
      
      Old fullness grouping, old compaction algorithm:
      323977 memcpy() in zs_object_copy().
      
      Old fullness grouping, new compaction algorithm:
      262944 memcpy() in zs_object_copy().
      
      New fullness grouping, new compaction algorithm:
      213978 memcpy() in zs_object_copy().
      
      Per-class compaction memcpy() comparison (T-test)
      -------------------------------------------------
      
      x Old fullness grouping, old compaction algorithm
      + Old fullness grouping, new compaction algorithm
      
          N           Min           Max        Median           Avg        Stddev
      x 140           349          3513          2461     2314.1214     806.03271
      + 140           289          2778          2006     1878.1714     641.02073
      Difference at 95.0% confidence
              -435.95 +/- 170.595
              -18.8387% +/- 7.37193%
              (Student's t, pooled s = 728.216)
      
      x Old fullness grouping, old compaction algorithm
      + New fullness grouping, new compaction algorithm
      
          N           Min           Max        Median           Avg        Stddev
      x 140           349          3513          2461     2314.1214     806.03271
      + 140           226          2279          1644     1528.4143     524.85268
      Difference at 95.0% confidence
              -785.707 +/- 159.331
              -33.9527% +/- 6.88516%
              (Student's t, pooled s = 680.132)
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-4-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a845e9f
    • Sergey Senozhatsky's avatar
      zsmalloc: fine-grained inuse ratio based fullness grouping · 4c7ac972
      Sergey Senozhatsky authored
      
      
      Each zspage maintains ->inuse counter which keeps track of the number of
      objects stored in the zspage.  The ->inuse counter also determines the
      zspage's "fullness group" which is calculated as the ratio of the "inuse"
      objects to the total number of objects the zspage can hold
      (objs_per_zspage).  The closer the ->inuse counter is to objs_per_zspage,
      the better.
      
      Each size class maintains several fullness lists, that keep track of
      zspages of particular "fullness".  Pages within each fullness list are
      stored in random order with regard to the ->inuse counter.  This is
      because sorting the zspages by ->inuse counter each time obj_malloc() or
      obj_free() is called would be too expensive.  However, the ->inuse counter
      is still a crucial factor in many situations.
      
      For the two major zsmalloc operations, zs_malloc() and zs_compact(), we
      typically select the head zspage from the corresponding fullness list as
      the best candidate zspage.  However, this assumption is not always
      accurate.
      
      For the zs_malloc() operation, the optimal candidate zspage should have
      the highest ->inuse counter.  This is because the goal is to maximize the
      number of ZS_FULL zspages and make full use of all allocated memory.
      
      For the zs_compact() operation, the optimal source zspage should have the
      lowest ->inuse counter.  This is because compaction needs to move objects
      in use to another page before it can release the zspage and return its
      physical pages to the buddy allocator.  The fewer objects in use, the
      quicker compaction can release the zspage.  Additionally, compaction is
      measured by the number of pages it releases.
      
      This patch reworks the fullness grouping mechanism.  Instead of having two
      groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and ZS_ALMOST_FULL (usage
      ration above 3/4) - that result in too many zspages being included in the
      ALMOST_EMPTY group for specific classes, size classes maintain a larger
      number of fullness lists that give strict guarantees on the minimum and
      maximum ->inuse values within each group.  Each group represents a 10%
      change in the ->inuse ratio compared to neighboring groups.  In essence,
      there are groups for zspages with 0%, 10%, 20% usage ratios, and so on, up
      to 100%.
      
      This enhances the selection of candidate zspages for both zs_malloc() and
      zs_compact().  A printout of the ->inuse counters of the first 7 zspages
      per (random) class fullness group:
      
       class-768 objs_per_zspage 16:
         fullness 100%:  empty
         fullness  99%:  empty
         fullness  90%:  empty
         fullness  80%:  empty
         fullness  70%:  empty
         fullness  60%:  8  8  9  9  8  8  8
         fullness  50%:  empty
         fullness  40%:  5  5  6  5  5  5  5
         fullness  30%:  4  4  4  4  4  4  4
         fullness  20%:  2  3  2  3  3  2  2
         fullness  10%:  1  1  1  1  1  1  1
         fullness   0%:  empty
      
      The zs_malloc() function searches through the groups of pages starting
      with the one having the highest usage ratio.  This means that it always
      selects a zspage from the group with the least internal fragmentation
      (highest usage ratio) and makes it even less fragmented by increasing its
      usage ratio.
      
      The zs_compact() function, on the other hand, begins by scanning the group
      with the highest fragmentation (lowest usage ratio) to locate the source
      page.  The first available zspage is selected, and then the function moves
      downward to find a destination zspage in the group with the lowest
      internal fragmentation (highest usage ratio).
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-3-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c7ac972
    • Sergey Senozhatsky's avatar
      zsmalloc: remove insert_zspage() ->inuse optimization · a40a71e8
      Sergey Senozhatsky authored
      
      
      Patch series "zsmalloc: fine-grained fullness and new compaction
      algorithm", v4.
      
      Existing zsmalloc page fullness grouping leads to suboptimal page
      selection for both zs_malloc() and zs_compact().  This patchset reworks
      zsmalloc fullness grouping/classification.
      
      Additinally it also implements new compaction algorithm that is expected
      to use less CPU-cycles (as it potentially does fewer memcpy-s in
      zs_object_copy()).
      
      Test (synthetic) results can be seen in patch 0003.
      
      
      This patch (of 4):
      
      This optimization has no effect.  It only ensures that when a zspage was
      added to its corresponding fullness list, its "inuse" counter was higher
      or lower than the "inuse" counter of the zspage at the head of the list. 
      The intention was to keep busy zspages at the head, so they could be
      filled up and moved to the ZS_FULL fullness group more quickly.  However,
      this doesn't work as the "inuse" counter of a zspage can be modified by
      obj_free() but the zspage may still belong to the same fullness list.  So,
      fix_fullness_group() won't change the zspage's position in relation to the
      head's "inuse" counter, leading to a largely random order of zspages
      within the fullness list.
      
      For instance, consider a printout of the "inuse" counters of the first 10
      zspages in a class that holds 93 objects per zspage:
      
       ZS_ALMOST_EMPTY:  36  67  68  64  35  54  63  52
      
      As we can see the zspage with the lowest "inuse" counter
      is actually the head of the fullness list.
      
      Remove this pointless "optimisation".
      
      Link: https://lkml.kernel.org/r/20230304034835.2082479-1-senozhatsky@chromium.org
      Link: https://lkml.kernel.org/r/20230304034835.2082479-2-senozhatsky@chromium.org
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a40a71e8
    • Jaewon Kim's avatar
      dma-buf: system_heap: avoid reclaim for order 4 · 3ccefdea
      Jaewon Kim authored
      
      
      Using order 4 pages would be helpful for IOMMUs mapping, but trying to get
      order 4 pages could spend quite much time in the page allocation.  From
      the perspective of responsiveness, the deterministic memory allocation
      speed, I think, is quite important.
      
      The order 4 allocation with __GFP_RECLAIM may spend much time in reclaim
      and compation logic.  __GFP_NORETRY also may affect.  These cause
      unpredictable delay.
      
      To get reasonable allocation speed from dma-buf system heap, use
      HIGH_ORDER_GFP for order 4 to avoid reclaim.  And let me remove
      meaningless __GFP_COMP for order 0.
      
      According to my tests, order 4 with MID_ORDER_GFP could get more number
      of order 4 pages but the elapsed times could be very slow.
      
               time	order 8	order 4	order 0
           584 usec	0	160	0
        28,428 usec	0	160	0
       100,701 usec	0	160	0
        76,645 usec	0	160	0
        25,522 usec	0	160	0
        38,798 usec	0	160	0
        89,012 usec	0	160	0
        23,015 usec	0	160	0
        73,360 usec	0	160	0
        76,953 usec	0	160	0
        31,492 usec	0	160	0
        75,889 usec	0	160	0
        84,551 usec	0	160	0
        84,352 usec	0	160	0
        57,103 usec	0	160	0
        93,452 usec	0	160	0
      
      If HIGH_ORDER_GFP is used for order 4, the number of order 4 could be
      decreased but the elapsed time results were quite stable and fast enough.
      
               time	order 8	order 4	order 0
         1,356 usec	0	155	80
         1,901 usec	0	11	2384
         1,912 usec	0	0	2560
         1,911 usec	0	0	2560
         1,884 usec	0	0	2560
         1,577 usec	0	0	2560
         1,366 usec	0	0	2560
         1,711 usec	0	0	2560
         1,635 usec	0	28	2112
           544 usec	10	0	0
           633 usec	2	128	0
           848 usec	0	160	0
           729 usec	0	160	0
         1,000 usec	0	160	0
         1,358 usec	0	160	0
         2,638 usec	0	31	2064
      
      Link: https://lkml.kernel.org/r/20230303050332.10138-1-jaewon31.kim@samsung.com
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Reviewed-by: default avatarJohn Stultz <jstultz@google.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: T.J. Mercier <tjmercier@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ccefdea
    • Alexander Potapenko's avatar
      kmsan: add memsetXX tests · 78c74aee
      Alexander Potapenko authored
      
      
      Add tests ensuring that memset16()/memset32()/memset64() are instrumented
      by KMSAN and correctly initialize the memory.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-4-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78c74aee
    • Alexander Potapenko's avatar
      x86: kmsan: use C versions of memset16/memset32/memset64 · 27f644dc
      Alexander Potapenko authored
      
      
      KMSAN must see as many memory accesses as possible to prevent false
      positive reports.  Fall back to versions of
      memset16()/memset32()/memset64() implemented in lib/string.c instead of
      those written in assembly.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-3-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      27f644dc
    • Alexander Potapenko's avatar
      kmsan: another take at fixing memcpy tests · d3402925
      Alexander Potapenko authored
      commit 5478afc5
      
       ("kmsan: fix memcpy tests") uses OPTIMIZER_HIDE_VAR()
      to hide the uninitialized var from the compiler optimizations.
      
      However OPTIMIZER_HIDE_VAR(uninit) enforces an immediate check of @uninit,
      so memcpy tests did not actually check the behavior of memcpy(), because
      they always contained a KMSAN report.
      
      Replace OPTIMIZER_HIDE_VAR() with a file-local macro that just clobbers
      the memory with a barrier(), and add a test case for memcpy() that does
      not expect an error report.
      
      Also reflow kmsan_test.c with clang-format.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-2-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d3402925
    • Alexander Potapenko's avatar
      x86: kmsan: don't rename memintrinsics in uninstrumented files · 6dc4bd4e
      Alexander Potapenko authored
      
      
      clang -fsanitize=kernel-memory already replaces calls to
      memset/memcpy/memmove and their __builtin_ versions with
      __msan_memset/__msan_memcpy/__msan_memmove in instrumented files, so
      there is no need to override them.
      
      In non-instrumented versions we are now required to leave memset() and
      friends intact, so we cannot replace them with __msan_XXX() functions.
      
      Link: https://lkml.kernel.org/r/20230303141433.3422671-1-glider@google.com
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6dc4bd4e
    • Peter Xu's avatar
      mm/khugepaged: cleanup memcg uncharge for failure path · 7cb1d7ef
      Peter Xu authored
      
      
      Explicit memcg uncharging is not needed when the memcg accounting has the
      same lifespan of the page/folio.  That becomes the case for khugepaged
      after Yang & Zach's recent rework so the hpage will be allocated for each
      collapse rather than being cached.
      
      Cleanup the explicit memcg uncharge in khugepaged failure path and leave
      that for put_page().
      
      Link: https://lkml.kernel.org/r/20230303151218.311015-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: David Stevens <stevensd@chromium.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7cb1d7ef
    • Anshuman Khandual's avatar
      mm/debug_vm_pgtable: replace pte_mkhuge() with arch_make_huge_pte() · 9dabf6e1
      Anshuman Khandual authored
      Since the following commit arch_make_huge_pte() should be used directly in
      generic memory subsystem as a platform provided page table helper, instead
      of pte_mkhuge().  Change hugetlb_basic_tests() to call
      arch_make_huge_pte() directly, and update its relevant documentation entry
      as required.
      
      'commit 16785bd7
      
       ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")'
      
      Link: https://lkml.kernel.org/r/20230302114845.421674-1-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
        Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9dabf6e1
    • Anshuman Khandual's avatar
      mm/migrate: drop pte_mkhuge() in remove_migration_pte() · 1da28f1b
      Anshuman Khandual authored
      Since the following commit, arch_make_huge_pte() should be used directly
      in generic memory subsystem as a platform provided page table helper,
      instead of pte_mkhuge().  This just drops pte_mkhuge() from
      remove_migration_pte(), which has now become redundant.
      
      'commit 16785bd7
      
       ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")'
      
      Link: https://lkml.kernel.org/r/20230302025349.358341-1-anshuman.khandual@arm.com
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
        Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1da28f1b
    • Kefeng Wang's avatar
      mm: swap: remove unneeded cgroup_throttle_swaprate() · 3e4fb13a
      Kefeng Wang authored
      
      
      All the callers of cgroup_throttle_swaprate() are converted to
      folio_throttle_swaprate(), so make __cgroup_throttle_swaprate() to take a
      folio, and rename it to __folio_throttle_swaprate(), also rename gfp_mask
      to gfp and drop redundant extern keyword.  finally, drop unused
      cgroup_throttle_swaprate().
      
      Link: https://lkml.kernel.org/r/20230302115835.105364-8-wangkefeng.wang@huawei.com
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e4fb13a