Skip to content
  1. Dec 10, 2022
    • John Starks's avatar
      mm/gup: fix gup_pud_range() for dax · fcd0ccd8
      John Starks authored
      For dax pud, pud_huge() returns true on x86. So the function works as long
      as hugetlb is configured. However, dax doesn't depend on hugetlb.
      Commit 414fd080 ("mm/gup: fix gup_pmd_range() for dax") fixed
      devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as
      well.
      
      This fixes the below kernel panic:
      
      general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP
      	< snip >
      Call Trace:
      <TASK>
      get_user_pages_fast+0x1f/0x40
      iov_iter_get_pages+0xc6/0x3b0
      ? mempool_alloc+0x5d/0x170
      bio_iov_iter_get_pages+0x82/0x4e0
      ? bvec_alloc+0x91/0xc0
      ? bio_alloc_bioset+0x19a/0x2a0
      blkdev_direct_IO+0x282/0x480
      ? __io_complete_rw_common+0xc0/0xc0
      ? filemap_range_has_page+0x82/0xc0
      generic_file_direct_write+0x9d/0x1a0
      ? inode_update_time+0x24/0x30
      __generic_file_write_iter+0xbd/0x1e0
      blkdev_write_iter+0xb4/0x150
      ? io_import_iovec+0x8d/0x340
      io_write+0xf9/0x300
      io_issue_sqe+0x3c3/0x1d30
      ? sysvec_reschedule_ipi+0x6c/0x80
      __io_queue_sqe+0x33/0x240
      ? fget+0x76/0xa0
      io_submit_sqes+0xe6a/0x18d0
      ? __fget_light+0xd1/0x100
      __x64_sys_io_uring_enter+0x199/0x880
      ? __context_tracking_enter+0x1f/0x70
      ? irqentry_exit_to_user_mode+0x24/0x30
      ? irqentry_exit+0x1d/0x30
      ? __context_tracking_exit+0xe/0x70
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x61/0xcb
      RIP: 0033:0x7fc97c11a7be
      	< snip >
      </TASK>
      ---[ end trace 48b2e0e67debcaeb ]---
      RIP: 0010:internal_get_user_pages_fast+0x340/0x990
      	< snip >
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      
      Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com
      Fixes: 414fd080
      
       ("mm/gup: fix gup_pmd_range() for dax")
      Signed-off-by: default avatarJohn Starks <jostarks@microsoft.com>
      Signed-off-by: default avatarSaurabh Sengar <ssengar@linux.microsoft.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fcd0ccd8
    • Liam Howlett's avatar
      mmap: fix do_brk_flags() modifying obviously incorrect VMAs · 6c28ca64
      Liam Howlett authored
      Add more sanity checks to the VMA that do_brk_flags() will expand.  Ensure
      the VMA matches basic merge requirements within the function before
      calling can_vma_merge_after().
      
      Drop the duplicate checks from vm_brk_flags() since they will be enforced
      later.
      
      The old code would expand file VMAs on brk(), which is functionally
      wrong and also dangerous in terms of locking because the brk() path
      isn't designed for file VMAs and therefore doesn't lock the file
      mapping.  Checking can_vma_merge_after() ensures that new anonymous
      VMAs can't be merged into file VMAs.
      
      See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221205192304.1957418-1-Liam.Howlett@oracle.com
      Fixes: 2e7ce7d3
      
       ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c28ca64
    • David Hildenbrand's avatar
      mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit · 630dc25e
      David Hildenbrand authored
      We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
      store a physical address.
      
      On a 64bit system, both are 64bit wide.  However, on a 32bit system, the
      latter might be 64bit wide.  This is, for example, the case on x86 with
      PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans
      32bit.
      
      The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
      that case, and assumes that the maximum PFN is limited by an 32bit
      phys_addr_t.  This implies, that SWP_PFN_BITS will currently only be able
      to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong.
      
      Let's rely on the number of bits in phys_addr_t instead, but make sure to
      not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in
      is_pfn_swap_entry() unhappy.  Note that swp_entry_t is effectively an
      unsigned long and the maximum swap offset shares that value with the swap
      type.
      
      For example, on an 8 GiB x86 PAE system with a kernel config based on
      Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently
      fail removing migration entries (remove_migration_ptes()), because
      mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as
      swp_offset_pfn() wrongly masks off PFN bits.  For example,
      split_huge_page_to_list()->...->remap_page() will leave migration entries
      in place and continue to unlock the page.
      
      Later, when we stumble over these migration entries (e.g., via
      /proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these
      migration entries shouldn't exist anymore and the page was unlocked.
      
      [   33.067591] kernel BUG at include/linux/swapops.h:497!
      [   33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      [   33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G            E      6.1.0-rc8+ #16
      [   33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
      [   33.067606] EIP: pagemap_pmd_range+0x644/0x650
      [   33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
      [   33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
      [   33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
      [   33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
      [   33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
      [   33.067625] Call Trace:
      [   33.067628]  ? madvise_free_pte_range+0x720/0x720
      [   33.067632]  ? smaps_pte_range+0x4b0/0x4b0
      [   33.067634]  walk_pgd_range+0x325/0x720
      [   33.067637]  ? mt_find+0x1d6/0x3a0
      [   33.067641]  ? mt_find+0x1d6/0x3a0
      [   33.067643]  __walk_page_range+0x164/0x170
      [   33.067646]  walk_page_range+0xf9/0x170
      [   33.067648]  ? __kmem_cache_alloc_node+0x2a8/0x340
      [   33.067653]  pagemap_read+0x124/0x280
      [   33.067658]  ? default_llseek+0x101/0x160
      [   33.067662]  ? smaps_account+0x1d0/0x1d0
      [   33.067664]  vfs_read+0x90/0x290
      [   33.067667]  ? do_madvise.part.0+0x24b/0x390
      [   33.067669]  ? debug_smp_processor_id+0x12/0x20
      [   33.067673]  ksys_pread64+0x58/0x90
      [   33.067675]  __ia32_sys_ia32_pread64+0x1b/0x20
      [   33.067680]  __do_fast_syscall_32+0x4c/0xc0
      [   33.067683]  do_fast_syscall_32+0x29/0x60
      [   33.067686]  do_SYSENTER_32+0x15/0x20
      [   33.067689]  entry_SYSENTER_32+0x98/0xf1
      
      Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it
      readable and consistent.
      
      [david@redhat.com: rely on sizeof(phys_addr_t) and min_t() instead]
        Link: https://lkml.kernel.org/r/20221206105737.69478-1-david@redhat.com
      [david@redhat.com: use "int" for comparison, as we're only comparing numbers < 64]
        Link: https://lkml.kernel.org/r/1f157500-2676-7cef-a84e-9224ed64e540@redhat.com
      Link: https://lkml.kernel.org/r/20221205150857.167583-1-david@redhat.com
      Fixes: 0d206b5d
      
       ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      630dc25e
    • Hugh Dickins's avatar
      tmpfs: fix data loss from failed fallocate · 44bcabd7
      Hugh Dickins authored
      Fix tmpfs data loss when the fallocate system call is interrupted by a
      signal, or fails for some other reason.  The partial folio handling in
      shmem_undo_range() forgot to consider this unfalloc case, and was liable
      to erase or truncate out data which had already been committed earlier.
      
      It turns out that none of the partial folio handling there is appropriate
      for the unfalloc case, which just wants to proceed to removal of whole
      folios: which find_get_entries() provides, even when partially covered.
      
      Original patch by Rui Wang.
      
      Link: https://lore.kernel.org/linux-mm/33b85d82.7764.1842e9ab207.Coremail.chenguoqic@163.com/
      Link: https://lkml.kernel.org/r/a5dac112-cf4b-7af-a33-f386e347fd38@google.com
      Fixes: b9a8a419
      
       ("truncate,shmem: Handle truncates that split large folios")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarGuoqi Chen <chenguoqic@163.com>
        Link: https://lore.kernel.org/all/20221101032248.819360-1-kernel@hev.cc/
      Cc: Rui Wang <kernel@hev.cc>
      Cc: Huacai Chen <chenhuacai@loongson.cn>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44bcabd7
    • Michal Hocko's avatar
      kselftests: cgroup: update kmem test precision tolerance · de16d6e4
      Michal Hocko authored
      1813e51e ("memcg: increase MEMCG_CHARGE_BATCH to 64") has changed
      the batch size while this test case has been left behind. This has led
      to a test failure reported by test bot:
      not ok 2 selftests: cgroup: test_kmem # exit=1
      
      Update the tolerance for the pcp charges to reflect the
      MEMCG_CHARGE_BATCH change to fix this.
      
      [akpm@linux-foundation.org: update comments, per Roman]
      Link: https://lkml.kernel.org/r/Y4m8Unt6FhWKC6IH@dhcp22.suse.cz
      Fixes: 1813e51e
      
       ("memcg: increase MEMCG_CHARGE_BATCH to 64")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarkernel test robot <yujie.liu@intel.com>
        Link: https://lore.kernel.org/oe-lkp/202212010958.c1053bd3-yujie.liu@intel.com
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Tested-by: default avatarYujie Liu <yujie.liu@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de16d6e4
    • Jason A. Donenfeld's avatar
      mm: do not BUG_ON missing brk mapping, because userspace can unmap it · f5ad5083
      Jason A. Donenfeld authored
      The following program will trigger the BUG_ON that this patch removes,
      because the user can munmap() mm->brk:
      
        #include <sys/syscall.h>
        #include <sys/mman.h>
        #include <assert.h>
        #include <unistd.h>
      
        static void *brk_now(void)
        {
          return (void *)syscall(SYS_brk, 0);
        }
      
        static void brk_set(void *b)
        {
          assert(syscall(SYS_brk, b) != -1);
        }
      
        int main(int argc, char *argv[])
        {
          void *b = brk_now();
          brk_set(b + 4096);
          assert(munmap(b - 4096, 4096 * 2) == 0);
          brk_set(b);
          return 0;
        }
      
      Compile that with musl, since glibc actually uses brk(), and then
      execute it, and it'll hit this splat:
      
        kernel BUG at mm/mmap.c:229!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 12 PID: 1379 Comm: a.out Tainted: G S   U             6.1.0-rc7+ #419
        RIP: 0010:__do_sys_brk+0x2fc/0x340
        Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20>
        RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000
        RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08
        RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000
        R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000
        R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000
        FS:  0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0
        PKRU: 55555554
        Call Trace:
         <TASK>
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        RIP: 0033:0x400678
        Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b>
        RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c
        RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678
        RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000
        RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000
        R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978
        R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000
         </TASK>
      
      Instead, just return the old brk value if the original mapping has been
      removed.
      
      [akpm@linux-foundation.org: fix changelog, per Liam]
      Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com
      Fixes: 2e7ce7d3
      
       ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f5ad5083
    • Matti Vaittinen's avatar
      mailmap: update Matti Vaittinen's email address · 38f1d4ae
      Matti Vaittinen authored
      
      
      The email backend used by ROHM keeps labeling patches as spam.  This can
      result in missing the patches.
      
      Switch my mail address from a company mail to a personal one.
      
      Link: https://lkml.kernel.org/r/8f4498b66fedcbded37b3b87e0c516e659f8f583.1669912977.git.mazziesaccount@gmail.com
      Signed-off-by: default avatarMatti Vaittinen <mazziesaccount@gmail.com>
      Suggested-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Anup Patel <anup@brainfault.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Atish Patra <atishp@atishpatra.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Ben Widawsky <bwidawsk@kernel.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Ian King <colin.i.king@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Vasily Averin <vasily.averin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38f1d4ae
  2. Dec 01, 2022
    • Andrew Morton's avatar
      revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible" · 1d351f18
      Andrew Morton authored
      
      
      It causes build failures with unusual CC/HOSTCC combinations.
      
      Quoting
      https://lkml.kernel.org/r/A222B1E6-69B8-4085-AD1B-27BDB72CA971@goldelico.com:
      
        HOSTCC  scripts/mod/modpost.o - due to target missing
      In file included from include/linux/string.h:5,
                       from scripts/mod/../../include/linux/license.h:5,
                       from scripts/mod/modpost.c:24:
      include/linux/compiler.h:246:10: fatal error: asm/rwonce.h: No such file or directory
        246 | #include <asm/rwonce.h>
            |          ^~~~~~~~~~~~~~
      compilation terminated.
      
      ...
      
      The problem is that HOSTCC is not necessarily the same compiler or even
      architecture as CC and pulling in <linux/compiler.h> or <asm/rwonce.h>
      files indirectly isn't a good idea then.
      
      My toolchain is providing HOSTCC = gcc (MacPorts) and CC = arm-linux-gnueabihf
      (built from gcc source) and all running on Darwin.
      
      If I change the include to <string.h> I can then "HOSTCC scripts/mod/modpost.c"
      but then it fails for "CC kernel/module/main.c" not finding <string.h>:
      
        CC      kernel/module/main.o - due to target missing
      In file included from kernel/module/main.c:43:0:
      ./include/linux/license.h:5:20: fatal error: string.h: No such file or directory
       #include <string.h>
                          ^
      compilation terminated.
      
      Reported-by: default avatar"H. Nikolaus Schaller" <hns@goldelico.com>
      Cc: Sam James <sam@gentoo.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d351f18
    • Lee Jones's avatar
      Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled · 152fe65f
      Lee Jones authored
      
      
      When enabled, KASAN enlarges function's stack-frames.  Pushing quite a few
      over the current threshold.  This can mainly be seen on 32-bit
      architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
      
      Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org
      Signed-off-by: default avatarLee Jones <lee@kernel.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@gmail.com>
      Cc: Harry Wentland <harry.wentland@amd.com>
      Cc: Leo Li <sunpeng.li@amd.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
      Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      152fe65f
    • Lee Jones's avatar
      drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame · 6f6cb171
      Lee Jones authored
      Patch series "Fix a bunch of allmodconfig errors", v2.
      
      Since b339ec9c
      
       ("kbuild: Only default to -Werror if COMPILE_TEST")
      WERROR now defaults to COMPILE_TEST meaning that it's enabled for
      allmodconfig builds.  This leads to some interesting build failures when
      using Clang, each resolved in this set.
      
      With this set applied, I am able to obtain a successful allmodconfig Arm
      build.
      
      
      This patch (of 2):
      
      calculate_bandwidth() is presently broken on all !(X86_64 || SPARC64 ||
      ARM64) architectures built with Clang (all released versions), whereby the
      stack frame gets blown up to well over 5k.  This would cause an immediate
      kernel panic on most architectures.  We'll revert this when the following
      bug report has been resolved:
      https://github.com/llvm/llvm-project/issues/41896.
      
      Link: https://lkml.kernel.org/r/20221125120750.3537134-1-lee@kernel.org
      Link: https://lkml.kernel.org/r/20221125120750.3537134-2-lee@kernel.org
      Signed-off-by: default avatarLee Jones <lee@kernel.org>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@gmail.com>
      Cc: Harry Wentland <harry.wentland@amd.com>
      Cc: Lee Jones <lee@kernel.org>
      Cc: Leo Li <sunpeng.li@amd.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
      Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f6cb171
    • Jann Horn's avatar
      mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths · f268f6cf
      Jann Horn authored
      Any codepath that zaps page table entries must invoke MMU notifiers to
      ensure that secondary MMUs (like KVM) don't keep accessing pages which
      aren't mapped anymore.  Secondary MMUs don't hold their own references to
      pages that are mirrored over, so failing to notify them can lead to page
      use-after-free.
      
      I'm marking this as addressing an issue introduced in commit f3f0e1d2
      ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
      the security impact of this only came in commit 27e1f827 ("khugepaged:
      enable collapse pmd for pte-mapped THP"), which actually omitted flushes
      for the removal of present PTEs, not just for the removal of empty page
      tables.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
      Fixes: f3f0e1d2
      
       ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f268f6cf
    • Jann Horn's avatar
      mm/khugepaged: fix GUP-fast interaction by sending IPI · 2ba99c5e
      Jann Horn authored
      Since commit 70cbc3cc ("mm: gup: fix the fast GUP race against THP
      collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to
      ensure that the page table was not removed by khugepaged in between.
      
      However, lockless_pages_from_mm() still requires that the page table is
      not concurrently freed.  Fix it by sending IPIs (if the architecture uses
      semi-RCU-style page table freeing) before freeing/reusing page tables.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com
      Fixes: ba76149f
      
       ("thp: khugepaged")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ba99c5e
    • Jann Horn's avatar
      mm/khugepaged: take the right locks for page table retraction · 8d3c106e
      Jann Horn authored
      pagetable walks on address ranges mapped by VMAs can be done under the
      mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
      VMA's address_space.  Only one of these needs to be held, and it does not
      need to be held in exclusive mode.
      
      Under those circumstances, the rules for concurrent access to page table
      entries are:
      
       - Terminal page table entries (entries that don't point to another page
         table) can be arbitrarily changed under the page table lock, with the
         exception that they always need to be consistent for
         hardware page table walks and lockless_pages_from_mm().
         This includes that they can be changed into non-terminal entries.
       - Non-terminal page table entries (which point to another page table)
         can not be modified; readers are allowed to READ_ONCE() an entry, verify
         that it is non-terminal, and then assume that its value will stay as-is.
      
      Retracting a page table involves modifying a non-terminal entry, so
      page-table-level locks are insufficient to protect against concurrent page
      table traversal; it requires taking all the higher-level locks under which
      it is possible to start a page walk in the relevant range in exclusive
      mode.
      
      The collapse_huge_page() path for anonymous THP already follows this rule,
      but the shmem/file THP path was getting it wrong, making it possible for
      concurrent rmap-based operations to cause corruption.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
      Fixes: 27e1f827
      
       ("khugepaged: enable collapse pmd for pte-mapped THP")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d3c106e
    • Gavin Shan's avatar
      mm: migrate: fix THP's mapcount on isolation · 829ae0f8
      Gavin Shan authored
      The issue is reported when removing memory through virtio_mem device.  The
      transparent huge page, experienced copy-on-write fault, is wrongly
      regarded as pinned.  The transparent huge page is escaped from being
      isolated in isolate_migratepages_block().  The transparent huge page can't
      be migrated and the corresponding memory block can't be put into offline
      state.
      
      Fix it by replacing page_mapcount() with total_mapcount().  With this, the
      transparent huge page can be isolated and migrated, and the memory block
      can be put into offline state.  Besides, The page's refcount is increased
      a bit earlier to avoid the page is released when the check is executed.
      
      Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com
      Fixes: 1da2f328
      
       ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Reported-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Tested-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>	[5.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      829ae0f8
    • Juergen Gross's avatar
      mm: introduce arch_has_hw_nonleaf_pmd_young() · 4aaf269c
      Juergen Gross authored
      When running as a Xen PV guests commit eed9a328 ("mm: x86: add
      CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in
      pmdp_test_and_clear_young():
      
       BUG: unable to handle page fault for address: ffff8880083374d0
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0003) - permissions violation
       PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065
       Oops: 0003 [#1] PREEMPT SMP NOPTI
       CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1
       RIP: e030:pmdp_test_and_clear_young+0x25/0x40
      
      This happens because the Xen hypervisor can't emulate direct writes to
      page table entries other than PTEs.
      
      This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young()
      similar to arch_has_hw_pte_young() and test that instead of
      CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG.
      
      Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com
      Fixes: eed9a328
      
       ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG")
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reported-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Tested-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Acked-by: David Hildenbrand <david@redhat.com>	[core changes]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4aaf269c
    • Juergen Gross's avatar
      mm: add dummy pmd_young() for architectures not having it · 6617da8f
      Juergen Gross authored
      
      
      In order to avoid #ifdeffery add a dummy pmd_young() implementation as a
      fallback.  This is required for the later patch "mm: introduce
      arch_has_hw_nonleaf_pmd_young()".
      
      Link: https://lkml.kernel.org/r/fd3ac3cd-7349-6bbd-890a-71a9454ca0b3@suse.com
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6617da8f
    • SeongJae Park's avatar
      mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes() · 95bc35f9
      SeongJae Park authored
      Commit da878780 ("mm/damon/sysfs: support online inputs update") made
      'damon_sysfs_set_schemes()' to be called for running DAMON context, which
      could have schemes.  In the case, DAMON sysfs interface is supposed to
      update, remove, or add schemes to reflect the sysfs files.  However, the
      code is assuming the DAMON context wouldn't have schemes at all, and
      therefore creates and adds new schemes.  As a result, the code doesn't
      work as intended for online schemes tuning and could have more than
      expected memory footprint.  The schemes are all in the DAMON context, so
      it doesn't leak the memory, though.
      
      Remove the wrong asssumption (the DAMON context wouldn't have schemes) in
      'damon_sysfs_set_schemes()' to fix the bug.
      
      Link: https://lkml.kernel.org/r/20221122194831.3472-1-sj@kernel.org
      Fixes: da878780
      
       ("mm/damon/sysfs: support online inputs update")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      95bc35f9
    • Tiezhu Yang's avatar
      tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep" · a435874b
      Tiezhu Yang authored
      
      
      The latest version of grep claims the egrep is now obsolete so the build
      now contains warnings that look like:
      
      	egrep: warning: egrep is obsolescent; using grep -E
      
      fix this up by moving the related file to use "grep -E" instead.
      
        sed -i "s/egrep/grep -E/g" `grep egrep -rwl tools/vm`
      
      Here are the steps to install the latest grep:
      
        wget http://ftp.gnu.org/gnu/grep/grep-3.8.tar.gz
        tar xf grep-3.8.tar.gz
        cd grep-3.8 && ./configure && make
        sudo make install
        export PATH=/usr/local/bin:$PATH
      
      Link: https://lkml.kernel.org/r/1668825419-30584-1-git-send-email-yangtiezhu@loongson.cn
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a435874b
    • ZhangPeng's avatar
      nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry() · f0a0ccda
      ZhangPeng authored
      
      
      Syzbot reported a null-ptr-deref bug:
      
       NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP
       frequency < 30 seconds
       general protection fault, probably for non-canonical address
       0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN
       KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
       CPU: 1 PID: 3603 Comm: segctord Not tainted
       6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0
       Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google
       10/11/2022
       RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0
       fs/nilfs2/alloc.c:608
       Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00
       00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02
       00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7
       RSP: 0018:ffffc90003dff830 EFLAGS: 00010212
       RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d
       RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010
       RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f
       R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158
       R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004
       FS:  0000000000000000(0000) GS:ffff8880b9b00000(0000)
       knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0
       Call Trace:
        <TASK>
        nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline]
        nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193
        nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236
        nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940
        nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline]
        nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline]
        nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088
        nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337
        nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568
        nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018
        nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067
        nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline]
        nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline]
        nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045
        nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379
        nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline]
        nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570
        kthread+0x2e4/0x3a0 kernel/kthread.c:376
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
        </TASK>
       ...
      
      If DAT metadata file is corrupted on disk, there is a case where
      req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during
      a b-tree operation that cascadingly updates ancestor nodes of the b-tree,
      because nilfs_dat_commit_alloc() for a lower level block can initialize
      the blocknr on the same DAT entry between nilfs_dat_prepare_end() and
      nilfs_dat_commit_end().
      
      If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free()
      without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and
      causes the NULL pointer dereference above in
      nilfs_palloc_commit_free_entry() function, which leads to a crash.
      
      Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh
      before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free().
      
      This also calls nilfs_error() in that case to notify that there is a fatal
      flaw in the filesystem metadata and prevent further operations.
      
      Link: https://lkml.kernel.org/r/00000000000097c20205ebaea3d6@google.com
      Link: https://lkml.kernel.org/r/20221114040441.1649940-1-zhangpeng362@huawei.com
      Link: https://lkml.kernel.org/r/20221119120542.17204-1-konishi.ryusuke@gmail.com
      Signed-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+ebe05ee8e98f755f61d0@syzkaller.appspotmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0a0ccda
    • Mike Kravetz's avatar
      hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing · 04ada095
      Mike Kravetz authored
      madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
      tables associated with the address range.  For hugetlb vmas,
      zap_page_range will call __unmap_hugepage_range_final.  However,
      __unmap_hugepage_range_final assumes the passed vma is about to be removed
      and deletes the vma_lock to prevent pmd sharing as the vma is on the way
      out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
      missing vma_lock prevents pmd sharing and could potentially lead to issues
      with truncation/fault races.
      
      This issue was originally reported here [1] as a BUG triggered in
      page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
      vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
      prevent pmd sharing.  Subsequent faults on this vma were confused as
      VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
      not set in new pages added to the page table.  This resulted in pages that
      appeared anonymous in a VM_SHARED vma and triggered the BUG.
      
      Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
      call from unmap_vmas().  This is used to indicate the 'final' unmapping of
      a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
      the vm_lock is not deleted.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
      Fixes: 90e7e7f5
      
       ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      04ada095
    • Mike Kravetz's avatar
      madvise: use zap_page_range_single for madvise dontneed · 21b85b09
      Mike Kravetz authored
      This series addresses the issue first reported in [1], and fully described
      in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
      for stable backports.
      
      While exploring solutions to this issue, related problems with mmu
      notification calls were discovered.  This is addressed in the patch
      "hugetlb: remove duplicate mmu notifications:".  Since there are no user
      visible effects, this third is not tagged for stable backports.
      
      Previous discussions suggested further cleanup by removing the
      routine zap_page_range.  This is possible because zap_page_range_single
      is now exported, and all callers of zap_page_range pass ranges entirely
      within a single vma.  This work will be done in a later patch so as not
      to distract from this bug fix.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      
      This patch (of 2):
      
      Expose the routine zap_page_range_single to zap a range within a single
      vma.  The madvise routine madvise_dontneed_single_vma can use this routine
      as it explicitly operates on a single vma.  Also, update the mmu
      notification range in zap_page_range_single to take hugetlb pmd sharing
      into account.  This is required as MADV_DONTNEED supports hugetlb vmas.
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
      Fixes: 90e7e7f5
      
       ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21b85b09
    • Yang Shi's avatar
      mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE · dec1d352
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
      include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
      6.1.0-rc1-syzkaller-00454-ga70385240892 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
      ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
      96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
      f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      It is because khugepaged allocates pages with __GFP_THISNODE, but the
      preferred node is bogus.  The previous patch fixed the khugepaged code to
      avoid allocating page from non-existing node.  But it is still racy
      against memory hotremove.  There is no synchronization with the memory
      hotplug so it is possible that memory gets offline during a longer taking
      scanning.
      
      So this warning still seems not quite helpful because:
        * There is no guarantee the node is online for __GFP_THISNODE context
          for all the callsites.
        * Kernel just fails the allocation regardless the warning, and it looks
          all callsites handle the allocation failure gracefully.
      
      Although while the warning has helped to identify a buggy code, it is not
      safe in general and this warning could panic the system with panic-on-warn
      configuration which tends to be used surprisingly often.  So replace
      VM_WARN_ON to pr_warn().  And the warning will be triggered if
      __GFP_NOWARN is set since the allocator would print out warning for such
      case if __GFP_NOWARN is not set.
      
      [shy828301@gmail.com: rename nid to this_node and gfp to warn_gfp]
        Link: https://lkml.kernel.org/r/20221123193014.153983-1-shy828301@gmail.com
      [akpm@linux-foundation.org: fix whitespace]
      [akpm@linux-foundation.org: print gfp_mask instead of warn_gfp, per Michel]
      Link: https://lkml.kernel.org/r/20221108184357.55614-3-shy828301@gmail.com
      Fixes: 7d8faaf1
      
       ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reported-by: default avatar <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dec1d352
  3. Nov 23, 2022
    • Li Hua's avatar
      test_kprobes: fix implicit declaration error of test_kprobes · de3db3f8
      Li Hua authored
      If KPROBES_SANITY_TEST and ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled, but
      STACKTRACE is not set. Build failed as below:
      
      lib/test_kprobes.c: In function `stacktrace_return_handler':
      lib/test_kprobes.c:228:8: error: implicit declaration of function `stack_trace_save'; did you mean `stacktrace_driver'? [-Werror=implicit-function-declaration]
        ret = stack_trace_save(stack_buf, STACK_BUF_SIZE, 0);
              ^~~~~~~~~~~~~~~~
              stacktrace_driver
      cc1: all warnings being treated as errors
      scripts/Makefile.build:250: recipe for target 'lib/test_kprobes.o' failed
      make[2]: *** [lib/test_kprobes.o] Error 1
      
      To fix this error, Select STACKTRACE if ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled.
      
      Link: https://lkml.kernel.org/r/20221121030620.63181-1-hucool.lihua@huawei.com
      Fixes: 1f6d3a8f
      
       ("kprobes: Add a test case for stacktrace from kretprobe handler")
      Signed-off-by: default avatarLi Hua <hucool.lihua@huawei.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de3db3f8
    • Chen Zhongjin's avatar
      nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty · 512c5ca0
      Chen Zhongjin authored
      When extending segments, nilfs_sufile_alloc() is called to get an
      unassigned segment, then mark it as dirty to avoid accidentally allocating
      the same segment in the future.
      
      But for some special cases such as a corrupted image it can be unreliable.
      If such corruption of the dirty state of the segment occurs, nilfs2 may
      reallocate a segment that is in use and pick the same segment for writing
      twice at the same time.
      
      This will cause the problem reported by syzkaller:
      https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24
      
      This case started with segbuf1.segnum = 3, nextnum = 4 when constructed. 
      It supposed segment 4 has already been allocated and marked as dirty.
      
      However the dirty state was corrupted and segment 4 usage was not dirty. 
      For the first time nilfs_segctor_extend_segments() segment 4 was allocated
      again, which made segbuf2 and next segbuf3 had same segment 4.
      
      sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added
      to both buffer lists of two segbuf.  It makes the lists broken which
      causes NULL pointer dereference.
      
      Fix the problem by setting usage as dirty every time in
      nilfs_sufile_mark_dirty(), which is called during constructing current
      segment to be written out and before allocating next segment.
      
      [chenzhongjin@huawei.com: add lock protection per Ryusuke]
        Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com
      Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com
      Fixes: 9ff05123
      
       ("nilfs2: segment constructor")
      Signed-off-by: default avatarChen Zhongjin <chenzhongjin@huawei.com>
      Reported-by: default avatar <syzbot+77e4f0...@syzkaller.appspotmail.com>
      Reported-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      512c5ca0
    • Aneesh Kumar K.V's avatar
      mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1 · 81a70c21
      Aneesh Kumar K.V authored
      balance_dirty_pages doesn't do the required dirty throttling on cgroupv1. 
      See commit 9badce00
      
       ("cgroup, writeback: don't enable cgroup writeback
      on traditional hierarchies").  Instead, the kernel depends on writeback
      throttling in shrink_folio_list to achieve the same goal.  With large
      memory systems, the flusher may not be able to writeback quickly enough
      such that we will start finding pages in the shrink_folio_list already in
      writeback.  Hence for cgroupv1 let's do a reclaim throttle after waking up
      the flusher.
      
      The below test which used to fail on a 256GB system completes till the the
      file system is full with this change.
      
      root@lp2:/sys/fs/cgroup/memory# mkdir test
      root@lp2:/sys/fs/cgroup/memory# cd test/
      root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes
      root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks
      root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M
      Killed
      
      Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: zefan li <lizefan.x@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      81a70c21
    • Qi Zheng's avatar
      mm: fix unexpected changes to {failslab|fail_page_alloc}.attr · ea4452de
      Qi Zheng authored
      When we specify __GFP_NOWARN, we only expect that no warnings will be
      issued for current caller.  But in the __should_failslab() and
      __should_fail_alloc_page(), the local GFP flags alter the global
      {failslab|fail_page_alloc}.attr, which is persistent and shared by all
      tasks.  This is not what we expected, let's fix it.
      
      [akpm@linux-foundation.org: unexport should_fail_ex()]
      Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
      Fixes: 3f913fc5
      
       ("mm: fix missing handler for __GFP_NOWARN")
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea4452de
    • Chen Wandun's avatar
      swapfile: fix soft lockup in scan_swap_map_slots · de1ccfb6
      Chen Wandun authored
      A softlockup occurs in scan free swap slot under huge memory pressure. 
      The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
      disksize of each zram device is 50MB.
      
      LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
      the real loop number would more than LATENCY_LIMIT because of "goto checks
      and goto scan" repeatly without decreasing latency limit.
      
      In order to fix it, decrease latency_ration in advance.
      
      There is also a suspicious place that will cause softlockups in
      get_swap_pages().  In this function, the "goto start_over" may result in
      continuous scanning of the swap partition.  If there is no cond_sched in
      scan_swap_map_slots(), it would cause a softlockup (I am not sure about
      this).
      
      WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
      CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
      dump backtrace+0x0/0x1le4
      show stack+0x20/@x2c
      dump_stack+0xd8/0x140
      watchdog print_info+0x48/0x54
      watchdog_process_before_softlockup+0x98/0xa0
      watchdog_timer_fn+0xlac/0x2d0
      hrtimer_rum_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handLe_percpu_devid_irq+0x90/0x1f4
      handle domain irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      e11 ira+0xhB/Bx140
      scan_swap_map_slots+0x678/0x890
      get_swap_pages+0x29c/0x440
      get_swap_page+0x120/0x2e0
      add_to_swap+UX2U/0XyC
      shrink_page_list+0x5d0/0x152c
      shrink_inactive_list+0xl6c/Bx500
      shrink_lruvec+0x270/0x304
      
      WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
      watchdog_timer_fn+0x1ac/0x2d0
      __run_hrtimer+0x98/0x2a0
      __hrtimer_run_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handle_percpu_devid_irq+0x90/0x1f4
      __handle_domain_irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      el1_irq+0xb8/0x140
      get_swap_pages+0x1e8/0x440
      get_swap_page+0x1c8/0x2e0
      add_to_swap+0x20/0x9c
      shrink_page_list+0x5d0/0x152c
      reclaim_pages+0x160/0x310
      madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
      walk_pmd_range.isra.0+0xac/0x22c
      walk_pud_range+0xfc/0x1c0
      walk_pgd_range+0x158/0x1b0
      __walk_page_range+0x64/0x100
      walk_page_range+0x104/0x150
      
      Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
      Fixes: 048c27fd
      
       ("[PATCH] swap: scan_swap_map latency breaks")
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: <xialonglong1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de1ccfb6
    • Mike Kravetz's avatar
      hugetlb: fix __prep_compound_gigantic_page page flag setting · 7fb0728a
      Mike Kravetz authored
      Commit 2b21624f ("hugetlb: freeze allocated pages before creating
      hugetlb pages") changed the order page flags were cleared and set in the
      head page.  It moved the __ClearPageReserved after __SetPageHead. 
      However, there is a check to make sure __ClearPageReserved is never done
      on a head page.  If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
      will be hit when creating a hugetlb gigantic page:
      
          page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
          ------------[ cut here ]------------
          kernel BUG at include/linux/page-flags.h:500!
          Call Trace will differ depending on whether hugetlb page is created
          at boot time or run time.
      
      Make sure to __ClearPageReserved BEFORE __SetPageHead.
      
      Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com
      Fixes: 2b21624f
      
       ("hugetlb: freeze allocated pages before creating hugetlb pages")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Tested-by: default avatarTarun Sahu <tsahu@linux.ibm.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fb0728a
    • Marco Elver's avatar
      kfence: fix stack trace pruning · 747c0f35
      Marco Elver authored
      Commit b1405135 ("mm/sl[au]b: generalize kmalloc subsystem")
      refactored large parts of the kmalloc subsystem, resulting in the stack
      trace pruning logic done by KFENCE to no longer work.
      
      While b1405135 attempted to fix the situation by including
      '__kmem_cache_free' in the list of functions KFENCE should skip through,
      this only works when the compiler actually optimized the tail call from
      kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
      appearing in the full stack trace to begin with).
      
      In some configurations, the compiler no longer optimizes the tail call
      into a jump, and __kmem_cache_free() appears in the stack trace.  This
      means that the pruned stack trace shown by KFENCE would include kfree()
      which is not intended - for example:
      
       | BUG: KFENCE: invalid free in kfree+0x7c/0x120
       |
       | Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
       |  kfree+0x7c/0x120
       |  test_double_free+0x116/0x1a9
       |  kunit_try_run_case+0x90/0xd0
       | [...]
      
      Fix it by moving __kmem_cache_free() to the list of functions that may be
      tail called by an allocator entry function, making the pruning logic work
      in both the optimized and unoptimized tail call cases.
      
      Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com
      Fixes: b1405135
      
       ("mm/sl[au]b: generalize kmalloc subsystem")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      747c0f35
    • Yosry Ahmed's avatar
      proc/meminfo: fix spacing in SecPageTables · f850c849
      Yosry Ahmed authored
      SecPageTables has a tab after it instead of a space, this can break
      fragile parsers that depend on spaces after the stat names.
      
      Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com
      Fixes: ebc97a52
      
       ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f850c849
    • Yu Zhao's avatar
      mm: multi-gen LRU: retry folios written back while isolated · 359a5e14
      Yu Zhao authored
      The page reclaim isolates a batch of folios from the tail of one of the
      LRU lists and works on those folios one by one.  For a suitable
      swap-backed folio, if the swap device is async, it queues that folio for
      writeback.  After the page reclaim finishes an entire batch, it puts back
      the folios it queued for writeback to the head of the original LRU list.
      
      In the meantime, the page writeback flushes the queued folios also by
      batches.  Its batching logic is independent from that of the page reclaim.
      For each of the folios it writes back, the page writeback calls
      folio_rotate_reclaimable() which tries to rotate a folio to the tail.
      
      folio_rotate_reclaimable() only works for a folio after the page reclaim
      has put it back.  If an async swap device is fast enough, the page
      writeback can finish with that folio while the page reclaim is still
      working on the rest of the batch containing it.  In this case, that folio
      will remain at the head and the page reclaim will not retry it before
      reaching there.
      
      This patch adds a retry to evict_folios().  After evict_folios() has
      finished an entire batch and before it puts back folios it cannot free
      immediately, it retries those that may have missed the rotation.
      
      Before this patch, ~60% of folios swapped to an Intel Optane missed
      folio_rotate_reclaimable().  After this patch, ~99% of missed folios were
      reclaimed upon retry.
      
      This problem affects relatively slow async swap devices like Samsung 980
      Pro much less and does not affect sync swap devices like zram or zswap at
      all.
      
      Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
      Fixes: ac35a490
      
       ("mm: multi-gen LRU: minimal implementation")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      359a5e14
    • Satya Priya's avatar
      mailmap: update email address for Satya Priya · 47123d7f
      Satya Priya authored
      
      
      Add and also update email address, skakit@codeaurora.org is no longer
      active.
      
      Link: https://lkml.kernel.org/r/20221116105017.3018971-1-quic_c_skakit@quicinc.com
      Signed-off-by: default avatarSatya Priya <quic_c_skakit@quicinc.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47123d7f
    • Alistair Popple's avatar
      mm/migrate_device: return number of migrating pages in args->cpages · 44af0b45
      Alistair Popple authored
      migrate_vma->cpages originally contained a count of the number of pages
      migrating including non-present pages which can be populated directly on
      the target.
      
      Commit 241f6885 ("mm/migrate_device.c: refactor migrate_vma and
      migrate_device_coherent_page()") inadvertantly changed this to contain
      just the number of pages that were unmapped.  Usage of migrate_vma->cpages
      isn't documented, but most drivers use it to see if all the requested
      addresses can be migrated so restore the original behaviour.
      
      Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
      Fixes: 241f6885
      
       ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reported-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44af0b45
    • Sam James's avatar
      kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible · 50c69721
      Sam James authored
      
      
      Add missing <linux/string.h> include for strcmp.
      
      Clang 16 makes -Wimplicit-function-declaration an error by default. 
      Unfortunately, out of tree modules may use this in configure scripts,
      which means failure might cause silent miscompilation or misconfiguration.
      
      For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
      or the (new) c-std-porting mailing list [3].
      
      [0] https://lwn.net/Articles/913505/
      [1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
      [2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
      [3] hosted at lists.linux.dev.
      
      [akpm@linux-foundation.org: remember "linux/"]
      Link: https://lkml.kernel.org/r/20221116182634.2823136-1-sam@gentoo.org
      Signed-off-by: default avatarSam James <sam@gentoo.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50c69721
    • Alex Hung's avatar
      MAINTAINERS: update Alex Hung's email address · db8e0998
      Alex Hung authored
      
      
      Use my personal email address.
      
      Link: https://lkml.kernel.org/r/20221114001302.671897-2-alex.hung@amd.com
      Signed-off-by: default avatarAlex Hung <alexhung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      db8e0998
    • Alex Hung's avatar
      mailmap: update Alex Hung's email address · d39e2ad6
      Alex Hung authored
      
      
      I am no longer at Canonical and add entry of my personal email address.
      
      Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.com
      Signed-off-by: default avatarAlex Hung <alexhung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d39e2ad6
    • Ian Cowan's avatar
      mm: mmap: fix documentation for vma_mas_szero · 4a423440
      Ian Cowan authored
      
      
      When the struct_mm input, mm, was changed to a struct ma_state, mas, the
      documentation for the function was never updated.  This updates that
      documentation reference.
      
      Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aero
      Signed-off-by: default avatarIan Cowan <ian@linux.cowan.aero>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a423440
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed · 8468b486
      SeongJae Park authored
      A DAMON sysfs interface user can start DAMON with a scheme, remove the
      sysfs directory for the scheme, and then ask update of the scheme's stats.
      Because the schemes stats update logic isn't aware of the situation, it
      results in an invalid memory access.  Fix the bug by checking if the
      scheme sysfs directory exists.
      
      Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
      Fixes: 0ac32b8a
      
       ("mm/damon/sysfs: support DAMOS stats")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[v5.18]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8468b486
    • Alistair Popple's avatar
      mm/memory: return vm_fault_t result from migrate_to_ram() callback · 4a955bed
      Alistair Popple authored
      The migrate_to_ram() callback should always succeed, but in rare cases can
      fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101d
      ("mm/memory.c: fix race when faulting a device private page") incorrectly
      stopped passing the return code up the stack.  Fix this by setting the ret
      variable, restoring the previous behaviour on migrate_to_ram() failure.
      
      Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
      Fixes: 16ce101d
      
       ("mm/memory.c: fix race when faulting a device private page")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a955bed
    • Li Liguang's avatar
      mm: correctly charge compressed memory to its memcg · cd08d80e
      Li Liguang authored
      Kswapd will reclaim memory when memory pressure is high, the annonymous
      memory will be compressed and stored in the zpool if zswap is enabled. 
      The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
      kernel thread and cause the compressed memory not be charged to its memory
      cgroup.
      
      Remove the memcg_kmem_bypass() call and properly charge compressed memory
      to its corresponding memory cgroup.
      
      Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
      Fixes: f4840ccf
      
       ("zswap: memcg accounting")
      Signed-off-by: default avatarLi Liguang <liliguang@baidu.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cd08d80e