Skip to content
  1. May 06, 2024
  2. Apr 26, 2024
    • Miaohe Lin's avatar
      mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() · 52ccdde1
      Miaohe Lin authored
      When I did memory failure tests recently, below warning occurs:
      
      DEBUG_LOCKS_WARN_ON(1)
      WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
      Modules linked in: mce_inject hwpoison_inject
      CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      RIP: 0010:__lock_acquire+0xccb/0x1ca0
      RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
      RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
      RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
      RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
      R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
      R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
      FS:  00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       lock_acquire+0xbe/0x2d0
       _raw_spin_lock_irqsave+0x3a/0x60
       hugepage_subpool_put_pages.part.0+0xe/0xc0
       free_huge_folio+0x253/0x3f0
       dissolve_free_huge_page+0x147/0x210
       __page_handle_poison+0x9/0x70
       memory_failure+0x4e6/0x8c0
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xbc/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff9f3114887
      RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
      RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
      RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
       </TASK>
      Kernel panic - not syncing: kernel: panic_on_warn set ...
      CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       panic+0x326/0x350
       check_panic_on_warn+0x4f/0x50
       __warn+0x98/0x190
       report_bug+0x18e/0x1a0
       handle_bug+0x3d/0x70
       exc_invalid_op+0x18/0x70
       asm_exc_invalid_op+0x1a/0x20
      RIP: 0010:__lock_acquire+0xccb/0x1ca0
      RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
      RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
      RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
      RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
      R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
      R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
       lock_acquire+0xbe/0x2d0
       _raw_spin_lock_irqsave+0x3a/0x60
       hugepage_subpool_put_pages.part.0+0xe/0xc0
       free_huge_folio+0x253/0x3f0
       dissolve_free_huge_page+0x147/0x210
       __page_handle_poison+0x9/0x70
       memory_failure+0x4e6/0x8c0
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xbc/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff9f3114887
      RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
      RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
      RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
       </TASK>
      
      After git bisecting and digging into the code, I believe the root cause is
      that _deferred_list field of folio is unioned with _hugetlb_subpool field.
      In __update_and_free_hugetlb_folio(), folio->_deferred_list is
      initialized leading to corrupted folio->_hugetlb_subpool when folio is
      hugetlb.  Later free_huge_folio() will use _hugetlb_subpool and above
      warning happens.
      
      But it is assumed hugetlb flag must have been cleared when calling
      folio_put() in update_and_free_hugetlb_folio().  This assumption is broken
      due to below race:
      
      CPU1					CPU2
      dissolve_free_huge_page			update_and_free_pages_bulk
       update_and_free_hugetlb_folio		 hugetlb_vmemmap_restore_folios
      					  folio_clear_hugetlb_vmemmap_optimized
        clear_flag = folio_test_hugetlb_vmemmap_optimized
        if (clear_flag) <-- False, it's already cleared.
         __folio_clear_hugetlb(folio) <-- Hugetlb is not cleared.
        folio_put
         free_huge_folio <-- free_the_page is expected.
      					 list_for_each_entry()
      					  __folio_clear_hugetlb <-- Too late.
      
      Fix this issue by checking whether folio is hugetlb directly instead of
      checking clear_flag to close the race window.
      
      Link: https://lkml.kernel.org/r/20240419085819.1901645-1-linmiaohe@huawei.com
      Fixes: 32c87719
      
       ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52ccdde1
    • Muhammad Usama Anjum's avatar
      selftests: mm: protection_keys: save/restore nr_hugepages value from launch script · ed74abcd
      Muhammad Usama Anjum authored
      The save/restore of nr_hugepages was added to the test itself by using the
      atexit() functionality.  But it is broken as parent exits after creating
      child.  Hence calling the atexit() function early.  That's not it.  The
      child exits after creating its child and so on.
      
      The parent cannot wait to get the termination status for its children as
      it'll keep on holding the resources until the new pkey allocation fails. 
      It is impossible to wait for exits of all the grand and great grand
      children.  Hence the restoring of nr_hugepages value from parent is wrong.
      
      Let's save/restore the nr_hugepages settings in the launch script
      instead of doing it in the test.
      
      Link: https://lkml.kernel.org/r/20240419115027.3848958-1-usama.anjum@collabora.com
      Fixes: c52eb6db
      
       ("selftests: mm: restore settings from only parent process")
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Reported-by: default avatarJoey Gouly <joey.gouly@arm.com>
      Closes: https://lore.kernel.org/all/20240418125250.GA2941398@e124191.cambridge.arm.com
      
      
      Cc: Joey Gouly <joey.gouly@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed74abcd
  3. Apr 25, 2024
  4. Apr 17, 2024
    • Jeongjun Park's avatar
      nilfs2: fix OOB in nilfs_set_de_type · c4a7dc95
      Jeongjun Park authored
      The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is
      defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function,
      which uses this array, specifies the index to read from the array in the
      same way as "(mode & S_IFMT) >> S_SHIFT".
      
      static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode
       *inode)
      {
      	umode_t mode = inode->i_mode;
      
      	de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob
      }
      
      However, when the index is determined this way, an out-of-bounds (OOB)
      error occurs by referring to an index that is 1 larger than the array size
      when the condition "mode & S_IFMT == S_IFMT" is satisfied.  Therefore, a
      patch to resize the nilfs_type_by_mode array should be applied to prevent
      OOB errors.
      
      Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com
      
      
      Reported-by: default avatar <syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8
      Fixes: 2ba466d7
      
       ("nilfs2: directory entry operations")
      Signed-off-by: default avatarJeongjun Park <aha310510@gmail.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4a7dc95
    • Naoya Horiguchi's avatar
      MAINTAINERS: update Naoya Horiguchi's email address · 8247bf1d
      Naoya Horiguchi authored
      My old NEC address has been removed, so update MAINTAINERS and .mailmap to
      map it to my gmail address.
      
      Link: https://lkml.kernel.org/r/20240412181720.18452-1-nao.horiguchi@gmail.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <nao.horiguchi@gmail.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8247bf1d
    • Miaohe Lin's avatar
      fork: defer linking file vma until vma is fully initialized · 35e35178
      Miaohe Lin authored
      Thorvald reported a WARNING [1]. And the root cause is below race:
      
       CPU 1					CPU 2
       fork					hugetlbfs_fallocate
        dup_mmap				 hugetlbfs_punch_hole
         i_mmap_lock_write(mapping);
         vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
         i_mmap_unlock_write(mapping);
         hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
      					 i_mmap_lock_write(mapping);
         					 hugetlb_vmdelete_list
      					  vma_interval_tree_foreach
      					   hugetlb_vma_trylock_write -- Vma_lock is cleared.
         tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
      					   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
      					 i_mmap_unlock_write(mapping);
      
      hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
      i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
      by deferring linking file vma until vma is fully initialized.  Those vmas
      should be initialized first before they can be used.
      
      Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
      Fixes: 8d9bfb26
      
       ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reported-by: default avatarThorvald Natvig <thorvald@google.com>
      Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/
      
       [1]
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peng Zhang <zhangpeng.00@bytedance.com>
      Cc: Tycho Andersen <tandersen@netflix.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35e35178
    • Sumanth Korikkar's avatar
      mm/shmem: inline shmem_is_huge() for disabled transparent hugepages · 1f737846
      Sumanth Korikkar authored
      In order to  minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y),
      compiler might choose to make a regular function call (out-of-line) for
      shmem_is_huge() instead of inlining it. When transparent hugepages are
      disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation
      error.
      
      mm/shmem.c: In function `shmem_getattr':
      ./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG'
        383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
            |                           ^~~~~~~~~
      mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE'
       1148 |                 stat->blksize = HPAGE_PMD_SIZE;
      
      To prevent the possible error, always inline shmem_is_huge() when
      transparent hugepages are disabled.
      
      Link: https://lkml.kernel.org/r/20240409155407.2322714-1-sumanthk@linux.ibm.com
      
      
      Signed-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f737846
    • Oscar Salvador's avatar
      mm,page_owner: defer enablement of static branch · 0b2cf0a4
      Oscar Salvador authored
      Kefeng Wang reported that he was seeing some memory leaks with kmemleak
      with page_owner enabled.
      
      The reason is that we enable the page_owner_inited static branch and then
      proceed with the linking of stack_list struct to dummy_stack, which means
      that exists a race window between these two steps where we can have pages
      already being allocated calling add_stack_record_to_list(), allocating
      objects and linking them to stack_list, but then we set stack_list
      pointing to dummy_stack in init_page_owner.  Which means that the objects
      that have been allocated during that time window are unreferenced and
      lost.
      
      Fix this by deferring the enablement of the branch until we have properly
      set up the list.
      
      Link: https://lkml.kernel.org/r/20240409131715.13632-1-osalvador@suse.de
      Fixes: 4bedfb31
      
       ("mm,page_owner: maintain own list of stack_records structs")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Closes: https://lore.kernel.org/linux-mm/74b147b0-718d-4d50-be75-d6afc801cd24@huawei.com/
      
      
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b2cf0a4
    • Phillip Lougher's avatar
      Squashfs: check the inode number is not the invalid value of zero · 9253c54e
      Phillip Lougher authored
      Syskiller has produced an out of bounds access in fill_meta_index().
      
      That out of bounds access is ultimately caused because the inode
      has an inode number with the invalid value of zero, which was not checked.
      
      The reason this causes the out of bounds access is due to following
      sequence of events:
      
      1. Fill_meta_index() is called to allocate (via empty_meta_index())
         and fill a metadata index.  It however suffers a data read error
         and aborts, invalidating the newly returned empty metadata index.
         It does this by setting the inode number of the index to zero,
         which means unused (zero is not a valid inode number).
      
      2. When fill_meta_index() is subsequently called again on another
         read operation, locate_meta_index() returns the previous index
         because it matches the inode number of 0.  Because this index
         has been returned it is expected to have been filled, and because
         it hasn't been, an out of bounds access is performed.
      
      This patch adds a sanity check which checks that the inode number
      is not zero when the inode is created and returns -EINVAL if it is.
      
      [phillip@squashfs.org.uk: whitespace fix]
        Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk
      Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.uk
      
      
      Signed-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: default avatar"Ubisectech Sirius" <bugreport@ubisectech.com>
      Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/
      
      
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9253c54e
    • Oscar Salvador's avatar
      mm,swapops: update check in is_pfn_swap_entry for hwpoison entries · 07a57a33
      Oscar Salvador authored
      Tony reported that the Machine check recovery was broken in v6.9-rc1, as
      he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
      DRAM.
      
      After some more digging and debugging on his side, he realized that this
      went back to v6.1, with the introduction of 'commit 0d206b5d
      ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'.  That
      commit, among other things, introduced swp_offset_pfn(), replacing
      hwpoison_entry_to_pfn() in its favour.
      
      The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
      is_pfn_swap_entry() never got updated to cover hwpoison entries, which
      means that we would hit the VM_BUG_ON whenever we would call
      swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
      set.  Fix this by updating the check to cover hwpoison entries as well,
      and update the comment while we are it.
      
      Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de
      Fixes: 0d206b5d
      
       ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarTony Luck <tony.luck@intel.com>
      Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/
      
      
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: <stable@vger.kernel.org>	[6.1.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07a57a33
    • Miaohe Lin's avatar
      mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled · 1983184c
      Miaohe Lin authored
      When I did hard offline test with hugetlb pages, below deadlock occurs:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.8.0-11409-gf6cef5f8c37f #1 Not tainted
      ------------------------------------------------------
      bash/46904 is trying to acquire lock:
      ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60
      
      but task is already holding lock:
      ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
             __mutex_lock+0x6c/0x770
             page_alloc_cpu_online+0x3c/0x70
             cpuhp_invoke_callback+0x397/0x5f0
             __cpuhp_invoke_callback_range+0x71/0xe0
             _cpu_up+0xeb/0x210
             cpu_up+0x91/0xe0
             cpuhp_bringup_mask+0x49/0xb0
             bringup_nonboot_cpus+0xb7/0xe0
             smp_init+0x25/0xa0
             kernel_init_freeable+0x15f/0x3e0
             kernel_init+0x15/0x1b0
             ret_from_fork+0x2f/0x50
             ret_from_fork_asm+0x1a/0x30
      
      -> #0 (cpu_hotplug_lock){++++}-{0:0}:
             __lock_acquire+0x1298/0x1cd0
             lock_acquire+0xc0/0x2b0
             cpus_read_lock+0x2a/0xc0
             static_key_slow_dec+0x16/0x60
             __hugetlb_vmemmap_restore_folio+0x1b9/0x200
             dissolve_free_huge_page+0x211/0x260
             __page_handle_poison+0x45/0xc0
             memory_failure+0x65e/0xc70
             hard_offline_page_store+0x55/0xa0
             kernfs_fop_write_iter+0x12c/0x1d0
             vfs_write+0x387/0x550
             ksys_write+0x64/0xe0
             do_syscall_64+0xca/0x1e0
             entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcp_batch_high_lock);
                                     lock(cpu_hotplug_lock);
                                     lock(pcp_batch_high_lock);
        rlock(cpu_hotplug_lock);
      
       *** DEADLOCK ***
      
      5 locks held by bash/46904:
       #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
       #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
       #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
       #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
       #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      stack backtrace:
      CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x68/0xa0
       check_noncircular+0x129/0x140
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7fc862314887
      Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
      RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
      RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00
      
      In short, below scene breaks the lock dependency chain:
      
       memory_failure
        __page_handle_poison
         zone_pcp_disable -- lock(pcp_batch_high_lock)
         dissolve_free_huge_page
          __hugetlb_vmemmap_restore_folio
           static_key_slow_dec
            cpus_read_lock -- rlock(cpu_hotplug_lock)
      
      Fix this by calling drain_all_pages() instead.
      
      This issue won't occur until commit a6b40850 ("mm: hugetlb: replace
      hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
      rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
      lock(pcp_batch_high_lock) is already in the __page_handle_poison().
      
      [linmiaohe@huawei.com: extend comment per Oscar]
      [akpm@linux-foundation.org: reflow block comment]
      Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
      Fixes: a6b40850
      
       ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1983184c
    • Peter Xu's avatar
      mm/userfaultfd: allow hugetlb change protection upon poison entry · c5977c95
      Peter Xu authored
      After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
      the POISON one or UFFD_WP one.
      
      Allow change protection to run on a poisoned marker just like !hugetlb
      cases, ignoring the marker irrelevant of the permission.
      
      Here the two bits are mutual exclusive.  For example, when install a
      poisoned entry it must not be UFFD_WP already (by checking pte_none()
      before such install).  And it also means if UFFD_WP is set there must have
      no POISON bit set.  It makes sense because UFFD_WP is a bit to reflect
      permission, and permissions do not apply if the pte is poisoned and
      destined to sigbus.
      
      So here we simply check uffd_wp bit set first, do nothing otherwise.
      
      Attach the Fixes to UFFDIO_POISON work, as before that it should not be
      possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
      so no chance of swapin errors).
      
      Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com
      Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com
      Fixes: fc71884a
      
       ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatar <syzbot+b07c8ac8eee3d4d8440f@syzkaller.appspotmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5977c95
    • Oscar Salvador's avatar
      mm,page_owner: fix printing of stack records · 74017458
      Oscar Salvador authored
      When seq_* code sees that its buffer overflowed, it re-allocates a bigger
      onecand calls seq_operations->start() callback again.  stack_start()
      naively though that if it got called again, it meant that the old record
      got already printed so it returned the next object, but that is not true.
      
      The consequence of that is that every time stack_stop() -> stack_start()
      get called because we needed a bigger buffer, stack_start() will skip
      entries, and those will not be printed.
      
      Fix it by not advancing to the next object in stack_start().
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-5-osalvador@suse.de
      Fixes: 765973a0
      
       ("mm,page_owner: display all stacks and their count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74017458
    • Oscar Salvador's avatar
      mm,page_owner: fix accounting of pages when migrating · 718b1f33
      Oscar Salvador authored
      Upon migration, new allocated pages are being given the handle of the old
      pages.  This is problematic because it means that for the stack which
      allocated the old page, we will be substracting the old page + the new one
      when that page is freed, creating an accounting imbalance.
      
      There is an interest in keeping it that way, as otherwise the output will
      biased towards migration stacks should those operations occur often, but
      that is not really helpful.
      
      The link from the new page to the old stack is being performed by calling
      __update_page_owner_handle() in __folio_copy_owner().  The only thing that
      is left is to link the migrate stack to the old page, so the old page will
      be subtracted from the migrate stack, avoiding by doing so any possible
      imbalance.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-4-osalvador@suse.de
      Fixes: 217b2119
      
       ("mm,page_owner: implement the tracking of the stacks count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      718b1f33
    • Oscar Salvador's avatar
      mm,page_owner: fix refcount imbalance · f5c12105
      Oscar Salvador authored
      Current code does not contemplate scenarios were an allocation and free
      operation on the same pages do not handle it in the same amount at once. 
      To give an example, page_alloc_exact(), where we will allocate a page of
      enough order to stafisfy the size request, but we will free the remainings
      right away.
      
      In the above example, we will increment the stack_record refcount only
      once, but we will decrease it the same number of times as number of unused
      pages we have to free.  This will lead to a warning because of refcount
      imbalance.
      
      Fix this by recording the number of base pages in the refcount field.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-3-osalvador@suse.de
      
      
      Reported-by: default avatar <syzbot+41bbfdb8d41003d12c0f@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/linux-mm/00000000000090e8ff0613eda0e5@google.com
      Fixes: 217b2119
      
       ("mm,page_owner: implement the tracking of the stacks count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f5c12105
    • Oscar Salvador's avatar
      mm,page_owner: update metadata for tail pages · ea4b5b33
      Oscar Salvador authored
      Patch series "page_owner: Fix refcount imbalance and print fixup", v4.
      
      This series consists of a refactoring/correctness of updating the metadata
      of tail pages, a couple of fixups for the refcounting part and a fixup for
      the stack_start() function.
      
      From this series on, instead of counting the stacks, we count the
      outstanding nr_base_pages each stack has, which gives us a much better
      memory overview.  The other fixup is for the migration part.
      
      A more detailed explanation can be found in the changelog of the
      respective patches.
      
      
      This patch (of 4):
      
      __set_page_owner_handle() and __reset_page_owner() update the metadata of
      all pages when the page is of a higher-order, but we miss to do the same
      when the pages are migrated.  __folio_copy_owner() only updates the
      metadata of the head page, meaning that the information stored in the
      first page and the tail pages will not match.
      
      Strictly speaking that is not a big problem because 1) we do not print
      tail pages and 2) upon splitting all tail pages will inherit the metadata
      of the head page, but it is better to have all metadata in check should
      there be any problem, so it can ease debugging.
      
      For that purpose, a couple of helpers are created
      __update_page_owner_handle() which updates the metadata on allocation, and
      __update_page_owner_free_handle() which does the same when the page is
      freed.
      
      __folio_copy_owner() will make use of both as it needs to entirely replace
      the page_owner metadata for the new page.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20240404070702.2744-2-osalvador@suse.de
      
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea4b5b33
    • Lokesh Gidra's avatar
      userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE · c0205eaf
      Lokesh Gidra authored
      Commit d7a08838 ("mm: userfaultfd: fix unexpected change to src_folio
      when UFFDIO_MOVE fails") moved the src_folio->{mapping, index} changing to
      after clearing the page-table and ensuring that it's not pinned.  This
      avoids failure of swapout+migration and possibly memory corruption.
      
      However, the commit missed fixing it in the huge-page case.
      
      Link: https://lkml.kernel.org/r/20240404171726.2302435-1-lokeshgidra@google.com
      Fixes: adef4406
      
       ("userfaultfd: UFFDIO_MOVE uABI")
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0205eaf
    • David Hildenbrand's avatar
      mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly · 631426ba
      David Hildenbrand authored
      Darrick reports that in some cases where pread() would fail with -EIO and
      mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
      MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
      
      While the madvise() call can be interrupted by a signal, this is not the
      desired behavior.  MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
      like page faults in that case: fail and not retry forever.
      
      A reproducer can be found at [1].
      
      The reason is that __get_user_pages(), as called by
      faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
      it will simply return 0 when VM_FAULT_RETRY happened, making
      madvise_populate()->faultin_vma_page_range() retry again and again, never
      setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().
      
      __get_user_pages_locked() does what we want, but duplicating that logic in
      faultin_vma_page_range() feels wrong.
      
      So let's use __get_user_pages_locked() instead, that will detect
      VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
      return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
      from faultin_page() to __get_user_pages(), all the way to
      madvise_populate().
      
      But, there is an issue: __get_user_pages_locked() will end up re-taking
      the MM lock and then __get_user_pages() will do another VMA lookup.  In
      the meantime, the VMA layout could have changed and we'd fail with
      different error codes than we'd want to.
      
      As __get_user_pages() will currently do a new VMA lookup either way, let
      it do the VMA handling in a different way, controlled by a new
      FOLL_MADV_POPULATE flag, effectively moving these checks from
      madvise_populate() + faultin_page_range() in there.
      
      With this change, Darricks reproducer properly fails with -EFAULT, as
      documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.
      
      [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/
      
      Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
      Fixes: 4ca9b385
      
       ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
      
      
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      631426ba
  5. Apr 15, 2024
  6. Apr 14, 2024