Skip to content
  1. Jun 25, 2021
    • Rasmus Villemoes's avatar
      mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array · b08e50dd
      Rasmus Villemoes authored
      In the event that somebody would call this with an already fully
      populated page_array, the last loop iteration would do an access beyond
      the end of page_array.
      
      It's of course extremely unlikely that would ever be done, but this
      triggers my internal static analyzer.  Also, if it really is not
      supposed to be invoked this way (i.e., with no NULL entries in
      page_array), the nr_populated<nr_pages check could simply be removed
      instead.
      
      Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk
      Fixes: 0f87d9d3
      
       ("mm/page_alloc: add an array-based interface to the bulk page allocator")
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b08e50dd
    • Naoya Horiguchi's avatar
      mm/hwpoison: do not lock page again when me_huge_page() successfully recovers · ea6d0630
      Naoya Horiguchi authored
      Currently me_huge_page() temporary unlocks page to perform some actions
      then locks it again later.  My testcase (which calls hard-offline on
      some tail page in a hugetlb, then accesses the address of the hugetlb
      range) showed that page allocation code detects this page lock on buddy
      page and printed out "BUG: Bad page state" message.
      
      check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
      page, so this flag works as kind of filter, but this filtering doesn't
      work in this case because the "bad page" is not the actual hwpoisoned
      page.  So stop locking page again.  Actions to be taken depend on the
      page type of the error, so page unlocking should be done in ->action()
      callbacks.  So let's make it assumed and change all existing callbacks
      that way.
      
      Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
      Fixes: commit 78bb9203
      
       ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea6d0630
    • Aili Yao's avatar
      mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned · 47af12ba
      Aili Yao authored
      
      
      When memory_failure() is called with MF_ACTION_REQUIRED on the page that
      has already been hwpoisoned, memory_failure() could fail to send SIGBUS
      to the affected process, which results in infinite loop of MCEs.
      
      Currently memory_failure() returns 0 if it's called for already
      hwpoisoned page, then the caller, kill_me_maybe(), could return without
      sending SIGBUS to current process.  An action required MCE is raised
      when the current process accesses to the broken memory, so no SIGBUS
      means that the current process continues to run and access to the error
      page again soon, so running into MCE loop.
      
      This issue can arise for example in the following scenarios:
      
       - Two or more threads access to the poisoned page concurrently. If
         local MCE is enabled, MCE handler independently handles the MCE
         events. So there's a race among MCE events, and the second or latter
         threads fall into the situation in question.
      
       - If there was a precedent memory error event and memory_failure() for
         the event failed to unmap the error page for some reason, the
         subsequent memory access to the error page triggers the MCE loop
         situation.
      
      To fix the issue, make memory_failure() return an error code when the
      error page has already been hwpoisoned.  This allows memory error
      handler to control how it sends signals to userspace.  And make sure
      that any process touching a hwpoisoned page should get a SIGBUS even in
      "already hwpoisoned" path of memory_failure() as is done in page fault
      path.
      
      Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com
      Signed-off-by: default avatarAili Yao <yaoaili@kingsoft.com>
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jue Wang <juew@google.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47af12ba
    • Tony Luck's avatar
      mm/memory-failure: use a mutex to avoid memory_failure() races · 171936dd
      Tony Luck authored
      
      
      Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.
      
      I wrote this patchset to materialize what I think is the current
      allowable solution mentioned by the previous discussion [1].  I simply
      borrowed Tony's mutex patch and Aili's return code patch, then I queued
      another one to find error virtual address in the best effort manner.  I
      know that this is not a perfect solution, but should work for some
      typical case.
      
      [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
      
      This patch (of 2):
      
      There can be races when multiple CPUs consume poison from the same page.
      The first into memory_failure() atomically sets the HWPoison page flag
      and begins hunting for tasks that map this page.  Eventually it
      invalidates those mappings and may send a SIGBUS to the affected tasks.
      
      But while all that work is going on, other CPUs see a "success" return
      code from memory_failure() and so they believe the error has been
      handled and continue executing.
      
      Fix by wrapping most of the internal parts of memory_failure() in a
      mutex.
      
      [akpm@linux-foundation.org: make mf_mutex local to memory_failure()]
      
      Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
      Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jue Wang <juew@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      171936dd
    • Hugh Dickins's avatar
      mm, futex: fix shared futex pgoff on shmem huge page · fe19bd3d
      Hugh Dickins authored
      If more than one futex is placed on a shmem huge page, it can happen
      that waking the second wakes the first instead, and leaves the second
      waiting: the key's shared.pgoff is wrong.
      
      When 3.11 commit 13d60f4b ("futex: Take hugepages into account when
      generating futex_key"), the only shared huge pages came from hugetlbfs,
      and the code added to deal with its exceptional page->index was put into
      hugetlb source.  Then that was missed when 4.8 added shmem huge pages.
      
      page_to_pgoff() is what others use for this nowadays: except that, as
      currently written, it gives the right answer on hugetlbfs head, but
      nonsense on hugetlbfs tails.  Fix that by calling hugetlbfs-specific
      hugetlb_basepage_index() on PageHuge tails as well as on head.
      
      Yes, it's unconventional to declare hugetlb_basepage_index() there in
      pagemap.h, rather than in hugetlb.h; but I do not expect anything but
      page_to_pgoff() ever to need it.
      
      [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
      
      Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
      Fixes: 800d8c63
      
       ("shmem: add huge pages support")
      Reported-by: default avatarNeel Natu <neelnatu@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Zhang Yi <wetpzy@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe19bd3d
    • Petr Mladek's avatar
      kthread: prevent deadlock when kthread_mod_delayed_work() races with... · 5fa54346
      Petr Mladek authored
      kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
      
      The system might hang with the following backtrace:
      
      	schedule+0x80/0x100
      	schedule_timeout+0x48/0x138
      	wait_for_common+0xa4/0x134
      	wait_for_completion+0x1c/0x2c
      	kthread_flush_work+0x114/0x1cc
      	kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
      	kthread_cancel_delayed_work_sync+0x18/0x2c
      	xxxx_pm_notify+0xb0/0xd8
      	blocking_notifier_call_chain_robust+0x80/0x194
      	pm_notifier_call_chain_robust+0x28/0x4c
      	suspend_prepare+0x40/0x260
      	enter_state+0x80/0x3f4
      	pm_suspend+0x60/0xdc
      	state_store+0x108/0x144
      	kobj_attr_store+0x38/0x88
      	sysfs_kf_write+0x64/0xc0
      	kernfs_fop_write_iter+0x108/0x1d0
      	vfs_write+0x2f4/0x368
      	ksys_write+0x7c/0xec
      
      It is caused by the following race between kthread_mod_delayed_work()
      and kthread_cancel_delayed_work_sync():
      
      CPU0				CPU1
      
      Context: Thread A		Context: Thread B
      
      kthread_mod_delayed_work()
        spin_lock()
        __kthread_cancel_work()
           spin_unlock()
           del_timer_sync()
      				kthread_cancel_delayed_work_sync()
      				  spin_lock()
      				  __kthread_cancel_work()
      				    spin_unlock()
      				    del_timer_sync()
      				    spin_lock()
      
      				  work->canceling++
      				  spin_unlock
           spin_lock()
         queue_delayed_work()
           // dwork is put into the worker->delayed_work_list
      
         spin_unlock()
      
      				  kthread_flush_work()
           // flush_work is put at the tail of the dwork
      
      				    wait_for_completion()
      
      Context: IRQ
      
        kthread_delayed_work_timer_fn()
          spin_lock()
          list_del_init(&work->node);
          spin_unlock()
      
      BANG: flush_work is not longer linked and will never get proceed.
      
      The problem is that kthread_mod_delayed_work() checks work->canceling
      flag before canceling the timer.
      
      A simple solution is to (re)check work->canceling after
      __kthread_cancel_work().  But then it is not clear what should be
      returned when __kthread_cancel_work() removed the work from the queue
      (list) and it can't queue it again with the new @delay.
      
      The return value might be used for reference counting.  The caller has
      to know whether a new work has been queued or an existing one was
      replaced.
      
      The proper solution is that kthread_mod_delayed_work() will remove the
      work from the queue (list) _only_ when work->canceling is not set.  The
      flag must be checked after the timer is stopped and the remaining
      operations can be done under worker->lock.
      
      Note that kthread_mod_delayed_work() could remove the timer and then
      bail out.  It is fine.  The other canceling caller needs to cancel the
      timer as well.  The important thing is that the queue (list)
      manipulation is done atomically under worker->lock.
      
      Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com
      Fixes: 9a6b06c8
      
       ("kthread: allow to modify delayed kthread work")
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Reported-by: default avatarMartin Liu <liumartin@google.com>
      Cc: <jenhaochen@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fa54346
    • Petr Mladek's avatar
      kthread_worker: split code for canceling the delayed work timer · 34b3d534
      Petr Mladek authored
      
      
      Patch series "kthread_worker: Fix race between kthread_mod_delayed_work()
      and kthread_cancel_delayed_work_sync()".
      
      This patchset fixes the race between kthread_mod_delayed_work() and
      kthread_cancel_delayed_work_sync() including proper return value
      handling.
      
      This patch (of 2):
      
      Simple code refactoring as a preparation step for fixing a race between
      kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().
      
      It does not modify the existing behavior.
      
      Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: <jenhaochen@google.com>
      Cc: Martin Liu <liumartin@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34b3d534
    • Daniel Axtens's avatar
      mm/vmalloc: unbreak kasan vmalloc support · 7ca3027b
      Daniel Axtens authored
      In commit 121e6f32 ("mm/vmalloc: hugepage vmalloc mappings"),
      __vmalloc_node_range was changed such that __get_vm_area_node was no
      longer called with the requested/real size of the vmalloc allocation,
      but rather with a rounded-up size.
      
      This means that __get_vm_area_node called kasan_unpoision_vmalloc() with
      a rounded up size rather than the real size.  This led to it allowing
      access to too much memory and so missing vmalloc OOBs and failing the
      kasan kunit tests.
      
      Pass the real size and the desired shift into __get_vm_area_node.  This
      allows it to round up the size for the underlying allocators while still
      unpoisioning the correct quantity of shadow memory.
      
      Adjust the other call-sites to pass in PAGE_SHIFT for the shift value.
      
      Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335
      Fixes: 121e6f32
      
       ("mm/vmalloc: hugepage vmalloc mappings")
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Tested-by: default avatarDavid Gow <davidgow@google.com>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ca3027b
    • Claudio Imbrenda's avatar
      KVM: s390: prepare for hugepage vmalloc · 185cca24
      Claudio Imbrenda authored
      The Create Secure Configuration Ultravisor Call does not support using
      large pages for the virtual memory area.  This is a hardware limitation.
      
      This patch replaces the vzalloc call with an almost equivalent call to
      the newly introduced vmalloc_no_huge function, which guarantees that
      only small pages will be used for the backing.
      
      The new call will not clear the allocated memory, but that has never
      been an actual requirement.
      
      Link: https://lkml.kernel.org/r/20210614132357.10202-3-imbrenda@linux.ibm.com
      Fixes: 121e6f32
      
       ("mm/vmalloc: hugepage vmalloc mappings")
      Signed-off-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Reviewed-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      185cca24
    • Claudio Imbrenda's avatar
      mm/vmalloc: add vmalloc_no_huge · 15a64f5a
      Claudio Imbrenda authored
      Patch series "mm: add vmalloc_no_huge and use it", v4.
      
      Add vmalloc_no_huge() and export it, so modules can allocate memory with
      small pages.
      
      Use the newly added vmalloc_no_huge() in KVM on s390 to get around a
      hardware limitation.
      
      This patch (of 2):
      
      Commit 121e6f32 ("mm/vmalloc: hugepage vmalloc mappings") added
      support for hugepage vmalloc mappings, it also added the flag
      VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be
      performed with 0-order non-huge pages.
      
      This flag is not accessible when calling vmalloc, the only option is to
      call directly __vmalloc_node_range, which is not exported.
      
      This means that a module can't vmalloc memory with small pages.
      
      Case in point: KVM on s390x needs to vmalloc a large area, and it needs
      to be mapped with non-huge pages, because of a hardware limitation.
      
      This patch adds the function vmalloc_no_huge, which works like vmalloc,
      but it is guaranteed to always back the mapping using small pages.  This
      new function is exported, therefore it is usable by modules.
      
      [akpm@linux-foundation.org: whitespace fixes, per Christoph]
      
      Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com
      Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com
      Fixes: 121e6f32
      
       ("mm/vmalloc: hugepage vmalloc mappings")
      Signed-off-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15a64f5a
    • Pavel Skripkin's avatar
      nilfs2: fix memory leak in nilfs_sysfs_delete_device_group · 8fd0c1b0
      Pavel Skripkin authored
      My local syzbot instance hit memory leak in nilfs2.  The problem was in
      missing kobject_put() in nilfs_sysfs_delete_device_group().
      
      kobject_del() does not call kobject_cleanup() for passed kobject and it
      leads to leaking duped kobject name if kobject_put() was not called.
      
      Fail log:
      
        BUG: memory leak
        unreferenced object 0xffff8880596171e0 (size 8):
        comm "syz-executor379", pid 8381, jiffies 4294980258 (age 21.100s)
        hex dump (first 8 bytes):
          6c 6f 6f 70 30 00 00 00                          loop0...
        backtrace:
           kstrdup+0x36/0x70 mm/util.c:60
           kstrdup_const+0x53/0x80 mm/util.c:83
           kvasprintf_const+0x108/0x190 lib/kasprintf.c:48
           kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
           kobject_add_varg lib/kobject.c:384 [inline]
           kobject_init_and_add+0xc9/0x160 lib/kobject.c:473
           nilfs_sysfs_create_device_group+0x150/0x800 fs/nilfs2/sysfs.c:999
           init_nilfs+0xe26/0x12b0 fs/nilfs2/the_nilfs.c:637
      
      Link: https://lkml.kernel.org/r/20210612140559.20022-1-paskripkin@gmail.com
      Fixes: da7141fb
      
       ("nilfs2: add /sys/fs/nilfs2/<device> group")
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Michael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fd0c1b0
    • Hugh Dickins's avatar
      mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() · a7a69d8b
      Hugh Dickins authored
      Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
      ptlock in the PVMW_SYNC case? That too might have been responsible for
      BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
      I've never seen any.
      
      Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com
      Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      Fixes: ace71a19
      
       ("mm: introduce page_vma_mapped_walk()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7a69d8b
    • Hugh Dickins's avatar
      mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes · a9a7504d
      Hugh Dickins authored
      Running certain tests with a DEBUG_VM kernel would crash within hours,
      on the total_mapcount BUG() in split_huge_page_to_list(), while trying
      to free up some memory by punching a hole in a shmem huge page: split's
      try_to_unmap() was unable to find all the mappings of the page (which,
      on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
      
      Crash dumps showed two tail pages of a shmem huge page remained mapped
      by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
      of a long unmapped range; and no page table had yet been allocated for
      the head of the huge page to be mapped into.
      
      Although designed to handle these odd misaligned huge-page-mapped-by-pte
      cases, page_vma_mapped_walk() falls short by returning false prematurely
      when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
      are cases when a huge page may span the boundary, with ptes present in
      the next.
      
      Restructure page_vma_mapped_walk() as a loop to continue in these cases,
      while keeping its layout much as before.  Add a step_forward() helper to
      advance pvmw->address across those boundaries: originally I tried to use
      mm's standard p?d_addr_end() macros, but hit the same crash 512 times
      less often: because of the way redundant levels are folded together, but
      folded differently in different configurations, it was just too
      difficult to use them correctly; and step_forward() is simpler anyway.
      
      Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com
      Fixes: ace71a19
      
       ("mm: introduce page_vma_mapped_walk()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a7504d
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): get vma_address_end() earlier · a765c417
      Hugh Dickins authored
      
      
      page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the
      start, rather than later at next_pte.
      
      It's a little unnecessary overhead on the first call, but makes for a
      simpler loop in the following commit.
      
      Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a765c417
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): use goto instead of while (1) · 47446630
      Hugh Dickins authored
      
      
      page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
      and use "goto this_pte", in place of the "while (1)" loop at the end.
      
      Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47446630
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): add a level of indentation · b3807a91
      Hugh Dickins authored
      
      
      page_vma_mapped_walk() cleanup: add a level of indentation to much of
      the body, making no functional change in this commit, but reducing the
      later diff when this is all converted to a loop.
      
      [hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix]
        Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com
      
      Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3807a91
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): crossing page table boundary · 44828248
      Hugh Dickins authored
      
      
      page_vma_mapped_walk() cleanup: adjust the test for crossing page table
      boundary - I believe pvmw->address is always page-aligned, but nothing
      else here assumed that; and remember to reset pvmw->pte to NULL after
      unmapping the page table, though I never saw any bug from that.
      
      Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44828248
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block · e2e1d407
      Hugh Dickins authored
      
      
      page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
      follow the same "return not_found, return not_found, return true"
      pattern as the block above it (note: returning not_found there is never
      premature, since existence or prior existence of huge pmd guarantees
      good alignment).
      
      Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2e1d407
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd · 3306d311
      Hugh Dickins authored
      
      
      page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
      use it in subsequent tests, instead of repeatedly dereferencing pointer.
      
      Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3306d311
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): settle PageHuge on entry · 6d0fd598
      Hugh Dickins authored
      
      
      page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of
      the way at the start, so no need to worry about it later.
      
      Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d0fd598
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): use page for pvmw->page · f003c03b
      Hugh Dickins authored
      
      
      Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes".
      
      I've marked all of these for stable: many are merely cleanups, but I
      think they are much better before the main fix than after.
      
      This patch (of 11):
      
      page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page
      was used, sometimes pvmw->page itself: use the local copy "page"
      throughout.
      
      Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com
      Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f003c03b
    • Linus Torvalds's avatar
      Merge tag 'mmc-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · 4a09d388
      Linus Torvalds authored
      Pull MMC fix from Ulf Hansson:
       "Use memcpy_to/fromio for dram-access-quirk in the meson-gx host
        driver"
      
      * tag 'mmc-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: meson-gx: use memcpy_to/fromio for dram-access-quirk
      4a09d388
    • Linus Torvalds's avatar
      Merge tag 'core-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7749b033
      Linus Torvalds authored
      Pull sigqueue cache fix from Ingo Molnar:
       "Fix a memory leak in the recently introduced sigqueue cache"
      
      * tag 'core-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        signal: Prevent sigqueue caching after task got released
      7749b033
  2. Jun 24, 2021
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 66675170
      Linus Torvalds authored
      Pull scheduler fix from Ingo Molnar:
       "A last minute cgroup bandwidth scheduling fix for a recently
        introduced logic fail which triggered a kernel warning by LTP's
        cfs_bandwidth01 test"
      
      * tag 'sched-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/fair: Ensure that the CFS parent is added after unthrottling
      66675170
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · df501100
      Linus Torvalds authored
      Pull x86 perf fix from Ingo Molnar:
       "An LBR buffer fix for code that probably only worked accidentally"
      
      * tag 'perf-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel/lbr: Zero the xstate buffer on allocation
      df501100
    • Linus Torvalds's avatar
      Merge tag 'objtool-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c0e45785
      Linus Torvalds authored
      Pull objtool fixes from Ingo Molnar:
       "Address a number of objtool warnings that got reported.
      
        No change in behavior intended, but code generation might be impacted
        by commit 1f008d46 ("x86: Always inline task_size_max()")"
      
      * tag 'objtool-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/lockdep: Improve noinstr vs errors
        x86: Always inline task_size_max()
        x86/xen: Fix noinstr fail in exc_xen_unknown_trap()
        x86/xen: Fix noinstr fail in xen_pv_evtchn_do_upcall()
        x86/entry: Fix noinstr fail in __do_fast_syscall_32()
        objtool/x86: Ignore __x86_indirect_alt_* symbols
      c0e45785
    • Thomas Gleixner's avatar
      perf/x86/intel/lbr: Zero the xstate buffer on allocation · 7f049fbd
      Thomas Gleixner authored
      XRSTORS requires a valid xstate buffer to work correctly. XSAVES does not
      guarantee to write a fully valid buffer according to the SDM:
      
        "XSAVES does not write to any parts of the XSAVE header other than the
         XSTATE_BV and XCOMP_BV fields."
      
      XRSTORS triggers a #GP:
      
        "If bytes 63:16 of the XSAVE header are not all zero."
      
      It's dubious at best how this can work at all when the buffer is not zeroed
      before use.
      
      Allocate the buffers with __GFP_ZERO to prevent XRSTORS failure.
      
      Fixes: ce711ea3
      
       ("perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/87wnr0wo2z.ffs@nanos.tec.linutronix.de
      7f049fbd
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · 7426cedc
      Linus Torvalds authored
      Pull spi fixes from Mark Brown:
       "A couple of small, driver specific fixes that arrived in the past few
        weeks"
      
      * tag 'spi-fix-v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: spi-nxp-fspi: move the register operation after the clock enable
        spi: tegra20-slink: Ensure SPI controller reset is deasserted
      7426cedc
    • Linus Torvalds's avatar
      Merge tag 'pm-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 7266f203
      Linus Torvalds authored
      Pull power management fix from Rafael Wysocki:
       "Revert a recent PCI power management commit that causes initialization
        issues to appear on some systems"
      
      * tag 'pm-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"
      7266f203
    • Linus Torvalds's avatar
      Merge branch 'stable/for-linus-5.14' of... · 8fd2ed1c
      Linus Torvalds authored
      Merge branch 'stable/for-linus-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
      
      Pull swiotlb fix from Konrad Rzeszutek Wilk:
       "A fix for the regression for the DMA operations where the offset was
        ignored and corruptions would appear.
      
        Going forward there will be a cleanups to make the offset and
        alignment logic more clearer and better test-cases to help with this"
      
      * 'stable/for-linus-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
        swiotlb: manipulate orig_addr when tlb_addr has offset
      8fd2ed1c
  3. Jun 23, 2021
  4. Jun 22, 2021