Skip to content
  1. Oct 17, 2020
    • David Hildenbrand's avatar
      mm/page_alloc: drop stale pageblock comment in memmap_init_zone*() · 4eb29bd9
      David Hildenbrand authored
      Commit ac5d2539
      
       ("mm: meminit: reduce number of times pageblocks are
      set during struct page init") moved the actual zone range check, leaving
      only the alignment check for pageblocks.
      
      Let's drop the stale comment and make the pageblock check easier to read.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-9-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eb29bd9
    • David Hildenbrand's avatar
      mm/memory_hotplug: simplify page onlining · aac65321
      David Hildenbrand authored
      
      
      We don't allow to offline memory with holes, all boot memory is online,
      and all hotplugged memory cannot have holes.
      
      We can now simplify onlining of pages.  As we only allow to online/offline
      full sections and sections always span full MAX_ORDER_NR_PAGES, we can
      just process MAX_ORDER - 1 pages without further special handling.
      
      The number of onlined pages simply corresponds to the number of pages we
      were requested to online.
      
      While at it, refine the comment regarding the callback not exposing all
      pages to the buddy.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aac65321
    • David Hildenbrand's avatar
      mm/page_isolation: simplify return value of start_isolate_page_range() · 3fa0c7c7
      David Hildenbrand authored
      
      
      Callers no longer need the number of isolated pageblocks.  Let's simplify.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fa0c7c7
    • David Hildenbrand's avatar
      mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages() · ea15153c
      David Hildenbrand authored
      
      
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages() and we only support to online/offline full sections.
      Both, sections and pageblocks are a power of two in size, and sections
      always span full pageblocks.
      
      We can directly calculate the number of isolated pageblocks from nr_pages.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-6-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea15153c
    • David Hildenbrand's avatar
      mm/page_alloc: simplify __offline_isolated_pages() · 257bea71
      David Hildenbrand authored
      
      
      offline_pages() is the only user.  __offline_isolated_pages() never gets
      called with ranges that contain memory holes and we no longer care about
      the return value.  Drop the return value handling and all pfn_valid()
      checks.
      
      Update the documentation.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-5-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      257bea71
    • David Hildenbrand's avatar
      mm/memory_hotplug: simplify page offlining · 0a1a9a00
      David Hildenbrand authored
      
      
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages().  We no longer need walk_system_ram_range() and can
      call test_pages_isolated() and __offline_isolated_pages() directly.
      
      offlined_pages always corresponds to nr_pages, so we can simplify that.
      
      [akpm@linux-foundation.org: patch conflict resolution]
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-4-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a1a9a00
    • David Hildenbrand's avatar
      mm/memory_hotplug: enforce section granularity when onlining/offlining · 4986fac1
      David Hildenbrand authored
      
      
      Already two people (including me) tried to offline subsections, because
      the function looks like it can deal with it.  But we really can only
      online/offline full sections that are properly aligned (e.g., we can only
      mark full sections online/offline via SECTION_IS_ONLINE).
      
      Add a simple safety net to document the restriction now.  Current users
      (core and powernv/memtrace) respect these restrictions.
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-3-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4986fac1
    • David Hildenbrand's avatar
      mm/memory_hotplug: inline __offline_pages() into offline_pages() · 73a11c96
      David Hildenbrand authored
      
      
      Patch series "mm/memory_hotplug: online_pages()/offline_pages() cleanups", v2.
      
      These are a bunch of cleanups for online_pages()/offline_pages() and
      related code, mostly getting rid of memory hole handling that is no longer
      necessary.  There is only a single walk_system_ram_range() call left in
      offline_pages(), to make sure we don't have any memory holes.  I had some
      of these patches lying around for a longer time but didn't have time to
      polish them.
      
      In addition, the last patch marks all pageblocks of memory to get onlined
      MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
      get allocated before onlining is complete.  Once heavy lifting is done,
      the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
      possible.
      
      I played with DIMMs and virtio-mem on x86-64 and didn't spot any
      surprises.  I verified that the numer of isolated pageblocks is correctly
      handled when onlining/offlining.
      
      This patch (of 10):
      
      There is only a single user, offline_pages(). Let's inline, to make
      it look more similar to online_pages().
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Link: https://lkml.kernel.org/r/20200819175957.28465-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20200819175957.28465-2-david@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73a11c96
    • Jann Horn's avatar
      mm/mmu_notifier: fix mmget() assert in __mmu_interval_notifier_insert · c9682d10
      Jann Horn authored
      The comment talks about having to hold mmget() (which means mm_users), but
      the actual check is on mm_count (which would be mmgrab()).
      
      Given that MMU notifiers are torn down in mmput() -> __mmput() ->
      exit_mmap() -> mmu_notifier_release(), I believe that the comment is
      correct and the check should be on mm->mm_users.  Fix it up accordingly.
      
      Fixes: 99cb252f
      
       ("mm/mmu_notifier: add an interval tree notifier")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christian König <christian.koenig@amd.com
      Link: https://lkml.kernel.org/r/20200901000143.207585-1-jannh@google.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9682d10
    • Bartosz Golaszewski's avatar
      mm/util.c: update the kerneldoc for kstrdup_const() · 295a1730
      Bartosz Golaszewski authored
      
      
      Memory allocated with kstrdup_const() must not be passed to regular
      krealloc() as it is not aware of the possibility of the chunk residing in
      .rodata.  Since there are no potential users of krealloc_const() at the
      moment, let's just update the doc to make it explicit.
      
      Signed-off-by: default avatarBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200817173927.23389-1-brgl@bgdev.pl
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      295a1730
    • Miaohe Lin's avatar
      mm/vmstat.c: use helper macro abs() · 40610076
      Miaohe Lin authored
      
      
      Use helper macro abs() to simplify the "x > t || x < -t" cmp.
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20200905084008.15748-1-linmiaohe@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40610076
    • Mateusz Nosek's avatar
      mm/page_poison.c: replace bool variable with static key · 11c9c7ed
      Mateusz Nosek authored
      
      
      Variable 'want_page_poisoning' is a switch deciding if page poisoning
      should be enabled.  This patch changes it to be static key.
      
      Signed-off-by: default avatarMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Link: https://lkml.kernel.org/r/20200921152931.938-1-mateusznosek0@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11c9c7ed
    • Oscar Salvador's avatar
      mm,hwpoison: try to narrow window race for free pages · b94e0282
      Oscar Salvador authored
      
      
      Aristeu Rozanski reported that a customer test case started to report
      -EBUSY after the hwpoison rework patchset.
      
      There is a race window between spotting a free page and taking it off its
      buddy freelist, so it might be that by the time we try to take it off, the
      page has been already allocated.
      
      This patch tries to handle such race window by trying to handle the new
      type of page again if the page was allocated under us.
      
      Reported-by: default avatarAristeu Rozanski <aris@ruivo.org>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarAristeu Rozanski <aris@ruivo.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-15-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b94e0282
    • Naoya Horiguchi's avatar
      mm,hwpoison: double-check page count in __get_any_page() · 1f2481dd
      Naoya Horiguchi authored
      
      
      Soft offlining could fail with EIO due to the race condition with hugepage
      migration.  This issuse became visible due to the change by previous patch
      that makes soft offline handler take page refcount by its own.  We have no
      way to directly pin zero refcount page, and the page considered as a zero
      refcount page could be allocated just after the first check.
      
      This patch adds the second check to find the race and gives us chance to
      handle it more reliably.
      
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-14-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f2481dd
    • Naoya Horiguchi's avatar
      mm,hwpoison: introduce MF_MSG_UNSPLIT_THP · 5d1fd5dc
      Naoya Horiguchi authored
      
      
      memory_failure() is supposed to call action_result() when it handles a
      memory error event, but there's one missing case.  So let's add it.
      
      I find that include/ras/ras_event.h has some other MF_MSG_* undefined, so
      this patch also adds them.
      
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-13-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d1fd5dc
    • Oscar Salvador's avatar
      mm,hwpoison: return 0 if the page is already poisoned in soft-offline · 5a2ffca3
      Oscar Salvador authored
      
      
      Currently, there is an inconsistency when calling soft-offline from
      different paths on a page that is already poisoned.
      
      1) madvise:
      
              madvise_inject_error skips any poisoned page and continues
              the loop.
              If that was the only page to madvise, it returns 0.
      
      2) /sys/devices/system/memory/:
      
              When calling soft_offline_page_store()->soft_offline_page(),
              we return -EBUSY in case the page is already poisoned.
              This is inconsistent with a) the above example and b)
              memory_failure, where we return 0 if the page was poisoned.
      
      Fix this by dropping the PageHWPoison() check in madvise_inject_error, and
      let soft_offline_page return 0 if it finds the page already poisoned.
      
      Please, note that this represents a user-api change, since now the return
      error when calling soft_offline_page_store()->soft_offline_page() will be
      different.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a2ffca3
    • Oscar Salvador's avatar
      mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page · 6b9a217e
      Oscar Salvador authored
      
      
      Merging soft_offline_huge_page and __soft_offline_page let us get rid of
      quite some duplicated code, and makes the code much easier to follow.
      
      Now, __soft_offline_page will handle both normal and hugetlb pages.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-11-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b9a217e
    • Oscar Salvador's avatar
      mm,hwpoison: rework soft offline for in-use pages · 79f5f8fa
      Oscar Salvador authored
      
      
      This patch changes the way we set and handle in-use poisoned pages.  Until
      now, poisoned pages were released to the buddy allocator, trusting that
      the checks that take place at allocation time would act as a safe net and
      would skip that page.
      
      This has proved to be wrong, as we got some pfn walkers out there, like
      compaction, that all they care is the page to be in a buddy freelist.
      
      Although this might not be the only user, having poisoned pages in the
      buddy allocator seems a bad idea as we should only have free pages that
      are ready and meant to be used as such.
      
      Before explaining the taken approach, let us break down the kind of pages
      we can soft offline.
      
      - Anonymous THP (after the split, they end up being 4K pages)
      - Hugetlb
      - Order-0 pages (that can be either migrated or invalited)
      
      * Normal pages (order-0 and anon-THP)
      
        - If they are clean and unmapped page cache pages, we invalidate
          then by means of invalidate_inode_page().
        - If they are mapped/dirty, we do the isolate-and-migrate dance.
      
      Either way, do not call put_page directly from those paths.  Instead, we
      keep the page and send it to page_handle_poison to perform the right
      handling.
      
      page_handle_poison sets the HWPoison flag and does the last put_page.
      
      Down the chain, we placed a check for HWPoison page in
      free_pages_prepare, that just skips any poisoned page, so those pages
      do not end up in any pcplist/freelist.
      
      After that, we set the refcount on the page to 1 and we increment
      the poisoned pages counter.
      
      If we see that the check in free_pages_prepare creates trouble, we can
      always do what we do for free pages:
      
        - wait until the page hits buddy's freelists
        - take it off, and flag it
      
      The downside of the above approach is that we could race with an
      allocation, so by the time we  want to take the page off the buddy, the
      page has been already allocated so we cannot soft offline it.
      But the user could always retry it.
      
      * Hugetlb pages
      
        - We isolate-and-migrate them
      
      After the migration has been successful, we call dissolve_free_huge_page,
      and we set HWPoison on the page if we succeed.
      Hugetlb has a slightly different handling though.
      
      While for non-hugetlb pages we cared about closing the race with an
      allocation, doing so for hugetlb pages requires quite some additional
      and intrusive code (we would need to hook in free_huge_page and some other
      places).
      So I decided to not make the code overly complicated and just fail
      normally if the page we allocated in the meantime.
      
      We can always build on top of this.
      
      As a bonus, because of the way we handle now in-use pages, we no longer
      need the put-as-isolation-migratetype dance, that was guarding for poisoned
      pages to end up in pcplists.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79f5f8fa
    • Oscar Salvador's avatar
      mm,hwpoison: rework soft offline for free pages · 06be6ff3
      Oscar Salvador authored
      
      
      When trying to soft-offline a free page, we need to first take it off the
      buddy allocator.  Once we know is out of reach, we can safely flag it as
      poisoned.
      
      take_page_off_buddy will be used to take a page meant to be poisoned off
      the buddy allocator.  take_page_off_buddy calls break_down_buddy_pages,
      which splits a higher-order page in case our page belongs to one.
      
      Once the page is under our control, we call page_handle_poison to set it
      as poisoned and grab a refcount on it.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06be6ff3
    • Oscar Salvador's avatar
      mm,hwpoison: unify THP handling for hard and soft offline · 694bf0b0
      Oscar Salvador authored
      
      
      Place the THP's page handling in a helper and use it from both hard and
      soft-offline machinery, so we get rid of some duplicated code.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-8-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      694bf0b0
    • Oscar Salvador's avatar
      mm,hwpoison: kill put_hwpoison_page · dd6e2402
      Oscar Salvador authored
      After commit 4e41a30c
      
       ("mm: hwpoison: adjust for new thp
      refcounting"), put_hwpoison_page got reduced to a put_page.  Let us just
      use put_page instead.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-7-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd6e2402
    • Oscar Salvador's avatar
      mm,hwpoison: refactor madvise_inject_error · dc7560b4
      Oscar Salvador authored
      
      
      Make a proper if-else condition for {hard,soft}-offline.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Link: https://lkml.kernel.org/r/20200908075626.11976-3-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc7560b4
    • Oscar Salvador's avatar
      mm,hwpoison: unexport get_hwpoison_page and make it static · 7e27f22c
      Oscar Salvador authored
      
      
      Since get_hwpoison_page is only used in memory-failure code now, let us
      un-export it and make it private to that code.
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-5-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e27f22c
    • Naoya Horiguchi's avatar
      mm,hwpoison-inject: don't pin for hwpoison_filter · fd476720
      Naoya Horiguchi authored
      
      
      Another memory error injection interface debugfs:hwpoison/corrupt-pfn also
      takes bogus refcount for hwpoison_filter().  It's justified because this
      does a coarse filter, expecting that memory_failure() redoes the check for
      sure.
      
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-4-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundatio...>
      fd476720
    • Naoya Horiguchi's avatar
      mm, hwpoison: remove recalculating hpage · 1b473bec
      Naoya Horiguchi authored
      
      
      hpage is never used after try_to_split_thp_page() in memory_failure(), so
      we don't have to update hpage.  So let's not recalculate/use hpage.
      
      Suggested-by: default avatar"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-3-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b473bec
    • Naoya Horiguchi's avatar
      mm,hwpoison: cleanup unused PageHuge() check · 7d9d46ac
      Naoya Horiguchi authored
      
      
      Patch series "HWPOISON: soft offline rework", v7.
      
      This patchset fixes a couple of issues that the patchset Naoya sent [1]
      contained due to rebasing problems and a misunterdansting.
      
      Main focus of this series is to stabilize soft offline.  Historically soft
      offlined pages have suffered from racy conditions because PageHWPoison is
      used to a little too aggressively, which (directly or indirectly) invades
      other mm code which cares little about hwpoison.  This results in
      unexpected behavior or kernel panic, which is very far from soft offline's
      "do not disturb userspace or other kernel component" policy.  An example
      of this can be found here [2].
      
      Along with several cleanups, this code refactors and changes the way soft
      offline work.  Main point of this change set is to contain target page
      "via buddy allocator" or in migrating path.  For ther former we first free
      the target page as we do for normal pages, and once it has reached buddy
      and it has been taken off the freelists, we flag it as HWpoison.  For the
      latter we never get to release the page in unmap_and_move, so the page is
      under our control and we can handle it in hwpoison code.
      
      [1] https://patchwork.kernel.org/cover/11704083/
      [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
      
      This patch (of 14):
      
      Drop the PageHuge check, which is dead code since memory_failure() forks
      into memory_failure_hugetlb() for hugetlb pages.
      
      memory_failure() and memory_failure_hugetlb() shares some functions like
      hwpoison_user_mappings() and identify_page_state(), so they should
      properly handle 4kB page, thp, and hugetlb.
      
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Oscar Salvador <osalvador@suse.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20200922135650.1634-2-osalvador@suse.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d9d46ac
    • David Howells's avatar
      mm/readahead: pass a file_ra_state into force_page_cache_ra · b1647dc0
      David Howells authored
      
      
      The file_ra_state being passed into page_cache_sync_readahead() was being
      ignored in favour of using the one embedded in the struct file.  The only
      caller for which this makes a difference is the fsverity code if the file
      has been marked as POSIX_FADV_RANDOM, but it's confusing and worth fixing.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-10-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1647dc0
    • David Howells's avatar
      mm/filemap: fold ra_submit into do_sync_mmap_readahead · db660d46
      David Howells authored
      
      
      Fold ra_submit() into its last remaining user and pass the
      readahead_control struct to both do_page_cache_ra() and
      page_cache_sync_ra().
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-9-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db660d46
    • Matthew Wilcox (Oracle)'s avatar
      mm/readahead: add page_cache_sync_ra and page_cache_async_ra · fefa7c47
      Matthew Wilcox (Oracle) authored
      
      
      Reimplement page_cache_sync_readahead() and page_cache_async_readahead()
      as wrappers around versions of the function which take a readahead_control
      in preparation for making do_sync_mmap_readahead() pass down an RAC
      struct.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-8-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fefa7c47
    • David Howells's avatar
      mm/readahead: pass readahead_control to force_page_cache_ra · 7b3df3b9
      David Howells authored
      
      
      Reimplement force_page_cache_readahead() as a wrapper around
      force_page_cache_ra().  Pass the existing readahead_control from
      page_cache_sync_readahead().
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-7-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b3df3b9
    • David Howells's avatar
      mm/readahead: make ondemand_readahead take a readahead_control · 6e4af69a
      David Howells authored
      
      
      Make ondemand_readahead() take a readahead_control struct in preparation
      for making do_sync_mmap_readahead() pass down an RAC struct.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-6-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e4af69a
    • Matthew Wilcox (Oracle)'s avatar
      mm/readahead: make do_page_cache_ra take a readahead_control · 8238287e
      Matthew Wilcox (Oracle) authored
      
      
      Rename __do_page_cache_readahead() to do_page_cache_ra() and call it
      directly from ondemand_readahead() instead of indirecting via ra_submit().
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-5-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8238287e
    • Matthew Wilcox (Oracle)'s avatar
      mm/readahead: make page_cache_ra_unbounded take a readahead_control · 73bb49da
      Matthew Wilcox (Oracle) authored
      
      
      Define it in the callers instead of in page_cache_ra_unbounded().
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-4-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73bb49da
    • Matthew Wilcox (Oracle)'s avatar
      mm/readahead: add DEFINE_READAHEAD · 1aa83cfa
      Matthew Wilcox (Oracle) authored
      
      
      Patch series "Readahead patches for 5.9/5.10".
      
      These are infrastructure for both the THP patchset and for the fscache
      rewrite,
      
      For both pieces of infrastructure being build on top of this patchset, we
      want the ractl to be available higher in the call-stack.
      
      For David's work, he wants to add the 'critical page' to the ractl so that
      he knows which page NEEDS to be brought in from storage, and which ones
      are nice-to-have.  We might want something similar in block storage too.
      It used to be simple -- the first page was the critical one, but then mmap
      added fault-around and so for that usecase, the middle page is the
      critical one.  Anyway, I don't have any code to show that yet, we just
      know that the lowest point in the callchain where we have that information
      is do_sync_mmap_readahead() and so the ractl needs to start its life
      there.
      
      For THP, we havew the code that needs it.  It's actually the apex patch to
      the series; the one which finally starts to allocate THPs and present them
      to consenting filesystems:
      http://git.infradead.org/users/willy/pagecache.git/commitdiff/798bcf30ab2eff278caad03a9edca74d2f8ae760
      
      This patch (of 8):
      
      Allow for a more concise definition of a struct readahead_control.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20200903140844.14194-3-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aa83cfa
    • Huang Ying's avatar
      mm: fix a race during THP splitting · c4f9c701
      Huang Ying authored
      It is reported that the following bug is triggered if the HDD is used as
      swap device,
      
      [ 5758.157556] BUG: kernel NULL pointer dereference, address: 0000000000000007
      [ 5758.165331] #PF: supervisor write access in kernel mode
      [ 5758.171161] #PF: error_code(0x0002) - not-present page
      [ 5758.176894] PGD 0 P4D 0
      [ 5758.179721] Oops: 0002 [#1] SMP PTI
      [ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S               --------- ---  5.9.0-0.rc3.1.tst.el8.x86_64 #1
      [ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
      [ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
      [ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
      [ 5758.234478] RSP: 0018:ffffb147442d7af0 EFLAGS: 00010246
      [ 5758.240309] RAX: 0000000000000000 RBX: 000000000014b217 RCX: ffffb14779fd9000
      [ 5758.248281] RDX: 000000000014b217 RSI: ffff9c52f2ab1400 RDI: 000000000014b217
      [ 5758.256246] RBP: ffffe00c51168080 R08: ffffe00c5116fe08 R09: ffff9c52fffd3000
      [ 5758.264208] R10: ffffe00c511537c8 R11: ffff9c52fffd3c90 R12: 0000000000000000
      [ 5758.272172] R13: ffffe00c51170000 R14: ffffe00c51170000 R15: ffffe00c51168040
      [ 5758.280134] FS:  0000000000000000(0000) GS:ffff9c52f2a80000(0000) knlGS:0000000000000000
      [ 5758.289163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5758.295575] CR2: 0000000000000007 CR3: 0000000022a0e003 CR4: 00000000000606e0
      [ 5758.303538] Call Trace:
      [ 5758.306273]  split_huge_page_to_list+0x88b/0x950
      [ 5758.311433]  deferred_split_scan+0x1ca/0x310
      [ 5758.316202]  do_shrink_slab+0x12c/0x2a0
      [ 5758.320491]  shrink_slab+0x20f/0x2c0
      [ 5758.324482]  shrink_node+0x240/0x6c0
      [ 5758.328469]  balance_pgdat+0x2d1/0x550
      [ 5758.332652]  kswapd+0x201/0x3c0
      [ 5758.336157]  ? finish_wait+0x80/0x80
      [ 5758.340147]  ? balance_pgdat+0x550/0x550
      [ 5758.344525]  kthread+0x114/0x130
      [ 5758.348126]  ? kthread_park+0x80/0x80
      [ 5758.352214]  ret_from_fork+0x22/0x30
      [ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod
      [ 5758.412673] CR2: 0000000000000007
      [    0.000000] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 (mockbuild@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020
      
      After further digging it's found that the following race condition exists in the
      original implementation,
      
      CPU1                                                             CPU2
      ----                                                             ----
      deferred_split_scan()
        split_huge_page(page) /* page isn't compound head */
          split_huge_page_to_list(page, NULL)
            __split_huge_page(page, )
              ClearPageCompound(head)
              /* unlock all subpages except page (not head) */
                                                                       add_to_swap(head)  /* not THP */
                                                                         get_swap_page(head)
                                                                         add_to_swap_cache(head, )
                                                                           SetPageSwapCache(head)
           if PageSwapCache(head)
             split_swap_cluster(/* swap entry of head */)
               /* Deref sis->cluster_info: NULL accessing! */
      
      So, in split_huge_page_to_list(), PageSwapCache() is called for the already
      split and unlocked "head", which may be added to swap cache in another CPU.  So
      split_swap_cluster() may be called wrongly.
      
      To fix the race, the call to split_swap_cluster() is moved to
      __split_huge_page() before all subpages are unlocked.  So that the
      PageSwapCache() is stable.
      
      Fixes: 59807685
      
       ("mm, THP, swap: support splitting THP for THP swap out")
      Reported-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: https://lkml.kernel.org/r/20201009073647.1531083-1-ying.huang@intel.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4f9c701
    • Matthew Wilcox (Oracle)'s avatar
      fs: do not update nr_thps for mappings which support THPs · 6f4d2f97
      Matthew Wilcox (Oracle) authored
      
      
      The nr_thps counter is to support THPs in the page cache when the
      filesystem doesn't understand THPs.  Eventually it will be removed, but we
      should still support filesystems which do not understand THPs yet.  Move
      the nr_thp manipulation functions to filemap.h since they're page-cache
      specific.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Link: https://lkml.kernel.org/r/20200916032717.22917-2-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f4d2f97
    • Matthew Wilcox (Oracle)'s avatar
      fs: add a filesystem flag for THPs · 01c70267
      Matthew Wilcox (Oracle) authored
      
      
      The page cache needs to know whether the filesystem supports THPs so that
      it doesn't send THPs to filesystems which can't handle them.  Dave Chinner
      points out that getting from the page mapping to the filesystem type is
      too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
      information in the address space flags.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01c70267
    • Matthew Wilcox (Oracle)'s avatar
      mm/vmscan: allow arbitrary sized pages to be paged out · 3efe62e4
      Matthew Wilcox (Oracle) authored
      
      
      Remove the assumption that a compound page has HPAGE_PMD_NR pins from the
      page cache.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSeongJae Park <sjpark@amazon.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Link: https://lkml.kernel.org/r/20200908195539.25896-12-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3efe62e4
    • Matthew Wilcox (Oracle)'s avatar
      mm/page-writeback: support tail pages in wait_for_stable_page · 8854a6a7
      Matthew Wilcox (Oracle) authored
      
      
      page->mapping is undefined for tail pages, so operate exclusively on the
      head page.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSeongJae Park <sjpark@amazon.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Link: https://lkml.kernel.org/r/20200908195539.25896-11-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8854a6a7
    • Matthew Wilcox (Oracle)'s avatar
      mm/truncate: fix truncation for pages of arbitrary size · fc3a5ac5
      Matthew Wilcox (Oracle) authored
      
      
      Remove the assumption that a compound page is HPAGE_PMD_SIZE, and the
      assumption that any page is PAGE_SIZE.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSeongJae Park <sjpark@amazon.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Link: https://lkml.kernel.org/r/20200908195539.25896-10-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc3a5ac5