Skip to content
  1. Dec 29, 2018
    • Sri Krishna chowdary's avatar
      kmemleak: add config to select auto scan · d53ce042
      Sri Krishna chowdary authored
      
      
      Kmemleak scan can be cpu intensive and can stall user tasks at times.  To
      prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable auto
      scan on boot up.  Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as
      this is meant for only first automatic scan.
      
      Link: http://lkml.kernel.org/r/1540231723-7087-1-git-send-email-prpatel@nvidia.com
      Signed-off-by: default avatarSri Krishna chowdary <schowdary@nvidia.com>
      Signed-off-by: default avatarSachin Nikam <snikam@nvidia.com>
      Signed-off-by: default avatarPrateek <prpatel@nvidia.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d53ce042
    • Waiman Long's avatar
      mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init · 3c0c12cc
      Waiman Long authored
      
      
      When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
      pages initialization can take a long time.  Below were the reported init
      times on a 8-socket 96-core 4TB IvyBridge system.
      
        1) Non-debug kernel without CONFIG_KASAN
           [    8.764222] node 1 initialised, 132086516 pages in 7027ms
      
        2) Debug kernel with CONFIG_KASAN
           [  146.288115] node 1 initialised, 132075466 pages in 143052ms
      
      So the page init time in a debug kernel was 20X of the non-debug kernel.
      The long init time can be problematic as the page initialization is done
      with interrupt disabled.  In this particular case, it caused the
      appearance of following warning messages as well as NMI backtraces of all
      the cores that were doing the initialization.
      
      [   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
      [   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
      [   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
      [   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
      [   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
      [   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
      [   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
      [   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)
      
      The long init time was mainly caused by the call to kasan_free_pages() to
      poison the newly initialized pages.  On a 4TB system, we are talking about
      almost 500GB of memory probably on the same node.
      
      In reality, we may not need to poison the newly initialized pages before
      they are ever allocated.  So KASAN poisoning of freed pages before the
      completion of deferred memory initialization is now disabled.  Those pages
      will be properly poisoned when they are allocated or freed after deferred
      pages initialization is done.
      
      With this change, the new page initialization time became:
      
      [   21.948010] node 1 initialised, 132075466 pages in 18702ms
      
      This was still about double the non-debug kernel time, but was much
      better than before.
      
      Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c0c12cc
    • Peter Xu's avatar
      userfaultfd: clear flag if remap event not enabled · 3cfd22be
      Peter Xu authored
      
      
      When the process being tracked does mremap() without
      UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
      we should not generate the remap event, and at the same time we should
      clear all the uffd flags on the new VMA.  Without this patch, we can still
      have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
      handling process does not even know the existance of the VMA.
      
      Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Pravin Shedge <pravin.shedge4linux@gmail.com>
      Signed-off-by: Andrew...
      3cfd22be
    • Pingfan Liu's avatar
      mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES · 125b860b
      Pingfan Liu authored
      
      
      Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
      If someone adds extra migrate type, then he may forget to enlarge the
      NR_PAGEBLOCK_BITS.  Hence it requires some way to fix.
      
      NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
      two different .h file with reverse dependency, it is a little hard to
      refer to MIGRATE_TYPES in pageblock-flag.h.  This patch tries to remind
      such relation in compiling-time.
      
      Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
      Signed-off-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      125b860b
    • Kirill Tkhai's avatar
      ksm: react on changing "sleep_millisecs" parameter faster · fcf9a0ef
      Kirill Tkhai authored
      
      
      ksm thread unconditionally sleeps in ksm_scan_thread() after each
      iteration:
      
      	schedule_timeout_interruptible(
      		msecs_to_jiffies(ksm_thread_sleep_millisecs))
      
      The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.
      
      In case of user writes a big value by a mistake, and the thread enters
      into schedule_timeout_interruptible(), it's not possible to cancel the
      sleep by writing a new smaler value; the thread is just sleeping till
      timeout expires.
      
      The patch fixes the problem by waking the thread each time after the value
      is updated.
      
      This also may be useful for debug purposes; and also for userspace
      daemons, which change sleep_millisecs value in dependence of system load.
      
      Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf9a0ef
    • Michal Hocko's avatar
      mm, fault_around: do not take a reference to a locked page · e0975b2a
      Michal Hocko authored
      
      
      filemap_map_pages takes a speculative reference to each page in the range
      before it tries to lock that page.  While this is correct it also can
      influence page migration which will bail out when seeing an elevated
      reference count.  The faultaround code would bail on seeing a locked page
      so we can pro-actively check the PageLocked bit before
      page_cache_get_speculative and prevent from pointless reference count
      churn.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0975b2a
    • Michal Hocko's avatar
      mm, memory_hotplug: deobfuscate migration part of offlining · bb8965bd
      Michal Hocko authored
      
      
      Memory migration might fail during offlining and we keep retrying in that
      case.  This is currently obfuscated by goto retry loop.  The code is hard
      to follow and as a result it is even suboptimal becase each retry round
      scans the full range from start_pfn even though we have successfully
      scanned/migrated [start_pfn, pfn] range already.  This is all only because
      check_pages_isolated failure has to rescan the full range again.
      
      De-obfuscate the migration retry loop by promoting it to a real for loop.
      In fact remove the goto altogether by making it a proper double loop
      (yeah, gotos are nasty in this specific case).  In the end we will get a
      slightly more optimal code which is better readable.
      
      [akpm@linux-foundation.org: reflow comments to 80 cols]
      Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8965bd
    • Michal Hocko's avatar
      mm, memory_hotplug: try to migrate full pfn range · a85009c3
      Michal Hocko authored
      
      
      Patch series "few memory offlining enhancements".
      
      I have been chasing memory offlining not making progress recently.  On the
      way I have noticed few weird decisions in the code.  The migration itself
      is restricted without a reasonable justification and the retry loop around
      the migration is quite messy.  This is addressed by patch 1 and patch 2.
      
      Patch 3 is targeting on the faultaround code which has been a hot
      candidate for the initial issue reported upstream [2] and that I am
      debugging internally.  It turned out to be not the main contributor in the
      end but I believe we should address it regardless.  See the patch
      description for more details.
      
      [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv
      
      This patch (of 3):
      
      do_migrate_range has been limiting the number of pages to migrate to 256
      for some reason which is not documented.  Even if the limit made some
      sense back then when it was introduced it doesn't really serve a good
      purpose these days.  If the range contains huge pages then we break out of
      the loop too early and go through LRU and pcp caches draining and
      scan_movable_pages is quite suboptimal.
      
      The only reason to limit the number of pages I can think of is to reduce
      the potential time to react on the fatal signal.  But even then the number
      of pages is a questionable metric because even a single page migration
      might block in a non-killable state (e.g.  __unmap_and_move).
      
      Remove the limit and offline the full requested range (this is one
      memblock worth of pages with the current code).  Should we ever get a
      report that offlining takes too long to react on fatal signal then we
      should rather fix the core migration to use killable waits and bailout
      on a signal.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a85009c3
    • Michal Hocko's avatar
      mm, proc: report PR_SET_THP_DISABLE in proc · a1400af7
      Michal Hocko authored
      David Rientjes has reported that commit 18600332 ("mm: make
      PR_SET_THP_DISABLE immediately active") has changed the way how we
      report THPable VMAs to the userspace.  Their monitoring tool is
      triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
      an insufficient THP usage as a memory fragmentation resp.  memory
      pressure issue.
      
      Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
      flag and that got exposed to the userspace via /proc/<pid>/smaps file.
      This implementation had its downsides as explained in the commit message
      but it is true that the userspace doesn't have any means to query for
      the process wide THP enabled/disabled status.
      
      PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to
      export in the process wide context rather than per-vma.  Introduce a new
      field to /proc/<pid>/status which export this status.  If
      PR_SET_THP_DISABLE is used then it reports false same as when the THP is
      not compiled in.  It doesn't consider the global THP status because we
      already export that information via sysfs
      
      Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org
      Fixes: 18600332
      
       ("mm: make PR_SET_THP_DISABLE immediately active")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1400af7
    • Michal Hocko's avatar
      mm, thp, proc: report THP eligibility for each vma · 7635d9cb
      Michal Hocko authored
      
      
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7635d9cb
    • Michal Hocko's avatar
      mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps · 7550c607
      Michal Hocko authored
      Patch series "THP eligibility reporting via proc".
      
      This series of three patches aims at making THP eligibility reporting much
      more robust and long term sustainable.  The trigger for the change is a
      regression report [2] and the long follow up discussion.  In short the
      specific application didn't have good API to query whether a particular
      mapping can be backed by THP so it has used VMA flags to workaround that.
      These flags represent a deep internal state of VMAs and as such they
      should be used by userspace with a great deal of caution.
      
      A similar has happened for [3] when users complained that VM_MIXEDMAP is
      no longer set on DAX mappings.  Again a lack of a proper API led to an
      abuse.
      
      The first patch in the series tries to emphasise that that the semantic of
      flags might change and any application consuming those should be really
      careful.
      
      The remaining two patches provide a more suitable interface to address [2]
      and provide a consistent API to query the THP status both for each VMA and
      process wide as well.  [1]
      
      http://lkml.kernel.org/r/20181120103515.25280-1-mhocko@kernel.org [2]
      http://lkml.kernel.org/r/http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
      [3] http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
      
      This patch (of 3):
      
      Even though vma flags exported via /proc/<pid>/smaps are explicitly
      documented to be not guaranteed for future compatibility the warning
      doesn't go far enough because it doesn't mention semantic changes to those
      flags.  And they are important as well because these flags are a deep
      implementation internal to the MM code and the semantic might change at
      any time.
      
      Let's consider two recent examples:
      http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
      : commit e1fb4a08 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
      : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
      : mean time certain customer of ours started poking into /proc/<pid>/smaps
      : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
      : flags, the application just fails to start complaining that DAX support is
      : missing in the kernel.
      
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
      : Commit 18600332
      
       ("mm: make PR_SET_THP_DISABLE immediately active")
      : introduced a regression in that userspace cannot always determine the set
      : of vmas where thp is ineligible.
      : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
      : to determine if a vma is eligible to be backed by hugepages.
      : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
      : be disabled and emit "nh" as a flag for the corresponding vmas as part of
      : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
      : flag and "nh" is not emitted.
      : This causes smaps parsing libraries to assume a vma is eligible for thp
      : and ends up puzzling the user on why its memory is not backed by thp.
      
      In both cases userspace was relying on a semantic of a specific VMA flag.
      The primary reason why that happened is a lack of a proper interface.
      While this has been worked on and it will be fixed properly, it seems that
      our wording could see some refinement and be more vocal about semantic
      aspect of these flags as well.
      
      Link: http://lkml.kernel.org/r/20181211143641.3503-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7550c607
    • Wei Yang's avatar
      include/linux/memory_hotplug.h: remove duplicate declaration of offline_pages() · 0614ce97
      Wei Yang authored
      
      
      offline_pages() is already declared in this file.
      
      Just remove the duplicated one.
      
      Link: http://lkml.kernel.org/r/20181205031357.24769-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0614ce97
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 · ac46d4f3
      Jérôme Glisse authored
      
      
      To avoid having to change many call sites everytime we want to add a
      parameter use a structure to group all parameters for the mmu_notifier
      invalidate_range_start/end cakks.  No functional changes with this patch.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      From: Jérôme Glisse <jglisse@redhat.com>
      Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3
      
      fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
      
      Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac46d4f3
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use structure for invalidate_range_start/end callback · 5d6527a7
      Jérôme Glisse authored
      
      
      Patch series "mmu notifier contextual informations", v2.
      
      This patchset adds contextual information, why an invalidation is
      happening, to mmu notifier callback.  This is necessary for user of mmu
      notifier that wish to maintains their own data structure without having to
      add new fields to struct vm_area_struct (vma).
      
      For instance device can have they own page table that mirror the process
      address space.  When a vma is unmap (munmap() syscall) the device driver
      can free the device page table for the range.
      
      Today we do not have any information on why a mmu notifier call back is
      happening and thus device driver have to assume that it is always an
      munmap().  This is inefficient at it means that it needs to re-allocate
      device page table on next page fault and rebuild the whole device driver
      data structure for the range.
      
      Other use case beside munmap() also exist, for instance it is pointless
      for device driver to invalidate the device page table when the
      invalidation is for the soft dirtyness tracking.  Or device driver can
      optimize away mprotect() that change the page table permission access for
      the range.
      
      This patchset enables all this optimizations for device drivers.  I do not
      include any of those in this series but another patchset I am posting will
      leverage this.
      
      The patchset is pretty simple from a code point of view.  The first two
      patches consolidate all mmu notifier arguments into a struct so that it is
      easier to add/change arguments.  The last patch adds the contextual
      information (munmap, protection, soft dirty, clear, ...).
      
      This patch (of 3):
      
      To avoid having to change many callback definition everytime we want to
      add a parameter use a structure to group all parameters for the
      mmu_notifier invalidate_range_start/end callback.  No functional changes
      with this patch.
      
      [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
      Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d6527a7
    • Michal Hocko's avatar
      hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined · b15c8726
      Michal Hocko authored
      
      
      We have received a bug report that an injected MCE about faulty memory
      prevents memory offline to succeed on 4.4 base kernel.  The underlying
      reason was that the HWPoison page has an elevated reference count and the
      migration keeps failing.  There are two problems with that.  First of all
      it is dubious to migrate the poisoned page because we know that accessing
      that memory is possible to fail.  Secondly it doesn't make any sense to
      migrate a potentially broken content and preserve the memory corruption
      over to a new location.
      
      Oscar has found out that 4.4 and the current upstream kernels behave
      slightly differently with his simply testcase
      
      ===
      
      int main(void)
      {
              int ret;
              int i;
              int fd;
              char *array = malloc(4096);
              char *array_locked = malloc(4096);
      
              fd = open("/tmp/data", O_RDONLY);
              read(fd, array, 4095);
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
              if (ret)
                      perror("mlock");
      
              sleep (20);
      
              ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
              if (ret)
                      perror("madvise");
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              return 0;
      }
      ===
      
      + offline this memory.
      
      In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
      list
      kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
      kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
      kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
      kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
      kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
      kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
      kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
      kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
      kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
      kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
      kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
      kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
      kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
      kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
      kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
      kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
      kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
      kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
      kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
      kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80
      
      And that latter confuses the hotremove path because an LRU page is
      attempted to be migrated and that fails due to an elevated reference
      count.  It is quite possible that the reuse of the HWPoisoned page is some
      kind of fixed race condition but I am not really sure about that.
      
      With the upstream kernel the failure is slightly different.  The page
      doesn't seem to have LRU bit set but isolate_movable_page simply fails and
      do_migrate_range simply puts all the isolated pages back to LRU and
      therefore no progress is made and scan_movable_pages finds same set of
      pages over and over again.
      
      Fix both cases by explicitly checking HWPoisoned pages before we even try
      to get reference on the page, try to unmap it if it is still mapped.  As
      explained by Naoya:
      
      : Hwpoison code never unmapped those for no big reason because
      : Ksm pages never dominate memory, so we simply didn't have strong
      : motivation to save the pages.
      
      Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
      HWPoison pages which shouldn't happen but I couldn't convince myself about
      that.  Naoya has noted the following:
      
      : Theoretically no such gurantee, because try_to_unmap() doesn't have a
      : guarantee of success and then memory_failure() returns immediately
      : when hwpoison_user_mappings fails.
      : Or the following code (comes after hwpoison_user_mappings block) also impli=
      : es
      : that the target page can still have PageLRU flag.
      :
      :         /*
      :          * Torn down by someone else?
      :          */
      :         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
      :                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
      :                 res =3D -EBUSY;
      :                 goto out;
      :         }
      :
      : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
      : current version of your patch.
      
      Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.com>
      Debugged-by: default avatarOscar Salvador <osalvador@suse.com>
      Tested-by: default avatarOscar Salvador <osalvador@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b15c8726
    • Oscar Salvador's avatar
      mm, kmemleak: little optimization while scanning · 9f1eb38e
      Oscar Salvador authored
      
      
      kmemleak_scan() goes through all online nodes and tries to scan all used
      pages.
      
      We can do better and use pfn_to_online_page(), so in case we have
      CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically.  For
      boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
      will fallback to pfn_valid().
      
      Another little optimization is to check if the page belongs to the node we
      are currently checking, so in case we have nodes interleaved we will not
      check the same pfn multiple times.
      
      I ran some tests:
      
      Add some memory to node1 and node2 making it interleaved:
      
      (qemu) object_add memory-backend-ram,id=ram0,size=1G
      (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      (qemu) object_add memory-backend-ram,id=ram1,size=1G
      (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
      (qemu) object_add memory-backend-ram,id=ram2,size=1G
      (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1
      
      Then, we offline that memory:
       # for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done
       # for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don
       # for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done
      
      And we run kmemleak_scan:
      
       # echo "scan" > /sys/kernel/debug/kmemleak
      
      before the patch:
      
      kmemleak: time spend: 41596 us
      
      after the patch:
      
      kmemleak: time spend: 34899 us
      
      [akpm@linux-foundation.org: remove stray newline, per Oscar]
      Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.de
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f1eb38e
    • Will Deacon's avatar
      lib/ioremap: ensure break-before-make is used for huge p4d mappings · 8e2d4340
      Will Deacon authored
      
      
      Whilst no architectures actually enable support for huge p4d mappings in
      the vmap area, the code that is implemented should be using
      break-before-make, as we do for pud and pmd huge entries.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-6-git-send-email-will.deacon@arm.com
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e2d4340
    • Will Deacon's avatar
      lib/ioremap: ensure phys_addr actually corresponds to a physical address · 36ddc5a7
      Will Deacon authored
      
      
      The current ioremap() code uses a phys_addr variable at each level of page
      table, which is confusingly offset by subtracting the base virtual address
      being mapped so that adding the current virtual address back on when
      iterating through the page table entries gives back the corresponding
      physical address.
      
      This is fairly confusing and results in all users of phys_addr having to
      add the current virtual address back on.  Instead, this patch just updates
      phys_addr when iterating over the page table entries, ensuring that it's
      always up-to-date and doesn't require explicit offsetting.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-5-git-send-email-will.deacon@arm.com
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36ddc5a7
    • Will Deacon's avatar
      x86/pgtable: drop pXd_none() checks from pXd_free_pYd_table() · 48e178ab
      Will Deacon authored
      
      
      The core code already has a check for pXd_none(), so remove it from the
      architecture implementation.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-4-git-send-email-will.deacon@arm.com
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48e178ab
    • Will Deacon's avatar
      arm64: mmu: drop pXd_present() checks from pXd_free_pYd_table() · 9c006972
      Will Deacon authored
      
      
      The core code already has a check for pXd_none(), so remove it from the
      architecture implementation.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-3-git-send-email-will.deacon@arm.com
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c006972
    • Will Deacon's avatar
      ioremap: rework pXd_free_pYd_page() API · d239865a
      Will Deacon authored
      
      
      The recently merged API for ensuring break-before-make on page-table
      entries when installing huge mappings in the vmalloc/ioremap region is
      fairly counter-intuitive, resulting in the arch freeing functions (e.g.
      pmd_free_pte_page()) being called even on entries that aren't present.
      This resulted in a minor bug in the arm64 implementation, giving rise to
      spurious VM_WARN messages.
      
      This patch moves the pXd_present() checks out into the core code,
      refactoring the callsites at the same time so that we avoid the complex
      conjunctions when determining whether or not we can put down a huge
      mapping.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-2-git-send-email-will.deacon@arm.com
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d239865a
    • Kirill Tkhai's avatar
      mm/filemap.c: remove useless check in pagecache_get_page() · c16eb000
      Kirill Tkhai authored
      
      
      page always is not NULL, so we may remove this useless check.
      
      Link: http://lkml.kernel.org/r/154419752044.18559.2452963074922917720.stgit@localhost.localdomain
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c16eb000
    • Anthony Yznaga's avatar
      /proc/kpagecount: return 0 for special pages that are never mapped · 144552ff
      Anthony Yznaga authored
      
      
      Certain pages that are never mapped to userspace have a type indicated in
      the page_type field of their struct pages (e.g.  PG_buddy).  page_type
      overlaps with _mapcount so set the count to 0 and avoid calling
      page_mapcount() for these pages.
      
      [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox]
        Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com
      Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com
      Signed-off-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      144552ff
    • Anthony Yznaga's avatar
      tools/vm/page-types.c: fix "kpagecount returned fewer pages than expected" failures · b6fb87b8
      Anthony Yznaga authored
      Because kpagecount_read() fakes success if map counts are not being
      collected, clamp the page count passed to it by walk_pfn() to the pages
      value returned by the preceding call to kpageflags_read().
      
      Link: http://lkml.kernel.org/r/1543962269-26116-1-git-send-email-anthony.yznaga@oracle.com
      Fixes: 7f1d23e6
      
       ("tools/vm/page-types.c: include shared map counts")
      Signed-off-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6fb87b8
    • Oscar Salvador's avatar
      mm/page_alloc.c: drop uneeded __meminit and __meminitdata · bbe5d993
      Oscar Salvador authored
      Since commit 03e85f9d
      
       ("mm/page_alloc: Introduce
      free_area_init_core_hotplug"), some functions changed to only be called
      during system initialization.  Concretly, free_area_init_node() and the
      functions that hang from it.
      
      Also, some variables are no longer used after the system has gone
      through initialization.  So this could be considered as a late clean-up
      for that patch.
      
      This patch changes the functions from __meminit to __init, and the
      variables from __meminitdata to __initdata.
      
      In return, we get some KBs back:
      
      Before:
        Freeing unused kernel image memory: 2472K
      
      After:
        Freeing unused kernel image memory: 2480K
      
      Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbe5d993
    • Brian Foster's avatar
      mm/page-writeback.c: don't break integrity writeback on ->writepage() error · 3fa750dc
      Brian Foster authored
      
      
      write_cache_pages() is used in both background and integrity writeback
      scenarios by various filesystems.  Background writeback is mostly
      concerned with cleaning a certain number of dirty pages based on various
      mm heuristics.  It may not write the full set of dirty pages or wait for
      I/O to complete.  Integrity writeback is responsible for persisting a set
      of dirty pages before the writeback job completes.  For example, an
      fsync() call must perform integrity writeback to ensure data is on disk
      before the call returns.
      
      write_cache_pages() unconditionally breaks out of its processing loop in
      the event of a ->writepage() error.  This is fine for background
      writeback, which had no strict requirements and will eventually come
      around again.  This can cause problems for integrity writeback on
      filesystems that might need to clean up state associated with failed page
      writeouts.  For example, XFS performs internal delayed allocation
      accounting before returning a ->writepage() error, where applicable.  If
      the current writeback happens to be associated with an unmount and
      write_cache_pages() completes the writeback prematurely due to error, the
      filesystem is unmounted in an inconsistent state if dirty+delalloc pages
      still exist.
      
      To handle this problem, update write_cache_pages() to always process the
      full set of pages for integrity writeback regardless of ->writepage()
      errors.  Save the first encountered error and return it to the caller once
      complete.  This facilitates XFS (or any other fs that expects integrity
      writeback to process the entire set of dirty pages) to clean up its
      internal state completely in the event of persistent mapping errors.
      Background writeback continues to exit on the first error encountered.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fa750dc
    • Wei Yang's avatar
      lib/show_mem.c: drop pgdat_resize_lock in show_mem() · c3a5c77a
      Wei Yang authored
      
      
      Function show_mem() is used to print system memory status when user
      requires or fail to allocate memory.  Generally, this is a best effort
      information so any races with memory hotplug (or very theoretically an
      early initialization) should be tolerable and the worst that could happen
      is to print an imprecise node state.
      
      Drop the resize lock because this is the only place which might hold the
      lock from the interrupt context and so all other callers might use a
      simple spinlock.  Even though this doesn't solve any real issue it makes
      the code easier to follow and tiny more effective.
      
      Link: http://lkml.kernel.org/r/20181129235532.9328-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3a5c77a
    • YueHaibing's avatar
      mm/hmm.c: remove set but not used variable 'devmem' · 0ecea993
      YueHaibing authored
      
      
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      mm/hmm.c: In function 'hmm_devmem_ref_kill':
      mm/hmm.c:995:21: warning:
       variable 'devmem' set but not used [-Wunused-but-set-variable]
      
      It not used any more since 35d39f953d4e ("mm, hmm: replace
      hmm_devmem_pages_create() with devm_memremap_pages()")
      
      Link: http://lkml.kernel.org/r/1543629971-128377-1-git-send-email-yuehaibing@huawei.com
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarReviewed-by: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ecea993
    • Wei Yang's avatar
      mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection · fa004ab7
      Wei Yang authored
      
      
      During online_pages phase, pgdat->nr_zones will be updated in case this
      zone is empty.
      
      Currently the online_pages phase is protected by the global locks
      (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
      no contention during the update of nr_zones.
      
      These global locks introduces scalability issues (especially the second
      one), which slow down code relying on get_online_mems().  This is also a
      preparation for not having to rely on get_online_mems() but instead some
      more fine grained locks.
      
      The patch moves init_currently_empty_zone under both zone_span_writelock
      and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
      and the zone's start_pfn.  Also this patch changes the documentation of
      node_size_lock to include the protection of nr_zones.
      
      Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa004ab7
    • Wei Yang's avatar
      mm, sparse: pass nid instead of pgdat to sparse_add_one_section() · 4e0d2e7e
      Wei Yang authored
      
      
      Since the information needed in sparse_add_one_section() is node id to
      allocate proper memory, it is not necessary to pass its pgdat.
      
      This patch changes the prototype of sparse_add_one_section() to pass node
      id directly.  This is intended to reduce misleading that
      sparse_add_one_section() would touch pgdat.
      
      Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e0d2e7e
    • Wei Yang's avatar
      mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section() · 83af6588
      Wei Yang authored
      
      
      pgdat_resize_lock is used to protect pgdat's memory region information
      like: node_start_pfn, node_present_pages, etc.  While in function
      sparse_add/remove_one_section(), pgdat_resize_lock is used to protect
      initialization/release of one mem_section.  This looks not proper.
      
      These code paths are currently protected by mem_hotplug_lock currently but
      should there ever be any reason for locking at the sparse layer a
      dedicated lock should be introduced.
      
      Following is the current call trace of sparse_add/remove_one_section()
      
          mem_hotplug_begin()
          arch_add_memory()
             add_pages()
                 __add_pages()
                     __add_section()
                         sparse_add_one_section()
          mem_hotplug_done()
      
          mem_hotplug_begin()
          arch_remove_memory()
              __remove_pages()
                  __remove_section()
                      sparse_remove_one_section()
          mem_hotplug_done()
      
      The comment above the pgdat_resize_lock also mentions "Holding this will
      also guarantee that any pfn_valid() stays that way.", which is true with
      the current implementation and false after this patch.  But current
      implementation doesn't meet this comment.  There isn't any pfn walkers to
      take the lock so this looks like a relict from the past.  This patch also
      removes this comment.
      
      [richard.weiyang@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com
      [mhocko@suse.com: changelog suggestion]
      Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83af6588
    • Yu Zhao's avatar
      mm: remove pte_lock_deinit() · 9e247bab
      Yu Zhao authored
      
      
      Pagetable page doesn't touch page->mapping or have any used field that
      overlaps with it.  No need to clear mapping in dtor.  In fact, doing so
      might mask problems that otherwise would be detected by bad_page().
      
      Link: http://lkml.kernel.org/r/20181128235525.58780-1-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e247bab
    • Minchan Kim's avatar
      zram: writeback throttle · bb416d18
      Minchan Kim authored
      
      
      If there are lots of write IO with flash device, it could have a
      wearout problem of storage. To overcome the problem, admin needs
      to design write limitation to guarantee flash health
      for entire product life.
      
      This patch creates a new knob "writeback_limit" for zram.
      
      writeback_limit's default value is 0 so that it doesn't limit
      any writeback. If admin want to measure writeback count in a
      certain period, he could know it via /sys/block/zram0/bd_stat's
      3rd column.
      
      If admin want to limit writeback as per-day 400M, he could do it
      like below.
      
      	MB_SHIFT=20
      	4K_SHIFT=12
      	echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
      		/sys/block/zram0/writeback_limit.
      
      If admin want to allow further write again, he could do it like below
      
      	echo 0 > /sys/block/zram0/writeback_limit
      
      If admin want to see remaining writeback budget,
      
      	cat /sys/block/zram0/writeback_limit
      
      The writeback_limit count will reset whenever you reset zram (e.g., system
      reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback
      happened until you reset the zram to allocate extra writeback budget in
      next setting is user's job.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb416d18
    • Minchan Kim's avatar
      zram: add bd_stat statistics · 23eddf39
      Minchan Kim authored
      
      
      bd_stat represents things that happened in the backing device.  Currently
      it supports bd_counts, bd_reads and bd_writes which are helpful to
      understand wearout of flash and memory saving.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23eddf39
    • Minchan Kim's avatar
      zram: support idle/huge page writeback · a939888e
      Minchan Kim authored
      
      
      Add a new feature "zram idle/huge page writeback".  In the zram-swap use
      case, zram usually has many idle/huge swap pages.  It's pointless to keep
      them in memory (ie, zram).
      
      To solve this problem, this feature introduces idle/huge page writeback to
      the backing device so the goal is to save more memory space on embedded
      systems.
      
      Normal sequence to use idle/huge page writeback feature is as follows,
      
      while (1) {
              # mark allocated zram slot to idle
              echo all > /sys/block/zram0/idle
              # leave system working for several hours
              # Unless there is no access for some blocks on zram,
      	# they are still IDLE marked pages.
      
              echo "idle" > /sys/block/zram0/writeback
      	or/and
      	echo "huge" > /sys/block/zram0/writeback
              # write the IDLE or/and huge marked slot into backing device
      	# and free the memory.
      }
      
      Per the discussion at
      https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
      
      This patch removes direct incommpressibe page writeback feature
      (d2afd25114f4 ("zram: write incompressible pages to backing device")).
      
      Below concerns from Sergey:
      == &< ==
      
      "IDLE writeback" is superior to "incompressible writeback".
      
      "incompressible writeback" is completely unpredictable and uncontrollable;
      it depens on data patterns and compression algorithms.  While "IDLE
      writeback" is predictable.
      
      I even suspect, that, *ideally*, we can remove "incompressible writeback".
      "IDLE pages" is a super set which also includes "incompressible" pages.
      So, technically, we still can do "incompressible writeback" from "IDLE
      writeback" path; but a much more reasonable one, based on a page idling
      period.
      
      I understand that you want to keep "direct incompressible writeback"
      around.  ZRAM is especially popular on devices which do suffer from flash
      wearout, so I can see "incompressible writeback" path becoming a dead
      code, long term.
      
      == &< ==
      
      Below concerns from Minchan:
      == &< ==
      
      My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
      both hugepage/idlepage writeck will turn on.  However someuser want to
      enable only idlepage writeback so we need to introduce turn on/off knob
      for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.  I
      don't want to make it complicated *if possible*.
      
      Long term, I imagine we need to make VM aware of new swap hierarchy a
      little bit different with as-is.  For example, first high priority swap
      can return -EIO or -ENOCOMP, swap try to fallback to next lower priority
      swap device.  With that, hugepage writeback will work tranparently.
      
      So we could regard it as regression because incompressible pages doesn't
      go to backing storage automatically.  Instead, user should do it via "echo
      huge" > /sys/block/zram/writeback" manually.
      
      == &< ==
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a939888e
    • Minchan Kim's avatar
      zram: introduce ZRAM_IDLE flag · e82592c4
      Minchan Kim authored
      
      
      To support idle page writeback with upcoming patches, this patch
      introduces a new ZRAM_IDLE flag.
      
      Userspace can mark zram slots as "idle" via
      	"echo all > /sys/block/zramX/idle"
      which marks every allocated zram slot as ZRAM_IDLE.
      User could see it by /sys/kernel/debug/zram/zram0/block_state.
      
                300    75.033841 ...i
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Once there is IO for the slot, the mark will be disappeared.
      
      	  300    75.033841 ...
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Therefore, 300th block is idle zpage. With this feature,
      user can how many zram has idle pages which are waste of memory.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e82592c4
    • Minchan Kim's avatar
      zram: refactor flags and writeback stuff · 7e529283
      Minchan Kim authored
      
      
      Rename some variables and restructure some code for better readability in
      writeback and zs_free_page.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-4-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e529283
    • Minchan Kim's avatar
      zram: fix double free backing device · 5547932d
      Minchan Kim authored
      
      
      If blkdev_get fails, we shouldn't do blkdev_put.  Otherwise, kernel emits
      below log.  This patch fixes it.
      
        WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120
        Modules linked in:
        CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        RIP: 0010:blkdev_put+0x105/0x120
        Call Trace:
          __x64_sys_swapoff+0x46d/0x490
          do_syscall_64+0x5a/0x190
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
        irq event stamp: 4466
        hardirqs last  enabled at (4465):  __free_pages_ok+0x1e3/0x490
        hardirqs last disabled at (4466):  trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (3420):  __do_softirq+0x333/0x446
        softirqs last disabled at (3407):  irq_exit+0xd1/0xe0
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-3-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5547932d
    • Minchan Kim's avatar
      zram: fix lockdep warning of free block handling · 3c9959e0
      Minchan Kim authored
      
      
      Patch series "zram idle page writeback", v3.
      
      Inherently, swap device has many idle pages which are rare touched since
      it was allocated.  It is never problem if we use storage device as swap.
      However, it's just waste for zram-swap.
      
      This patchset supports zram idle page writeback feature.
      
      * Admin can define what is idle page "no access since X time ago"
      * Admin can define when zram should writeback them
      * Admin can define when zram should stop writeback to prevent wearout
      
      Details are in each patch's description.
      
      This patch (of 7):
      
        ================================
        WARNING: inconsistent lock state
        4.19.0+ #390 Not tainted
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
        00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
        {SOFTIRQ-ON-W} state was registered at:
          _raw_spin_lock+0x2c/0x40
          zram_make_request+0x755/0xdc9
          generic_make_request+0x373/0x6a0
          submit_bio+0x6c/0x140
          __swap_writepage+0x3a8/0x480
          shrink_page_list+0x1102/0x1a60
          shrink_inactive_list+0x21b/0x3f0
          shrink_node_memcg.constprop.99+0x4f8/0x7e0
          shrink_node+0x7d/0x2f0
          do_try_to_free_pages+0xe0/0x300
          try_to_free_pages+0x116/0x2b0
          __alloc_pages_slowpath+0x3f4/0xf80
          __alloc_pages_nodemask+0x2a2/0x2f0
          __handle_mm_fault+0x42e/0xb50
          handle_mm_fault+0x55/0xb0
          __do_page_fault+0x235/0x4b0
          page_fault+0x1e/0x30
        irq event stamp: 228412
        hardirqs last  enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
        hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
        softirqs last  enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
        softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&(&zram->bitmap_lock)->rlock);
          <Interrupt>
            lock(&(&zram->bitmap_lock)->rlock);
      
         *** DEADLOCK ***
      
        no locks held by zram_verify/2095.
      
        stack backtrace:
        CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x67/0x9b
         print_usage_bug+0x1bd/0x1d3
         mark_lock+0x4aa/0x540
         __lock_acquire+0x51d/0x1300
         lock_acquire+0x90/0x180
         _raw_spin_lock+0x2c/0x40
         put_entry_bdev+0x1e/0x50
         zram_free_page+0xf6/0x110
         zram_slot_free_notify+0x42/0xa0
         end_swap_bio_read+0x5b/0x170
         blk_update_request+0x8f/0x340
         scsi_end_request+0x2c/0x1e0
         scsi_io_completion+0x98/0x650
         blk_done_softirq+0x9e/0xd0
         __do_softirq+0xcc/0x427
         irq_exit+0xd1/0xe0
         do_IRQ+0x93/0x120
         common_interrupt+0xf/0xf
         </IRQ>
      
      With writeback feature, zram_slot_free_notify could be called in softirq
      context by end_swap_bio_read.  However, bitmap_lock is not aware of that
      so lockdep yell out:
      
        get_entry_bdev
        spin_lock(bitmap->lock);
        irq
        softirq
        end_swap_bio_read
        zram_slot_free_notify
        zram_slot_lock <-- deadlock prone
        zram_free_page
        put_entry_bdev
        spin_lock(bitmap->lock); <-- deadlock prone
      
      With akpm's suggestion (i.e.  bitmap operation is already atomic), we
      could remove bitmap lock.  It might fail to find a empty slot if serious
      contention happens.  However, it's not severe problem because huge page
      writeback has already possiblity to fail if there is severe memory
      pressure.  Worst case is just keeping the incompressible in memory, not
      storage.
      
      The other problem is zram_slot_lock in zram_slot_slot_free_notify.  To
      make it safe is this patch introduces zram_slot_trylock where
      zram_slot_free_notify uses it.  Although it's rare to be contented, this
      patch adds new debug stat "miss_free" to keep monitoring how often it
      happens.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c9959e0
    • Qian Cai's avatar
      mm/memblock.c: skip kmemleak for kasan_init() · fed84c78
      Qian Cai authored
      
      
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us
      Signed-off-by: default avatarQian Cai <cai@gmx.us>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed84c78