Skip to content
  1. Feb 01, 2018
    • Aneesh Kumar K.V's avatar
      powerpc/mm: update pmdp_invalidate to return old pmd value · 8cc931e0
      Aneesh Kumar K.V authored
      
      
      It's required to avoid losing dirty and accessed bits.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-7-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cc931e0
    • Kirill A. Shutemov's avatar
      mips: use generic_pmdp_establish as pmdp_establish · b6b34b2d
      Kirill A. Shutemov authored
      
      
      MIPS doesn't support hardware dirty/accessed bits.
      generic_pmdp_establish() is suitable in this case.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-6-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6b34b2d
    • Catalin Marinas's avatar
      arm64: provide pmdp_establish() helper · 1d78a62c
      Catalin Marinas authored
      
      
      We need an atomic way to setup pmd page table entry, avoiding races with
      CPU setting dirty/accessed bits.  This is required to implement
      pmdp_invalidate() that doesn't lose these bits.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-5-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d78a62c
    • Kirill A. Shutemov's avatar
      arm/mm: provide pmdp_establish() helper · ef298cc5
      Kirill A. Shutemov authored
      
      
      ARM LPAE doesn't have hardware dirty/accessed bits.
      
      generic_pmdp_establish() is the right implementation of pmdp_establish
      for this case.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-4-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef298cc5
    • Kirill A. Shutemov's avatar
      arc: use generic_pmdp_establish as pmdp_establish · 5c8aa7ea
      Kirill A. Shutemov authored
      
      
      ARC doesn't support hardware dirty/accessed bits.
      generic_pmdp_establish() is suitable in this case.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-3-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c8aa7ea
    • Kirill A. Shutemov's avatar
      asm-generic: provide generic_pmdp_establish() · c58f0bb7
      Kirill A. Shutemov authored
      
      
      Patch series "Do not lose dirty bit on THP pages", v4.
      
      Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
      dirty and access bits if CPU sets them after pmdp dereference, but
      before set_pmd_at().
      
      The bug can lead to data loss, but the race window is tiny and I haven't
      seen any reports that suggested that it happens in reality.  So I don't
      think it worth sending it to stable.
      
      Unfortunately, there's no way to address the issue in a generic way.  We
      need to fix all architectures that support THP one-by-one.
      
      All architectures that have THP supported have to provide atomic
      pmdp_invalidate() that returns previous value.
      
      If generic implementation of pmdp_invalidate() is used, architecture
      needs to provide atomic pmdp_estabish().
      
      pmdp_estabish() is not used out-side generic implementation of
      pmdp_invalidate() so far, but I think this can change in the future.
      
      This patch (of 12):
      
      This is an implementation of pmdp_establish() that is only suitable for
      an architecture that doesn't have hardware dirty/accessed bits.  In this
      case we can't race with CPU which sets these bits and non-atomic
      approach is fine.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-2-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c58f0bb7
    • Matthew Wilcox's avatar
      mm: get 7% more pages in a pagevec · 146500e9
      Matthew Wilcox authored
      
      
      We don't have to use an entire 'long' for the number of elements in the
      pagevec; we know it's a number between 0 and 14 (now 15).  So we can
      store it in a char, and then the bool packs next to it and we still have
      two or six bytes of padding for more elements in the header.  That gives
      us space to cram in an extra page.
      
      Link: http://lkml.kernel.org/r/20171206022521.GM26021@bombadil.infradead.org
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      146500e9
    • Matthew Wilcox's avatar
      mm: add unmap_mapping_pages() · 977fbdcd
      Matthew Wilcox authored
      
      
      Several users of unmap_mapping_range() would prefer to express their
      range in pages rather than bytes.  Unfortuately, on a 32-bit kernel, you
      have to remember to cast your page number to a 64-bit type before
      shifting it, and four places in the current tree didn't remember to do
      that.  That's a sign of a bad interface.
      
      Conveniently, unmap_mapping_range() actually converts from bytes into
      pages, so hoist the guts of unmap_mapping_range() into a new function
      unmap_mapping_pages() and convert the callers which want to use pages.
      
      Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reported-by: default avatar"zhangyi (F)" <yi.zhang@huawei.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      977fbdcd
    • Huang Ying's avatar
      mm, userfaultfd, THP: avoid waiting when PMD under THP migration · a365ac09
      Huang Ying authored
      
      
      If THP migration is enabled, for a VMA handled by userfaultfd, consider
      the following situation,
      
        do_page_fault()
          __do_huge_pmd_anonymous_page()
           handle_userfault()
             userfault_msg()
               /* a huge page is allocated and mapped at fault address */
               /* the huge page is under migration, leaves migration entry
                  in page table */
             userfaultfd_must_wait()
               /* return true because !pmd_present() */
             /* may wait in loop until fatal signal */
      
      That is, it may be possible for userfaultfd_must_wait() encounters a PMD
      entry which is !pmd_none() && !pmd_present().  In the current
      implementation, we will wait for such PMD entries, which may cause
      unnecessary waiting, and potential soft lockup.
      
      This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
      only wait when pmd_none().
      
      This may be not a problem in practice, because userfaultfd_must_wait()
      is always called with mm->mmap_sem read-locked.  mremap() will
      write-lock mm->mmap_sem.  And UFFDIO_COPY doesn't support to copy THP
      mapping.  But the change introduced still makes the code more correct,
      and makes the PMD and PTE code more consistent.
      
      Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.UK>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a365ac09
    • Yisheng Xie's avatar
      mm/huge_memory.c: fix comment in __split_huge_pmd_locked · 9bebc09f
      Yisheng Xie authored
      
      
      pmd_trans_splitting() was removed after THP refcounting redesign,
      therefore related comment should be updated.
      
      Link: http://lkml.kernel.org/r/1512625745-59451-1-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bebc09f
    • Oscar Salvador's avatar
      mm: memory_hotplug: remove second __nr_to_section in register_page_bootmem_info_section() · 9ac9322d
      Oscar Salvador authored
      
      
      In register_page_bootmem_info_section() we call __nr_to_section() in
      order to get the mem_section struct at the beginning of the function.
      Since we already got it, there is no need for a second call to
      __nr_to_section().
      
      Link: http://lkml.kernel.org/r/20171207102914.GA12396@techadventures.net
      Signed-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ac9322d
    • Konstantin Khlebnikov's avatar
      fs/proc/task_mmu.c: do not show VmExe bigger than total executable virtual memory · 8526d84f
      Konstantin Khlebnikov authored
      
      
      If start_code / end_code pointers are screwed then "VmExe" could be
      bigger than total executable virtual memory and "VmLib" becomes
      negative:
      
        VmExe:	  294320 kB
        VmLib:	18446744073709327564 kB
      
      VmExe and VmLib documented as text segment and shared library code size.
      
      Now their sum will be always equal to mm->exec_vm which sums size of
      executable and not writable and not stack areas.
      
      I've seen this for huge (>2Gb) statically linked binary which has whole
      world inside.  For it start_code ..  end_code range also covers one of
      rodata sections.  Probably this is bug in customized linker, elf loader
      or both.
      
      Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus
      we cannot trust them without validation.
      
      Link: http://lkml.kernel.org/r/150728955451.743749.11276392315459539583.stgit@buzz
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8526d84f
    • Mike Rapoport's avatar
      mm: update comment describing tlb_gather_mmu · ef549e13
      Mike Rapoport authored
      
      
      The comment describes @fullmm argument, but the function has no such
      parameter.
      
      Update the comment to match the code and convert it to kernel-doc
      markup.
      
      Link: http://lkml.kernel.org/r/1512394531-2264-1-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef549e13
    • Oscar Salvador's avatar
      mm/memory_hotplug.c: remove unnecesary check from register_page_bootmem_info_section() · dc88c889
      Oscar Salvador authored
      
      
      When we call register_page_bootmem_info_section() having
      CONFIG_SPARSEMEM_VMEMMAP enabled, we check if the pfn is valid.
      
      This check is redundant as we already checked this in
      register_page_bootmem_info_node() before calling
      register_page_bootmem_info_section(), so let's get rid of it.
      
      Link: http://lkml.kernel.org/r/20171205143422.GA31458@techadventures.net
      Signed-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc88c889
    • Michal Hocko's avatar
      mm, hugetlb: remove hugepages_treat_as_movable sysctl · d6cb41cc
      Michal Hocko authored
      hugepages_treat_as_movable has been introduced by 396faf03
      
       ("Allow
      huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
      allocations from ZONE_MOVABLE even when hugetlb pages were not
      migrateable.  The purpose of the movable zone was different at the time.
      It aimed at reducing memory fragmentation and hugetlb pages being long
      lived and large werre not contributing to the fragmentation so it was
      acceptable to use the zone back then.
      
      Things have changed though and the primary purpose of the zone became
      migratability guarantee.  If we allow non migrateable hugetlb pages to
      be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
      
      Remove the knob and only rely on hugepage_migration_supported to allow
      movable zones.
      
      Mel said:
      
      : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
      : the ability to grow it again.  The use case was for batched jobs, some of
      : which needed huge pages and others that did not but didn't want the memory
      : useless pinned in the huge pages pool.
      :
      : I suspect that more users rely on THP than hugetlbfs for flexible use of
      : huge pages with fallback options so I think that removing the option
      : should be ok.
      
      Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6cb41cc
    • Jan Kara's avatar
      mm: remove unused pgdat_reclaimable_pages() · a4ef8768
      Jan Kara authored
      
      
      Remove unused function pgdat_reclaimable_pages() and
      node_page_state_snapshot() which becomes unused as well.
      
      Link: http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4ef8768
    • Vasyl Gomonovych's avatar
      mm/interval_tree.c: use vma_pages() helper · e025f059
      Vasyl Gomonovych authored
      
      
      Use vma_pages function on vma object instead of explicit computation.
      
        mm/interval_tree.c:21:27-33: WARNING: Consider using vma_pages helper
      
      Generated by: scripts/coccinelle/api/vma_pages.cocci
      
      Link: http://lkml.kernel.org/r/1511364410-13499-1-git-send-email-gomonovych@gmail.com
      Signed-off-by: default avatarVasyl Gomonovych <gomonovych@gmail.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e025f059
    • Aneesh Kumar K.V's avatar
      selftests/vm: move 128TB mmap boundary test to generic directory · 235266b8
      Aneesh Kumar K.V authored
      
      
      Architectures like PPC64 support mmap hint address based large address
      space selection.  This test can be run on those architectures too.  Move
      the test from the x86 selftests to selftest/vm so that other
      architectures can use it too.
      
      We also add a few new test scenarios in this patch.  We do test a few
      boundary conditions before we do a high address mmap.  PPC64 uses the
      address limit to validate the address in the fault path.  We had bugs in
      this area w.r.t SLB fault handling before we updated the addess limit.
      
      We also touch the allocated space to make sure we don't have any bugs in
      the fault handling path.
      
      [akpm@linux-foundation.org: restore tools/testing/selftests/vm/Makefile alpha ordering]
      Link: http://lkml.kernel.org/r/20171123165226.32582-1-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      235266b8
    • Minchan Kim's avatar
      mm: do not stall register_shrinker() · e496612c
      Minchan Kim authored
      
      
      Shakeel Butt reported he has observed in production systems that the job
      loader gets stuck for 10s of seconds while doing a mount operation.  It
      turns out that it was stuck in register_shrinker() because some
      unrelated job was under memory pressure and was spending time in
      shrink_slab().  Machines have a lot of shrinkers registered and jobs
      under memory pressure have to traverse all of those memcg-aware
      shrinkers and affect unrelated jobs which want to register their own
      shrinkers.
      
      To solve the issue, this patch simply bails out slab shrinking if it is
      found that someone wants to register a shrinker in parallel.  A downside
      is it could cause unfair shrinking between shrinkers.  However, it
      should be rare and we can add compilcated logic if we find it's not
      enough.
      
      [akpm@linux-foundation.org: tweak code comment]
      Link: http://lkml.kernel.org/r/20171115005602.GB23810@bbox
      Link: http://lkml.kernel.org/r/1511481899-20335-1-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarShakeel Butt <shakeelb@google.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e496612c
    • Jiankang Chen's avatar
      mm/page_alloc.c: fix comment in __get_free_pages() · 48128397
      Jiankang Chen authored
      
      
      __get_free_pages() will return a virtual address, but it is not just a
      32-bit address, for example on a 64-bit system.  And this comment really
      confuses new readers of mm.
      
      Link: http://lkml.kernel.org/r/1511780964-64864-1-git-send-email-chenjiankang1@huawei.com
      Signed-off-by: default avatarJiankang Chen <chenjiankang1@huawei.com>
      Reported-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48128397
    • Vasyl Gomonovych's avatar
      mm/page_owner.c: use PTR_ERR_OR_ZERO() · 8e33771c
      Vasyl Gomonovych authored
      
      
      Fix ptr_ret.cocci warnings:
      
        mm/page_owner.c:639:1-3: WARNING: PTR_ERR_OR_ZERO can be used
      
      Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
      
      Generated by: scripts/coccinelle/api/ptr_ret.cocci
      
      Link: http://lkml.kernel.org/r/1511824101-9597-1-git-send-email-gomonovych@gmail.com
      Signed-off-by: default avatarVasyl Gomonovych <gomonovych@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e33771c
    • Johannes Weiner's avatar
      mm: memcontrol: fix excessive complexity in memory.stat reporting · a983b5eb
      Johannes Weiner authored
      
      
      We've seen memory.stat reads in top-level cgroups take up to fourteen
      seconds during a userspace bug that created tens of thousands of ghost
      cgroups pinned by lingering page cache.
      
      Even with a more reasonable number of cgroups, aggregating memory.stat
      is unnecessarily heavy.  The complexity is this:
      
      	nr_cgroups * nr_stat_items * nr_possible_cpus
      
      where the stat items are ~70 at this point.  With 128 cgroups and 128
      CPUs - decent, not enormous setups - reading the top-level memory.stat
      has to aggregate over a million per-cpu counters.  This doesn't scale.
      
      Instead of spreading the source of truth across all CPUs, use the
      per-cpu counters merely to batch updates to shared atomic counters.
      
      This is the same as the per-cpu stocks we use for charging memory to the
      shared atomic page_counters, and also the way the global vmstat counters
      are implemented.
      
      Vmstat has elaborate spilling thresholds that depend on the number of
      CPUs, amount of memory, and memory pressure - carefully balancing the
      cost of counter updates with the amount of per-cpu error.  That's
      because the vmstat counters are system-wide, but also used for decisions
      inside the kernel (e.g.  NR_FREE_PAGES in the allocator).  Neither is
      true for the memory controller.
      
      Use the same static batch size we already use for page_counter updates
      during charging.  The per-cpu error in the stats will be 128k, which is
      an acceptable ratio of cores to memory accounting granularity.
      
      [hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls]
        Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
      Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a983b5eb
    • Johannes Weiner's avatar
      mm: memcontrol: implement lruvec stat functions on top of each other · 28454265
      Johannes Weiner authored
      
      
      The implementation of the lruvec stat functions and their variants for
      accounting through a page, or accounting from a preemptible context, are
      mostly identical and needlessly repetitive.
      
      Implement the lruvec_page functions by looking up the page's lruvec and
      then using the lruvec function.
      
      Implement the functions for preemptible contexts by disabling preemption
      before calling the atomic context functions.
      
      Link: http://lkml.kernel.org/r/20171103153336.24044-2-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28454265
    • Johannes Weiner's avatar
      mm: memcontrol: eliminate raw access to stat and event counters · c9019e9b
      Johannes Weiner authored
      
      
      Replace all raw 'this_cpu_' modifications of the stat and event per-cpu
      counters with API functions such as mod_memcg_state().
      
      This makes the code easier to read, but is also in preparation for the
      next patch, which changes the per-cpu implementation of those counters.
      
      Link: http://lkml.kernel.org/r/20171103153336.24044-1-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9019e9b
    • Yang Shi's avatar
      mm/filemap.c: remove include of hardirq.h · 2b9fceb3
      Yang Shi authored
      
      
      in_atomic() has been moved to include/linux/preempt.h, and the filemap.c
      doesn't use in_atomic() directly at all, so it sounds unnecessary to
      include hardirq.h.
      
      Link: http://lkml.kernel.org/r/1509985319-38633-1-git-send-email-yang.s@alibaba-inc.com
      Signed-off-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b9fceb3
    • Pavel Tatashin's avatar
      mm: split deferred_init_range into initializing and freeing parts · 80b1f41c
      Pavel Tatashin authored
      
      
      In deferred_init_range() we initialize struct pages, and also free them
      to buddy allocator.  We do it in separate loops, because buddy page is
      computed ahead, so we do not want to access a struct page that has not
      been initialized yet.
      
      There is still, however, a corner case where it is potentially possible
      to access uninitialized struct page: this is when buddy page is from the
      next memblock range.
      
      This patch fixes this problem by splitting deferred_init_range() into
      two functions: one to initialize struct pages, and another to free them.
      
      In addition, this patch brings the following improvements:
       - Get rid of __def_free() helper function. And simplifies loop logic by
         adding a new pfn validity check function: deferred_pfn_valid().
       - Reduces number of variables that we track. So, there is a higher
         chance that we will avoid using stack to store/load variables inside
         hot loops.
       - Enables future multi-threading of these functions: do initialization
         in multiple threads, wait for all threads to finish, do freeing part
         in multithreading.
      
      Tested on x86 with 1T of memory to make sure no regressions are
      introduced.
      
      [akpm@linux-foundation.org: fix spello in comment]
      Link: http://lkml.kernel.org/r/20171107150446.32055-2-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80b1f41c
    • Josef Bacik's avatar
      mm: use sc->priority for slab shrink targets · 9092c71b
      Josef Bacik authored
      
      
      Previously we were using the ratio of the number of lru pages scanned to
      the number of eligible lru pages to determine the number of slab objects
      to scan.  The problem with this is that these two things have nothing to
      do with each other, so in slab heavy work loads where there is little to
      no page cache we can end up with the pages scanned being a very low
      number.  This means that we reclaim next to no slab pages and waste a
      lot of time reclaiming small amounts of space.
      
      Consider the following scenario, where we have the following values and
      the rest of the memory usage is in slab
      
        Active:            58840 kB
        Inactive:          46860 kB
      
      Every time we do a get_scan_count() we do this
      
        scan = size >> sc->priority
      
      where sc->priority starts at DEF_PRIORITY, which is 12.  The first loop
      through reclaim would result in a scan target of 2 pages to 11715 total
      inactive pages, and 3 pages to 14710 total active pages.  This is a
      really really small target for a system that is entirely slab pages.
      And this is super optimistic, this assumes we even get to scan these
      pages.  We don't increment sc->nr_scanned unless we 1) isolate the page,
      which assumes it's not in use, and 2) can lock the page.  Under pressure
      these numbers could probably go down, I'm sure there's some random pages
      from daemons that aren't actually in use, so the targets get even
      smaller.
      
      Instead use sc->priority in the same way we use it to determine scan
      amounts for the lru's.  This generally equates to pages.  Consider the
      following
      
        slab_pages = (nr_objects * object_size) / PAGE_SIZE
      
      What we would like to do is
      
        scan = slab_pages >> sc->priority
      
      but we don't know the number of slab pages each shrinker controls, only
      the objects.  However say that theoretically we knew how many pages a
      shrinker controlled, we'd still have to convert this to objects, which
      would look like the following
      
        scan = shrinker_pages >> sc->priority
        scan_objects = (PAGE_SIZE / object_size) * scan
      
      or written another way
      
        scan_objects = (shrinker_pages >> sc->priority) *
      		 (PAGE_SIZE / object_size)
      
      which can thus be written
      
        scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
      		 sc->priority
      
      which is just
      
        scan_objects = nr_objects >> sc->priority
      
      We don't need to know exactly how many pages each shrinker represents,
      it's objects are all the information we need.  Making this change allows
      us to place an appropriate amount of pressure on the shrinker pools for
      their relative size.
      
      Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDave Chinner <david@fromorbit.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9092c71b
    • Roman Gushchin's avatar
      mm: show total hugetlb memory consumption in /proc/meminfo · fcb2b0c5
      Roman Gushchin authored
      
      
      Currently we display some hugepage statistics (total, free, etc) in
      /proc/meminfo, but only for default hugepage size (e.g.  2Mb).
      
      If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
      /proc/meminfo output can be confusing, as non-default sized hugepages
      are not reflected at all, and there are no signs that they are existing
      and consuming system memory.
      
      To solve this problem, let's display the total amount of memory,
      consumed by hugetlb pages of all sized (both free and used).  Let's call
      it "Hugetlb", and display size in kB to match generic /proc/meminfo
      style.
      
      For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated):
        $ cat /proc/meminfo
        MemTotal:        8168984 kB
        MemFree:         3789276 kB
        <...>
        CmaFree:               0 kB
        HugePages_Total:    1024
        HugePages_Free:     1024
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:         4194304 kB
        DirectMap4k:       32632 kB
        DirectMap2M:     4161536 kB
        DirectMap1G:     6291456 kB
      
      Also, this patch updates corresponding docs to reflect Hugetlb entry
      meaning and difference between Hugetlb and HugePages_Total * Hugepagesize.
      
      Link: http://lkml.kernel.org/r/20171115231409.12131-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcb2b0c5
    • Michal Hocko's avatar
      mm: drop hotplug lock from lru_add_drain_all() · 9852a721
      Michal Hocko authored
      Pulling cpu hotplug locks inside the mm core function like
      lru_add_drain_all just asks for problems and the recent lockdep splat
      [1] just proves this.  While the usage in that particular case might be
      wrong we should avoid the locking as lru_add_drain_all() is used in many
      places.  It seems that this is not all that hard to achieve actually.
      
      We have done the same thing for drain_all_pages which is analogous by
      commit a459eeb7
      
       ("mm, page_alloc: do not depend on cpu hotplug locks
      inside the allocator").  All we have to care about is to handle
      
            - the work item might be executed on a different cpu in worker from
              unbound pool so it doesn't run on pinned on the cpu
      
            - we have to make sure that we do not race with page_alloc_cpu_dead
              calling lru_add_drain_cpu
      
      the first part is already handled because the worker calls lru_add_drain
      which disables preemption when calling lru_add_drain_cpu on the local
      cpu it is draining.  The later is true because page_alloc_cpu_dead is
      called on the controlling CPU after the hotplugged CPU vanished
      completely.
      
      [1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com
      
      [add a cpu hotplug locking interaction as per tglx]
      Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9852a721
    • Yisheng Xie's avatar
      mm/mempolicy: add nodes_empty check in SYSC_migrate_pages · 0486a38b
      Yisheng Xie authored
      
      
      As in manpage of migrate_pages, the errno should be set to EINVAL when
      none of the node IDs specified by new_nodes are on-line and allowed by
      the process's current cpuset context, or none of the specified nodes
      contain memory.  However, when test by following case:
      
      	new_nodes = 0;
      	old_nodes = 0xf;
      	ret = migrate_pages(pid, old_nodes, new_nodes, MAX);
      
      The ret will be 0 and no errno is set.  As the new_nodes is empty, we
      should expect EINVAL as documented.
      
      To fix the case like above, this patch check whether target nodes AND
      current task_nodes is empty, and then check whether AND
      node_states[N_MEMORY] is empty.
      
      Link: http://lkml.kernel.org/r/1510882624-44342-4-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0486a38b
    • Yisheng Xie's avatar
      mm/mempolicy: fix the check of nodemask from user · 56521e7a
      Yisheng Xie authored
      
      
      As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system
      which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2:
      
        migrate_pages01    0  TINFO  :  test_invalid_nodes
        migrate_pages01   14  TFAIL  :  migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
        migrate_pages01   15  TFAIL  :  migrate_pages_common.c:55: call succeeded unexpectedly
      
      In this case the test_invalid_nodes of migrate_pages01 will call:
      SYSC_migrate_pages as:
      
        migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0
      
      The new nodes specifies one or more node IDs that are greater than the
      maximum supported node ID, however, the errno is not set to EINVAL as
      expected.
      
      As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
      mentioned, when nodemask specifies one or more node IDs that are greater
      than the maximum supported node ID, the errno should set to EINVAL.
      However, get_nodes only check whether the part of bits
      [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
      remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
      unchecked.
      
      This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
      to let migrate_pages set the errno to EINVAL when nodemask specifies one
      or more node IDs that are greater than the maximum supported node ID,
      which follows the manpage's guide.
      
      [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
      [2] http://man7.org/linux/man-pages/man2/mbind.2.html
      [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html
      
      Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarTan Xiaojun <tanxiaojun@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56521e7a
    • Yisheng Xie's avatar
      mm/mempolicy: remove redundant check in get_nodes · 66f308ed
      Yisheng Xie authored
      
      
      We have already checked whether maxnode is a page worth of bits, by:
          maxnode > PAGE_SIZE*BITS_PER_BYTE
      
      So no need to check it once more.
      
      Link: http://lkml.kernel.org/r/1510882624-44342-2-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Tan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66f308ed
    • Pavel Tatashin's avatar
      mm: relax deferred struct page requirements · 2e3ca40f
      Pavel Tatashin authored
      
      
      There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT, as all
      the page initialization code is in common code.
      
      Also, there is no need to depend on MEMORY_HOTPLUG, as initialization
      code does not really use hotplug memory functionality.  So, we can
      remove this requirement as well.
      
      This patch allows to use deferred struct page initialization on all
      platforms with memblock allocator.
      
      Tested on x86, arm64, and sparc.  Also, verified that code compiles on
      PPC with CONFIG_MEMORY_HOTPLUG disabled.
      
      Link: http://lkml.kernel.org/r/20171117014601.31606-1-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e3ca40f
    • Srividya Desireddy's avatar
      zswap: same-filled pages handling · a85f878b
      Srividya Desireddy authored
      
      
      Zswap is a cache which compresses the pages that are being swapped out
      and stores them into a dynamically allocated RAM-based memory pool.
      Experiments have shown that around 10-20% of pages stored in zswap are
      same-filled pages (i.e.  contents of the page are all same), but these
      pages are handled as normal pages by compressing and allocating memory
      in the pool.
      
      This patch adds a check in zswap_frontswap_store() to identify
      same-filled page before compression of the page.  If the page is a
      same-filled page, set zswap_entry.length to zero, save the same-filled
      value and skip the compression of the page and alloction of memory in
      zpool.  In zswap_frontswap_load(), check if value of zswap_entry.length
      is zero corresponding to the page to be loaded.  If zswap_entry.length
      is zero, fill the page with same-filled value.  This saves the
      decompression time during load.
      
      On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
      relaunching different applications, out of ~64000 pages stored in zswap,
      ~11000 pages were same-value filled pages (including zero-filled pages)
      and ~9000 pages were zero-filled pages.
      
      An average of 17% of pages(including zero-filled pages) in zswap are
      same-value filled pages and 14% pages are zero-filled pages.  An average
      of 3% of pages are same-filled non-zero pages.
      
      The below table shows the execution time profiling with the patch.
      
                                  Baseline    With patch  % Improvement
        -----------------------------------------------------------------
        *Zswap Store Time           26.5ms       18ms          32%
         (of same value pages)
        *Zswap Load Time
         (of same value pages)      25.5ms       13ms          49%
        -----------------------------------------------------------------
      
      On Ubuntu PC with 2GB RAM, while executing kernel build and other test
      scripts and running multimedia applications, out of 360000 pages stored
      in zswap 78000(~22%) of pages were found to be same-value filled pages
      (including zero-filled pages) and 64000(~17%) are zero-filled pages.  So
      an average of %5 of pages are same-filled non-zero pages.
      
      The below table shows the execution time profiling with the patch.
      
                                  Baseline    With patch  % Improvement
        -----------------------------------------------------------------
        *Zswap Store Time           91ms        74ms           19%
         (of same value pages)
        *Zswap Load Time            50ms        7.5ms          85%
         (of same value pages)
        -----------------------------------------------------------------
      
      *The execution times may vary with test device used.
      
      Dan said:
      
      : I did test this patch out this week, and I added some instrumentation to
      : check the performance impact, and tested with a small program to try to
      : check the best and worst cases.
      :
      : When doing a lot of swap where all (or almost all) pages are same-value, I
      : found this patch does save both time and space, significantly.  The exact
      : improvement in time and space depends on which compressor is being used,
      : but roughly agrees with the numbers you listed.
      :
      : In the worst case situation, where all (or almost all) pages have the
      : same-value *except* the final long (meaning, zswap will check each long on
      : the entire page but then still have to pass the page to the compressor),
      : the same-value check is around 10-15% of the total time spent in
      : zswap_frontswap_store().  That's a not-insignificant amount of time, but
      : it's not huge.  Considering that most systems will probably be swapping
      : pages that aren't similar to the worst case (although I don't have any
      : data to know that), I'd say the improvement is worth the possible
      : worst-case performance impact.
      
      [srividya.dr@samsung.com: add memset_l instead of for loop]
      Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
      Signed-off-by: default avatarSrividya Desireddy <srividya.dr@samsung.com>
      Acked-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
      Cc: SHARAN ALLUR <sharan.allur@samsung.com>
      Cc: RAJIB BASU <rajib.basu@samsung.com>
      Cc: JUHUN KIM <juhunkim@samsung.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Timofey Titovets <nefelim4ag@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a85f878b
    • Yang Shi's avatar
      mm: kmemleak: remove unused hardirq.h · 4a01768e
      Yang Shi authored
      
      
      Preempt counter APIs have been split out, currently, hardirq.h just
      includes irq_enter/exit APIs which are not used by kmemleak at all.
      
      So, remove the unused hardirq.h.
      
      Link: http://lkml.kernel.org/r/1510959741-31109-1-git-send-email-yang.s@alibaba-inc.com
      Signed-off-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a01768e
    • Andrew Morton's avatar
      include/linux/sched/mm.h: uninline mmdrop_async(), etc · d70f2a14
      Andrew Morton authored
      
      
      mmdrop_async() is only used in fork.c.  Move that and its support
      functions into fork.c, uninline it all.
      
      Quite a lot of code gets moved around to avoid forward declarations.
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d70f2a14
    • Miles Chen's avatar
      slub: remove obsolete comments of put_cpu_partial() · 0d2d5d40
      Miles Chen authored
      Commit d6e0b7fa
      
       ("slub: make dead caches discard free slabs
      immediately") makes put_cpu_partial() run with preemption disabled and
      interrupts disabled when calling unfreeze_partials().
      
      The comment: "put_cpu_partial() is done without interrupts disabled and
      without preemption disabled" looks obsolete, so remove it.
      
      Link: http://lkml.kernel.org/r/1516968550-1520-1-git-send-email-miles.chen@mediatek.com
      Signed-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d2d5d40
    • Balasubramani Vivekanandan's avatar
      mm/slub.c: fix wrong address during slab padding restoration · 5d682681
      Balasubramani Vivekanandan authored
      
      
      Start address calculated for slab padding restoration was wrong.  Wrong
      address would point to some section before padding and could cause
      corruption
      
      Link: http://lkml.kernel.org/r/1516604578-4577-1-git-send-email-balasubramani_vivekanandan@mentor.com
      Signed-off-by: default avatarBalasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d682681
    • Oscar Salvador's avatar
      mm/slab.c: remove redundant assignments for slab_state · 84ebb582
      Oscar Salvador authored
      
      
      slab_state is being set to "UP" in create_kmalloc_caches(), and later on
      we set it again in kmem_cache_init_late(), but slab_state does not
      change in the meantime.
      
      Remove the redundant assignment from kmem_cache_init_late().
      
      And unless I overlooked anything, the same goes for "slab_state = FULL".
      slab_state is set to "FULL" in kmem_cache_init_late(), but it is later
      being set again in cpucache_init(), which gets called from
      do_initcall_level().  So remove the assignment from cpucache_init() as
      well.
      
      Link: http://lkml.kernel.org/r/20171215134452.GA1920@techadventures.net
      Signed-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84ebb582
    • Byongho Lee's avatar
      mm/slab_common.c: make calculate_alignment() static · 692ae74a
      Byongho Lee authored
      
      
      calculate_alignment() function is only used inside slab_common.c.  So
      make it static and let the compiler do more optimizations.
      
      After this patch there's a small improvement in text and data size.
      
        $ gcc --version
          gcc (GCC) 7.2.1 20171128
      
      Before:
        text	   data	    bss	    dec	     hex	filename
        9890457  3828702  1212364 14931523 e3d643	vmlinux
      
      After:
        text	   data	    bss	    dec	     hex	filename
        9890437  3828670  1212364 14931471 e3d60f	vmlinux
      
      Also I fixed a style problem reported by checkpatch.
      
        WARNING: Missing a blank line after declarations
        #53: FILE: mm/slab_common.c:286:
        +		unsigned long ralign = cache_line_size();
        +		while (size <= ralign / 2)
      
      Link: http://lkml.kernel.org/r/20171210080132.406-1-bhlee.kernel@gmail.com
      Signed-off-by: default avatarByongho Lee <bhlee.kernel@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      692ae74a