Skip to content
  1. Nov 16, 2017
    • Pavel Tatashin's avatar
      mm: define memblock_virt_alloc_try_nid_raw · ea1f5f37
      Pavel Tatashin authored
      
      
      * A new variant of memblock_virt_alloc_* allocations:
      memblock_virt_alloc_try_nid_raw()
          - Does not zero the allocated memory
          - Does not panic if request cannot be satisfied
      
      * optimize early system hash allocations
      
      Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
      specify that memory that was allocated for system hash needs to be
      zeroed, otherwise the memory does not need to be zeroed, and client will
      initialize it.
      
      If memory does not need to be zero'd, call the new
      memblock_virt_alloc_raw() interface, and thus improve the boot
      performance.
      
      * debug for raw alloctor
      
      When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
      returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
      places excpect zeroed memory.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea1f5f37
    • Pavel Tatashin's avatar
      sparc64: simplify vmemmap_populate · df8ee578
      Pavel Tatashin authored
      
      
      Remove duplicating code by using common functions vmemmap_pud_populate
      and vmemmap_pgd_populate.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-5-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df8ee578
    • Pavel Tatashin's avatar
      sparc64/mm: set fields in deferred pages · 2a20aa17
      Pavel Tatashin authored
      
      
      Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
      flags and other fields in "struct page"es are never changed prior to
      first initializing struct pages by going through __init_single_page().
      
      With deferred struct page feature enabled there is a case where we set
      some fields prior to initializing:
      
      mem_init() {
           register_page_bootmem_info();
           free_all_bootmem();
           ...
      }
      
      When register_page_bootmem_info() is called only non-deferred struct
      pages are initialized.  But, this function goes through some reserved
      pages which might be part of the deferred, and thus are not yet
      initialized.
      
      mem_init
      register_page_bootmem_info
      register_page_bootmem_info_node
       get_page_bootmem
        .. setting fields here ..
        such as: page->freelist = (void *)type;
      
      free_all_bootmem()
      free_low_memory_core_early()
       for_each_reserved_mem_region()
        reserve_bootmem_region()
         init_reserved_page() <- Only if this is deferred reserved page
          __init_single_pfn()
           __init_single_page()
            memset(0) <-- Loose the set fields here
      
      We end up with similar issue as in the previous patch, where currently
      we do not observe problem as memory is zeroed.  But, if flag asserts are
      changed we can start hitting issues.
      
      Also, because in this patch series we will stop zeroing struct page
      memory during allocation, we must make sure that struct pages are
      properly initialized prior to using them.
      
      The deferred-reserved pages are initialized in free_all_bootmem().
      Therefore, the fix is to switch the above calls.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-4-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a20aa17
    • Pavel Tatashin's avatar
      x86/mm: set fields in deferred pages · 353b1e7b
      Pavel Tatashin authored
      
      
      Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
      flags and other fields in "struct page"es are never changed prior to
      first initializing struct pages by going through __init_single_page().
      
      With deferred struct page feature enabled, however, we set fields in
      register_page_bootmem_info that are subsequently clobbered right after
      in free_all_bootmem:
      
              mem_init() {
                      register_page_bootmem_info();
                      free_all_bootmem();
                      ...
              }
      
      When register_page_bootmem_info() is called only non-deferred struct
      pages are initialized.  But, this function goes through some reserved
      pages which might be part of the deferred, and thus are not yet
      initialized.
      
        mem_init
         register_page_bootmem_info
          register_page_bootmem_info_node
           get_page_bootmem
            .. setting fields here ..
            such as: page->freelist = (void *)type;
      
        free_all_bootmem()
         free_low_memory_core_early()
          for_each_reserved_mem_region()
           reserve_bootmem_region()
            init_reserved_page() <- Only if this is deferred reserved page
             __init_single_pfn()
              __init_single_page()
                  memset(0) <-- Loose the set fields here
      
      We end up with issue where, currently we do not observe problem as
      memory is explicitly zeroed.  But, if flag asserts are changed we can
      start hitting issues.
      
      Also, because in this patch series we will stop zeroing struct page
      memory during allocation, we must make sure that struct pages are
      properly initialized prior to using them.
      
      The deferred-reserved pages are initialized in free_all_bootmem().
      Therefore, the fix is to switch the above calls.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-3-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      353b1e7b
    • Pavel Tatashin's avatar
      mm: deferred_init_memmap improvements · 2f47a91f
      Pavel Tatashin authored
      
      
      Patch series "complete deferred page initialization", v12.
      
      SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
      option, which defers initializing struct pages until all cpus have been
      started so it can be done in parallel.
      
      However, this feature is sub-optimal, because the deferred page
      initialization code expects that the struct pages have already been
      zeroed, and the zeroing is done early in boot with a single thread only.
      Also, we access that memory and set flags before struct pages are
      initialized.  All of this is fixed in this patchset.
      
      In this work we do the following:
       - Never read access struct page until it was initialized
       - Never set any fields in struct pages before they are initialized
       - Zero struct page at the beginning of struct page initialization
      
      ==========================================================================
      Performance improvements on x86 machine with 8 nodes:
      Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                              TIME          SPEED UP
      base no deferred:       95.796233s
      fix no deferred:        79.978956s    19.77%
      
      base deferred:          77.254713s
      fix deferred:           55.050509s    40.34%
      ==========================================================================
      SPARC M6 3600 MHz with 15T of memory
                              TIME          SPEED UP
      base no deferred:       358.335727s
      fix no deferred:        302.320936s   18.52%
      
      base deferred:          237.534603s
      fix deferred:           182.103003s   30.44%
      ==========================================================================
      Raw dmesg output with timestamps:
      x86 base no deferred:    https://hastebin.com/ofunepurit.scala
      x86 base deferred:       https://hastebin.com/ifazegeyas.scala
      x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
      x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
      sparc base no deferred:  https://hastebin.com/ibobeteken.go
      sparc base deferred:     https://hastebin.com/fariqimiyu.go
      sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
      sparc fix deferred:      https://hastebin.com/xadinobutu.go
      
      This patch (of 11):
      
      deferred_init_memmap() is called when struct pages are initialized later
      in boot by slave CPUs.  This patch simplifies and optimizes this
      function, and also fixes a couple issues (described below).
      
      The main change is that now we are iterating through free memblock areas
      instead of all configured memory.  Thus, we do not have to check if the
      struct page has already been initialized.
      
      =====
      In deferred_init_memmap() where all deferred struct pages are
      initialized we have a check like this:
      
        if (page->flags) {
      	VM_BUG_ON(page_zone(page) != zone);
      	goto free_range;
        }
      
      This way we are checking if the current deferred page has already been
      initialized.  It works, because memory for struct pages has been zeroed,
      and the only way flags are not zero if it went through
      __init_single_page() before.  But, once we change the current behavior
      and won't zero the memory in memblock allocator, we cannot trust
      anything inside "struct page"es until they are initialized.  This patch
      fixes this.
      
      The deferred_init_memmap() is re-written to loop through only free
      memory ranges provided by memblock.
      
      Note, this first issue is relevant only when the following change is
      merged:
      
      =====
      This patch fixes another existing issue on systems that have holes in
      zones i.e CONFIG_HOLES_IN_ZONE is defined.
      
      In for_each_mem_pfn_range() we have code like this:
      
        if (!pfn_valid_within(pfn)
      	goto free_range;
      
      Note: 'page' is not set to NULL and is not incremented but 'pfn'
      advances.  Thus means if deferred struct pages are enabled on systems
      with these kind of holes, linux would get memory corruptions.  I have
      fixed this issue by defining a new macro that performs all the necessary
      operations when we free the current set of pages.
      
      [pasha.tatashin@oracle.com: buddy page accessed before initialized]
        Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f47a91f
    • Changbin Du's avatar
      mm/swap_state.c: declare a few variables as __read_mostly · 783cb68e
      Changbin Du authored
      
      
      These global variables are only set during initialization or rarely
      change, so declare them as __read_mostly.
      
      Link: http://lkml.kernel.org/r/1507802349-5554-1-git-send-email-changbin.du@intel.com
      Signed-off-by: default avatarChangbin Du <changbin.du@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      783cb68e
    • Levin, Alexander (Sasha Levin)'s avatar
      kmemcheck: rip it out · 4675ff05
      Levin, Alexander (Sasha Levin) authored
      
      
      Fix up makefiles, remove references, and git rm kmemcheck.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4675ff05
    • Levin, Alexander (Sasha Levin)'s avatar
      kmemcheck: remove whats left of NOTRACK flags · d8be7566
      Levin, Alexander (Sasha Levin) authored
      
      
      Now that kmemcheck is gone, we don't need the NOTRACK flags.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8be7566
    • Levin, Alexander (Sasha Levin)'s avatar
      kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK · 75f296d9
      Levin, Alexander (Sasha Levin) authored
      
      
      Convert all allocations that used a NOTRACK flag to stop using it.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75f296d9
    • Levin, Alexander (Sasha Levin)'s avatar
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) authored
      
      
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • Colin Ian King's avatar
      mm/rmap.c: remove redundant variable cend · cdb07bde
      Colin Ian King authored
      Variable cend is set but never read, hence it is redundant and can be
      removed.
      
      Cleans up clang build warning: Value stored to 'cend' is never read
      
      Link: http://lkml.kernel.org/r/20171011174942.1372-1-colin.king@canonical.com
      Fixes: 369ea824
      
       ("mm/rmap: update to new mmu_notifier semantic v2")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdb07bde
    • Shakeel Butt's avatar
      fs, mm: account filp cache to kmemcg · f3f7c093
      Shakeel Butt authored
      
      
      The allocations from filp cache can be directly triggered by userspace
      applications.  A buggy application can consume a significant amount of
      unaccounted system memory.  Though we have not noticed such buggy
      applications in our production but upon close inspection, we found that
      a lot of machines spend very significant amount of memory on these
      caches.
      
      One way to limit allocations from filp cache is to set system level
      limit of maximum number of open files.  However this limit is shared
      between different users on the system and one user can hog this
      resource.  To cater that, we can charge filp to kmemcg and set the
      maximum limit very high and let the memory limit of each user limit the
      number of files they can open and indirectly limiting their allocations
      from filp cache.
      
      One side effect of this change is that it will allow _sysctl() to return
      ENOMEM and the man page of _sysctl() does not specify that.  However the
      man page also discourages to use _sysctl() at all.
      
      Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3f7c093
    • Kirill A. Shutemov's avatar
      mm: consolidate page table accounting · af5b0f6a
      Kirill A. Shutemov authored
      
      
      Currently, we account page tables separately for each page table level,
      but that's redundant -- we only make use of total memory allocated to
      page tables for oom_badness calculation.  We also provide the
      information to userspace, but it has dubious value there too.
      
      This patch switches page table accounting to single counter.
      
      mm->pgtables_bytes is now used to account all page table levels.  We use
      bytes, because page table size for different levels of page table tree
      may be different.
      
      The change has user-visible effect: we don't have VmPMD and VmPUD
      reported in /proc/[pid]/status.  Not sure if anybody uses them.  (As
      alternative, we can always report 0 kB for them.)
      
      OOM-killer report is also slightly changed: we now report pgtables_bytes
      instead of nr_ptes, nr_pmd, nr_puds.
      
      Apart from reducing number of counters per-mm, the benefit is that we
      now calculate oom_badness() more correctly for machines which have
      different size of page tables depending on level or where page tables
      are less than a page in size.
      
      The only downside can be debuggability because we do not know which page
      table level could leak.  But I do not remember many bugs that would be
      caught by separate counters so I wouldn't lose sleep over this.
      
      [akpm@linux-foundation.org: fix mm/huge_memory.c]
      Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      [kirill.shutemov@linux.intel.com: fix build]
        Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af5b0f6a
    • Kirill A. Shutemov's avatar
      mm: introduce wrappers to access mm->nr_ptes · c4812909
      Kirill A. Shutemov authored
      
      
      Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
      and nr_pud.
      
      The patch also makes nr_ptes accounting dependent onto CONFIG_MMU.  Page
      table accounting doesn't make sense if you don't have page tables.
      
      It's preparation for consolidation of page-table counters in mm_struct.
      
      Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4812909
    • Kirill A. Shutemov's avatar
      mm: account pud page tables · b4e98d9a
      Kirill A. Shutemov authored
      On a machine with 5-level paging support a process can allocate
      significant amount of memory and stay unnoticed by oom-killer and memory
      cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
      account PUD page tables, only PMD and PTE.
      
      We already addressed the same issue for PMD page tables, see commit
      dc6c9a35
      
       ("mm: account pmd page tables to the process").
      Introduction of 5-level paging brings the same issue for PUD page
      tables.
      
      The patch expands accounting to PUD level.
      
      [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
        Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
      [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
        Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
      Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4e98d9a
    • Konstantin Khlebnikov's avatar
      kmemleak: change /sys/kernel/debug/kmemleak permissions from 0444 to 0644 · 7d6c4dfa
      Konstantin Khlebnikov authored
      
      
      Kmemleak can be tweaked at runtime by writing commands into debugfs
      file.  Root can use it anyway, but without the write-bit this interface
      isn't obvious.
      
      Link: http://lkml.kernel.org/r/150728996582.744328.11541332857988399411.stgit@buzz
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d6c4dfa
    • Jan Kara's avatar
      cifs: use find_get_pages_range_tag() · 9c19a9cb
      Jan Kara authored
      
      
      wdata_alloc_and_fillpages() needlessly iterates calls to
      find_get_pages_tag().  Also it wants only pages from given range.  Make
      it use find_get_pages_range_tag().
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-17-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Suggested-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Steve French <sfrench@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c19a9cb
    • Jan Kara's avatar
      afs: use find_get_pages_range_tag() · aef6e415
      Jan Kara authored
      
      
      Use find_get_pages_range_tag() in afs_writepages_region() as we are
      interested only in pages from given range.  Remove unnecessary code
      after this conversion.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-16-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aef6e415
    • Jan Kara's avatar
      mm: remove nr_pages argument from pagevec_lookup_{,range}_tag() · 67fd707f
      Jan Kara authored
      
      
      All users of pagevec_lookup() and pagevec_lookup_range() now pass
      PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67fd707f
    • Jan Kara's avatar
      ceph: use pagevec_lookup_range_nr_tag() · 4be90299
      Jan Kara authored
      
      
      Use new function for looking up pages since nr_pages argument from
      pagevec_lookup_range_tag() is going away.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-14-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4be90299
    • Jan Kara's avatar
      mm: add variant of pagevec_lookup_range_tag() taking number of pages · 93d3b714
      Jan Kara authored
      
      
      Currently pagevec_lookup_range_tag() takes number of pages to look up
      but most users don't need this.  Create a new function
      pagevec_lookup_range_nr_tag() that takes maximum number of pages to
      lookup for Ceph which wants this functionality so that we can drop
      nr_pages argument from pagevec_lookup_range_tag().
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93d3b714
    • Jan Kara's avatar
      mm: use pagevec_lookup_range_tag() in write_cache_pages() · 2b9775ae
      Jan Kara authored
      
      
      Use pagevec_lookup_range_tag() in write_cache_pages() as it is
      interested only in pages from given range.  Remove unnecessary code
      resulting from this.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-12-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b9775ae
    • Jan Kara's avatar
      mm: use pagevec_lookup_range_tag() in __filemap_fdatawait_range() · 312e9d2f
      Jan Kara authored
      
      
      Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is
      interested only in pages from given range.  Remove unnecessary code
      resulting from this.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-11-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      312e9d2f
    • Jan Kara's avatar
      nilfs2: use pagevec_lookup_range_tag() · 40f9c513
      Jan Kara authored
      
      
      We want only pages from given range in nilfs_lookup_dirty_data_buffers().
      Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and
      remove unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-10-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40f9c513
    • Jan Kara's avatar
      gfs2: use pagevec_lookup_range_tag() · d2bc5b3c
      Jan Kara authored
      
      
      We want only pages from given range in gfs2_write_cache_jdata().  Use
      pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
      unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-9-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2bc5b3c
    • Jan Kara's avatar
      f2fs: use find_get_pages_tag() for looking up single page · 8faab642
      Jan Kara authored
      
      
      __get_first_dirty_index() wants to lookup only the first dirty page
      after given index.  There's no point in using pagevec_lookup_tag() for
      that.  Just use find_get_pages_tag() directly.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-8-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8faab642
    • Jan Kara's avatar
      f2fs: simplify page iteration loops · 028a63a6
      Jan Kara authored
      
      
      In several places we want to iterate over all tagged pages in a mapping.
      However the code was apparently copied from places that iterate only
      over a limited range and thus it checks for index <= end, optimizes the
      case where we are coming close to range end which is all pointless when
      end == ULONG_MAX.  So just remove this dead code.
      
      [akpm@linux-foundation.org: fix warnings]
      Link: http://lkml.kernel.org/r/20171009151359.31984-7-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      028a63a6
    • Jan Kara's avatar
      f2fs: use pagevec_lookup_range_tag() · 69c4f35d
      Jan Kara authored
      
      
      We want only pages from given range in f2fs_write_cache_pages().  Use
      pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
      unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-6-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69c4f35d
    • Jan Kara's avatar
      ext4: use pagevec_lookup_range_tag() · dc7f3e86
      Jan Kara authored
      
      
      We want only pages from given range in ext4_writepages().  Use
      pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
      unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-5-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc7f3e86
    • Jan Kara's avatar
      ceph: use pagevec_lookup_range_tag() · 0ed75fc8
      Jan Kara authored
      
      
      We want only pages from given range in ceph_writepages_start().  Use
      pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
      unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-4-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed75fc8
    • Jan Kara's avatar
      btrfs: use pagevec_lookup_range_tag() · 4006f437
      Jan Kara authored
      
      
      We want only pages from given range in btree_write_cache_pages() and
      extent_write_cache_pages().  Use pagevec_lookup_range_tag() instead of
      pagevec_lookup_tag() and remove unnecessary code.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: David Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4006f437
    • Jan Kara's avatar
      mm: implement find_get_pages_range_tag() · 72b045ae
      Jan Kara authored
      
      
      Patch series "Ranged pagevec tagged lookup", v3.
      
      In this series I provide a ranged variant of pagevec_lookup_tag() and
      use it in places where it makes sense.  This series removes some common
      code and it also has a potential for speeding up some operations
      similarly as for pagevec_lookup_range() (but for now I can think of only
      artificial cases where this happens).
      
      This patch (of 16):
      
      Implement a variant of find_get_pages_tag() that stops iterating at
      given index.  Lots of users of this function (through pagevec_lookup())
      actually want a range lookup and all of them are currently open-coding
      this.
      
      Also create corresponding pagevec_lookup_range_tag() function.
      
      Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Steve French <sfrench@samba.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b045ae
    • Ayush Mittal's avatar
      mm/page_owner.c: reduce page_owner structure size · 6b4c54e3
      Ayush Mittal authored
      
      
      Maximum page order can be at max 10 which can be accomodated in short
      data type(2 bytes).  last_migrate_reason is defined as enum type whose
      values can be accomodated in short data type (2 bytes).
      
      Total structure size is currently 16 bytes but after changing structure
      size it goes to 12 bytes.
      
      Vlastimil said:
       "Looks like it works, so why not.
        Before:
        [    0.001000] allocated 50331648 bytes of page_ext
        After:
        [    0.001000] allocated 41943040 bytes of page_ext"
      
      Link: http://lkml.kernel.org/r/1507623917-37991-1-git-send-email-ayush.m@samsung.com
      Signed-off-by: default avatarAyush Mittal <ayush.m@samsung.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: Vaneet Narang <v.narang@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b4c54e3
    • Pintu Agarwal's avatar
      mm/cma.c: change pr_info to pr_err for cma_alloc fail log · 5984af10
      Pintu Agarwal authored
      
      
      It was observed that under cma_alloc fail log, pr_info was used instead
      of pr_err.  This will lead to problems if printk debug level is set to
      below 7.  In this case the cma_alloc failure log will not be captured in
      the log and it will be difficult to debug.
      
      Simply replace the pr_info with pr_err to capture failure log.
      
      Link: http://lkml.kernel.org/r/1507650633-4430-1-git-send-email-pintu.ping@gmail.com
      Signed-off-by: default avatarPintu Agarwal <pintu.ping@gmail.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jaewon Kim <jaewon31.kim@samsung.com>
      Cc: Doug Berger <opendmb@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5984af10
    • Michal Hocko's avatar
      mm, arch: remove empty_bad_page* · 8745808f
      Michal Hocko authored
      
      
      empty_bad_page() and empty_bad_pte_table() seem to be relics from old
      days which is not used by any code for a long time.  I have tried to
      find when exactly but this is not really all that straightforward due to
      many code movements - traces disappear around 2.4 times.
      
      Anyway no code really references neither empty_bad_page nor
      empty_bad_pte_table.  We only allocate the storage which is not used by
      anybody so remove them.
      
      Link: http://lkml.kernel.org/r/20171004150045.30755-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRalf Baechle <ralf@linus-mips.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8745808f
    • Tim Chen's avatar
      mm/swap_slots.c: fix race conditions in swap_slots cache init · a2e16731
      Tim Chen authored
      
      
      Memory allocations can happen before the swap_slots cache initialization
      is completed during cpu bring up.  If we are low on memory, we could
      call get_swap_page() and access swap_slots_cache before it is fully
      initialized.
      
      Add a check in get_swap_page() for initialized swap_slots_cache to
      prevent this condition.  Similar check already exists in free_swap_slot.
      Also annotate the checks to indicate the likely condition.
      
      We also added a memory barrier to make sure that the locks
      initialization are done before the assignment of cache->slots and
      cache->slots_ret pointers.  This ensures the assumption that it is safe
      to acquire the slots cache locks and use the slots cache when the
      corresponding cache->slots or cache->slots_ret pointers are non null.
      
      [akpm@linux-foundation.org: tidy up comment]
      [akpm@linux-foundation.org: fix spello in comment]
      Link: http://lkml.kernel.org/r/65a9d0f133f63e66bba37b53b2fd0464b7cae771.1500677066.git.tim.c.chen@linux.intel.com
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Reported-by: default avatarWenwei Tao <wenwei.tww@alibaba-inc.com>
      Acked-by: default avatarYing Huang <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2e16731
    • Andrey Ryabinin's avatar
      mm: remove unused pgdat->inactive_ratio · 3a50d14d
      Andrey Ryabinin authored
      Since commit 59dc76b0
      
       ("mm: vmscan: reduce size of inactive file
      list") 'pgdat->inactive_ratio' is not used, except for printing
      "node_inactive_ratio: 0" in /proc/zoneinfo output.
      
      Remove it.
      
      Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a50d14d
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid call to invalidate_range() in range_end() · 4645b9fe
      Jérôme Glisse authored
      
      
      This is an optimization patch that only affect mmu_notifier users which
      rely on the invalidate_range() callback.  This patch avoids calling that
      callback twice in a row from inside __mmu_notifier_invalidate_range_end
      
      Existing pattern (before this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_end()
              mmu_notifier_invalidate_range()
      
      New pattern (after this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_only_end()
      
      We call the invalidate_range callback after clearing the page table
      under the page table lock and we skip the call to invalidate_range
      inside the __mmu_notifier_invalidate_range_end() function.
      
      Idea from Andrea Arcangeli
      
      Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4645b9fe
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse authored
      
      
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
    • Sergey Senozhatsky's avatar
      zsmalloc: calling zs_map_object() from irq is a bug · 1aedcafb
      Sergey Senozhatsky authored
      
      
      Use BUG_ON(in_interrupt()) in zs_map_object().  This is not a new
      BUG_ON(), it's always been there, but was recently changed to
      VM_BUG_ON().  There are several problems there.  First, we use use
      per-CPU mappings both in zsmalloc and in zram, and interrupt may easily
      corrupt those buffers.  Second, and more importantly, we believe it's
      possible to start leaking sensitive information.  Consider the following
      case:
      
      -> process P
      	swap out
      	 zram
      	  per-cpu mapping CPU1
      	   compress page A
      -> IRQ
      
      	swap out
      	 zram
      	  per-cpu mapping CPU1
      	   compress page B
      	    write page from per-cpu mapping CPU1 to zsmalloc pool
      	iret
      
      -> process P
      	    write page from per-cpu mapping CPU1 to zsmalloc pool  [*]
      	return
      
      * so we store overwritten data that actually belongs to another
        page (task) and potentially contains sensitive data. And when
        process P will page fault it's going to read (swap in) that
        other task's data.
      
      Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.com
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aedcafb