Skip to content
  1. Aug 18, 2018
    • Oscar Salvador's avatar
      mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range() · 4fbce633
      Oscar Salvador authored
      
      
      link_mem_sections() and walk_memory_range() share most of the code, so
      we can use convert link_mem_sections() into a dummy function that calls
      walk_memory_range() with a callback to register_mem_sect_under_node().
      
      This patch converts register_mem_sect_under_node() in order to match a
      walk_memory_range's callback, getting rid of the check_nid argument and
      checking instead if the system is still boothing, since we only have to
      check for the nid if the system is in such state.
      
      Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.net
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Suggested-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fbce633
    • Oscar Salvador's avatar
      mm/memory_hotplug.c: call register_mem_sect_under_node() · d5b6f6a3
      Oscar Salvador authored
      
      
      When hotplugging memory, it is possible that two calls are being made to
      register_mem_sect_under_node().
      
      One comes from __add_section()->hotplug_memory_register() and the other
      from add_memory_resource()->link_mem_sections() if we had to register a
      new node.
      
      In case we had to register a new node, hotplug_memory_register() will
      only handle/allocate the memory_block's since
      register_mem_sect_under_node() will return right away because the node
      it is not online yet.
      
      I think it is better if we leave hotplug_memory_register() to
      handle/allocate only memory_block's and make link_mem_sections() to call
      register_mem_sect_under_node().
      
      So this patch removes the call to register_mem_sect_under_node() from
      hotplug_memory_register(), and moves the call to link_mem_sections() out
      of the condition, so it will always be called.  In this way we only have
      one place where the memory sections are registered.
      
      Link: http://lkml.kernel.org/r/20180622111839.10071-3-osalvador@techadventures.net
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5b6f6a3
    • Oscar Salvador's avatar
      mm/memory_hotplug.c: make add_memory_resource use __try_online_node · b9ff0360
      Oscar Salvador authored
      
      
      This is a small cleanup for the memhotplug code.  A lot more could be
      done, but it is better to start somewhere.  I tried to unify/remove
      duplicated code.
      
      The following is what this patchset does:
      
      1) add_memory_resource() has code to allocate a node in case it was
         offline.  Since try_online_node has some code for that as well, I just
         made add_memory_resource() to use that so we can remove duplicated
         code..  This is better explained in patch 1/4.
      
      2) register_mem_sect_under_node() will be called only from
         link_mem_sections()
      
      3) Make register_mem_sect_under_node() a callback of
         walk_memory_range()
      
      4) Drop unnecessary checks from register_mem_sect_under_node()
      
      I have done some tests and I could not see anything broken because of
      this patchset.
      
      add_memory_resource() contains code to allocate a new node in case it is
      necessary.  Since try_online_node() also has some code for this purpose,
      let us make use of that and remove duplicate code.
      
      This introduces __try_online_node(), which is called by
      add_memory_resource() and try_online_node().  __try_online_node() has
      two new parameters, start_addr of the node, and if the node should be
      onlined and registered right away.  This is always wanted if we are
      calling from do_cpu_up(), but not when we are calling from memhotplug
      code.  Nothing changes from the point of view of the users of
      try_online_node(), since try_online_node passes start_addr=0 and
      online_node=true to __try_online_node().
      
      Link: http://lkml.kernel.org/r/20180622111839.10071-2-osalvador@techadventures.net
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9ff0360
    • Andrew Morton's avatar
      mm/list_lru.c: fold __list_lru_count_one() into its caller · 930eaac5
      Andrew Morton authored
      
      
      __list_lru_count_one() has a single callsite.
      
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      930eaac5
    • Sebastian Andrzej Siewior's avatar
      mm: workingset: make shadow_lru_isolate() use locking suffix · 6ca342d0
      Sebastian Andrzej Siewior authored
      
      
      shadow_lru_isolate() disables interrupts and acquires a lock.  It could
      use spin_lock_irq() instead.  It also uses local_irq_enable() while it
      could use spin_unlock_irq()/xa_unlock_irq().
      
      Use proper suffix for lock/unlock in order to enable/disable interrupts
      during release/acquire of a lock.
      
      Link: http://lkml.kernel.org/r/20180622151221.28167-3-bigeasy@linutronix.de
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ca342d0
    • Sebastian Andrzej Siewior's avatar
      mm: workingset: remove local_irq_disable() from count_shadow_nodes() · ae1e16da
      Sebastian Andrzej Siewior authored
      Patch series "mm: use irq locking suffix instead local_irq_disable()".
      
      A small series which avoids using local_irq_disable()/local_irq_enable()
      but instead does spin_lock_irq()/spin_unlock_irq() so it is within the
      context of the lock which it belongs to.  Patch #1 is a cleanup where
      local_irq_.*() remained after the lock was removed.
      
      This patch (of 2):
      
      In 0c7c1bed
      
       ("mm: make counting of list_lru_one::nr_items lockless")
      the
      
      	spin_lock(&nlru->lock);
      
      statement was replaced with
      
      	rcu_read_lock();
      
      in __list_lru_count_one().  The comment in count_shadow_nodes() says
      that the local_irq_disable() is required because the lock must be
      acquired with disabled interrupts and (spin_lock()) does not do so.
      Since the lock is replaced with rcu_read_lock() the local_irq_disable()
      is no longer needed.  The code path is
      
        list_lru_shrink_count()
          -> list_lru_count_one()
            -> __list_lru_count_one()
              -> rcu_read_lock()
              -> list_lru_from_memcg_idx()
              -> rcu_read_unlock()
      
      Remove the local_irq_disable() statement.
      
      Link: http://lkml.kernel.org/r/20180622151221.28167-2-bigeasy@linutronix.de
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae1e16da
    • Michal Hocko's avatar
      mm: drop VM_BUG_ON from __get_free_pages · 9ea9a680
      Michal Hocko authored
      
      
      There is no real reason to blow up just because the caller doesn't know
      that __get_free_pages cannot return highmem pages.  Simply fix that up
      silently.  Even if we have some confused users such a fixup will not be
      harmful.
      
      [akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
      Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jiankang Chen <chenjiankang1@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ea9a680
    • Huang Ying's avatar
      mm, hugetlbfs: pass fault address to cow handler · 974e6d66
      Huang Ying authored
      
      
      This is to take better advantage of the general huge page copying
      optimization.  Where, the target subpage will be copied last to avoid
      the cache lines of target subpage to be evicted when copying other
      subpages.  This works better if the address of the target subpage is
      available when copying huge page.  So hugetlbfs page fault handlers are
      changed to pass that information to hugetlb_cow().  This will benefit
      workloads which don't access the begin of the hugetlbfs huge page after
      the page fault under heavy cache contention.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-5-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Punit Agrawal <punit.agrawal@arm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      974e6d66
    • Huang Ying's avatar
      mm, hugetlbfs: rename address to haddr in hugetlb_cow() · 5b7a1d40
      Huang Ying authored
      
      
      To take better advantage of general huge page copying optimization, the
      target subpage address will be passed to hugetlb_cow(), then
      copy_user_huge_page().  So we will use both target subpage address and
      huge page size aligned address in hugetlb_cow().  To distinguish between
      them, "haddr" is used for huge page size aligned address to be
      consistent with Transparent Huge Page naming convention.
      
      Now, only huge page size aligned address is used in hugetlb_cow(), so
      the "address" is renamed to "haddr" in hugetlb_cow() in this patch.
      Next patch will use target subpage address in hugetlb_cow() too.
      
      The patch is just code cleanup without any functionality changes.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-4-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Punit Agrawal <punit.agrawal@arm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b7a1d40
    • Huang Ying's avatar
      mm, huge page: copy target sub-page last when copy huge page · c9f4cd71
      Huang Ying authored
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4
      
       ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patch.  Because we have put the order algorithm into a separate
      function, the implementation is quite simple.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on
      transparent huge page.
      
      With this patch, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9f4cd71
    • Huang Ying's avatar
      mm, clear_huge_page: move order algorithm into a separate function · c6ddfb6c
      Huang Ying authored
      Patch series "mm, huge page: Copy target sub-page last when copy huge
      page", v2.
      
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patchset.
      
      The patchset is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patchset, we have tested it with vm-scalability run on
      transparent huge page.
      
      With this patchset, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      This patch (of 4):
      
      In c79b57e4
      
       ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  This optimization could
      be applied to copying huge page too with the same order algorithm.  To
      avoid code duplication and reduce maintenance overhead, in this patch,
      the order algorithm is moved out of clear_huge_page() into a separate
      function: process_huge_page().  So that we can use it for copying huge
      page too.
      
      This will change the direct calls to clear_user_highpage() into the
      indirect calls.  But with the proper inline support of the compilers,
      the indirect call will be optimized to be the direct call.  Our tests
      show no performance change with the patch.
      
      This patch is a code cleanup without functionality change.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6ddfb6c
    • Jens Axboe's avatar
      ext4: readpages() should submit IO as read-ahead · ac22b46a
      Jens Axboe authored
      
      
      a_ops->readpages() is only ever used for read-ahead.  Ensure that we
      pass this information down to the block layer.
      
      Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac22b46a
    • Jens Axboe's avatar
      btrfs: readpages() should submit IO as read-ahead · 5e9d3982
      Jens Axboe authored
      
      
      a_ops->readpages() is only ever used for read-ahead.  Ensure that we
      pass this information down to the block layer.
      
      Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e9d3982
    • Jens Axboe's avatar
      mpage: mpage_readpages() should submit IO as read-ahead · 74c8164e
      Jens Axboe authored
      
      
      a_ops->readpages() is only ever used for read-ahead, yet we don't flag
      the IO being submitted as such.  Fix that up.  Any file system that uses
      mpage_readpages() as its ->readpages() implementation will now get this
      right.
      
      Since we're passing in whether the IO is read-ahead or not, we don't
      need to pass in the 'gfp' separately, as it is dependent on the IO being
      read-ahead.  Kill off that member.
      
      Add some documentation notes on ->readpages() being purely for
      read-ahead.
      
      Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74c8164e
    • Jens Axboe's avatar
      mpage: add argument structure for do_mpage_readpage() · 357c1206
      Jens Axboe authored
      
      
      Patch series "Submit ->readpages() IO as read-ahead", v4.
      
      The only caller of ->readpages() is from read-ahead, yet we don't submit
      IO flagged with REQ_RAHEAD.  This means we don't see it in blktrace, for
      instance, which is a shame.  Additionally, it's preventing further
      functional changes in the block layer for deadling with read-ahead more
      intelligently.  We already make assumptions about ->readpages() just
      being for read-ahead in the mpage implementation, using
      readahead_gfp_mask(mapping) as out GFP mask of choice.
      
      This small series fixes up mpage_readpages() to submit with REQ_RAHEAD,
      which takes care of file systems using mpage_readpages().  The first
      patch is a prep patch, that makes do_mpage_readpage() take an argument
      structure.
      
      This patch (of 4):
      
      We're currently passing 8 arguments to this function, clean it up a bit
      by packing the arguments in an args structure we pass to it.
      
      No intentional functional changes in this patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      357c1206
    • Yang Shi's avatar
      mm: thp: inc counter for collapsed shmem THP · 87aa7529
      Yang Shi authored
      
      
      /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used
      to record the counter of collapsed THP, but it just gets inc'ed in
      anonymous THP collapse path, do this for shmem THP collapse too.
      
      Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87aa7529
    • Yang Shi's avatar
      mm: thp: register mm for khugepaged when merging vma for shmem · c2231020
      Yang Shi authored
      
      
      When merging anonymous page vma, if the size of the vma can fit in at
      least one hugepage, the mm will be registered for khugepaged for
      collapsing THP in the future.
      
      But it skips shmem vmas.  Do so for shmem also, but not for file-private
      mappings when merging a vma in order to increase the odds of collapsing
      a hugepage via khugepaged.
      
      hugepage_vma_check() sounds like a good fit to do the check.  And move
      the definition of it before khugepaged_enter_vma_merge() to avoid a
      build error.
      
      Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2231020
    • Jia-Ju Bai's avatar
      mm/mempool.c: remove unused argument in kasan_unpoison_element() and remove_element() · 8cded866
      Jia-Ju Bai authored
      
      
      The argument "gfp_t flags" is not used in kasan_unpoison_element() and
      remove_element(), so remove it.
      
      Link: http://lkml.kernel.org/r/20180621070332.16633-1-baijiaju1990@gmail.com
      Signed-off-by: default avatarJia-Ju Bai <baijiaju1990@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cded866
    • Greg Thelen's avatar
      mm/vmscan.c: condense scan_control · bb451fdf
      Greg Thelen authored
      Use smaller scan_control fields for order, priority, and reclaim_idx.
      Convert fields from int => s8.  All easily fit within a byte:
      
       - allocation order range: 0..MAX_ORDER(64?)
       - priority range:         0..12(DEF_PRIORITY)
       - reclaim_idx range:      0..6(__MAX_NR_ZONES)
      
      Since 6538b8ea
      
       ("x86_64: expand kernel stack to 16K") x86_64 stack
      overflows are not an issue.  But it's inefficient to use ints.
      
      Use s8 (signed byte) rather than u8 to allow for loops like:
      	do {
      		...
      	} while (--sc.priority >= 0);
      
      Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
      
      This reduces sizeof(struct scan_control):
       - 96 => 80 bytes (x86_64)
       - 68 => 56 bytes (i386)
      
      scan_control structure field order is changed to utilize padding.  After
      this patch there is 1 bit of scan_control padding.
      
      akpm: makes my vmscan.o's .text 572 bytes smaller as well.
      
      Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.com
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb451fdf
    • Kirill A. Shutemov's avatar
      mm/page_ext.c: constify lookup_page_ext() argument · 10ed6341
      Kirill A. Shutemov authored
      
      
      lookup_page_ext() finds 'struct page_ext' for a given page.  It requires
      only read access to the given struct page.
      
      Current implemnentation takes 'struct page *' as an argument.  It makes
      compiler complain when 'const struct page *' passed.
      
      Change the argument to 'const struct page *'.
      
      Link: http://lkml.kernel.org/r/20180531135457.20167-3-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10ed6341
    • Kirill A. Shutemov's avatar
      include/linux/page_ext.h: drop definition of unused PAGE_EXT_DEBUG_POISON · b3a23696
      Kirill A. Shutemov authored
      After commit bd33ef36
      
       ("mm: enable page poisoning early at boot")
      PAGE_EXT_DEBUG_POISON is not longer used.  Remove it.
      
      Link: http://lkml.kernel.org/r/20180531135457.20167-2-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarVinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3a23696
    • Arnd Bergmann's avatar
      shmem: use monotonic time for i_generation · 46c9a946
      Arnd Bergmann authored
      
      
      get_seconds() is deprecated because it will lead to a 32-bit overflow in
      2038 or 2106.  We don't need the i_generation to be strictly monotonic
      anyway, and other file systems like ext4 and xfs just use prandom_u32(),
      so let's use the same one here.
      
      If this is considered too slow, we could also use ktime_get_seconds() or
      ktime_get_real_seconds() to keep the previous behavior.  Both of these
      return a time64_t and are not deprecated, but only return a unique value
      once per second, and are predictable.
      
      Link: http://lkml.kernel.org/r/20180620082556.581543-1-arnd@arndb.de
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46c9a946
    • Vlastimil Babka's avatar
      mm, page_alloc: actually ignore mempolicies for high priority allocations · d6a24df0
      Vlastimil Babka authored
      __alloc_pages_slowpath() has for a long time contained code to ignore
      node restrictions from memory policies for high priority allocations.
      The current code that resets the zonelist iterator however does
      effectively nothing after commit 7810e678 ("mm, page_alloc: do not
      break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
      Even before that commit, mempolicy restrictions were still not ignored,
      as they are passed in ac->nodemask which is untouched by the code.
      
      We can either remove the code, or make it work as intended.  Since
      ac->nodemask can be set from task's mempolicy via alloc_pages_current()
      and thus also alloc_pages(), it may indeed affect kernel allocations,
      and it makes sense to ignore it to allow progress for high priority
      allocations.
      
      Thus, this patch resets ac->nodemask to NULL in such cases.  This
      assumes all callers can handle it (i.e.  there are no guarantees as in
      the case of __GFP_THISNODE) which seems to be the case.  The same
      assumption is already present in check_retry_cpuset() for some time.
      
      The expected effect is that high priority kernel allocations in the
      context of userspace tasks (e.g.  OOM victims) restricted by mempolicies
      will have higher chance to succeed if they are restricted to nodes with
      depleted memory, while there are other nodes with free memory left.
      
      It's not a new intention, but for the first time the code will match the
      intention, AFAICS.  It was intended by commit 183f6371 ("mm: ignore
      mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
      really worked, as mempolicy restriction was already encoded in nodemask,
      not zonelist, at that time.
      
      So originally that was for ALLOC_NO_WATERMARK only.  Then it was
      adjusted by e46e7b77 ("mm, page_alloc: recalculate the preferred
      zoneref if the context can ignore memory policies") and cd04ae1e
      
      
      ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
      current state.  So even GFP_ATOMIC would now ignore mempolicies after
      the initial attempts fail - if the code worked as people thought it
      does.
      
      Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6a24df0
    • Christian Hansen's avatar
      tools/vm/page-types.c: add support for idle page tracking · 59ae96ff
      Christian Hansen authored
      
      
      Add a flag which causes page-types to use the kernels's idle page
      tracking to mark pages idle.  As the tool already prints the idle flag
      if set, subsequent runs will show which pages have been accessed since
      last run.
      
      [akpm@linux-foundation.org: simplify mark_page_idle()]
      [chansen3@cisco.com: reorganize mark_page_idle() logic, add docs]
        Link: http://lkml.kernel.org/r/20180706172237.21691-1-chansen3@cisco.com
      Link: http://lkml.kernel.org/r/20180612153223.13174-1-chansen3@cisco.com
      Signed-off-by: default avatarChristian Hansen <chansen3@cisco.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59ae96ff
    • Christian Hansen's avatar
      tools/vm/page-types.c: include shared map counts · 7f1d23e6
      Christian Hansen authored
      
      
      Add a new flag that will read kpagecount for each PFN and print out the
      number of times the page is mapped along with the flags in the listing
      view.
      
      This information is useful in understanding and optimizing memory usage.
      Identifying pages which are not shared allows us to focus on adjusting
      the memory layout or access patterns for the sole owning process.
      Knowing the number of processes that share a page tells us how many
      other times we must make the same adjustments or how many processes to
      potentially disable.
      
      Truncated sample output:
      
        voffset map-cnt offset  len     flags
        561a3591e       1       15fe8   1       ___U_lA____Ma_b___________________________
        561a3591f       1       2b103   1       ___U_lA____Ma_b___________________________
        561a36ca4       1       2cc78   1       ___U_lA____Ma_b___________________________
        7f588bb4e       14      2273c   1       __RU_lA____M______________________________
      
      [akpm@linux-foundation.org: coding-style fixes]
      [chansen3@cisco.com: add documentation, tweak whitespace]
        Link: http://lkml.kernel.org/r/20180705181204.5529-1-chansen3@cisco.com
      Link: http://lkml.kernel.org/r/20180612153205.12879-1-chansen3@cisco.com
      Signed-off-by: default avatarChristian Hansen <chansen3@cisco.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f1d23e6
    • Yang Shi's avatar
      thp: use mm_file_counter to determine update which rss counter · fadae295
      Yang Shi authored
      Since commit eca56ff9 ("mm, shmem: add internal shmem resident
      memory accounting"), MM_SHMEMPAGES is added to separate the shmem
      accounting from regular files.  So, all shmem pages should be accounted
      to MM_SHMEMPAGES instead of MM_FILEPAGES.
      
      And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
      shmem thp pages should be not treated differently.  Account them to
      MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
      keep consistent with normal 4K shmem pages.
      
      This will not change the rss counter of processes since shmem pages are
      still a part of it.
      
      The /proc/pid/status and /proc/pid/statm counters will however be more
      accurate wrt shmem usage, as originally intended.  And as eca56ff9
      
      
      ("mm, shmem: add internal shmem resident memory accounting") mentioned,
      oom also could report more accurate "shmem-rss".
      
      Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fadae295
    • Pavel Tatashin's avatar
      mm: skip invalid pages block at a time in zero_resv_unresv() · 720e14eb
      Pavel Tatashin authored
      
      
      The role of zero_resv_unavail() is to make sure that every struct page
      that is allocated but is not backed by memory that is accessible by
      kernel is zeroed and not in some uninitialized state.
      
      Since struct pages are allocated in blocks (2M pages in x86 case), we
      can skip pageblock_nr_pages at a time, when the first one is found to be
      invalid.
      
      This optimization may help since now on x86 every hole in e820 maps is
      marked as reserved in memblock, and thus will go through this function.
      
      This function is called before sched_clock() is initialized, so I used
      my x86 early boot clock patches to measure the performance improvement.
      
      With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
      but with this optimization 0.001103s.
      
      Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      720e14eb
    • Souptick Joarder's avatar
      mm: convert return type of handle_mm_fault() caller to vm_fault_t · 50a7ca3c
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220
      
       ("mm: change return type to vm_fault_t")
      
      In this patch all the caller of handle_mm_fault() are changed to return
      vm_fault_t type.
      
      Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50a7ca3c
    • Vlastimil Babka's avatar
      mm, slub: restore the original intention of prefetch_freepointer() · 0882ff91
      Vlastimil Babka authored
      In SLUB, prefetch_freepointer() is used when allocating an object from
      cache's freelist, to make sure the next object in the list is cache-hot,
      since it's probable it will be allocated soon.
      
      Commit 2482ddec ("mm: add SLUB free list pointer obfuscation") has
      unintentionally changed the prefetch in a way where the prefetch is
      turned to a real fetch, and only the next->next pointer is prefetched.
      In case there is not a stream of allocations that would benefit from
      prefetching, the extra real fetch might add a useless cache miss to the
      allocation.  Restore the previous behavior.
      
      Link: http://lkml.kernel.org/r/20180809085245.22448-1-vbabka@suse.cz
      Fixes: 2482ddec
      
       ("mm: add SLUB free list pointer obfuscation")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0882ff91
    • NeilBrown's avatar
      fs/seq_file.c: simplify seq_file iteration code and interface · 1f4aace6
      NeilBrown authored
      
      
      The documentation for seq_file suggests that it is necessary to be able
      to move the iterator to a given offset, however that is not the case.
      If the iterator is stored in the private data and is stable from one
      read() syscall to the next, it is only necessary to support first/next
      interactions.  Implementing this in a client is a little clumsy.
      
       - if ->start() is given a pos of zero, it should go to start of
         sequence.
      
       - if ->start() is given the name pos that was given to the most recent
         next() or start(), it should restore the iterator to state just
         before that last call
      
       - if ->start is given another number, it should set the iterator one
         beyond the start just before the last ->start or ->next call.
      
      Also, the documentation says that the implementation can interpret the
      pos however it likes (other than zero meaning start), but seq_file
      increments the pos sometimes which does impose on the implementation.
      
      This patch simplifies the interface for first/next iteration and
      simplifies the code, while maintaining complete backward compatability.
      Now:
      
       - if ->start() is given a pos of zero, it should return an iterator
         placed at the start of the sequence
      
       - if ->start() is given a non-zero pos, it should return the iterator
         in the same state it was after the last ->start or ->next.
      
      This is particularly useful for interators which walk the multiple
      chains in a hash table, e.g.  using rhashtable_walk*.  See
      fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c
      
      A large part of achieving this is to *always* call ->next after ->show
      has successfully stored all of an entry in the buffer.  Never just
      increment the index instead.  Also:
      
       - always pass &m->index to ->start() and ->next(), never a temp
         variable
      
       - don't clear ->from when ->count is zero, as ->from is dead when
         ->count is zero.
      
      Some ->next functions do not increment *pos when they return NULL.  To
      maintain compatability with this, we still need to increment m->index in
      one place, if ->next didn't increment it.  Note that such ->next
      functions are buggy and should be fixed.  A simple demonstration is
      
         dd if=/proc/swaps bs=1000 skip=1
      
      Choose any block size larger than the size of /proc/swaps.  This will
      always show the whole last line of /proc/swaps.
      
      This patch doesn't work around buggy next() functions for this case.
      
      [neilb@suse.com: ensure ->from is valid]
        Link: http://lkml.kernel.org/r/87601ryb8a.fsf@notabene.neil.brown.name
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Acked-by: Jonathan Corbet <corbet@lwn.net>	[docs]
      Tested-by: default avatarJann Horn <jannh@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f4aace6
    • NeilBrown's avatar
      vfs: discard ATTR_ATTR_FLAG · 4cdfffc8
      NeilBrown authored
      
      
      This flag was introduce in 2.1.37pre1 and the only place it was tested
      was removed in 2.1.43pre1.  The flag was never set.
      
      Let's discard it properly.
      
      Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4cdfffc8
    • Tetsuo Handa's avatar
      fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot() · 6cd00a01
      Tetsuo Handa authored
      
      
      Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
      are initialized at __d_alloc(), we can't copy the whole size
      unconditionally.
      
       WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
       636f6e66696766732e746d70000000000010000000000000020000000188ffff
        i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
                                        ^
       RIP: 0010:take_dentry_name_snapshot+0x28/0x50
       RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
       RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
       RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
       RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
       R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
       R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
       FS:  00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
        take_dentry_name_snapshot+0x28/0x50
        vfs_rename+0x128/0x870
        SyS_rename+0x3b2/0x3d0
        entry_SYSCALL_64_fastpath+0x1a/0xa4
        0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cd00a01
    • Colin Ian King's avatar
      ocfs2: make several functions and variables static (and some const) · 480bd564
      Colin Ian King authored
      
      
      There are a variety of functions and variables that are local to the
      source and do not need to be in global scope, so make them static.  Also
      make a couple of char arrays static const.
      
      Cleans up sparse warnings:
        symbol 'o2hb_heartbeat_mode_desc' was not declared. Should it be static?
        symbol 'o2hb_heartbeat_mode' was not declared. Should it be static?
        symbol 'o2hb_dependent_users' was not declared. Should it be static?
        symbol 'o2hb_region_dec_user' was not declared. Should it be static?
        symbol 'o2nm_fence_method_desc' was not declared. Should it be static?
        symbol 'lockdep_keys' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20180628131659.12133-1-colin.king@canonical.com
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      480bd564
    • wangyan's avatar
      ocfs2: clean up some unnecessary code · 229ba1f8
      wangyan authored
      
      
      Several functions have some unnecessary code, clean up these code.
      
      Link: http://lkml.kernel.org/r/5B14DF72.5020800@huawei.com
      Signed-off-by: default avatarYan Wang <wangyan122@huawei.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      229ba1f8
    • Jun Piao's avatar
      ocfs2: return -EROFS when filesystem becomes read-only · 93f5920d
      Jun Piao authored
      
      
      We should return -EROFS rather than other errno if filesystem becomes
      read-only.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/5B191B26.9010501@huawei.com
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Acked-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93f5920d
    • Nick Desaulniers's avatar
      sh: prefer _THIS_IP_ to current_text_addr · 8d00d0c0
      Nick Desaulniers authored
      
      
      As part of the effort to reduce the code duplication between _THIS_IP_
      and current_text_addr(), let's consolidate callers of
      current_text_addr() to use _THIS_IP_.
      
      Link: http://lkml.kernel.org/r/20180801185331.39535-1-ndesaulniers@google.com
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d00d0c0
    • Dmitry Torokhov's avatar
      sh: make use of for_each_node_by_type() · 82f7c510
      Dmitry Torokhov authored
      
      
      Instead of open-coding the loop, let's use canned macro.
      
      Also make sure we are not leaking "cpus" node reference.
      
      Link: http://lkml.kernel.org/r/20180624224252.GA220395@dtor-ws
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82f7c510
    • Kees Cook's avatar
      ntfs: mft: remove VLA usage · ab62ef82
      Kees Cook authored
      
      
      In the quest to remove all stack VLA usage from the kernel[1], this
      allocates the maximum size stack buffer.  Existing checks already
      require that blocksize >= NTFS_BLOCK_SIZE and mft_record_size <=
      PAGE_SIZE, so max_bhs can be at most PAGE_SIZE / NTFS_BLOCK_SIZE.
      Sanity checks are added for robustness.
      
      [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
      
      Link: http://lkml.kernel.org/r/20180626172909.41453-4-keescook@chromium.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab62ef82
    • Kees Cook's avatar
      ntfs: decompress: remove VLA usage · 2c27ce91
      Kees Cook authored
      
      
      In the quest to remove all stack VLA usage from the kernel[1], this
      moves the stack buffer used during decompression to be allocated
      externally.
      
      The existing "dest_max_index" used in the VLA is bounded by cb_max_page.
      cb_max_page is bounded by max_page, and max_page is bounded by nr_pages.
      Since nr_pages is used for the "pages" allocation, it can similarly be
      used for the "completed_pages" allocation and passed into the
      decompression function.  The error paths are updated to free the new
      allocation.
      
      [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
      
      Link: http://lkml.kernel.org/r/20180626172909.41453-3-keescook@chromium.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c27ce91
    • Kees Cook's avatar
      ntfs: aops: remove VLA usage · ac4ecf96
      Kees Cook authored
      
      
      In the quest to remove all stack VLA usage from the kernel[1], this uses
      the maximum size needed on the stack and adds a sanity check for
      robustness: index.block_size cannot be larger than PAGE_SIZE nor less
      than NTFS_BLOCK_SIZE.
      
      [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
      
      Link: http://lkml.kernel.org/r/20180626172909.41453-2-keescook@chromium.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac4ecf96