Skip to content
  1. Jun 08, 2018
    • Mike Rapoport's avatar
      mm/ksm: move [set_]page_stable_node from ksm.h to ksm.c · 88484826
      Mike Rapoport authored
      
      
      page_stable_node() and set_page_stable_node() are only used in mm/ksm.c
      and there is no point to keep them in the include/linux/ksm.h
      
      [akpm@linux-foundation.org: fix SYSFS=n build]
      Link: http://lkml.kernel.org/r/1524552106-7356-3-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88484826
    • Mike Rapoport's avatar
      mm/ksm: remove unused page_referenced_ksm declaration · 48f49e1f
      Mike Rapoport authored
      Commit 9f32624b
      
       ("mm/rmap: use rmap_walk() in page_referenced()")
      removed the declaration of page_referenced_ksm for the case
      CONFIG_KSM=y, but left one for CONFIG_KSM=n.
      
      Remove the unused leftover.
      
      Link: http://lkml.kernel.org/r/1524552106-7356-2-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48f49e1f
    • Omar Sandoval's avatar
      lockdep: fix fs_reclaim annotation · 93781325
      Omar Sandoval authored
      While revisiting my Btrfs swapfile series [1], I introduced a situation
      in which reclaim would lock i_rwsem, and even though the swapon() path
      clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
      complaints from lockdep.  It turns out that the rework of the fs_reclaim
      annotation was broken: if the current task has PF_MEMALLOC set, we don't
      acquire the dummy fs_reclaim lock, but when reclaiming we always check
      this _after_ we've just set the PF_MEMALLOC flag.  In most cases, we can
      fix this by moving the fs_reclaim_{acquire,release}() outside of the
      memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
      different.  After applying this, I got the expected lockdep splats.
      
      1: https://lwn.net/Articles/625412/
      
      Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
      Fixes: d92a8cfc
      
       ("locking/lockdep: Rework FS_RECLAIM annotation")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93781325
    • Yang Shi's avatar
      mm: shmem: make stat.st_blksize return huge page size if THP is on · 89fdcd26
      Yang Shi authored
      
      
      Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
      filesystem with huge page support anymore.  tmpfs can use huge page via
      THP when mounting by "huge=" mount option.
      
      When applications use huge page on hugetlbfs, it just need check the
      filesystem magic number, but it is not enough for tmpfs.  Make
      stat.st_blksize return huge page size if it is mounted by appropriate
      "huge=" option to give applications a hint to optimize the behavior with
      THP.
      
      Some applications may not do wisely with THP.  For example, QEMU may
      mmap file on non huge page aligned hint address with MAP_FIXED, which
      results in no pages are PMD mapped even though THP is used.  Some
      applications may mmap file with non huge page aligned offset.  Both
      behaviors make THP pointless.
      
      statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and
      it also may fallback to 4KB page silently if there is not enough huge
      page.  Furthermore, different f_bsize makes max_blocks and free_blocks
      calculation harder but without too much benefit.  Returning huge page
      size via stat.st_blksize sounds good enough.
      
      Since PUD size huge page for THP has not been supported, now it just
      returns HPAGE_PMD_SIZE.
      
      Hugh said:
      
      : Sorry, I have no enthusiasm for this patch; but do I feel strongly
      : enough to override you and everyone else to NAK it?  No, I don't feel
      : that strongly, maybe st_blksize isn't worth arguing over.
      :
      : We did look at struct stat when designing huge tmpfs, to see if there
      : were any fields that should be adjusted for it; but concluded none.
      : Yes, it would sometimes be nice to have a quickly accessible indicator
      : for when tmpfs has been mounted huge (scanning /proc/mounts for options
      : can be tiresome, agreed); but since tmpfs tries to supply huge (or not)
      : pages transparently, no difference seemed right.
      :
      : So, because st_blksize is a not very useful field of struct stat, with
      : "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead
      : of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge"
      : options (force or always, okay; within_size or advise, not so much).
      : Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize
      : for file system I/O" than PAGE_SIZE was.
      :
      : Which we can expect to speed up some applications and disadvantage
      : others, depending on how they interpret st_blksize: just like if we
      : changed it in the same way on non-huge tmpfs.  (Did I actually try
      : changing st_blksize early on, and find it broke something?  If so, I've
      : now forgotten what, and a search through commit messages didn't find
      : it; but I guess we'll find out soon enough.)
      :
      : If there were an mstat() syscall, returning a field "preferred
      : alignment", then we could certainly agree to put HPAGE_PMD_SIZE in
      : there; but in stat()'s st_blksize?  And what happens when (in future)
      : mm maps this or that hard-disk filesystem's blocks with a pmd mapping -
      : should that filesystem then advertise a bigger st_blksize, despite the
      : same disk layout as before?  What happens with DAX?
      :
      : And this change is not going to help the QEMU suboptimality that
      : brought you here (or does QEMU align mmaps according to st_blksize?).
      : QEMU ought to work well with kernels without this change, and kernels
      : with this change; and I hope it can easily deal with both by avoiding
      : that use of MAP_FIXED which prevented the kernel's intended alignment.
      
      [akpm@linux-foundation.org: remove unneeded `else']
      Link: http://lkml.kernel.org/r/1524665633-83806-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89fdcd26
    • Chintan Pandya's avatar
      mm: vmalloc: pass proper vm_start into debugobjects · 05e3ff95
      Chintan Pandya authored
      
      
      Client can call vunmap with some intermediate 'addr' which may not be
      the start of the VM area.  Entire unmap code works with vm->vm_start
      which is proper but debug object API is called with 'addr'.  This could
      be a problem within debug objects.
      
      Pass proper start address into debug object API.
      
      [akpm@linux-foundation.org: fix warning]
      Link: http://lkml.kernel.org/r/1523961828-9485-3-git-send-email-cpandya@codeaurora.org
      Signed-off-by: default avatarChintan Pandya <cpandya@codeaurora.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05e3ff95
    • Chintan Pandya's avatar
      mm: vmalloc: avoid racy handling of debugobjects in vunmap · f3c01d2f
      Chintan Pandya authored
      
      
      Currently, __vunmap flow is,
       1) Release the VM area
       2) Free the debug objects corresponding to that vm area.
      
      This leave some race window open.
       1) Release the VM area
       1.5) Some other client gets the same vm area
       1.6) This client allocates new debug objects on the same
            vm area
       2) Free the debug objects corresponding to this vm area.
      
      Here, we actually free 'other' client's debug objects.
      
      Fix this by freeing the debug objects first and then releasing the VM
      area.
      
      Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org
      Signed-off-by: default avatarChintan Pandya <cpandya@codeaurora.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3c01d2f
    • Chintan Pandya's avatar
      mm: vmalloc: clean up vunmap to avoid pgtable ops twice · 82a2e924
      Chintan Pandya authored
      
      
      vunmap does page table clear operations twice in the case when
      DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled.
      
      So, clean up the code as that is unintended.
      
      As a perf gain, we save few us.  Below ftrace data was obtained while
      doing 1 MB of vmalloc/vfree on ARM64 based SoC *without* this patch
      applied.  After this patch, we can save ~3 us (on 1 extra
      vunmap_page_range).
      
        CPU  DURATION                  FUNCTION CALLS
        |     |   |                     |   |   |   |
       6)               |  __vunmap() {
       6)               |    vmap_debug_free_range() {
       6)   3.281 us    |      vunmap_page_range();
       6) + 45.468 us   |    }
       6)   2.760 us    |    vunmap_page_range();
       6) ! 505.105 us  |  }
      
      [cpandya@codeaurora.org: v3]
        Link: http://lkml.kernel.org/r/1525176960-18408-1-git-send-email-cpandya@codeaurora.org
      Link: http://lkml.kernel.org/r/1523876342-10545-1-git-send-email-cpandya@codeaurora.org
      Signed-off-by: default avatarChintan Pandya <cpandya@codeaurora.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82a2e924
    • Wei Yang's avatar
      mm/sparse.c: pass the __highest_present_section_nr + 1 to alloc_func() · 08994b24
      Wei Yang authored
      In commit c4e1be9e
      
       ("mm, sparsemem: break out of loops early")
      __highest_present_section_nr is introduced to reduce the loop counts for
      present section.  This is also helpful for usemap and memmap allocation.
      
      This patch uses __highest_present_section_nr + 1 to optimize the loop.
      
      Link: http://lkml.kernel.org/r/20180326081956.75275-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08994b24
    • Wei Yang's avatar
      mm/sparse.c: check __highest_present_section_nr only for a present section · d538c164
      Wei Yang authored
      
      
      When searching a present section, there are two boundaries:
      
          * __highest_present_section_nr
          * NR_MEM_SECTIONS
      
      And it is known, __highest_present_section_nr is a more strict boundary
      than NR_MEM_SECTIONS.  This means it would be necessary to check
      __highest_present_section_nr only.
      
      Link: http://lkml.kernel.org/r/20180326081956.75275-2-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d538c164
    • Huang Ying's avatar
      mm, gup: prevent pmd checking race in follow_pmd_mask() · 68827280
      Huang Ying authored
      
      
      mmap_sem will be read locked when calling follow_pmd_mask().  But this
      cannot prevent PMD from being changed for all cases when PTL is
      unlocked, for example, from pmd_trans_huge() to pmd_none() via
      MADV_DONTNEED.  So it is possible for the pmd_present() check in
      follow_pmd_mask() to encounter an invalid PMD.  This may cause an
      incorrect VM_BUG_ON() or an infinite loop.  Fix this by reading the PMD
      entry into a local variable with READ_ONCE() and checking the local
      variable and pmd_none() in the retry loop.
      
      As Kirill pointed out, with PTL unlocked, the *pmd may be changed under
      us, so reading it directly again and again may incur weird bugs.  So
      although using *pmd directly other than for pmd_present() checking may
      be safe, it is still better to replace them to read *pmd once and check
      the local variable multiple times.
      
      When PTL unlocked, replace all *pmd with local variable was suggested by
      Kirill.
      
      Link: http://lkml.kernel.org/r/20180419083514.1365-1-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68827280
    • Roman Gushchin's avatar
      mm/docs: describe memory.low refinements · 7854207f
      Roman Gushchin authored
      
      
      Refine cgroup v2 docs after latest memory.low changes.
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-4-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7854207f
    • Roman Gushchin's avatar
      mm: treat memory.low value inclusive · 5f93ad67
      Roman Gushchin authored
      
      
      If memcg's usage is equal to the memory.low value, avoid reclaiming from
      this cgroup while there is a surplus of reclaimable memory.
      
      This sounds more logical and also matches memory.high and memory.max
      behavior: both are inclusive.
      
      Empty cgroups are not considered protected, so MEMCG_LOW events are not
      emitted for empty cgroups, if there is no more reclaimable memory in the
      system.
      
      Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f93ad67
    • Roman Gushchin's avatar
      mm: memory.low hierarchical behavior · 23067153
      Roman Gushchin authored
      
      
      This patch aims to address an issue in current memory.low semantics,
      which makes it hard to use it in a hierarchy, where some leaf memory
      cgroups are more valuable than others.
      
      For example, there are memcgs A, A/B, A/C, A/D and A/E:
      
        A      A/memory.low = 2G, A/memory.current = 6G
       //\\
      BC  DE   B/memory.low = 3G  B/memory.current = 2G
               C/memory.low = 1G  C/memory.current = 2G
               D/memory.low = 0   D/memory.current = 2G
      	 E/memory.low = 10G E/memory.current = 0
      
      If we apply memory pressure, B, C and D are reclaimed at the same pace
      while A's usage exceeds 2G.  This is obviously wrong, as B's usage is
      fully below B's memory.low, and C has 1G of protection as well.  Also, A
      is pushed to the size, which is less than A's 2G memory.low, which is
      also wrong.
      
      A simple bash script (provided below) can be used to reproduce
      the problem. Current results are:
        A:    1430097920
        A/B:  711929856
        A/C:  717426688
        A/D:  741376
        A/E:  0
      
      To address the issue a concept of effective memory.low is introduced.
      Effective memory.low is always equal or less than original memory.low.
      In a case, when there is no memory.low overcommittment (and also for
      top-level cgroups), these two values are equal.
      
      Otherwise it's a part of parent's effective memory.low, calculated as a
      cgroup's memory.low usage divided by sum of sibling's memory.low usages
      (under memory.low usage I mean the size of actually protected memory:
      memory.current if memory.current < memory.low, 0 otherwise).  It's
      necessary to track the actual usage, because otherwise an empty cgroup
      with memory.low set (A/E in my example) will affect actual memory
      distribution, which makes no sense.  To avoid traversing the cgroup tree
      twice, page_counters code is reused.
      
      Calculating effective memory.low can be done in the reclaim path, as we
      conveniently traversing the cgroup tree from top to bottom and check
      memory.low on each level.  So, it's a perfect place to calculate
      effective memory low and save it to use it for children cgroups.
      
      This also eliminates a need to traverse the cgroup tree from bottom to
      top each time to check if parent's guarantee is not exceeded.
      
      Setting/resetting effective memory.low is intentionally racy, but it's
      fine and shouldn't lead to any significant differences in actual memory
      distribution.
      
      With this patch applied results are matching the expectations:
        A:    2147930112
        A/B:  1428721664
        A/C:  718393344
        A/D:  815104
        A/E:  0
      
      Test script:
        #!/bin/bash
      
        CGPATH="/sys/fs/cgroup"
      
        truncate /file1 --size 2G
        truncate /file2 --size 2G
        truncate /file3 --size 2G
        truncate /file4 --size 50G
      
        mkdir "${CGPATH}/A"
        echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
        mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
      
        echo 2G > "${CGPATH}/A/memory.low"
        echo 3G > "${CGPATH}/A/B/memory.low"
        echo 1G > "${CGPATH}/A/C/memory.low"
        echo 0 > "${CGPATH}/A/D/memory.low"
        echo 10G > "${CGPATH}/A/E/memory.low"
      
        echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
        echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
        echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
        echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4
      
        echo "A:   " `cat "${CGPATH}/A/memory.current"`
        echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
        echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
        echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
        echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`
      
        rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
        rmdir "${CGPATH}/A"
        rm /file1 /file2 /file3 /file4
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23067153
    • Roman Gushchin's avatar
      mm: rename page_counter's count/limit into usage/max · bbec2e15
      Roman Gushchin authored
      
      
      This patch renames struct page_counter fields:
        count -> usage
        limit -> max
      
      and the corresponding functions:
        page_counter_limit() -> page_counter_set_max()
        mem_cgroup_get_limit() -> mem_cgroup_get_max()
        mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
        memcg_update_kmem_limit() -> memcg_update_kmem_max()
        memcg_update_tcp_limit() -> memcg_update_tcp_max()
      
      The idea behind this renaming is to have the direct matching
      between memory cgroup knobs (low, high, max) and page_counters API.
      
      This is pure renaming, this patch doesn't bring any functional change.
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbec2e15
    • Stefan Agner's avatar
      mm/memblock: introduce PHYS_ADDR_MAX · 1c4bc43d
      Stefan Agner authored
      
      
      So far code was using ULLONG_MAX and type casting to obtain a
      phys_addr_t with all bits set.  The typecast is necessary to silence
      compiler warnings on 32-bit platforms.
      
      Use the simpler but still type safe approach "~(phys_addr_t)0" to create a
      preprocessor define for all bits set.
      
      Link: http://lkml.kernel.org/r/20180406213809.566-1-stefan@agner.ch
      Signed-off-by: default avatarStefan Agner <stefan@agner.ch>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c4bc43d
    • Laurent Dufour's avatar
      mm: remove odd HAVE_PTE_SPECIAL · 00b3a331
      Laurent Dufour authored
      
      
      Remove the additional define HAVE_PTE_SPECIAL and rely directly on
      CONFIG_ARCH_HAS_PTE_SPECIAL.
      
      There is no functional change introduced by this patch
      
      Link: http://lkml.kernel.org/r/1523533733-25437-1-git-send-email-ldufour@linux.vnet.ibm.com
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00b3a331
    • Laurent Dufour's avatar
      mm: introduce ARCH_HAS_PTE_SPECIAL · 3010a5ea
      Laurent Dufour authored
      
      
      Currently the PTE special supports is turned on in per architecture
      header files.  Most of the time, it is defined in
      arch/*/include/asm/pgtable.h depending or not on some other per
      architecture static definition.
      
      This patch introduce a new configuration variable to manage this
      directly in the Kconfig files.  It would later replace
      __HAVE_ARCH_PTE_SPECIAL.
      
      Here notes for some architecture where the definition of
      __HAVE_ARCH_PTE_SPECIAL is not obvious:
      
      arm
       __HAVE_ARCH_PTE_SPECIAL which is currently defined in
      arch/arm/include/asm/pgtable-3level.h which is included by
      arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
      So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
      
      powerpc
      __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
       - arch/powerpc/include/asm/book3s/64/pgtable.h
       - arch/powerpc/include/asm/pte-common.h
      The first one is included if (PPC_BOOK3S & PPC64) while the second is
      included in all the other cases.
      So select ARCH_HAS_PTE_SPECIAL all the time.
      
      sparc:
      __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
      defined(__arch64__) which are defined through the compiler in
      sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
      So select ARCH_HAS_PTE_SPECIAL if SPARC64
      
      There is no functional change introduced by this patch.
      
      Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Albert Ou <albert@sifive.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3010a5ea
    • Wei Yang's avatar
      mm/page_alloc: remove realsize in free_area_init_core() · e6943859
      Wei Yang authored
      
      
      Highmem's realsize always equals to freesize, so it is not necessary to
      spare a variable to record this.
      
      Link: http://lkml.kernel.org/r/20180413083859.65888-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6943859
    • Mike Kravetz's avatar
      mm: restructure memfd code · 5d752600
      Mike Kravetz authored
      
      
      With the addition of memfd hugetlbfs support, we now have the situation
      where memfd depends on TMPFS -or- HUGETLBFS.  Previously, memfd was only
      supported on tmpfs, so it made sense that the code resided in shmem.c.
      In the current code, memfd is only functional if TMPFS is defined.  If
      HUGETLFS is defined and TMPFS is not defined, then memfd functionality
      will not be available for hugetlbfs.  This does not cause BUGs, just a
      lack of potentially desired functionality.
      
      Code is restructured in the following way:
      - include/linux/memfd.h is a new file containing memfd specific
        definitions previously contained in shmem_fs.h.
      - mm/memfd.c is a new file containing memfd specific code previously
        contained in shmem.c.
      - memfd specific code is removed from shmem_fs.h and shmem.c.
      - A new config option MEMFD_CREATE is added that is defined if TMPFS
        or HUGETLBFS is defined.
      
      No functional changes are made to the code: restructuring only.
      
      Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d752600
    • Mike Kravetz's avatar
      mm/shmem: update file sealing comments and file checking · c49fcfcd
      Mike Kravetz authored
      
      
      In preparation for memfd code restructure, update comments, definitions
      and function names dealing with file sealing to indicate that tmpfs and
      hugetlbfs are the supported filesystems.  Also, change file pointer
      checks in memfd_file_seals_ptr to use defined interfaces instead of
      directly referencing file_operation structs.
      
      Link: http://lkml.kernel.org/r/20180415182119.4517-3-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c49fcfcd
    • Mike Kravetz's avatar
      mm/shmem: add __rcu annotations and properly deref radix entry · 5b9c98f3
      Mike Kravetz authored
      
      
      Patch series "restructure memfd code", v4.
      
      This patch (of 3):
      
      In preparation for memfd code restucture, clean up sparse warnings.
      Most changes required adding __rcu annotations.  The routine
      find_swap_entry was modified to properly deference radix tree entries.
      
      Link: http://lkml.kernel.org/r/20180415182119.4517-2-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b9c98f3
    • Minchan Kim's avatar
      zram: introduce zram memory tracking · c0265342
      Minchan Kim authored
      
      
      zRam as swap is useful for small memory device.  However, swap means
      those pages on zram are mostly cold pages due to VM's LRU algorithm.
      Especially, once init data for application are touched for launching,
      they tend to be not accessed any more and finally swapped out.  zRAM can
      store such cold pages as compressed form but it's pointless to keep in
      memory.  Better idea is app developers free them directly rather than
      remaining them on heap.
      
      This patch tell us last access time of each block of zram via "cat
      /sys/kernel/debug/zram/zram0/block_state".
      
      The output is as follows,
            300    75.033841 .wh
            301    63.806904 s..
            302    63.806919 ..h
      
      First column is zram's block index and 3rh one represents symbol (s:
      same page w: written page to backing store h: huge page) of the block
      state.  Second column represents usec time unit of the block was last
      accessed.  So above example means the 300th block is accessed at
      75.033851 second and it was huge so it was written to the backing store.
      
      Admin can leverage this information to catch cold|incompressible pages
      of process with *pagemap* once part of heaps are swapped out.
      
      I used the feature a few years ago to find memory hoggers in userspace
      to notify them what memory they have wasted without touch for a long
      time.  With it, they could reduce unnecessary memory space.  However, at
      that time, I hacked up zram for the feature but now I need the feature
      again so I decided it would be better to upstream rather than keeping it
      alone.  I hope I submit the userspace tool to use the feature soon.
      
      [akpm@linux-foundation.org: fix i386 printk warning]
      [minchan@kernel.org: use ktime_get_boottime() instead of sched_clock()]
        Link: http://lkml.kernel.org/r/20180420063525.GA253739@rodete-desktop-imager.corp.google.com
      [akpm@linux-foundation.org: documentation tweak]
      [akpm@linux-foundation.org: fix i386 printk warning]
      [minchan@kernel.org: fix compile warning]
        Link: http://lkml.kernel.org/r/20180508104849.GA8209@rodete-desktop-imager.corp.google.com
      [rdunlap@infradead.org: fix printk formats]
        Link: http://lkml.kernel.org/r/3652ccb1-96ef-0b0b-05d1-f661d7733dcc@infradead.org
      Link: http://lkml.kernel.org/r/20180416090946.63057-5-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0265342
    • Minchan Kim's avatar
      zram: record accessed second · d7eac6b6
      Minchan Kim authored
      
      
      zRam as swap is useful for small memory device.  However, swap means
      those pages on zram are mostly cold pages due to VM's LRU algorithm.
      Especially, once init data for application are touched for launching,
      they tend to be not accessed any more and finally swapped out.  zRAM can
      store such cold pages as compressed form but it's pointless to keep in
      memory.  Better idea is app developers free them directly rather than
      remaining them on heap.
      
      This patch records last access time of each block of zram so that With
      upcoming zram memory tracking, it could help userspace developers to
      reduce memory footprint.
      
      Link: http://lkml.kernel.org/r/20180416090946.63057-4-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7eac6b6
    • Minchan Kim's avatar
      zram: mark incompressible page as ZRAM_HUGE · 89e85bce
      Minchan Kim authored
      
      
      Mark incompressible pages so that we could investigate who is the owner
      of the incompressible pages once the page is swapped out via using
      upcoming zram memory tracker feature.
      
      With it, we could prevent such pages to be swapped out by using mlock.
      Otherwise we might remove them.
      
      This patch exposes new stat for huge pages via mm_stat.
      
      Link: http://lkml.kernel.org/r/20180416090946.63057-3-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89e85bce
    • Minchan Kim's avatar
      zram: correct flag name of ZRAM_ACCESS · c4d6c4cc
      Minchan Kim authored
      
      
      Patch series "zram memory tracking", v5.
      
      zRam as swap is useful for small memory device.  However, swap means
      those pages on zram are mostly cold pages due to VM's LRU algorithm.
      Especially, once init data for application are touched for launching,
      they tend to be not accessed any more and finally swapped out.  zRAM can
      store such cold pages as compressed form but it's pointless to keep in
      memory.  As well, it's pointless to store incompressible pages to zram
      so better idea is app developers manages them directly like free or
      mlock rather than remaining them on heap.
      
      This patch provides a debugfs /sys/kernel/debug/zram/zram0/block_state
      to represent each block's state so admin can investigate what memory is
      cold|incompressible|same page with using pagemap once the pages are
      swapped out.
      
      The output is as follows:
            300    75.033841 .wh
            301    63.806904 s..
            302    63.806919 ..h
      
      First column is zram's block index and 3rh one represents symbol (s:
      same page w: written page to backing store h: huge page) of the block
      state.  Second column represents usec time unit of the block was last
      accessed.  So above example means the 300th block is accessed at
      75.033851 second and it was huge so it was written to the backing store.
      
      This patch (of 4):
      
      ZRAM_ACCESS is used for locking a slot of zram so correct the name.  It
      is also not a common flag to indicate status of the block so move the
      declare position on top of the flag.  Lastly, let's move the function to
      the top of source code to be able to use it easily without forward
      declaration.
      
      Link: http://lkml.kernel.org/r/20180416090946.63057-2-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4d6c4cc
    • Tejun Heo's avatar
      mm, memcontrol: implement memory.swap.events · f3a53a3a
      Tejun Heo authored
      
      
      Add swap max and fail events so that userland can monitor and respond to
      running out of swap.
      
      I'm not too sure about the fail event.  Right now, it's a bit confusing
      which stats / events are recursive and which aren't and also which ones
      reflect events which originate from a given cgroup and which targets the
      cgroup.  No idea what the right long term solution is and it could just
      be that growing them organically is actually the only right thing to do.
      
      Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.com
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3a53a3a
    • Tejun Heo's avatar
      mm, memcontrol: move swap charge handling into get_swap_page() · bb98f2c5
      Tejun Heo authored
      
      
      Patch series "mm, memcontrol: Implement memory.swap.events", v2.
      
      This patchset implements memory.swap.events which contains max and fail
      events so that userland can monitor and respond to swap running out.
      
      This patch (of 2):
      
      get_swap_page() is always followed by mem_cgroup_try_charge_swap().
      This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
      makes get_swap_page() call the function even after swap allocation
      failure.
      
      This simplifies the callers and consolidates memcg related logic and
      will ease adding swap related memcg events.
      
      Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb98f2c5
    • Yang Shi's avatar
      mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct · 88aa7cc6
      Yang Shi authored
      
      
      mmap_sem is on the hot path of kernel, and it very contended, but it is
      abused too.  It is used to protect arg_start|end and evn_start|end when
      reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
      sense since those proc files just expect to read 4 values atomically and
      not related to VM, they could be set to arbitrary values by C/R.
      
      And, the mmap_sem contention may cause unexpected issue like below:
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
       Call Trace:
         schedule+0x36/0x80
         rwsem_down_read_failed+0xf0/0x150
         call_rwsem_down_read_failed+0x18/0x30
         down_read+0x20/0x40
         proc_pid_cmdline_read+0xd9/0x4e0
         __vfs_read+0x37/0x150
         vfs_read+0x96/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
      for them to mitigate the abuse of mmap_sem.
      
      So, introduce a new spinlock in mm_struct to protect the concurrent
      access to arg_start|end, env_start|end and others, as well as replace
      write map_sem to read to protect the race condition between prctl and
      sys_brk which might break check_data_rlimit(), and makes prctl more
      friendly to other VM operations.
      
      This patch just eliminates the abuse of mmap_sem, but it can't resolve
      the above hung task warning completely since the later
      access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
      scalability issue will be solved in the future.
      
      [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
        Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88aa7cc6
    • Baoquan He's avatar
      slab: clean up the code comment in slab kmem_cache struct · 05fec35e
      Baoquan He authored
      In commit 3b0efdfa
      
       ("mm, sl[aou]b: Extract common fields from struct
      kmem_cache") the variable 'obj_size' was moved above, however the
      related code comment is not updated accordingly.  Do it here.
      
      Link: http://lkml.kernel.org/r/20180603032402.27526-1-bhe@redhat.com
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05fec35e
    • Canjiang Lu's avatar
      mm/slub: remove obsolete comment · 05088e5d
      Canjiang Lu authored
      The obsolete comment removed in this patch was introduced by
      51df1142
      
       ("slub: Dynamically size kmalloc cache allocations").
      
      I paste related modification from that commit:
      
      +#ifdef CONFIG_NUMA
      +       /*
      +        * Allocate kmem_cache_node properly from the kmem_cache slab.
      +        * kmem_cache_node is separately allocated so no need to
      +        * update any list pointers.
      +        */
      +       temp_kmem_cache_node = kmem_cache_node;
      
      +       kmem_cache_node = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
      +       memcpy(kmem_cache_node, temp_kmem_cache_node, kmem_size);
      +
      +       kmem_cache_bootstrap_fixup(kmem_cache_node);
      +
      +       caches++;
      +#else
      +       /*
      +        * kmem_cache has kmem_cache_node embedded and we moved it!
      +        * Update the list heads
      +        */
      +       INIT_LIST_HEAD(&kmem_cache->local_node.partial);
      +       list_splice(&temp_kmem_cache->local_node.partial, &kmem_cache->local_node.partial);
      +#ifdef CONFIG_SLUB_DEBUG
      +       INIT_LIST_HEAD(&kmem_cache->local_node.full);
      +       list_splice(&temp_kmem_cache->local_node.full, &kmem_cache->local_node.full);
      +#endif
      
      As we can see there're used to distinguish the difference handling
      between NUMA/non-NUMA configuration in the original commit.  I think it
      doesn't make any sense in current implementation which is placed above
      kmem_cache_node = bootstrap(&boot_kmem_cache_node); So maybe it's better
      to remove them now?
      
      Link: http://lkml.kernel.org/r/5af26f58.1c69fb81.1be0e.c520SMTPIN_ADDED_BROKEN@mx.google.com
      Signed-off-by: default avatarCanjiang Lu <canjiang.lu@samsung.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05088e5d
    • Mathieu Malaterre's avatar
      mm/slub.c: add __printf verification to slab_err() · a38965bf
      Mathieu Malaterre authored
      
      
      __printf is useful to verify format and arguments.  Remove the following
      warning (with W=1):
      
        mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format]
      
      Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org
      Signed-off-by: default avatarMathieu Malaterre <malat@debian.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a38965bf
    • Matthew Wilcox's avatar
      slab: __GFP_ZERO is incompatible with a constructor · 128227e7
      Matthew Wilcox authored
      __GFP_ZERO requests that the object be initialised to all-zeroes, while
      the purpose of a constructor is to initialise an object to a particular
      pattern.  We cannot do both.  Add a warning to catch any users who
      mistakenly pass a __GFP_ZERO flag when allocating a slab with a
      constructor.
      
      Link: http://lkml.kernel.org/r/20180412191322.GA21205@bombadil.infradead.org
      Fixes: d07dbea4
      
       ("Slab allocators: support __GFP_ZERO in all allocators")
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      128227e7
    • Sebastian Andrzej Siewior's avatar
      net/9p/trans_xen.c: don't inclide rwlock.h directly · e56ee574
      Sebastian Andrzej Siewior authored
      
      
      rwlock.h should not be included directly.  Instead linux/splinlock.h
      should be included.  One thing it does is to break the RT build.
      
      Link: http://lkml.kernel.org/r/20180504100319.11880-1-bigeasy@linutronix.de
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e56ee574
    • Chengguang Xu's avatar
      fs/9p: detect invalid options as much as possible · 478ae0ca
      Chengguang Xu authored
      
      
      Currently when detecting invalid options in option parsing, some
      options(e.g.  msize) just set errno and allow to continuously validate
      other options so that it can detect invalid options as much as possible
      and give proper error messages together.
      
      This patch applies same rule to option 'cache' and 'access' when
      detecting -EINVAL.
      
      Link: http://lkml.kernel.org/r/1525340676-34072-2-git-send-email-cgxu519@gmx.com
      Signed-off-by: default avatarChengguang Xu <cgxu519@gmx.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      478ae0ca
    • Chengguang Xu's avatar
      net/9p: detect invalid options as much as possible · 8d856c72
      Chengguang Xu authored
      
      
      Currently when detecting invalid options in option parsing, some
      options(e.g.  msize) just set errno and allow to continuously validate
      other options so that it can detect invalid options as much as possible
      and give proper error messages together.
      
      This patch applies same rule to option 'trans' and 'version' when
      detecting -EINVAL.
      
      Link: http://lkml.kernel.org/r/1525340676-34072-1-git-send-email-cgxu519@gmx.com
      Signed-off-by: default avatarChengguang Xu <cgxu519@gmx.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d856c72
    • Souptick Joarder's avatar
      fs: ocfs2: use new return type vm_fault_t · c6137fe3
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220
      
       ("mm: change return type to vm_fault_t")
      
      vmf_error() is the newly introduce inline function in 4.18.
      
      Fix one checkpatch.pl warning by replacing BUG_ON() with WARN_ON()
      
      [akpm@linux-foundation.org: undo BUG_ON->WARN_ON change]
      Link: http://lkml.kernel.org/r/20180523153258.GA28451@jordon-HP-15-Notebook-PC
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6137fe3
    • Salvatore Mesoraca's avatar
      ocfs2: drop a VLA in ocfs2_orphan_del() · 64202a21
      Salvatore Mesoraca authored
      
      
      Avoid a VLA by using a real constant expression instead of a variable.
      The compiler should be able to optimize the original code and avoid
      using an actual VLA.  Anyway this change is useful because it will avoid
      a false positive with -Wvla, it might also help the compiler generating
      better code.
      
      Link: http://lkml.kernel.org/r/1520970710-19732-1-git-send-email-s.mesoraca16@gmail.com
      Signed-off-by: default avatarSalvatore Mesoraca <s.mesoraca16@gmail.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64202a21
    • Guozhonghua's avatar
      ocfs2: correct the comments position of struct ocfs2_dir_block_trailer · f3797d8a
      Guozhonghua authored
      
      
      Correct the comments position of the structure ocfs2_dir_block_trailer.
      
      Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071C5FDE@H3CMLB12-EX.srv.huawei-3com.com
      Signed-off-by: default avatarguozhonghua <guozhonghua@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3797d8a
    • Zhen Lei's avatar
      ocfs2: eliminate a misreported warning · 731a40fa
      Zhen Lei authored
      
      
      The warning is invalid because the parameter chunksize passed from
      ocfs2_info_freefrag_scan_chain-->ocfs2_info_update_ffg is guaranteed to
      be positive.  So __ilog2_u32 cannot return -1.
      
        fs/ocfs2/ioctl.c: In function 'ocfs2_info_update_ffg':
        fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds]
          hist->fc_chunks[index]++;
                         ^
        fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds]
      
      Link: http://lkml.kernel.org/r/1524655799-12112-1-git-send-email-thunder.leizhen@huawei.com
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      731a40fa
    • Larry Chen's avatar
      ocfs2: ocfs2_inode_lock_tracker does not distinguish lock level · 133b81f2
      Larry Chen authored
      
      
      ocfs2_inode_lock_tracker as a variant of ocfs2_inode_lock, is used to
      prevent deadlock due to recursive lock acquisition.
      
      But this function does not distinguish whether the requested level is EX
      or PR.
      
      If a RP lock has been attained, this function will immediately return
      success afterwards even an EX lock is requested.
      
      But actually the return value does not mean that the process got a EX
      lock, because ocfs2_inode_lock has not been called.
      
      When taking lock levels into account, we face some different situations:
      
      1. no lock is held
         In this case, just lock the inode and return 0
      
      2. We are holding a lock
         For this situation, things diverges into several cases
      
         wanted     holding	     what to do
         ex		ex	    see 2.1 below
         ex		pr	    see 2.2 below
         pr		ex	    see 2.1 below
         pr		pr	    see 2.1 below
      
         2.1 lock level that is been held is compatible
         with the wanted level, so no lock action will be tacken.
      
         2.2 Otherwise, an upgrade is needed, but it is forbidden.
      
      Reason why upgrade within a process is forbidden is that lock upgrade
      may cause dead lock.  The following illustrate how it happens.
      
              process 1                             process 2
      ocfs2_inode_lock_tracker(ex=0)
                                     <======   ocfs2_inode_lock_tracker(ex=1)
      
      ocfs2_inode_lock_tracker(ex=1)
      
      For the status quo of ocfs2, without this patch, neither a bug nor
      end-user impact will be caused because the wrong logic is avoided.
      
      But I'm afraid this generic interface, may be called by other developers
      in future and used in this situation.
      
        a process
      ocfs2_inode_lock_tracker(ex=0)
      ocfs2_inode_lock_tracker(ex=1)
      
      Link: http://lkml.kernel.org/r/20180510053230.17217-1-lchen@suse.com
      Signed-off-by: default avatarLarry Chen <lchen@suse.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      133b81f2