Skip to content
  1. Jun 08, 2018
    • Roman Gushchin's avatar
      memcg: introduce memory.min · bf8d5d52
      Roman Gushchin authored
      
      
      Memory controller implements the memory.low best-effort memory
      protection mechanism, which works perfectly in many cases and allows
      protecting working sets of important workloads from sudden reclaim.
      
      But its semantics has a significant limitation: it works only as long as
      there is a supply of reclaimable memory.  This makes it pretty useless
      against any sort of slow memory leaks or memory usage increases.  This
      is especially true for swapless systems.  If swap is enabled, memory
      soft protection effectively postpones problems, allowing a leaking
      application to fill all swap area, which makes no sense.  The only
      effective way to guarantee the memory protection in this case is to
      invoke the OOM killer.
      
      It's possible to handle this case in userspace by reacting on MEMCG_LOW
      events; but there is still a place for a fail-safe in-kernel mechanism
      to provide stronger guarantees.
      
      This patch introduces the memory.min interface for cgroup v2 memory
      controller.  It works very similarly to memory.low (sharing the same
      hierarchical behavior), except that it's not disabled if there is no
      more reclaimable memory in the system.
      
      If cgroup is not populated, its memory.min is ignored, because otherwise
      even the OOM killer wouldn't be able to reclaim the protected memory,
      and the system can stall.
      
      [guro@fb.com: s/low/min/ in docs]
      Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
      Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf8d5d52
    • Mathieu Malaterre's avatar
      mm: move is_pageblock_removable_nolock() to mm/memory_hotplug.c · fb52bbae
      Mathieu Malaterre authored
      
      
      is_pageblock_removable_nolock() is not used outside of
      mm/memory_hotplug.c.  Move it next to unique caller
      is_mem_section_removable() and make it static.
      
      Remove prototype in <linux/memory_hotplug.h> to silence gcc warning (W=1):
      
        mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes]
      
      Link: http://lkml.kernel.org/r/20180509190001.24789-1-malat@debian.org
      Signed-off-by: default avatarMathieu Malaterre <malat@debian.org>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb52bbae
    • Huang Ying's avatar
      mm: /proc/pid/pagemap: hide swap entries from unprivileged users · ab6ecf24
      Huang Ying authored
      In commit ab676b7d ("pagemap: do not leak physical addresses to
      non-privileged userspace"), the /proc/PID/pagemap is restricted to be
      readable only by CAP_SYS_ADMIN to address some security issue.
      
      In commit 1c90308e ("pagemap: hide physical addresses from
      non-privileged users"), the restriction is relieved to make
      /proc/PID/pagemap readable, but hide the physical addresses for
      non-privileged users.
      
      But the swap entries are readable for non-privileged users too.  This
      has some security issues.  For example, for page under migrating, the
      swap entry has physical address information.  So, in this patch, the
      swap entries are hided for non-privileged users too.
      
      Link: http://lkml.kernel.org/r/20180508012745.7238-1-ying.huang@intel.com
      Fixes: 1c90308e
      
       ("pagemap: hide physical addresses from non-privileged users")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab6ecf24
    • Minchan Kim's avatar
      mm/memblock: print memblock_remove · 25cf23d7
      Minchan Kim authored
      
      
      memblock_remove report is useful to see why MemTotal of /proc/meminfo
      between two kernels makes difference.
      
      Link: http://lkml.kernel.org/r/20180508104223.8028-1-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25cf23d7
    • Junaid Shahid's avatar
      mm: memcontrol: drain memcg stock on force_empty · d12c60f6
      Junaid Shahid authored
      
      
      The per-cpu memcg stock can retain a charge of upto 32 pages.  On a
      machine with large number of cpus, this can amount to a decent amount of
      memory.  Additionally force_empty interface might be triggering unneeded
      memcg reclaims.
      
      Link: http://lkml.kernel.org/r/20180507201651.165879-1-shakeelb@google.com
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d12c60f6
    • Shakeel Butt's avatar
      mm: memcontrol: drain stocks on resize limit · bb4a7ea2
      Shakeel Butt authored
      
      
      Resizing the memcg limit for cgroup-v2 drains the stocks before
      triggering the memcg reclaim.  Do the same for cgroup-v1 to make the
      behavior consistent.
      
      Link: http://lkml.kernel.org/r/20180504205548.110696-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb4a7ea2
    • Greg Thelen's avatar
      memcg: mark memcg1_events static const · 8dd53fd3
      Greg Thelen authored
      
      
      Mark memcg1_events static: it's only used by memcontrol.c.  And mark it
      const: it's not modified.
      
      Link: http://lkml.kernel.org/r/20180503192940.94971-1-gthelen@google.com
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dd53fd3
    • Wang Long's avatar
      memcg: writeback: use memcg->cgwb_list directly · 9ccc3617
      Wang Long authored
      
      
      mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
      outside of code under CONFIG_CGROUP_WRITEBACK.  so use memcg->cgwb_list
      directly.
      
      Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.com
      Signed-off-by: default avatarWang Long <wanglong19@meituan.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ccc3617
    • Amir Goldstein's avatar
      tmpfs: allow decoding a file handle of an unlinked file · 12ba780d
      Amir Goldstein authored
      
      
      tmpfs uses the helper d_find_alias() to find a dentry from a decoded
      inode, but d_find_alias() skips unhashed dentries, so unlinked files
      cannot be decoded from a file handle.
      
      This can be reproduced using xfstests test program open_by_handle:
      
        $ open_by handle -c /tmp/testdir
        $ open_by_handle -dk /tmp/testdir
        open_by_handle(/tmp/testdir/file000000) returned 116 incorrectly on an unlinked open file!
      
      To fix this, if d_find_alias() can't find a hashed alias, call
      d_find_any_alias() to return an unhashed one.
      
      Link: http://lkml.kernel.org/r/CAOQ4uxg+qSLP0KwdW+h1tcPqOCQd+_pGZVXiePQB1TXCMBMctQ@mail.gmail.com
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12ba780d
    • Mike Rapoport's avatar
      mm/ksm: move [set_]page_stable_node from ksm.h to ksm.c · 88484826
      Mike Rapoport authored
      
      
      page_stable_node() and set_page_stable_node() are only used in mm/ksm.c
      and there is no point to keep them in the include/linux/ksm.h
      
      [akpm@linux-foundation.org: fix SYSFS=n build]
      Link: http://lkml.kernel.org/r/1524552106-7356-3-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88484826
    • Mike Rapoport's avatar
      mm/ksm: remove unused page_referenced_ksm declaration · 48f49e1f
      Mike Rapoport authored
      Commit 9f32624b
      
       ("mm/rmap: use rmap_walk() in page_referenced()")
      removed the declaration of page_referenced_ksm for the case
      CONFIG_KSM=y, but left one for CONFIG_KSM=n.
      
      Remove the unused leftover.
      
      Link: http://lkml.kernel.org/r/1524552106-7356-2-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48f49e1f
    • Omar Sandoval's avatar
      lockdep: fix fs_reclaim annotation · 93781325
      Omar Sandoval authored
      While revisiting my Btrfs swapfile series [1], I introduced a situation
      in which reclaim would lock i_rwsem, and even though the swapon() path
      clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
      complaints from lockdep.  It turns out that the rework of the fs_reclaim
      annotation was broken: if the current task has PF_MEMALLOC set, we don't
      acquire the dummy fs_reclaim lock, but when reclaiming we always check
      this _after_ we've just set the PF_MEMALLOC flag.  In most cases, we can
      fix this by moving the fs_reclaim_{acquire,release}() outside of the
      memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
      different.  After applying this, I got the expected lockdep splats.
      
      1: https://lwn.net/Articles/625412/
      
      Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
      Fixes: d92a8cfc
      
       ("locking/lockdep: Rework FS_RECLAIM annotation")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93781325
    • Yang Shi's avatar
      mm: shmem: make stat.st_blksize return huge page size if THP is on · 89fdcd26
      Yang Shi authored
      
      
      Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
      filesystem with huge page support anymore.  tmpfs can use huge page via
      THP when mounting by "huge=" mount option.
      
      When applications use huge page on hugetlbfs, it just need check the
      filesystem magic number, but it is not enough for tmpfs.  Make
      stat.st_blksize return huge page size if it is mounted by appropriate
      "huge=" option to give applications a hint to optimize the behavior with
      THP.
      
      Some applications may not do wisely with THP.  For example, QEMU may
      mmap file on non huge page aligned hint address with MAP_FIXED, which
      results in no pages are PMD mapped even though THP is used.  Some
      applications may mmap file with non huge page aligned offset.  Both
      behaviors make THP pointless.
      
      statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and
      it also may fallback to 4KB page silently if there is not enough huge
      page.  Furthermore, different f_bsize makes max_blocks and free_blocks
      calculation harder but without too much benefit.  Returning huge page
      size via stat.st_blksize sounds good enough.
      
      Since PUD size huge page for THP has not been supported, now it just
      returns HPAGE_PMD_SIZE.
      
      Hugh said:
      
      : Sorry, I have no enthusiasm for this patch; but do I feel strongly
      : enough to override you and everyone else to NAK it?  No, I don't feel
      : that strongly, maybe st_blksize isn't worth arguing over.
      :
      : We did look at struct stat when designing huge tmpfs, to see if there
      : were any fields that should be adjusted for it; but concluded none.
      : Yes, it would sometimes be nice to have a quickly accessible indicator
      : for when tmpfs has been mounted huge (scanning /proc/mounts for options
      : can be tiresome, agreed); but since tmpfs tries to supply huge (or not)
      : pages transparently, no difference seemed right.
      :
      : So, because st_blksize is a not very useful field of struct stat, with
      : "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead
      : of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge"
      : options (force or always, okay; within_size or advise, not so much).
      : Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize
      : for file system I/O" than PAGE_SIZE was.
      :
      : Which we can expect to speed up some applications and disadvantage
      : others, depending on how they interpret st_blksize: just like if we
      : changed it in the same way on non-huge tmpfs.  (Did I actually try
      : changing st_blksize early on, and find it broke something?  If so, I've
      : now forgotten what, and a search through commit messages didn't find
      : it; but I guess we'll find out soon enough.)
      :
      : If there were an mstat() syscall, returning a field "preferred
      : alignment", then we could certainly agree to put HPAGE_PMD_SIZE in
      : there; but in stat()'s st_blksize?  And what happens when (in future)
      : mm maps this or that hard-disk filesystem's blocks with a pmd mapping -
      : should that filesystem then advertise a bigger st_blksize, despite the
      : same disk layout as before?  What happens with DAX?
      :
      : And this change is not going to help the QEMU suboptimality that
      : brought you here (or does QEMU align mmaps according to st_blksize?).
      : QEMU ought to work well with kernels without this change, and kernels
      : with this change; and I hope it can easily deal with both by avoiding
      : that use of MAP_FIXED which prevented the kernel's intended alignment.
      
      [akpm@linux-foundation.org: remove unneeded `else']
      Link: http://lkml.kernel.org/r/1524665633-83806-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89fdcd26
    • Chintan Pandya's avatar
      mm: vmalloc: pass proper vm_start into debugobjects · 05e3ff95
      Chintan Pandya authored
      
      
      Client can call vunmap with some intermediate 'addr' which may not be
      the start of the VM area.  Entire unmap code works with vm->vm_start
      which is proper but debug object API is called with 'addr'.  This could
      be a problem within debug objects.
      
      Pass proper start address into debug object API.
      
      [akpm@linux-foundation.org: fix warning]
      Link: http://lkml.kernel.org/r/1523961828-9485-3-git-send-email-cpandya@codeaurora.org
      Signed-off-by: default avatarChintan Pandya <cpandya@codeaurora.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05e3ff95
    • Chintan Pandya's avatar
      mm: vmalloc: avoid racy handling of debugobjects in vunmap · f3c01d2f
      Chintan Pandya authored
      
      
      Currently, __vunmap flow is,
       1) Release the VM area
       2) Free the debug objects corresponding to that vm area.
      
      This leave some race window open.
       1) Release the VM area
       1.5) Some other client gets the same vm area
       1.6) This client allocates new debug objects on the same
            vm area
       2) Free the debug objects corresponding to this vm area.
      
      Here, we actually free 'other' client's debug objects.
      
      Fix this by freeing the debug objects first and then releasing the VM
      area.
      
      Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org
      Signed-off-by: default avatarChintan Pandya <cpandya@codeaurora.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3c01d2f
    • Chintan Pandya's avatar
      mm: vmalloc: clean up vunmap to avoid pgtable ops twice · 82a2e924
      Chintan Pandya authored
      
      
      vunmap does page table clear operations twice in the case when
      DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled.
      
      So, clean up the code as that is unintended.
      
      As a perf gain, we save few us.  Below ftrace data was obtained while
      doing 1 MB of vmalloc/vfree on ARM64 based SoC *without* this patch
      applied.  After this patch, we can save ~3 us (on 1 extra
      vunmap_page_range).
      
        CPU  DURATION                  FUNCTION CALLS
        |     |   |                     |   |   |   |
       6)               |  __vunmap() {
       6)               |    vmap_debug_free_range() {
       6)   3.281 us    |      vunmap_page_range();
       6) + 45.468 us   |    }
       6)   2.760 us    |    vunmap_page_range();
       6) ! 505.105 us  |  }
      
      [cpandya@codeaurora.org: v3]
        Link: http://lkml.kernel.org/r/1525176960-18408-1-git-send-email-cpandya@codeaurora.org
      Link: http://lkml.kernel.org/r/1523876342-10545-1-git-send-email-cpandya@codeaurora.org
      Signed-off-by: default avatarChintan Pandya <cpandya@codeaurora.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82a2e924
    • Wei Yang's avatar
      mm/sparse.c: pass the __highest_present_section_nr + 1 to alloc_func() · 08994b24
      Wei Yang authored
      In commit c4e1be9e
      
       ("mm, sparsemem: break out of loops early")
      __highest_present_section_nr is introduced to reduce the loop counts for
      present section.  This is also helpful for usemap and memmap allocation.
      
      This patch uses __highest_present_section_nr + 1 to optimize the loop.
      
      Link: http://lkml.kernel.org/r/20180326081956.75275-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08994b24
    • Wei Yang's avatar
      mm/sparse.c: check __highest_present_section_nr only for a present section · d538c164
      Wei Yang authored
      
      
      When searching a present section, there are two boundaries:
      
          * __highest_present_section_nr
          * NR_MEM_SECTIONS
      
      And it is known, __highest_present_section_nr is a more strict boundary
      than NR_MEM_SECTIONS.  This means it would be necessary to check
      __highest_present_section_nr only.
      
      Link: http://lkml.kernel.org/r/20180326081956.75275-2-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d538c164
    • Huang Ying's avatar
      mm, gup: prevent pmd checking race in follow_pmd_mask() · 68827280
      Huang Ying authored
      
      
      mmap_sem will be read locked when calling follow_pmd_mask().  But this
      cannot prevent PMD from being changed for all cases when PTL is
      unlocked, for example, from pmd_trans_huge() to pmd_none() via
      MADV_DONTNEED.  So it is possible for the pmd_present() check in
      follow_pmd_mask() to encounter an invalid PMD.  This may cause an
      incorrect VM_BUG_ON() or an infinite loop.  Fix this by reading the PMD
      entry into a local variable with READ_ONCE() and checking the local
      variable and pmd_none() in the retry loop.
      
      As Kirill pointed out, with PTL unlocked, the *pmd may be changed under
      us, so reading it directly again and again may incur weird bugs.  So
      although using *pmd directly other than for pmd_present() checking may
      be safe, it is still better to replace them to read *pmd once and check
      the local variable multiple times.
      
      When PTL unlocked, replace all *pmd with local variable was suggested by
      Kirill.
      
      Link: http://lkml.kernel.org/r/20180419083514.1365-1-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68827280
    • Roman Gushchin's avatar
      mm/docs: describe memory.low refinements · 7854207f
      Roman Gushchin authored
      
      
      Refine cgroup v2 docs after latest memory.low changes.
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-4-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7854207f
    • Roman Gushchin's avatar
      mm: treat memory.low value inclusive · 5f93ad67
      Roman Gushchin authored
      
      
      If memcg's usage is equal to the memory.low value, avoid reclaiming from
      this cgroup while there is a surplus of reclaimable memory.
      
      This sounds more logical and also matches memory.high and memory.max
      behavior: both are inclusive.
      
      Empty cgroups are not considered protected, so MEMCG_LOW events are not
      emitted for empty cgroups, if there is no more reclaimable memory in the
      system.
      
      Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f93ad67
    • Roman Gushchin's avatar
      mm: memory.low hierarchical behavior · 23067153
      Roman Gushchin authored
      
      
      This patch aims to address an issue in current memory.low semantics,
      which makes it hard to use it in a hierarchy, where some leaf memory
      cgroups are more valuable than others.
      
      For example, there are memcgs A, A/B, A/C, A/D and A/E:
      
        A      A/memory.low = 2G, A/memory.current = 6G
       //\\
      BC  DE   B/memory.low = 3G  B/memory.current = 2G
               C/memory.low = 1G  C/memory.current = 2G
               D/memory.low = 0   D/memory.current = 2G
      	 E/memory.low = 10G E/memory.current = 0
      
      If we apply memory pressure, B, C and D are reclaimed at the same pace
      while A's usage exceeds 2G.  This is obviously wrong, as B's usage is
      fully below B's memory.low, and C has 1G of protection as well.  Also, A
      is pushed to the size, which is less than A's 2G memory.low, which is
      also wrong.
      
      A simple bash script (provided below) can be used to reproduce
      the problem. Current results are:
        A:    1430097920
        A/B:  711929856
        A/C:  717426688
        A/D:  741376
        A/E:  0
      
      To address the issue a concept of effective memory.low is introduced.
      Effective memory.low is always equal or less than original memory.low.
      In a case, when there is no memory.low overcommittment (and also for
      top-level cgroups), these two values are equal.
      
      Otherwise it's a part of parent's effective memory.low, calculated as a
      cgroup's memory.low usage divided by sum of sibling's memory.low usages
      (under memory.low usage I mean the size of actually protected memory:
      memory.current if memory.current < memory.low, 0 otherwise).  It's
      necessary to track the actual usage, because otherwise an empty cgroup
      with memory.low set (A/E in my example) will affect actual memory
      distribution, which makes no sense.  To avoid traversing the cgroup tree
      twice, page_counters code is reused.
      
      Calculating effective memory.low can be done in the reclaim path, as we
      conveniently traversing the cgroup tree from top to bottom and check
      memory.low on each level.  So, it's a perfect place to calculate
      effective memory low and save it to use it for children cgroups.
      
      This also eliminates a need to traverse the cgroup tree from bottom to
      top each time to check if parent's guarantee is not exceeded.
      
      Setting/resetting effective memory.low is intentionally racy, but it's
      fine and shouldn't lead to any significant differences in actual memory
      distribution.
      
      With this patch applied results are matching the expectations:
        A:    2147930112
        A/B:  1428721664
        A/C:  718393344
        A/D:  815104
        A/E:  0
      
      Test script:
        #!/bin/bash
      
        CGPATH="/sys/fs/cgroup"
      
        truncate /file1 --size 2G
        truncate /file2 --size 2G
        truncate /file3 --size 2G
        truncate /file4 --size 50G
      
        mkdir "${CGPATH}/A"
        echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
        mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
      
        echo 2G > "${CGPATH}/A/memory.low"
        echo 3G > "${CGPATH}/A/B/memory.low"
        echo 1G > "${CGPATH}/A/C/memory.low"
        echo 0 > "${CGPATH}/A/D/memory.low"
        echo 10G > "${CGPATH}/A/E/memory.low"
      
        echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
        echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
        echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
        echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4
      
        echo "A:   " `cat "${CGPATH}/A/memory.current"`
        echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
        echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
        echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
        echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`
      
        rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
        rmdir "${CGPATH}/A"
        rm /file1 /file2 /file3 /file4
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23067153
    • Roman Gushchin's avatar
      mm: rename page_counter's count/limit into usage/max · bbec2e15
      Roman Gushchin authored
      
      
      This patch renames struct page_counter fields:
        count -> usage
        limit -> max
      
      and the corresponding functions:
        page_counter_limit() -> page_counter_set_max()
        mem_cgroup_get_limit() -> mem_cgroup_get_max()
        mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
        memcg_update_kmem_limit() -> memcg_update_kmem_max()
        memcg_update_tcp_limit() -> memcg_update_tcp_max()
      
      The idea behind this renaming is to have the direct matching
      between memory cgroup knobs (low, high, max) and page_counters API.
      
      This is pure renaming, this patch doesn't bring any functional change.
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbec2e15
    • Stefan Agner's avatar
      mm/memblock: introduce PHYS_ADDR_MAX · 1c4bc43d
      Stefan Agner authored
      
      
      So far code was using ULLONG_MAX and type casting to obtain a
      phys_addr_t with all bits set.  The typecast is necessary to silence
      compiler warnings on 32-bit platforms.
      
      Use the simpler but still type safe approach "~(phys_addr_t)0" to create a
      preprocessor define for all bits set.
      
      Link: http://lkml.kernel.org/r/20180406213809.566-1-stefan@agner.ch
      Signed-off-by: default avatarStefan Agner <stefan@agner.ch>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c4bc43d
    • Laurent Dufour's avatar
      mm: remove odd HAVE_PTE_SPECIAL · 00b3a331
      Laurent Dufour authored
      
      
      Remove the additional define HAVE_PTE_SPECIAL and rely directly on
      CONFIG_ARCH_HAS_PTE_SPECIAL.
      
      There is no functional change introduced by this patch
      
      Link: http://lkml.kernel.org/r/1523533733-25437-1-git-send-email-ldufour@linux.vnet.ibm.com
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00b3a331
    • Laurent Dufour's avatar
      mm: introduce ARCH_HAS_PTE_SPECIAL · 3010a5ea
      Laurent Dufour authored
      
      
      Currently the PTE special supports is turned on in per architecture
      header files.  Most of the time, it is defined in
      arch/*/include/asm/pgtable.h depending or not on some other per
      architecture static definition.
      
      This patch introduce a new configuration variable to manage this
      directly in the Kconfig files.  It would later replace
      __HAVE_ARCH_PTE_SPECIAL.
      
      Here notes for some architecture where the definition of
      __HAVE_ARCH_PTE_SPECIAL is not obvious:
      
      arm
       __HAVE_ARCH_PTE_SPECIAL which is currently defined in
      arch/arm/include/asm/pgtable-3level.h which is included by
      arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
      So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
      
      powerpc
      __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
       - arch/powerpc/include/asm/book3s/64/pgtable.h
       - arch/powerpc/include/asm/pte-common.h
      The first one is included if (PPC_BOOK3S & PPC64) while the second is
      included in all the other cases.
      So select ARCH_HAS_PTE_SPECIAL all the time.
      
      sparc:
      __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
      defined(__arch64__) which are defined through the compiler in
      sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
      So select ARCH_HAS_PTE_SPECIAL if SPARC64
      
      There is no functional change introduced by this patch.
      
      Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Albert Ou <albert@sifive.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3010a5ea
    • Wei Yang's avatar
      mm/page_alloc: remove realsize in free_area_init_core() · e6943859
      Wei Yang authored
      
      
      Highmem's realsize always equals to freesize, so it is not necessary to
      spare a variable to record this.
      
      Link: http://lkml.kernel.org/r/20180413083859.65888-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6943859
    • Mike Kravetz's avatar
      mm: restructure memfd code · 5d752600
      Mike Kravetz authored
      
      
      With the addition of memfd hugetlbfs support, we now have the situation
      where memfd depends on TMPFS -or- HUGETLBFS.  Previously, memfd was only
      supported on tmpfs, so it made sense that the code resided in shmem.c.
      In the current code, memfd is only functional if TMPFS is defined.  If
      HUGETLFS is defined and TMPFS is not defined, then memfd functionality
      will not be available for hugetlbfs.  This does not cause BUGs, just a
      lack of potentially desired functionality.
      
      Code is restructured in the following way:
      - include/linux/memfd.h is a new file containing memfd specific
        definitions previously contained in shmem_fs.h.
      - mm/memfd.c is a new file containing memfd specific code previously
        contained in shmem.c.
      - memfd specific code is removed from shmem_fs.h and shmem.c.
      - A new config option MEMFD_CREATE is added that is defined if TMPFS
        or HUGETLBFS is defined.
      
      No functional changes are made to the code: restructuring only.
      
      Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d752600
    • Mike Kravetz's avatar
      mm/shmem: update file sealing comments and file checking · c49fcfcd
      Mike Kravetz authored
      
      
      In preparation for memfd code restructure, update comments, definitions
      and function names dealing with file sealing to indicate that tmpfs and
      hugetlbfs are the supported filesystems.  Also, change file pointer
      checks in memfd_file_seals_ptr to use defined interfaces instead of
      directly referencing file_operation structs.
      
      Link: http://lkml.kernel.org/r/20180415182119.4517-3-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c49fcfcd
    • Mike Kravetz's avatar
      mm/shmem: add __rcu annotations and properly deref radix entry · 5b9c98f3
      Mike Kravetz authored
      
      
      Patch series "restructure memfd code", v4.
      
      This patch (of 3):
      
      In preparation for memfd code restucture, clean up sparse warnings.
      Most changes required adding __rcu annotations.  The routine
      find_swap_entry was modified to properly deference radix tree entries.
      
      Link: http://lkml.kernel.org/r/20180415182119.4517-2-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b9c98f3
    • Minchan Kim's avatar
      zram: introduce zram memory tracking · c0265342
      Minchan Kim authored
      
      
      zRam as swap is useful for small memory device.  However, swap means
      those pages on zram are mostly cold pages due to VM's LRU algorithm.
      Especially, once init data for application are touched for launching,
      they tend to be not accessed any more and finally swapped out.  zRAM can
      store such cold pages as compressed form but it's pointless to keep in
      memory.  Better idea is app developers free them directly rather than
      remaining them on heap.
      
      This patch tell us last access time of each block of zram via "cat
      /sys/kernel/debug/zram/zram0/block_state".
      
      The output is as follows,
            300    75.033841 .wh
            301    63.806904 s..
            302    63.806919 ..h
      
      First column is zram's block index and 3rh one represents symbol (s:
      same page w: written page to backing store h: huge page) of the block
      state.  Second column represents usec time unit of the block was last
      accessed.  So above example means the 300th block is accessed at
      75.033851 second and it was huge so it was written to the backing store.
      
      Admin can leverage this information to catch cold|incompressible pages
      of process with *pagemap* once part of heaps are swapped out.
      
      I used the feature a few years ago to find memory hoggers in userspace
      to notify them what memory they have wasted without touch for a long
      time.  With it, they could reduce unnecessary memory space.  However, at
      that time, I hacked up zram for the feature but now I need the feature
      again so I decided it would be better to upstream rather than keeping it
      alone.  I hope I submit the userspace tool to use the feature soon.
      
      [akpm@linux-foundation.org: fix i386 printk warning]
      [minchan@kernel.org: use ktime_get_boottime() instead of sched_clock()]
        Link: http://lkml.kernel.org/r/20180420063525.GA253739@rodete-desktop-imager.corp.google.com
      [akpm@linux-foundation.org: documentation tweak]
      [akpm@linux-foundation.org: fix i386 printk warning]
      [minchan@kernel.org: fix compile warning]
        Link: http://lkml.kernel.org/r/20180508104849.GA8209@rodete-desktop-imager.corp.google.com
      [rdunlap@infradead.org: fix printk formats]
        Link: http://lkml.kernel.org/r/3652ccb1-96ef-0b0b-05d1-f661d7733dcc@infradead.org
      Link: http://lkml.kernel.org/r/20180416090946.63057-5-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0265342
    • Minchan Kim's avatar
      zram: record accessed second · d7eac6b6
      Minchan Kim authored
      
      
      zRam as swap is useful for small memory device.  However, swap means
      those pages on zram are mostly cold pages due to VM's LRU algorithm.
      Especially, once init data for application are touched for launching,
      they tend to be not accessed any more and finally swapped out.  zRAM can
      store such cold pages as compressed form but it's pointless to keep in
      memory.  Better idea is app developers free them directly rather than
      remaining them on heap.
      
      This patch records last access time of each block of zram so that With
      upcoming zram memory tracking, it could help userspace developers to
      reduce memory footprint.
      
      Link: http://lkml.kernel.org/r/20180416090946.63057-4-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7eac6b6
    • Minchan Kim's avatar
      zram: mark incompressible page as ZRAM_HUGE · 89e85bce
      Minchan Kim authored
      
      
      Mark incompressible pages so that we could investigate who is the owner
      of the incompressible pages once the page is swapped out via using
      upcoming zram memory tracker feature.
      
      With it, we could prevent such pages to be swapped out by using mlock.
      Otherwise we might remove them.
      
      This patch exposes new stat for huge pages via mm_stat.
      
      Link: http://lkml.kernel.org/r/20180416090946.63057-3-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89e85bce
    • Minchan Kim's avatar
      zram: correct flag name of ZRAM_ACCESS · c4d6c4cc
      Minchan Kim authored
      
      
      Patch series "zram memory tracking", v5.
      
      zRam as swap is useful for small memory device.  However, swap means
      those pages on zram are mostly cold pages due to VM's LRU algorithm.
      Especially, once init data for application are touched for launching,
      they tend to be not accessed any more and finally swapped out.  zRAM can
      store such cold pages as compressed form but it's pointless to keep in
      memory.  As well, it's pointless to store incompressible pages to zram
      so better idea is app developers manages them directly like free or
      mlock rather than remaining them on heap.
      
      This patch provides a debugfs /sys/kernel/debug/zram/zram0/block_state
      to represent each block's state so admin can investigate what memory is
      cold|incompressible|same page with using pagemap once the pages are
      swapped out.
      
      The output is as follows:
            300    75.033841 .wh
            301    63.806904 s..
            302    63.806919 ..h
      
      First column is zram's block index and 3rh one represents symbol (s:
      same page w: written page to backing store h: huge page) of the block
      state.  Second column represents usec time unit of the block was last
      accessed.  So above example means the 300th block is accessed at
      75.033851 second and it was huge so it was written to the backing store.
      
      This patch (of 4):
      
      ZRAM_ACCESS is used for locking a slot of zram so correct the name.  It
      is also not a common flag to indicate status of the block so move the
      declare position on top of the flag.  Lastly, let's move the function to
      the top of source code to be able to use it easily without forward
      declaration.
      
      Link: http://lkml.kernel.org/r/20180416090946.63057-2-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4d6c4cc
    • Tejun Heo's avatar
      mm, memcontrol: implement memory.swap.events · f3a53a3a
      Tejun Heo authored
      
      
      Add swap max and fail events so that userland can monitor and respond to
      running out of swap.
      
      I'm not too sure about the fail event.  Right now, it's a bit confusing
      which stats / events are recursive and which aren't and also which ones
      reflect events which originate from a given cgroup and which targets the
      cgroup.  No idea what the right long term solution is and it could just
      be that growing them organically is actually the only right thing to do.
      
      Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.com
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3a53a3a
    • Tejun Heo's avatar
      mm, memcontrol: move swap charge handling into get_swap_page() · bb98f2c5
      Tejun Heo authored
      
      
      Patch series "mm, memcontrol: Implement memory.swap.events", v2.
      
      This patchset implements memory.swap.events which contains max and fail
      events so that userland can monitor and respond to swap running out.
      
      This patch (of 2):
      
      get_swap_page() is always followed by mem_cgroup_try_charge_swap().
      This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
      makes get_swap_page() call the function even after swap allocation
      failure.
      
      This simplifies the callers and consolidates memcg related logic and
      will ease adding swap related memcg events.
      
      Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb98f2c5
    • Yang Shi's avatar
      mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct · 88aa7cc6
      Yang Shi authored
      
      
      mmap_sem is on the hot path of kernel, and it very contended, but it is
      abused too.  It is used to protect arg_start|end and evn_start|end when
      reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
      sense since those proc files just expect to read 4 values atomically and
      not related to VM, they could be set to arbitrary values by C/R.
      
      And, the mmap_sem contention may cause unexpected issue like below:
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
       Call Trace:
         schedule+0x36/0x80
         rwsem_down_read_failed+0xf0/0x150
         call_rwsem_down_read_failed+0x18/0x30
         down_read+0x20/0x40
         proc_pid_cmdline_read+0xd9/0x4e0
         __vfs_read+0x37/0x150
         vfs_read+0x96/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
      for them to mitigate the abuse of mmap_sem.
      
      So, introduce a new spinlock in mm_struct to protect the concurrent
      access to arg_start|end, env_start|end and others, as well as replace
      write map_sem to read to protect the race condition between prctl and
      sys_brk which might break check_data_rlimit(), and makes prctl more
      friendly to other VM operations.
      
      This patch just eliminates the abuse of mmap_sem, but it can't resolve
      the above hung task warning completely since the later
      access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
      scalability issue will be solved in the future.
      
      [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
        Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88aa7cc6
    • Baoquan He's avatar
      slab: clean up the code comment in slab kmem_cache struct · 05fec35e
      Baoquan He authored
      In commit 3b0efdfa
      
       ("mm, sl[aou]b: Extract common fields from struct
      kmem_cache") the variable 'obj_size' was moved above, however the
      related code comment is not updated accordingly.  Do it here.
      
      Link: http://lkml.kernel.org/r/20180603032402.27526-1-bhe@redhat.com
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05fec35e
    • Canjiang Lu's avatar
      mm/slub: remove obsolete comment · 05088e5d
      Canjiang Lu authored
      The obsolete comment removed in this patch was introduced by
      51df1142
      
       ("slub: Dynamically size kmalloc cache allocations").
      
      I paste related modification from that commit:
      
      +#ifdef CONFIG_NUMA
      +       /*
      +        * Allocate kmem_cache_node properly from the kmem_cache slab.
      +        * kmem_cache_node is separately allocated so no need to
      +        * update any list pointers.
      +        */
      +       temp_kmem_cache_node = kmem_cache_node;
      
      +       kmem_cache_node = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
      +       memcpy(kmem_cache_node, temp_kmem_cache_node, kmem_size);
      +
      +       kmem_cache_bootstrap_fixup(kmem_cache_node);
      +
      +       caches++;
      +#else
      +       /*
      +        * kmem_cache has kmem_cache_node embedded and we moved it!
      +        * Update the list heads
      +        */
      +       INIT_LIST_HEAD(&kmem_cache->local_node.partial);
      +       list_splice(&temp_kmem_cache->local_node.partial, &kmem_cache->local_node.partial);
      +#ifdef CONFIG_SLUB_DEBUG
      +       INIT_LIST_HEAD(&kmem_cache->local_node.full);
      +       list_splice(&temp_kmem_cache->local_node.full, &kmem_cache->local_node.full);
      +#endif
      
      As we can see there're used to distinguish the difference handling
      between NUMA/non-NUMA configuration in the original commit.  I think it
      doesn't make any sense in current implementation which is placed above
      kmem_cache_node = bootstrap(&boot_kmem_cache_node); So maybe it's better
      to remove them now?
      
      Link: http://lkml.kernel.org/r/5af26f58.1c69fb81.1be0e.c520SMTPIN_ADDED_BROKEN@mx.google.com
      Signed-off-by: default avatarCanjiang Lu <canjiang.lu@samsung.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05088e5d
    • Mathieu Malaterre's avatar
      mm/slub.c: add __printf verification to slab_err() · a38965bf
      Mathieu Malaterre authored
      
      
      __printf is useful to verify format and arguments.  Remove the following
      warning (with W=1):
      
        mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format]
      
      Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org
      Signed-off-by: default avatarMathieu Malaterre <malat@debian.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a38965bf