Skip to content
  1. Feb 23, 2017
    • Pavel Emelyanov's avatar
      userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor · 9cd75c3c
      Pavel Emelyanov authored
      
      
      The custom events are queued in ctx->event_wqh not to disturb the
      fast-path-ed PF queue-wait-wakeup functions.
      
      The events to be generated (other than PF-s) are requested in UFFD_API
      ioctl with the uffd_api.features bits. Those, known by the kernel, are
      then turned on and reported back to the user-space.
      
      Link: http://lkml.kernel.org/r/20161216144821.5183-7-aarcange@redhat.com
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cd75c3c
    • Pavel Emelyanov's avatar
      userfaultfd: non-cooperative: Split the find_userfault() routine · 6dcc27fd
      Pavel Emelyanov authored
      
      
      I will need one to lookup for userfaultfd_wait_queue-s in different
      wait queue
      
      Link: http://lkml.kernel.org/r/20161216144821.5183-6-aarcange@redhat.com
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dcc27fd
    • Andrea Arcangeli's avatar
      userfaultfd: use vma_is_anonymous · a94720bf
      Andrea Arcangeli authored
      
      
      Cleanup the vma->vm_ops usage.
      
      Side note: it would be more robust if vma_is_anonymous() would also
      check that vm_flags hasn't VM_PFNMAP set.
      
      Link: http://lkml.kernel.org/r/20161216144821.5183-5-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a94720bf
    • Andrea Arcangeli's avatar
      userfaultfd: convert BUG() to WARN_ON_ONCE() · 8474901a
      Andrea Arcangeli authored
      
      
      Avoid BUG_ON()s and only WARN instead.  This is just a cleanup, it can't
      make any runtime difference.  This BUG_ON has never triggered and cannot
      trigger.
      
      Link: http://lkml.kernel.org/r/20161216144821.5183-4-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8474901a
    • Andrea Arcangeli's avatar
      userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP · a4605a61
      Andrea Arcangeli authored
      
      
      Minor comment correction.
      
      Link: http://lkml.kernel.org/r/20161216144821.5183-3-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4605a61
    • Andrea Arcangeli's avatar
      userfaultfd: document _IOR/_IOW · e067eba5
      Andrea Arcangeli authored
      
      
      Patch series "userfaultfd tmpfs/hugetlbfs/non-cooperative", v2
      
      These userfaultfd features are finished and are ready for larger
      exposure in -mm and upstream merging.
      
      1) tmpfs non present userfault
      2) hugetlbfs non present userfault
      3) non cooperative userfault for fork/madvise/mremap
      
      qemu development code is already exercising 2) and container postcopy
      live migration needs 3).
      
      1) is not currently used but there's a self test and we know some qemu
      user for various reasons uses tmpfs as backing for KVM so it'll need it
      too to use postcopy live migration with tmpfs memory.
      
      All review feedback from the previous submit has been handled and the
      fixes are included.  There's no outstanding issue AFIK.
      
      Upstream code just did a s/fe/vmf/ conversion in the page faults and
      this has been converted as well incrementally.
      
      In addition to the previous submits, this also wakes up stuck userfaults
      during UFFDIO_UNREGISTER.  The non cooperative testcase actually
      reproduced this problem by getting stuck instead of quitting clean in
      some rare case as it could call UFFDIO_UNREGISTER while some userfault
      could be still in flight.  The other option would have been to keep
      leaving it up to userland to serialize itself and to patch the testcase
      instead but the wakeup during unregister I think is preferable.
      
      David also asked the UFFD_FEATURE_MISSING_HUGETLBFS and
      UFFD_FEATURE_MISSING_SHMEM feature flags to be added so QEMU can avoid
      to probe if the hugetlbfs/shmem missing support is available by calling
      UFFDIO_REGISTER.  QEMU already checks HUGETLBFS_MAGIC with fstatfs so if
      UFFD_FEATURE_MISSING_HUGETLBFS is also set, it knows UFFDIO_REGISTER
      will succeed (or if it fails, it's for some other more concerning
      reason).  There's no reason to worry about adding too many feature
      flags.  There are 64 available and worst case we've to bump the API if
      someday we're really going to run out of them.
      
      The round-trip network latency of hugetlbfs userfaults during postcopy
      live migration is still of the order of dozen milliseconds on 10GBit if
      at 2MB hugepage granularity so it's working perfectly and it should
      provide for higher bandwidth or lower CPU usage (which makes it
      interesting to add an option in the future to support THP granularity
      too for anonymous memory, UFFDIO_COPY would then have to create THP if
      alignment/len allows for it).  1GB hugetlbfs granularity will require
      big changes in hugetlbfs to work so it's deferred for later.
      
      This patch (of 42):
      
      This adds proper documentation (inline) to avoid the risk of further
      misunderstandings about the semantics of _IOW/_IOR and it also reminds
      whoever will bump the UFFDIO_API in the future, to change the two ioctl
      to _IOW.
      
      This was found while implementing strace support for those ioctl,
      otherwise we could have never found it by just reviewing kernel code and
      testing it.
      
      _IOC_READ or _IOC_WRITE alters nothing but the ioctl number itself, so
      it's only worth fixing if the UFFDIO_API is bumped someday.
      
      Link: http://lkml.kernel.org/r/20161216144821.5183-2-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatar"Dmitry V. Levin" <ldv@altlinux.org>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e067eba5
    • Michal Hocko's avatar
      oom, trace: add compaction retry tracepoint · 65190cff
      Michal Hocko authored
      
      
      Higher order requests oom debugging is currently quite hard.  We do have
      some compaction points which can tell us how the compaction is operating
      but there is no trace point to tell us about compaction retry logic.
      This patch adds a one which will have the following format
      
                  bash-3126  [001] ....  1498.220001: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=0
      
      we can see that the order 9 request is not retried even though we are in
      the highest compaction priority mode becase the last compaction attempt
      was withdrawn.  This means that compaction_zonelist_suitable must have
      returned false and there is no suitable zone to compact for this request
      and so no need to retry further.
      
      another example would be
                 <...>-3137  [001] ....    81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0
      
      in this case the order-9 compaction failed to find any suitable block.
      We do not retry anymore because this is a costly request and those do
      not go below COMPACT_PRIO_SYNC_LIGHT priority.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-4-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65190cff
    • Michal Hocko's avatar
      oom, trace: add oom detection tracepoints · d379f01d
      Michal Hocko authored
      
      
      should_reclaim_retry is the central decision point for declaring the
      OOM.  It might be really useful to expose data used for this decision
      making when debugging an unexpected oom situations.
      
      Say we have an OOM report:
      [   52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
      [   52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G        W       4.8.0-oomtrace3-00006-gb21338b386d2 #1024
      
      Now we can check the tracepoint data to see how we have ended up in this
      situation:
             mem_eater-3148  [003] ....    52.432801: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11134 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433269: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11103 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433712: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11100 min_wmark=11084 no_progress_loops=2 wmark_check=1
             mem_eater-3148  [003] ....    52.434067: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11097 min_wmark=11084 no_progress_loops=3 wmark_check=1
             mem_eater-3148  [003] ....    52.434414: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11094 min_wmark=11084 no_progress_loops=4 wmark_check=1
             mem_eater-3148  [003] ....    52.434761: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11091 min_wmark=11084 no_progress_loops=5 wmark_check=1
             mem_eater-3148  [003] ....    52.435108: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11087 min_wmark=11084 no_progress_loops=6 wmark_check=1
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11084 min_wmark=11084 no_progress_loops=7 wmark_check=0
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA order=0 reclaimable=0 available=1126 min_wmark=179 no_progress_loops=7 wmark_check=0
      
      The above shows that we can quickly deduce that the reclaim stopped
      making any progress (see no_progress_loops increased in each round) and
      while there were still some 51 reclaimable pages they couldn't be
      dropped for some reason (vmscan trace points would tell us more about
      that part).  available will represent reclaimable + free_pages scaled
      down per no_progress_loops factor.  This is essentially an optimistic
      estimate of how much memory we would have when reclaiming everything.
      This can be compared to min_wmark to get a rought idea but the
      wmark_check tells the result of the watermark check which is more
      precise (includes lowmem reserves, considers the order etc.).  As we can
      see no zone is eligible in the end and that is why we have triggered the
      oom in this situation.
      
      Please note that higher order requests might fail on the wmark_check
      even when there is much more memory available than min_wmark - e.g.
      when the memory is fragmented.  A follow up tracepoint will help to
      debug those situations.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d379f01d
    • Michal Hocko's avatar
      mm, trace: extract COMPACTION_STATUS and ZONE_TYPE to a common header · aff28015
      Michal Hocko authored
      
      
      COMPACTION_STATUS resp. ZONE_TYPE are currently used to translate enum
      compact_result resp.  struct zone index into their symbolic names for an
      easier post processing.  The follow up patch would like to reuse this as
      well.  The code involves some preprocessor black magic which is better not
      duplicated elsewhere so move it to a common mm tracing relate header.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aff28015
    • Geliang Tang's avatar
      mm/vmalloc.c: use rb_entry_safe · 4583e773
      Geliang Tang authored
      
      
      Use rb_entry_safe() instead of open-coding it.
      
      Link: http://lkml.kernel.org/r/81bb9820e5b9e4a1c596b3e76f88abf8c4a76cb0.1482221947.git.geliangtang@gmail.com
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4583e773
    • Vlastimil Babka's avatar
      mm, page_alloc: avoid page_to_pfn() when merging buddies · 13ad59df
      Vlastimil Babka authored
      
      
      On architectures that allow memory holes, page_is_buddy() has to perform
      page_to_pfn() to check for the memory hole.  After the previous patch,
      we have the pfn already available in __free_one_page(), which is the
      only caller of page_is_buddy(), so move the check there and avoid
      page_to_pfn().
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-2-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13ad59df
    • Vlastimil Babka's avatar
      mm, page_alloc: don't convert pfn to idx when merging · 76741e77
      Vlastimil Babka authored
      
      
      In __free_one_page() we do the buddy merging arithmetics on "page/buddy
      index", which is just the lower MAX_ORDER bits of pfn.  The operations
      we do that affect the higher bits are bitwise AND and subtraction (in
      that order), where the final result will be the same with the higher
      bits left unmasked, as long as these bits are equal for both buddies -
      which must be true by the definition of a buddy.
      
      We can therefore use pfn's directly instead of "index" and skip the
      zeroing of >MAX_ORDER bits.  This can help a bit by itself, although
      compiler might be smart enough already.  It also helps the next patch to
      avoid page_to_pfn() for memory hole checks.
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76741e77
    • Michal Hocko's avatar
      mm: throttle show_mem() from warn_alloc() · aa187507
      Michal Hocko authored
      
      
      Tetsuo has been stressing OOM killer path with many parallel allocation
      requests when he has noticed that it is not all that hard to swamp
      kernel logs with warn_alloc messages caused by allocation stalls.  Even
      though the allocation stall message is triggered only once in 10s there
      might be many different tasks hitting it roughly around the same time.
      
      A big part of the output is show_mem() which can generate a lot of
      output even on a small machines.  There is no reason to show the state
      of memory counter for each allocation stall, especially when multiple of
      them are reported in a short time period.  Chances are that not much has
      changed since the last report.  This patch simply rate limits show_mem
      called from warn_alloc to only dump something once per second.  This
      should be enough to give us a clue why an allocation might be stalling
      while burst of warnings will not swamp log with too much data.
      
      While we are at it, extract all the show_mem related handling (filters)
      into a separate function warn_alloc_show_mem.  This will make the code
      cleaner and as a bonus point we can distinguish which part of warn_alloc
      got throttled due to rate limiting as ___ratelimit dumps the caller.
      
      [akpm@linux-foundation.org: reduce scope of the ratelimit_states]
      Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa187507
    • Hugh Dickins's avatar
      tmpfs: change shmem_mapping() to test shmem_aops · f8005451
      Hugh Dickins authored
      
      
      Callers of shmem_mapping() are interested in whether the mapping is swap
      backed - except for uprobes, which is interested in whether it should
      use shmem_read_mapping_page().  All these callers are better served by a
      shmem_mapping() which checks for shmem_aops, than the current version
      which goes through several indirections to find where the inode lives -
      and has the surprising effect that a private mmap of /dev/zero satisfies
      both vma_is_anonymous() and shmem_mapping(), when that device node is on
      devtmpfs.  I don't think anything in the tree suffers from that
      surprise, but it caught me out, and is better fixed.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052148530.13021@eggly.anvils
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8005451
    • Tejun Heo's avatar
      slub: make sysfs directories for memcg sub-caches optional · 1663f26d
      Tejun Heo authored
      
      
      SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
      bunch of debug files.  Usually, there aren't that many caches on a
      system and this doesn't really matter; however, if memcg is in use, each
      cache can have per-cgroup sub-caches.  SLUB creates the same directories
      for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.
      
      Unfortunately, because there can be a lot of cgroups, active or
      draining, the product of the numbers of caches, cgroups and files in
      each directory can reach a very high number - hundreds of thousands is
      commonplace.  Millions and beyond aren't difficult to reach either.
      
      What's under /sys/kernel/slab is primarily for debugging and the
      information and control on the a root cache already cover its
      sub-caches.  While having a separate directory for each sub-cache can be
      helpful for development, it doesn't make much sense to pay this amount
      of overhead by default.
      
      This patch introduces a boot parameter slub_memcg_sysfs which determines
      whether to create sysfs directories for per-memcg sub-caches.  It also
      adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
      default value and defaults to 0.
      
      [akpm@linux-foundation.org: kset_unregister(NULL) is legal]
      Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1663f26d
    • Tejun Heo's avatar
      slab: use memcg_kmem_cache_wq for slab destruction operations · 17cc4dfe
      Tejun Heo authored
      
      
      If there's contention on slab_mutex, queueing the per-cache destruction
      work item on the system_wq can unnecessarily create and tie up a lot of
      kworkers.
      
      Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
      global and use that workqueue for the destruction work items too.  While
      at it, convert the workqueue from an unbound workqueue to a per-cpu one
      with concurrency limited to 1.  It's generally preferable to use per-cpu
      workqueues and concurrency limit of 1 is safe enough.
      
      This is suggested by Joonsoo Kim.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17cc4dfe
    • Tejun Heo's avatar
      slab: remove slub sysfs interface files early for empty memcg caches · 50862ce7
      Tejun Heo authored
      
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      Each cache has a number of sysfs interface files under /sys/kernel/slab.
      On a system with a lot of memory and transient memcgs, the number of
      interface files which have to be removed once memory reclaim kicks in
      can reach millions.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-10-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50862ce7
    • Tejun Heo's avatar
      slab: remove synchronous synchronize_sched() from memcg cache deactivation path · 01fb58bc
      Tejun Heo authored
      
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slub uses synchronize_sched() to deactivate a memcg cache.
      synchronize_sched() is an expensive and slow operation and doesn't scale
      when a huge number of caches are destroyed back-to-back.  While there
      used to be a simple batching mechanism, the batching was too restricted
      to be helpful.
      
      This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
      can use to schedule sched RCU callback instead of performing
      synchronize_sched() synchronously while holding cgroup_mutex.  While
      this adds online cpus, mems and slab_mutex operations, operating on
      these locks back-to-back from the same kworker, which is what's gonna
      happen when there are many to deactivate, isn't expensive at all and
      this gets rid of the scalability problem completely.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01fb58bc
    • Tejun Heo's avatar
      slab: introduce __kmemcg_cache_deactivate() · c9fc5864
      Tejun Heo authored
      
      
      __kmem_cache_shrink() is called with %true @deactivate only for memcg
      caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
      __kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
      should implement it and it should deactivate and drain the cache.
      
      This is to allow memcg cache deactivation behavior to further deviate
      from simple shrinking without messing up __kmem_cache_shrink().
      
      This is pure reorganization and doesn't introduce any observable
      behavior changes.
      
      v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9fc5864
    • Tejun Heo's avatar
      slab: implement slab_root_caches list · 510ded33
      Tejun Heo authored
      
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slab_caches currently lists all caches including root and memcg ones.
      This is the only data structure which lists the root caches and
      iterating root caches can only be done by walking the list while
      skipping over memcg caches.  As there can be a huge number of memcg
      caches, this can become very expensive.
      
      This also can make /proc/slabinfo behave very badly.  seq_file processes
      reads in 4k chunks and seeks to the previous Nth position on slab_caches
      list to resume after each chunk.  With a lot of memcg cache churns on
      the list, reading /proc/slabinfo can become very slow and its content
      often ends up with duplicate and/or missing entries.
      
      This patch adds a new list slab_root_caches which lists only the root
      caches.  When memcg is not enabled, it becomes just an alias of
      slab_caches.  memcg specific list operations are collected into
      memcg_[un]link_cache().
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      510ded33
    • Tejun Heo's avatar
      slab: link memcg kmem_caches on their associated memory cgroup · bc2791f8
      Tejun Heo authored
      
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      While a memcg kmem_cache is listed on its root cache's ->children list,
      there is no direct way to iterate all kmem_caches which are assocaited
      with a memory cgroup.  The only way to iterate them is walking all
      caches while filtering out caches which don't match, which would be most
      of them.
      
      This makes memcg destruction operations O(N^2) where N is the total
      number of slab caches which can be huge.  This combined with the
      synchronous RCU operations can tie up a CPU and affect the whole machine
      for many hours when memory reclaim triggers offlining and destruction of
      the stale memcgs.
      
      This patch adds mem_cgroup->kmem_caches list which goes through
      memcg_cache_params->kmem_caches_node of all kmem_caches which are
      associated with the memcg.  All memcg specific iterations, including
      stat file access, are updated to use the new list instead.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc2791f8
    • Tejun Heo's avatar
      slab: reorganize memcg_cache_params · 9eeadc8b
      Tejun Heo authored
      
      
      We're going to change how memcg caches are iterated.  In preparation,
      clean up and reorganize memcg_cache_params.
      
      * The shared ->list is replaced by ->children in root and
        ->children_node in children.
      
      * ->is_root_cache is removed.  Instead ->root_cache is moved out of
        the child union and now used by both root and children.  NULL
        indicates root cache.  Non-NULL a memcg one.
      
      This patch doesn't cause any observable behavior changes.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9eeadc8b
    • Tejun Heo's avatar
      slab: remove synchronous rcu_barrier() call in memcg cache release path · 657dc2f9
      Tejun Heo authored
      
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      SLAB_DESTORY_BY_RCU caches need to flush all RCU operations before
      destruction because slab pages are freed through RCU and they need to be
      able to dereference the associated kmem_cache.  Currently, it's done
      synchronously with rcu_barrier().  As rcu_barrier() is expensive
      time-wise, slab implements a batching mechanism so that rcu_barrier()
      can be done for multiple caches at the same time.
      
      Unfortunately, the rcu_barrier() is in synchronous path which is called
      while holding cgroup_mutex and the batching is too limited to be
      actually helpful.
      
      This patch updates the cache release path so that the batching is
      asynchronous and global.  All SLAB_DESTORY_BY_RCU caches are queued
      globally and a work item consumes the list.  The work item calls
      rcu_barrier() only once for all caches that are currently queued.
      
      * release_caches() is removed and shutdown_cache() now either directly
        release the cache or schedules a RCU callback to do that.  This
        makes the cache inaccessible once shutdown_cache() is called and
        makes it impossible for shutdown_memcg_caches() to do memcg-specific
        cleanups afterwards.  Move memcg-specific part into a helper,
        unlink_memcg_cache(), and make shutdown_cache() call it directly.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-4-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      657dc2f9
    • Tejun Heo's avatar
      slub: separate out sysfs_slab_release() from sysfs_slab_remove() · bf5eb3de
      Tejun Heo authored
      
      
      Separate out slub sysfs removal and release, and call the former earlier
      from __kmem_cache_shutdown().  There's no reason to defer sysfs removal
      through RCU and this will later allow us to remove sysfs files way
      earlier during memory cgroup offline instead of release.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf5eb3de
    • Tejun Heo's avatar
      Revert "slub: move synchronize_sched out of slab_mutex on shrink" · 290b6a58
      Tejun Heo authored
      Patch series "slab: make memcg slab destruction scalable", v3.
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.
      
      I've seen machines which end up with hundred thousands of caches and
      many millions of kernfs_nodes.  The current code is O(N^2) on the total
      number of caches and has synchronous rcu_barrier() and
      synchronize_sched() in cgroup offline / release path which is executed
      while holding cgroup_mutex.  Combined, this leads to very expensive and
      slow cache destruction operations which can easily keep running for half
      a day.
      
      This also messes up /proc/slabinfo along with other cache iterating
      operations.  seq_file operates on 4k chunks and on each 4k boundary
      tries to seek to the last position in the list.  With a huge number of
      caches on the list, this becomes very slow and very prone to the list
      content changing underneath it leading to a lot of missing and/or
      duplicate entries.
      
      This patchset addresses the scalability problem.
      
      * Add root and per-memcg lists.  Update each user to use the
        appropriate list.
      
      * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
        and asynchronous.
      
      * For dying empty slub caches, remove the sysfs files after
        deactivation so that we don't end up with millions of sysfs files
        without any useful information on them.
      
      This patchset contains the following nine patches.
      
       0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
       0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
       0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
       0004-slab-reorganize-memcg_cache_params.patch
       0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
       0006-slab-implement-slab_root_caches-list.patch
       0007-slab-introduce-__kmemcg_cache_deactivate.patch
       0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
       0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
       0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch
      
      0001 reverts an existing optimization to prepare for the following
      changes.  0002 is a prep patch.  0003 makes rcu_barrier() in release
      path batched and asynchronous.  0004-0006 separate out the lists.
      0007-0008 replace synchronize_sched() in slub destruction path with
      call_rcu_sched().  0009 removes sysfs files early for empty dying
      caches.  0010 makes destruction work items use a workqueue with limited
      concurrency.
      
      This patch (of 10):
      
      Revert 89e364db
      
       ("slub: move synchronize_sched out of slab_mutex on
      shrink").
      
      With kmem cgroup support enabled, kmem_caches can be created and destroyed
      frequently and a great number of near empty kmem_caches can accumulate if
      there are a lot of transient cgroups and the system is not under memory
      pressure.  When memory reclaim starts under such conditions, it can lead
      to consecutive deactivation and destruction of many kmem_caches, easily
      hundreds of thousands on moderately large systems, exposing scalability
      issues in the current slab management code.  This is one of the patches to
      address the issue.
      
      Moving synchronize_sched() out of slab_mutex isn't enough as it's still
      inside cgroup_mutex.  The whole deactivation / release path will be
      updated to avoid all synchronous RCU operations.  Revert this insufficient
      optimization in preparation to ease future changes.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      290b6a58
    • Vlastimil Babka's avatar
      mm, slab: rename kmalloc-node cache to kmalloc-<size> · af3b5f87
      Vlastimil Babka authored
      
      
      SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
      the kmem_cache_node management structure, and puts it into the generic
      kmalloc cache array (e.g. for 128b objects).  The name of this cache is
      "kmalloc-node", which is confusing for readers of /proc/slabinfo as the
      cache is used for generic allocations (and not just the kmem_cache_node
      struct) and it appears as the kmalloc-128 cache is missing.
      
      An easy solution is to use the kmalloc-<size> name when pre-creating the
      cache, which we can get from the kmalloc_info array.
      
      Example /proc/slabinfo before the patch:
      
        ...
        kmalloc-256         1647   1984    256   16    1 : tunables  120   60    8 : slabdata    124    124    828
        kmalloc-192         1974   1974    192   21    1 : tunables  120   60    8 : slabdata     94     94    133
        kmalloc-96          1332   1344    128   32    1 : tunables  120   60    8 : slabdata     42     42    219
        kmalloc-64          2505   5952     64   64    1 : tunables  120   60    8 : slabdata     93     93    715
        kmalloc-32          4278   4464     32  124    1 : tunables  120   60    8 : slabdata     36     36    346
        kmalloc-node        1352   1376    128   32    1 : tunables  120   60    8 : slabdata     43     43     53
        kmem_cache           132    147    192   21    1 : tunables  120   60    8 : slabdata      7      7      0
      
      After the patch:
      
        ...
        kmalloc-256         1672   2160    256   16    1 : tunables  120   60    8 : slabdata    135    135    807
        kmalloc-192         1992   2016    192   21    1 : tunables  120   60    8 : slabdata     96     96    203
        kmalloc-96          1159   1184    128   32    1 : tunables  120   60    8 : slabdata     37     37    116
        kmalloc-64          2561   4864     64   64    1 : tunables  120   60    8 : slabdata     76     76    785
        kmalloc-32          4253   4340     32  124    1 : tunables  120   60    8 : slabdata     35     35    270
        kmalloc-128         1256   1280    128   32    1 : tunables  120   60    8 : slabdata     40     40     39
        kmem_cache           125    147    192   21    1 : tunables  120   60    8 : slabdata      7      7      0
      
      [vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
        Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
      Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af3b5f87
    • Borislav Petkov's avatar
      mm/slub: add a dump_stack() to the unexpected GFP check · 65b9de75
      Borislav Petkov authored
      
      
      We wish to know who is doing such a thing. slab.c does this.
      
      Link: http://lkml.kernel.org/r/20170116091643.15260-1-bp@alien8.de
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65b9de75
    • Grygorii Maistrenko's avatar
      slub: do not merge cache if slub_debug contains a never-merge flag · c6e28895
      Grygorii Maistrenko authored
      
      
      In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from
      commandline but never checks if there are features from the
      SLAB_NEVER_MERGE set.
      
      As a result selected by slub_debug caches are always mergeable if they
      have been created without a custom constructor set or without one of the
      SLAB_* debug features on.
      
      This moves the SLAB_NEVER_MERGE check below the flags update from
      commandline to make sure it won't merge the slab cache if one of the debug
      features is on.
      
      Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d
      Signed-off-by: default avatarGrygorii Maistrenko <grygoriimkd@gmail.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6e28895
    • Prarit Bhargava's avatar
      kernel/watchdog.c: do not hardcode CPU 0 as the initial thread · 8dcde9de
      Prarit Bhargava authored
      
      
      When CONFIG_BOOTPARAM_HOTPLUG_CPU0 is enabled, the socket containing the
      boot cpu can be replaced.  During the hot add event, the message
      
      NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
      
      is output implying that the NMI watchdog was disabled at some point.  This
      is not the case and the message has caused confusion for users of systems
      that support the removal of the boot cpu socket.
      
      The watchdog code is coded to assume that cpu 0 is always the first cpu to
      initialize the watchdog, and the last to stop its watchdog thread.  That
      is not the case for initializing if cpu 0 has been removed and added.  The
      removal case has never been correct because the smpboot code will remove
      the watchdog threads starting with the lowest cpu number.
      
      This patch adds watchdog_cpus to track the number of cpus with active NMI
      watchdog threads so that the first and last thread can be used to set and
      clear the value of firstcpu_err.  firstcpu_err is set when the first
      watchdog thread is enabled, and cleared when the last watchdog thread is
      disabled.
      
      Link: http://lkml.kernel.org/r/1480425321-32296-1-git-send-email-prarit@redhat.com
      Signed-off-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Joshua Hunt <johunt@akamai.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Babu Moger <babu.moger@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dcde9de
    • Cong Wang's avatar
      9p: fix a potential acl leak · b5c66bab
      Cong Wang authored
      
      
      posix_acl_update_mode() could possibly clear 'acl', if so we leak the
      memory pointed by 'acl'.  Save this pointer before calling
      posix_acl_update_mode() and release the memory if 'acl' really gets
      cleared.
      
      Link: http://lkml.kernel.org/r/1486678332-2430-1-git-send-email-xiyou.wangcong@gmail.com
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avatarMark Salyzyn <salyzyn@android.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kurz <groug@kaod.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5c66bab
    • Tetsuo Handa's avatar
      block: use for_each_thread() in sys_ioprio_set()/sys_ioprio_get() · 612dafab
      Tetsuo Handa authored
      IOPRIO_WHO_USER case in sys_ioprio_set()/sys_ioprio_get() are using
      while_each_thread(), which is unsafe under RCU lock according to commit
      0c740d0a
      
       ("introduce for_each_thread() to replace the buggy
      while_each_thread()").  Use for_each_thread() (via
      for_each_process_thread()) which is safe under RCU lock.
      
      Link: http://lkml.kernel.org/r/201702011947.DBD56740.OMVHOLOtSJFFFQ@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/1486041779-4401-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      612dafab
    • Davidlohr Bueso's avatar
      parisc: use generic current.h · 6302666d
      Davidlohr Bueso authored
      
      
      Given that the arch does not add its own implementations, simply use the
      asm-generic/current.h (generic-y) header instead of duplicating code.
      
      Link: http://lkml.kernel.org/r/1485992878-4780-4-git-send-email-dave@stgolabs.net
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6302666d
    • Eric Ren's avatar
      ocfs2: fix deadlock issue when taking inode lock at vfs entry points · b891fa50
      Eric Ren authored
      Commit 743b5f14
      
       ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
      results in a deadlock, as the author "Tariq Saeed" realized shortly
      after the patch was merged.  The discussion happened here
      
        https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html
      
      The reason why taking cluster inode lock at vfs entry points opens up a
      self deadlock window, is explained in the previous patch of this series.
      
      So far, we have seen two different code paths that have this issue.
      
      1. do_sys_open
           may_open
            inode_permission
             ocfs2_permission
              ocfs2_inode_lock() <=== take PR
               generic_permission
                get_acl
                 ocfs2_iop_get_acl
                  ocfs2_inode_lock() <=== take PR
      
      2. fchmod|fchmodat
          chmod_common
           notify_change
            ocfs2_setattr <=== take EX
             posix_acl_chmod
              get_acl
               ocfs2_iop_get_acl <=== take PR
              ocfs2_iop_set_acl <=== take EX
      
      Fixes them by adding the tracking logic (in the previous patch) for these
      funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
      ocfs2_setattr().
      
      Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com
      Signed-off-by: default avatarEric Ren <zren@suse.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b891fa50
    • Eric Ren's avatar
      ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock · 439a36b8
      Eric Ren authored
      
      
      We are in the situation that we have to avoid recursive cluster locking,
      but there is no way to check if a cluster lock has been taken by a precess
      already.
      
      Mostly, we can avoid recursive locking by writing code carefully.
      However, we found that it's very hard to handle the routines that are
      invoked directly by vfs code.  For instance:
      
        const struct inode_operations ocfs2_file_iops = {
            .permission     = ocfs2_permission,
            .get_acl        = ocfs2_iop_get_acl,
            .set_acl        = ocfs2_iop_set_acl,
        };
      
      Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
      
        do_sys_open
         may_open
          inode_permission
           ocfs2_permission
            ocfs2_inode_lock() <=== first time
             generic_permission
              get_acl
               ocfs2_iop_get_acl
        	ocfs2_inode_lock() <=== recursive one
      
      A deadlock will occur if a remote EX request comes in between two of
      ocfs2_inode_lock().  Briefly describe how the deadlock is formed:
      
      On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
      BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
      the remote EX lock request.  Another hand, the recursive cluster lock
      (the second one) will be blocked in in __ocfs2_cluster_lock() because of
      OCFS2_LOCK_BLOCKED.  But, the downconvert never complete, why? because
      there is no chance for the first cluster lock on this node to be
      unlocked - we block ourselves in the code path.
      
      The idea to fix this issue is mostly taken from gfs2 code.
      
      1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
         of the processes' pid who has taken the cluster lock of this lock
         resource;
      
      2. introduce a new flag for ocfs2_inode_lock_full:
         OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
         us if we've got cluster lock.
      
      3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
         got the cluster lock in the upper code path.
      
      The tracking logic should be used by some of the ocfs2 vfs's callbacks,
      to solve the recursive locking issue cuased by the fact that vfs
      routines can call into each other.
      
      The performance penalty of processing the holder list should only be
      seen at a few cases where the tracking logic is used, such as get/set
      acl.
      
      You may ask what if the first time we got a PR lock, and the second time
      we want a EX lock? fortunately, this case never happens in the real
      world, as far as I can see, including permission check,
      (get|set)_(acl|attr), and the gfs2 code also do so.
      
      [sfr@canb.auug.org.au remove some inlines]
      Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com
      Signed-off-by: default avatarEric Ren <zren@suse.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      439a36b8
    • Davidlohr Bueso's avatar
      score: remove asm/current.h · ca376b37
      Davidlohr Bueso authored
      
      
      ...  it's already using the generic version anyways, so just drop the file
      as do the other archs that do not implement their own version of the
      current macro.
      
      Link: http://lkml.kernel.org/r/1485992878-4780-5-git-send-email-dave@stgolabs.net
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca376b37
    • Sudip Mukherjee's avatar
      m32r: fix build warning · af0de781
      Sudip Mukherjee authored
      Some m32r builds were having a warning:
      
        arch/m32r/include/asm/cmpxchg.h:191:3: warning: value computed is not used
        arch/m32r/include/asm/cmpxchg.h:68:3: warning: value computed is not used
      
      Taking the idea from commit e001bbae
      
       ("ARM: cmpxchg: avoid warnings
      from macro-ized cmpxchg() implementations") the m32r implementation is
      changed to use a similar construct with a compound expression instead of
      a typecast, which causes the compiler to not complain about an unused
      result.
      
      Link: http://lkml.kernel.org/r/1484432664-7015-1-git-send-email-sudipm.mukherjee@gmail.com
      Signed-off-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0de781
    • Davidlohr Bueso's avatar
      m32r: use generic current.h · 6408b6fb
      Davidlohr Bueso authored
      
      
      Given that the arch does not add its own implementations, simply
      use the asm-generic/current.h (generic-y) header instead of
      duplicating code.
      
      Link: http://lkml.kernel.org/r/1482896994-25863-1-git-send-email-dave@stgolabs.net
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6408b6fb
    • Hou Tao's avatar
      scripts/tags.sh: include arch/Kconfig* for tags generation · 7659c655
      Hou Tao authored
      
      
      Kconfig files under arch/ directory are ignored by all_kconfigs(),
      so include them for tags generation.
      
      Link: http://lkml.kernel.org/r/1486206053-38223-1-git-send-email-houtao1@huawei.com
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Mathieu Maret <mathieu.maret@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7659c655
    • Cheah Kok Cheong's avatar
      scripts/checkincludes.pl: add exit message for no duplicates found · 8087a560
      Cheah Kok Cheong authored
      
      
      If no duplicates found, inform user.
      
      Link: http://lkml.kernel.org/r/1486391275-2843-1-git-send-email-thrust73@gmail.com
      Signed-off-by: default avatarCheah Kok Cheong <thrust73@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8087a560
    • Tobias Klauser's avatar
      scripts/checkstack.pl: add support for nios2 · 0e5a47a8
      Tobias Klauser authored
      
      
      Adjust checkstack.pl for the nios2 architecture.
      
      Link: http://lkml.kernel.org/r/20170116113052.15034-1-tklauser@distanz.ch
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Cc: Ley Foon Tan <lftan@altera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e5a47a8