Skip to content
  1. Aug 08, 2020
    • Roman Gushchin's avatar
      mm: memcg/slab: charge individual slab objects instead of pages · f2fe7b09
      Roman Gushchin authored
      
      
      Switch to per-object accounting of non-root slab objects.
      
      Charging is performed using obj_cgroup API in the pre_alloc hook.
      Obj_cgroup is charged with the size of the object and the size of
      metadata: as now it's the size of an obj_cgroup pointer.  If the amount of
      memory has been charged successfully, the actual allocation code is
      executed.  Otherwise, -ENOMEM is returned.
      
      In the post_alloc hook if the actual allocation succeeded, corresponding
      vmstats are bumped and the obj_cgroup pointer is saved.  Otherwise, the
      charge is canceled.
      
      On the free path obj_cgroup pointer is obtained and used to uncharge the
      size of the releasing object.
      
      Memcg and lruvec counters are now representing only memory used by active
      slab objects and do not include the free space.  The free space is shared
      and doesn't belong to any specific cgroup.
      
      Global per-node slab vmstats are still modified from
      (un)charge_slab_page() functions.  The idea is to keep all slab pages
      accounted as slab pages on system level.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2fe7b09
    • Roman Gushchin's avatar
      mm: memcg/slab: save obj_cgroup for non-root slab objects · 964d4bd3
      Roman Gushchin authored
      
      
      Store the obj_cgroup pointer in the corresponding place of
      page->obj_cgroups for each allocated non-root slab object.  Make sure that
      each allocated object holds a reference to obj_cgroup.
      
      Objcg pointer is obtained from the memcg->objcg dereferencing in
      memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
      Then in case of successful allocation(s) it's getting stored in the
      page->obj_cgroups vector.
      
      The objcg obtaining part look a bit bulky now, but it will be simplified
      by next commits in the series.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      964d4bd3
    • Roman Gushchin's avatar
      mm: memcg/slab: allocate obj_cgroups for non-root slab pages · 286e04b8
      Roman Gushchin authored
      
      
      Allocate and release memory to store obj_cgroup pointers for each non-root
      slab page. Reuse page->mem_cgroup pointer to store a pointer to the
      allocated space.
      
      This commit temporarily increases the memory footprint of the kernel memory
      accounting. To store obj_cgroup pointers we'll need a place for an
      objcg_pointer for each allocated object. However, the following patches
      in the series will enable sharing of slab pages between memory cgroups,
      which will dramatically increase the total slab utilization. And the final
      memory footprint will be significantly smaller than before.
      
      To distinguish between obj_cgroups and memcg pointers in case when it's
      not obvious which one is used (as in page_cgroup_ino()), let's always set
      the lowest bit in the obj_cgroup case. The original obj_cgroups
      pointer is marked to be ignored by kmemleak, which otherwise would
      report a memory leak for each allocated vector.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      286e04b8
    • Roman Gushchin's avatar
      mm: memcg/slab: obj_cgroup API · bf4f0599
      Roman Gushchin authored
      
      
      Obj_cgroup API provides an ability to account sub-page sized kernel
      objects, which potentially outlive the original memory cgroup.
      
      The top-level API consists of the following functions:
        bool obj_cgroup_tryget(struct obj_cgroup *objcg);
        void obj_cgroup_get(struct obj_cgroup *objcg);
        void obj_cgroup_put(struct obj_cgroup *objcg);
      
        int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
        void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
      
        struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
        struct obj_cgroup *get_obj_cgroup_from_current(void);
      
      Object cgroup is basically a pointer to a memory cgroup with a per-cpu
      reference counter.  It substitutes a memory cgroup in places where it's
      necessary to charge a custom amount of bytes instead of pages.
      
      All charged memory rounded down to pages is charged to the corresponding
      memory cgroup using __memcg_kmem_charge().
      
      It implements reparenting: on memcg offlining it's getting reattached to
      the parent memory cgroup.  Each online memory cgroup has an associated
      active object cgroup to handle new allocations and the list of all
      attached object cgroups.  On offlining of a cgroup this list is reparented
      and for each object cgroup in the list the memcg pointer is swapped to the
      parent memory cgroup.  It prevents long-living objects from pinning the
      original memory cgroup in the memory.
      
      The implementation is based on byte-sized per-cpu stocks.  A sub-page
      sized leftover is stored in an atomic field, which is a part of obj_cgroup
      object.  So on cgroup offlining the leftover is automatically reparented.
      
      memcg->objcg is rcu protected.  objcg->memcg is a raw pointer, which is
      always pointing at a memory cgroup, but can be atomically swapped to the
      parent memory cgroup.  So a user must ensure the lifetime of the
      cgroup, e.g.  grab rcu_read_lock or css_set_lock.
      
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-7-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf4f0599
    • Johannes Weiner's avatar
      mm: memcontrol: decouple reference counting from page accounting · 1a3e1f40
      Johannes Weiner authored
      The reference counting of a memcg is currently coupled directly to how
      many 4k pages are charged to it.  This doesn't work well with Roman's new
      slab controller, which maintains pools of objects and doesn't want to keep
      an extra balance sheet for the pages backing those objects.
      
      This unusual refcounting design (reference counts usually track pointers
      to an object) is only for historical reasons: memcg used to not take any
      css references and simply stalled offlining until all charges had been
      reparented and the page counters had dropped to zero.  When we got rid of
      the reparenting requirement, the simple mechanical translation was to take
      a reference for every charge.
      
      More historical context can be found in commit e8ea14cc ("mm:
      memcontrol: take a css reference for each charged page"), commit
      64f21993 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
      commit b2052564
      
       ("mm: memcontrol: continue cache reclaim from offlined
      groups").
      
      The new slab controller exposes the limitations in this scheme, so let's
      switch it to a more idiomatic reference counting model based on actual
      kernel pointers to the memcg:
      
      - The per-cpu stock holds a reference to the memcg its caching
      
      - User pages hold a reference for their page->mem_cgroup. Transparent
        huge pages will no longer acquire tail references in advance, we'll
        get them if needed during the split.
      
      - Kernel pages hold a reference for their page->mem_cgroup
      
      - Pages allocated in the root cgroup will acquire and release css
        references for simplicity. css_get() and css_put() optimize that.
      
      - The current memcg_charge_slab() already hacked around the per-charge
        references; this change gets rid of that as well.
      
      - tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}
      
      Roman:
      1) Rebased on top of the current mm tree: added css_get() in
         mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
      2) I've reformatted commit references in the commit log to make
         checkpatch.pl happy.
      
      [hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvils
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a3e1f40
    • Roman Gushchin's avatar
      mm: slub: implement SLUB version of obj_to_index() · 4138fdfc
      Roman Gushchin authored
      This commit implements SLUB version of the obj_to_index() function, which
      will be required to calculate the offset of obj_cgroup in the obj_cgroups
      vector to store/obtain the objcg ownership data.
      
      To make it faster, let's repeat the SLAB's trick introduced by commit
      6a2d7a95
      
       ("SLAB: use a multiply instead of a divide in
      obj_to_index()") and avoid an expensive division.
      
      Vlastimil Babka noticed, that SLUB does have already a similar function
      called slab_index(), which is defined only if SLUB_DEBUG is enabled.  The
      function does a similar math, but with a division, and it also takes a
      page address instead of a page pointer.
      
      Let's remove slab_index() and replace it with the new helper
      __obj_to_index(), which takes a page address.  obj_to_index() will be a
      simple wrapper taking a page pointer and passing page_address(page) into
      __obj_to_index().
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-5-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4138fdfc
    • Roman Gushchin's avatar
      mm: memcg: convert vmstat slab counters to bytes · d42f3245
      Roman Gushchin authored
      
      
      In order to prepare for per-object slab memory accounting, convert
      NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
      
      To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
      NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
      
      Internally global and per-node counters are stored in pages, however memcg
      and lruvec counters are stored in bytes.  This scheme may look weird, but
      only for now.  As soon as slab pages will be shared between multiple
      cgroups, global and node counters will reflect the total number of slab
      pages.  However memcg and lruvec counters will be used for per-memcg slab
      memory tracking, which will take separate kernel objects in the account.
      Keeping global and node counters in pages helps to avoid additional
      overhead.
      
      The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
      will fit into atomic_long_t we use for vmstats.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d42f3245
    • Roman Gushchin's avatar
      mm: memcg: prepare for byte-sized vmstat items · ea426c2a
      Roman Gushchin authored
      
      
      To implement per-object slab memory accounting, we need to convert slab
      vmstat counters to bytes.  Actually, out of 4 levels of counters: global,
      per-node, per-memcg and per-lruvec only two last levels will require
      byte-sized counters.  It's because global and per-node counters will be
      counting the number of slab pages, and per-memcg and per-lruvec will be
      counting the amount of memory taken by charged slab objects.
      
      Converting all vmstat counters to bytes or even all slab counters to bytes
      would introduce an additional overhead.  So instead let's store global and
      per-node counters in pages, and memcg and lruvec counters in bytes.
      
      To make the API clean all access helpers (both on the read and write
      sides) are dealing with bytes.
      
      To avoid back-and-forth conversions a new flavor of read-side helpers is
      introduced, which always returns values in pages: node_page_state_pages()
      and global_node_page_state_pages().
      
      Actually new helpers are just reading raw values.  Old helpers are simple
      wrappers, which will complain on an attempt to read byte value, because at
      the moment no one actually needs bytes.
      
      Thanks to Johannes Weiner for the idea of having the byte-sized API on top
      of the page-sized internal storage.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea426c2a
    • Roman Gushchin's avatar
      mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() · eedc4e5a
      Roman Gushchin authored
      
      
      Patch series "The new cgroup slab memory controller", v7.
      
      The patchset moves the accounting from the page level to the object level.
      It allows to share slab pages between memory cgroups.  This leads to a
      significant win in the slab utilization (up to 45%) and the corresponding
      drop in the total kernel memory footprint.  The reduced number of
      unmovable slab pages should also have a positive effect on the memory
      fragmentation.
      
      The patchset makes the slab accounting code simpler: there is no more need
      in the complicated dynamic creation and destruction of per-cgroup slab
      caches, all memory cgroups use a global set of shared slab caches.  The
      lifetime of slab caches is not more connected to the lifetime of memory
      cgroups.
      
      The more precise accounting does require more CPU, however in practice the
      difference seems to be negligible.  We've been using the new slab
      controller in Facebook production for several months with different
      workloads and haven't seen any noticeable regressions.  What we've seen
      were memory savings in order of 1 GB per host (it varied heavily depending
      on the actual workload, size of RAM, number of CPUs, memory pressure,
      etc).
      
      The third version of the patchset added yet another step towards the
      simplification of the code: sharing of slab caches between accounted and
      non-accounted allocations.  It comes with significant upsides (most
      noticeable, a complete elimination of dynamic slab caches creation) but
      not without some regression risks, so this change sits on top of the
      patchset and is not completely merged in.  So in the unlikely event of a
      noticeable performance regression it can be reverted separately.
      
      The slab memory accounting works in exactly the same way for SLAB and
      SLUB.  With both allocators the new controller shows significant memory
      savings, with SLUB the difference is bigger.  On my 16-core desktop
      machine running Fedora 32 the size of the slab memory measured after the
      start of the system was lower by 58% and 38% with SLUB and SLAB
      correspondingly.
      
      As an estimation of a potential CPU overhead, below are results of
      slab_bulk_test01 test, kindly provided by Jesper D.  Brouer.  He also
      helped with the evaluation of results.
      
      The test can be found here: https://github.com/netoptimizer/prototype-kernel/
      The smallest number in each row should be picked for a comparison.
      
      SLUB-patched - bulk-API
       - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
      
      SLUB-original -  bulk-API
       - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)
      
      SLAB-patched -  bulk-API
       - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)
      
      SLAB-original-  bulk-API
       - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)
      
      This patch (of 19):
      
      To convert memcg and lruvec slab counters to bytes there must be a way to
      change these counters without touching node counters.  Factor out
      __mod_memcg_lruvec_state() out of __mod_lruvec_state().
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eedc4e5a
    • Roman Gushchin's avatar
      mm: kmem: make memcg_kmem_enabled() irreversible · d648bcc7
      Roman Gushchin authored
      
      
      Historically the kernel memory accounting was an opt-in feature, which
      could be enabled for individual cgroups.  But now it's not true, and it's
      on by default both on cgroup v1 and cgroup v2.  And as long as a user has
      at least one non-root memory cgroup, the kernel memory accounting is on.
      So in most setups it's either always on (if memory cgroups are in use and
      kmem accounting is not disabled), either always off (otherwise).
      
      memcg_kmem_enabled() is used in many places to guard the kernel memory
      accounting code.  If memcg_kmem_enabled() can reverse from returning true
      to returning false (as now), we can't rely on it on release paths and have
      to check if it was on before.
      
      If we'll make memcg_kmem_enabled() irreversible (always returning true
      after returning it for the first time), it'll make the general logic more
      simple and robust.  It also will allow to guard some checks which
      otherwise would stay unguarded.
      
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200702180926.1330769-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d648bcc7
    • Chris Down's avatar
      tmpfs: support 64-bit inums per-sb · ea3271f7
      Chris Down authored
      
      
      The default is still set to inode32 for backwards compatibility, but
      system administrators can opt in to the new 64-bit inode numbers by
      either:
      
      1. Passing inode64 on the command line when mounting, or
      2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
      
      The inode64 and inode32 names are used based on existing precedent from
      XFS.
      
      [hughd@google.com: Kconfig fixes]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea3271f7
    • Chris Down's avatar
      tmpfs: per-superblock i_ino support · e809d5f0
      Chris Down authored
      
      
      Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
      
      In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.  On
      affected tiers, in excess of 10% of hosts show multiple files with
      different content and the same inode number, with some servers even having
      as many as 150 duplicated inode numbers with differing file content.
      
      This causes actual, tangible problems in production.  For example, we have
      complaints from those working on remote caches that their application is
      reporting cache corruptions because it uses (device, inodenum) to
      establish the identity of a particular cache object, but because it's not
      unique any more, the application refuses to continue and reports cache
      corruption.  Even worse, sometimes applications may not even detect the
      corruption but may continue anyway, causing phantom and hard to debug
      behaviour.
      
      In general, userspace applications expect that (device, inodenum) should
      be enough to be uniquely point to one inode, which seems fair enough.  One
      might also need to check the generation, but in this case:
      
      1. That's not currently exposed to userspace
         (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
      2. Even with generation, there shouldn't be two live inodes with the
         same inode number on one device.
      
      In order to mitigate this, we take a two-pronged approach:
      
      1. Moving inum generation from being global to per-sb for tmpfs. This
         itself allows some reduction in i_ino churn. This works on both 64-
         and 32- bit machines.
      2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
         64-bit ino_t only: we allow users to mount tmpfs with a new inode64
         option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
      
      You can see how this compares to previous related patches which didn't
      implement this per-superblock:
      
      - https://patchwork.kernel.org/patch/11254001/
      - https://patchwork.kernel.org/patch/11023915/
      
      This patch (of 2):
      
      get_next_ino has a number of problems:
      
      - It uses and returns a uint, which is susceptible to become overflowed
        if a lot of volatile inodes that use get_next_ino are created.
      - It's global, with no specificity per-sb or even per-filesystem. This
        means it's not that difficult to cause inode number wraparounds on a
        single device, which can result in having multiple distinct inodes
        with the same inode number.
      
      This patch adds a per-superblock counter that mitigates the second case.
      This design also allows us to later have a specific i_ino size per-device,
      for example, allowing users to choose whether to use 32- or 64-bit inodes
      for each tmpfs mount.  This is implemented in the next commit.
      
      For internal shmem mounts which may be less tolerant to spinlock delays,
      we implement a percpu batching scheme which only takes the stat_lock at
      each batch boundary.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e809d5f0
    • Xianting Tian's avatar
      mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io · 0f190a7a
      Xianting Tian authored
      
      
      swap_readpage() does the sync io for one page, the io is not big,
      normally, the io can be finished quickly, but it may take long time or
      wait forever in case of io failure or discard.
      
      This patch uses blk_io_schedule() instead of io_schedule() to avoid task
      hung and crash (when set /proc/sys/kernel/hung_task_panic) when the above
      exception occurs.
      
      This is similar to the hung task avoidance in submit_bio_wait(),
      blk_execute_rq() and __blkdev_direct_IO().
      
      Signed-off-by: default avatarXianting Tian <xianting_tian@126.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/1596461807-21087-1-git-send-email-xianting_tian@126.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f190a7a
    • Krzysztof Kozlowski's avatar
      mm: swap: fix kerneldoc of swap_vma_readahead() · 27ec4878
      Krzysztof Kozlowski authored
      
      
      Fix W=1 compile warnings (invalid kerneldoc):
      
          mm/swap_state.c:742: warning: Function parameter or member 'fentry' not described in 'swap_vma_readahead'
          mm/swap_state.c:742: warning: Excess function parameter 'entry' description in 'swap_vma_readahead'
      
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200728171109.28687-2-krzk@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ec4878
    • Zhen Lei's avatar
      mm/swap_slots.c: remove redundant check for swap_slot_cache_initialized · e0f3ebba
      Zhen Lei authored
      
      
      Because enable_swap_slots_cache can only become true in
      enable_swap_slots_cache(), and depends on swap_slot_cache_initialized is
      true before.  That means, when enable_swap_slots_cache is true,
      swap_slot_cache_initialized is true also.
      
      So the condition:
      "swap_slot_cache_enabled && swap_slot_cache_initialized"
      can be reduced to "swap_slot_cache_enabled"
      
      And in mathematics:
      "!swap_slot_cache_enabled || !swap_slot_cache_initialized"
      is equal to "!(swap_slot_cache_enabled && swap_slot_cache_initialized)"
      
      So no functional change.
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200430061143.450-4-thunder.leizhen@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0f3ebba
    • Zhen Lei's avatar
      mm/swap_slots.c: simplify enable_swap_slots_cache() · d69a9575
      Zhen Lei authored
      
      
      Whether swap_slot_cache_initialized is true or false,
      __reenable_swap_slots_cache() is always called.  To make this meaning
      clear, leave only one call to __reenable_swap_slots_cache().  This also
      make it clearer what extra needs be done when swap_slot_cache_initialized
      is false.
      
      No functional change.
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200430061143.450-3-thunder.leizhen@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d69a9575
    • Zhen Lei's avatar
      mm/swap_slots.c: simplify alloc_swap_slot_cache() · f90eae2a
      Zhen Lei authored
      
      
      Patch series "clean up some functions in mm/swap_slots.c".
      
      When I studied the code of mm/swap_slots.c, I found some places can be
      improved.
      
      This patch (of 3):
      
      Both "slots" and "slots_ret" are only need to be freed when cache already
      allocated.  Make them closer, seems more clear.
      
      No functional change.
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200430061143.450-1-thunder.leizhen@huawei.com
      Link: http://lkml.kernel.org/r/20200430061143.450-2-thunder.leizhen@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f90eae2a
    • Tang Yizhou's avatar
      mm/gup.c: fix the comment of return value for populate_vma_page_range() · 0a36f7f8
      Tang Yizhou authored
      
      
      The return value of populate_vma_page_range() is consistent with
      __get_user_pages(), and so is the function comment of return value.
      
      Signed-off-by: default avatarTang Yizhou <tangyizhou@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Link: http://lkml.kernel.org/r/20200720034303.29920-1-tangyizhou@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a36f7f8
    • Yang Shi's avatar
      mm: filemap: add missing FGP_ flags in kerneldoc comment for pagecache_get_page · 605cad83
      Yang Shi authored
      
      
      FGP_{WRITE|NOFS|NOWAIT} were missed in pagecache_get_page's kerneldoc
      comment.
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Gang Deng <gavin.dg@linux.alibaba.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/1593031747-4249-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      605cad83
    • Yang Shi's avatar
      mm: filemap: clear idle flag for writes · b9306a79
      Yang Shi authored
      Since commit bbddabe2 ("mm: filemap: only do access activations on
      reads"), mark_page_accessed() is called for reads only.  But the idle flag
      is cleared by mark_page_accessed() so the idle flag won't get cleared if
      the page is write accessed only.
      
      Basically idle page tracking is used to estimate workingset size of
      workload, noticeable size of workingset might be missed if the idle flag
      is not maintained correctly.
      
      It seems good enough to just clear idle flag for write operations.
      
      Fixes: bbddabe2
      
       ("mm: filemap: only do access activations on reads")
      Reported-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/1593020612-13051-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9306a79
    • John Hubbard's avatar
      mm, dump_page: do not crash with bad compound_mapcount() · 6dc5ea16
      John Hubbard authored
      
      
      If a compound page is being split while dump_page() is being run on that
      page, we can end up calling compound_mapcount() on a page that is no
      longer compound.  This leads to a crash (already seen at least once in the
      field), due to the VM_BUG_ON_PAGE() assertion inside compound_mapcount().
      
      (The above is from Matthew Wilcox's analysis of Qian Cai's bug report.)
      
      A similar problem is possible, via compound_pincount() instead of
      compound_mapcount().
      
      In order to avoid this kind of crash, make dump_page() slightly more
      robust, by providing a pair of simpler routines that don't contain
      assertions: head_mapcount() and head_pincount().
      
      For debug tools, we don't want to go *too* far in this direction, but this
      is a simple small fix, and the crash has already been seen, so it's a good
      trade-off.
      
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Link: http://lkml.kernel.org/r/20200804214807.169256-1-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dc5ea16
    • Matthew Wilcox (Oracle)'s avatar
      mm/debug: print hashed address of struct page · 54a75157
      Matthew Wilcox (Oracle) authored
      
      
      The actual address of the struct page isn't particularly helpful, while
      the hashed address helps match with other messages elsewhere.  Add the PFN
      that the page refers to in order to help diagnose problems where the page
      is improperly aligned for the purpose.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Link: http://lkml.kernel.org/r/20200709202117.7216-7-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a75157
    • Matthew Wilcox (Oracle)'s avatar
      mm/debug: print the inode number in dump_page · 9bdaf2cc
      Matthew Wilcox (Oracle) authored
      
      
      The inode number helps correlate this page with debug messages elsewhere
      in the kernel.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Link: http://lkml.kernel.org/r/20200709202117.7216-6-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bdaf2cc
    • Matthew Wilcox (Oracle)'s avatar
      mm/debug: switch dump_page to get_kernel_nofault · 9ad38265
      Matthew Wilcox (Oracle) authored
      
      
      This is simpler to use than copy_from_kernel_nofault().  Also make some of
      the related error messages less verbose.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Link: http://lkml.kernel.org/r/20200709202117.7216-5-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ad38265
    • Matthew Wilcox (Oracle)'s avatar
      mm/debug: print head flags in dump_page · 0b93d59e
      Matthew Wilcox (Oracle) authored
      
      
      Tail page flags contain very little useful information.  Print the head
      page's flags instead.  While the flags will contain "head" for tail pages,
      this should not be too confusing as the previous line starts with the word
      "head:" and so the flags should be interpreted as belonging to the head
      page.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Link: http://lkml.kernel.org/r/20200709202117.7216-4-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b93d59e
    • Matthew Wilcox (Oracle)'s avatar
      mm/debug: dump compound page information on a second line · 452b557c
      Matthew Wilcox (Oracle) authored
      
      
      Simplify both the implementation and the output by splitting all the
      compound page information onto a second line.
      
      Reported-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Link: http://lkml.kernel.org/r/20200709202117.7216-3-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      452b557c
    • Matthew Wilcox (Oracle)'s avatar
      mm/debug: handle page->mapping better in dump_page · e1ab96f8
      Matthew Wilcox (Oracle) authored
      
      
      Patch series "Improvements for dump_page()", v2.
      
      Here's a sample dump of a pagecache tail page with all of the patches
      applied:
      
      page:000000006d1c49ca refcount:6 mapcount:0 mapping:00000000136b8d90 index:0x109 pfn:0x6c645
      head:000000008bd38076 order:2 compound_mapcount:0 compound_pincount:0
      aops:xfs_address_space_operations ino:800042 dentry name:"fd"
      flags: 0x4000000000012014(uptodate|lru|private|head)
      raw: 4000000000000000 ffffd46ac1b19101 ffffffff00000202 dead000000000004
      raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
      head: 4000000000012014 ffffd46ac1b1bbc8 ffffd46ac1b1bc08 ffff91976f659560
      head: 0000000000000108 ffff919773220680 00000006ffffffff 0000000000000000
      page dumped because: testing
      
      This patch (of 6):
      
      If we can't call page_mapping() to get the page mapping, handle the
      anon/ksm/movable bits correctly.
      
      [akpm@linux-foundation.org: augmented code comment from John]
        Link: http://lkml.kernel.org/r/15cff11a-6762-8a6a-3f0e-dd227280cd6f@nvidia.com
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Link: http://lkml.kernel.org/r/20200709202117.7216-1-willy@infradead.org
      Link: http://lkml.kernel.org/r/20200709202117.7216-2-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1ab96f8
    • Anshuman Khandual's avatar
      Documentation/mm: add descriptions for arch page table helpers · b1d00007
      Anshuman Khandual authored
      
      
      This adds a specific description file for all arch page table helpers which
      is in sync with the semantics being tested via CONFIG_DEBUG_VM_PGTABLE. All
      future changes either to these descriptions here or the debug test should
      always remain in sync.
      
      [anshuman.khandual@arm.com: fold in Mike's patch for the rst document, fix typos in the rst document]
        Link: http://lkml.kernel.org/r/1594610587-4172-5-git-send-email-anshuman.khandual@arm.com
      
      Suggested-by: default avatarMike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/1593996516-7186-5-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1d00007
    • Anshuman Khandual's avatar
      mm/debug_vm_pgtable: add debug prints for individual tests · 6315df41
      Anshuman Khandual authored
      
      
      This adds debug print information that enlists all tests getting executed
      on a given platform.  With dynamic debug enabled, the following
      information will be splashed during boot.  For compactness purpose,
      dropped both time stamp and prefix (i.e debug_vm_pgtable) from this sample
      output.
      
      [debug_vm_pgtable      ]: Validating architecture page table helpers
      [pte_basic_tests       ]: Validating PTE basic
      [pmd_basic_tests       ]: Validating PMD basic
      [p4d_basic_tests       ]: Validating P4D basic
      [pgd_basic_tests       ]: Validating PGD basic
      [pte_clear_tests       ]: Validating PTE clear
      [pmd_clear_tests       ]: Validating PMD clear
      [pte_advanced_tests    ]: Validating PTE advanced
      [pmd_advanced_tests    ]: Validating PMD advanced
      [hugetlb_advanced_tests]: Validating HugeTLB advanced
      [pmd_leaf_tests        ]: Validating PMD leaf
      [pmd_huge_tests        ]: Validating PMD huge
      [pte_savedwrite_tests  ]: Validating PTE saved write
      [pmd_savedwrite_tests  ]: Validating PMD saved write
      [pmd_populate_tests    ]: Validating PMD populate
      [pte_special_tests     ]: Validating PTE special
      [pte_protnone_tests    ]: Validating PTE protnone
      [pmd_protnone_tests    ]: Validating PMD protnone
      [pte_devmap_tests      ]: Validating PTE devmap
      [pmd_devmap_tests      ]: Validating PMD devmap
      [pte_swap_tests        ]: Validating PTE swap
      [swap_migration_tests  ]: Validating swap migration
      [hugetlb_basic_tests   ]: Validating HugeTLB basic
      [pmd_thp_tests         ]: Validating PMD based THP
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Vineet Gupta <vgupta@synopsys.com>	[arc]
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Link: http://lkml.kernel.org/r/1593996516-7186-4-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6315df41
    • Anshuman Khandual's avatar
      mm/debug_vm_pgtable: add tests validating advanced arch page table helpers · a5c3b9ff
      Anshuman Khandual authored
      
      
      This adds new tests validating for these following arch advanced page
      table helpers.  These tests create and test specific mapping types at
      various page table levels.
      
      1. pxxp_set_wrprotect()
      2. pxxp_get_and_clear()
      3. pxxp_set_access_flags()
      4. pxxp_get_and_clear_full()
      5. pxxp_test_and_clear_young()
      6. pxx_leaf()
      7. pxx_set_huge()
      8. pxx_(clear|mk)_savedwrite()
      9. huge_pxxp_xxx()
      
      [anshuman.khandual@arm.com: drop RANDOM_ORVALUE from hugetlb_advanced_tests()]
        Link: http://lkml.kernel.org/r/1594610587-4172-3-git-send-email-anshuman.khandual@arm.com
      
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Vineet Gupta <vgupta@synopsys.com>	[arc]
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Link: http://lkml.kernel.org/r/1593996516-7186-3-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5c3b9ff
    • Anshuman Khandual's avatar
      mm/debug_vm_pgtable: add tests validating arch helpers for core MM features · 05289402
      Anshuman Khandual authored
      
      
      Patch series "mm/debug_vm_pgtable: Add some more tests", v5.
      
      This series adds some more arch page table helper validation tests which
      are related to core and advanced memory functions.  This also creates a
      documentation, enlisting expected semantics for all page table helpers as
      suggested by Mike Rapoport previously
      (https://lkml.org/lkml/2020/1/30/40).
      
      There are many TRANSPARENT_HUGEPAGE and ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD
      ifdefs scattered across the test.  But consolidating all the fallback
      stubs is not very straight forward because
      ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD is not explicitly dependent on
      ARCH_HAS_TRANSPARENT_HUGEPAGE.
      
      Tested on arm64, x86 platforms but only build tested on all other enabled
      platforms through ARCH_HAS_DEBUG_VM_PGTABLE i.e powerpc, arc, s390.  The
      following failure on arm64 still exists which was mentioned previously.
      It will be fixed with the upcoming THP migration on arm64 enablement
      series.
      
      WARNING .... mm/debug_vm_pgtable.c:860 debug_vm_pgtable+0x940/0xa54
      WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))))
      
      This patch (of 4):
      
      This adds new tests validating arch page table helpers for these following
      core memory features.  These tests create and test specific mapping types
      at various page table levels.
      
      1. SPECIAL mapping
      2. PROTNONE mapping
      3. DEVMAP mapping
      4. SOFTDIRTY mapping
      5. SWAP mapping
      6. MIGRATION mapping
      7. HUGETLB mapping
      8. THP mapping
      
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Vineet Gupta <vgupta@synopsys.com>	[arc]
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Link: http://lkml.kernel.org/r/1594610587-4172-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r/1593996516-7186-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r/1593996516-7186-2-git-send-email-anshuman.khandual@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05289402
    • Marco Elver's avatar
      mm, kcsan: instrument SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS" · cfbe1636
      Marco Elver authored
      
      
      Provide the necessary KCSAN checks to assist with debugging racy
      use-after-frees.  While KASAN is more reliable at generally catching such
      use-after-frees (due to its use of a quarantine), it can be difficult to
      debug racy use-after-frees.  If a reliable reproducer exists, KCSAN can
      assist in debugging such issues.
      
      Note: ASSERT_EXCLUSIVE_ACCESS is a convenience wrapper if the size is
      simply sizeof(var).  Instead, here we just use __kcsan_check_access()
      explicitly to pass the correct size.
      
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200623072653.114563-1-elver@google.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cfbe1636
    • Sebastian Andrzej Siewior's avatar
      mm/slub.c: drop lockdep_assert_held() from put_map() · b3cb9fc3
      Sebastian Andrzej Siewior authored
      
      
      There is no point in using lockdep_assert_held() unlock that is about to
      be unlocked.  It works only with lockdep and lockdep will complain if
      spin_unlock() is used on a lock that has not been locked.
      
      Remove superfluous lockdep_assert_held().
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20200618201234.795692-2-bigeasy@linutronix.de
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3cb9fc3
    • Vlastimil Babka's avatar
      mm, slab/slub: improve error reporting and overhead of cache_from_obj() · e42f174e
      Vlastimil Babka authored
      cache_from_obj() was added by commit b9ce5ef4 ("sl[au]b: always get
      the cache from its page in kmem_cache_free()") to support kmemcg, where
      per-memcg cache can be different from the root one, so we can't use the
      kmem_cache pointer given to kmem_cache_free().
      
      Prior to that commit, SLUB already had debugging check+warning that could
      be enabled to compare the given kmem_cache pointer to one referenced by
      the slab page where the object-to-be-freed resides.  This check was moved
      to cache_from_obj().  Later the check was also enabled for
      SLAB_FREELIST_HARDENED configs by commit 598a0717 ("mm/slab: validate
      cache membership under freelist hardening").
      
      These checks and warnings can be useful especially for the debugging,
      which can be improved.  Commit 598a0717
      
       changed the pr_err() with
      WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
      others are silent.  This patch changes it to WARN() so that all errors are
      reported.
      
      It's also useful to print SLUB allocation/free tracking info for the
      offending object, if tracking is enabled.  Thus, export the SLUB
      print_tracking() function and provide an empty one for SLAB.
      
      For SLUB we can also benefit from the static key check in
      kmem_cache_debug_flags(), but we need to move this function to slab.h and
      declare the static key there.
      
      [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com
      
      [vbabka@suse.cz: avoid bogus WARN()]
        Link: https://lore.kernel.org/r/20200623090213.GW5535@shao2-debian
        Link: http://lkml.kernel.org/r/b33e0fa7-cd28-4788-9e54-5927846329ef@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Garrett <mjg59@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/afeda7ac-748b-33d8-a905-56b708148ad5@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e42f174e
    • Vlastimil Babka's avatar
      mm, slab/slub: move and improve cache_from_obj() · d3c58f24
      Vlastimil Babka authored
      The function cache_from_obj() was added by commit b9ce5ef4 ("sl[au]b:
      always get the cache from its page in kmem_cache_free()") to support
      kmemcg, where per-memcg cache can be different from the root one, so we
      can't use the kmem_cache pointer given to kmem_cache_free().
      
      Prior to that commit, SLUB already had debugging check+warning that could
      be enabled to compare the given kmem_cache pointer to one referenced by
      the slab page where the object-to-be-freed resides.  This check was moved
      to cache_from_obj().  Later the check was also enabled for
      SLAB_FREELIST_HARDENED configs by commit 598a0717 ("mm/slab: validate
      cache membership under freelist hardening").
      
      These checks and warnings can be useful especially for the debugging,
      which can be improved.  Commit 598a0717
      
       changed the pr_err() with
      WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
      others are silent.  This patch changes it to WARN() so that all errors are
      reported.
      
      It's also useful to print SLUB allocation/free tracking info for the
      offending object, if tracking is enabled.  We could export the SLUB
      print_tracking() function and provide an empty one for SLAB, or realize
      that both the debugging and hardening cases in cache_from_obj() are only
      supported by SLUB anyway.  So this patch moves cache_from_obj() from
      slab.h to separate instances in slab.c and slub.c, where the SLAB version
      only does the kmemcg lookup and even could be completely removed once the
      kmemcg rework [1] is merged.  The SLUB version can thus easily use the
      print_tracking() function.  It can also use the kmem_cache_debug_flags()
      static key check for improved performance in kernels without the hardening
      and with debugging not enabled on boot.
      
      [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3c58f24
    • Vlastimil Babka's avatar
      mm, slub: extend checks guarded by slub_debug static key · 8fc8d666
      Vlastimil Babka authored
      
      
      There are few more places in SLUB that could benefit from reduced overhead
      of the static key introduced by a previous patch:
      
      - setup_object_debug() called on each object in newly allocated slab page
      - setup_page_debug() called on newly allocated slab page
      - __free_slab() called on freed slab page
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200610163135.17364-9-vbabka@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fc8d666
    • Vlastimil Babka's avatar
      mm, slub: introduce kmem_cache_debug_flags() · 59052e89
      Vlastimil Babka authored
      
      
      There are few places that call kmem_cache_debug(s) (which tests if any of
      debug flags are enabled for a cache) immediately followed by a test for a
      specific flag.  The compiler can probably eliminate the extra check, but
      we can make the code nicer by introducing kmem_cache_debug_flags() that
      works like kmem_cache_debug() (including the static key check) but tests
      for specific flag(s).  The next patches will add more users.
      
      [vbabka@suse.cz: change return from int to bool, per Kees.  Add VM_WARN_ON_ONCE() for invalid flags, per Roman]
        Link: http://lkml.kernel.org/r/949b90ed-e0f0-07d7-4d21-e30ec0958a7c@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200610163135.17364-8-vbabka@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59052e89
    • Vlastimil Babka's avatar
      mm, slub: introduce static key for slub_debug() · ca0cab65
      Vlastimil Babka authored
      
      
      One advantage of CONFIG_SLUB_DEBUG is that a generic distro kernel can be
      built with the option enabled, but it's inactive until simply enabled on
      boot, without rebuilding the kernel.  With a static key, we can further
      eliminate the overhead of checking whether a cache has a particular debug
      flag enabled if we know that there are no such caches (slub_debug was not
      enabled during boot).  We use the same mechanism also for e.g.
      page_owner, debug_pagealloc or kmemcg functionality.
      
      This patch introduces the static key and makes the general check for
      per-cache debug flags kmem_cache_debug() use it.  This benefits several
      call sites, including (slow path but still rather frequent) __slab_free().
      The next patches will add more uses.
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200610163135.17364-7-vbabka@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca0cab65
    • Vlastimil Babka's avatar
      mm, slub: make reclaim_account attribute read-only · 8f58119a
      Vlastimil Babka authored
      
      
      The attribute reflects the SLAB_RECLAIM_ACCOUNT cache flag.  It's not
      clear why this attribute was writable in the first place, as it's tied to
      how the cache is used by its creator, it's not a user tunable.
      Furthermore:
      
      - it affects slab merging, but that's not being checked while toggled
      - if affects whether __GFP_RECLAIMABLE flag is used to allocate page, but
        the runtime toggle doesn't update allocflags
      - it affects cache_vmstat_idx() so runtime toggling might lead to incosistency
        of NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE
      
      Thus make it read-only.
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200610163135.17364-6-vbabka@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f58119a
    • Vlastimil Babka's avatar
      mm, slub: make remaining slub_debug related attributes read-only · 060807f8
      Vlastimil Babka authored
      
      
      SLUB_DEBUG creates several files under /sys/kernel/slab/<cache>/ that can
      be read to check if the respective debugging options are enabled for given
      cache.  Some options, namely sanity_checks, trace, and failslab can be
      also enabled and disabled at runtime by writing into the files.
      
      The runtime toggling is racy.  Some options disable __CMPXCHG_DOUBLE when
      enabled, which means that in case of concurrent allocations, some can
      still use __CMPXCHG_DOUBLE and some not, leading to potential corruption.
      The s->flags field is also not updated or checked atomically.  The
      simplest solution is to remove the runtime toggling.  The extended
      slub_debug boot parameter syntax introduced by earlier patch should allow
      to fine-tune the debugging configuration during boot with same
      granularity.
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      060807f8