Skip to content
  1. Sep 04, 2021
    • Christoph Hellwig's avatar
      scatterlist: replace flush_kernel_dcache_page with flush_dcache_page · 0e84f5db
      Christoph Hellwig authored
      
      
      Pages used in scatterlist can be mapped page cache pages (and often are),
      so we must use flush_dcache_page here instead of the more limited
      flush_kernel_dcache_page that is intended for highmem pages only.
      
      Also remove the PageSlab check given that page_mapping_file as used by the
      flush_dcache_page implementations already contains that check.
      
      Link: https://lkml.kernel.org/r/20210712060928.4161649-5-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Cercueil <paul@crapouillou.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e84f5db
    • Christoph Hellwig's avatar
      mmc: mmc_spi: replace flush_kernel_dcache_page with flush_dcache_page · 64a05fe6
      Christoph Hellwig authored
      
      
      Pages passed to block drivers can be mapped page cache pages, so we must
      use flush_dcache_page here instead of the more limited
      flush_kernel_dcache_page that is intended for highmem pages only.
      
      Link: https://lkml.kernel.org/r/20210712060928.4161649-3-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Cercueil <paul@crapouillou.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64a05fe6
    • Christoph Hellwig's avatar
      mmc: JZ4740: remove the flush_kernel_dcache_page call in jz4740_mmc_read_data · 79c62de8
      Christoph Hellwig authored
      
      
      Patch series "_kernel_dcache_page fixes and removal".
      
      While looking to convert the block layer away from kmap_atomic towards
      kmap_local_page and prefeably the helpers that abstract it away I noticed
      that a few block drivers directly or implicitly call
      flush_kernel_dcache_page before kunmapping a page that has been written
      to.
      
      flush_kernel_dcache_page is documented to to be used in such cases, but
      flush_dcache_page is actually required when the page could be in the page
      cache and mapped to userspace, which is pretty much always the case when
      kmapping an arbitrary page.  Unfortunately the documentation doesn't
      exactly make that clear, which lead to this misused.  And it turns out
      that only the copy_strings / copy_string_kernel in the exec code were
      actually correct users of flush_kernel_dcache_page, which is why I think
      we should just remove it and eat the very minor overhead in exec rather
      than confusing poor driver writers.
      
      This patch (of 6):
      
      MIPS now implements flush_kernel_dcache_page (as an alias to
      flush_dcache_page).
      
      Link: https://lkml.kernel.org/r/20210712060928.4161649-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20210712060928.4161649-2-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Paul Cercueil <paul@crapouillou.net>
      Cc: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: Alex Shi <alexs@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79c62de8
    • Colin Ian King's avatar
      selftests: Fix spelling mistake "cann't" -> "cannot" · 0c52ec95
      Colin Ian King authored
      
      
      There is a spelling mistake in an error message. Fix it.
      
      Link: https://lkml.kernel.org/r/20210826121217.12885-1-colin.king@canonical.com
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c52ec95
    • Po-Hsu Lin's avatar
      selftests/vm: use kselftest skip code for skipped tests · 6260618e
      Po-Hsu Lin authored
      
      
      There are several test cases in the vm directory are still using exit 0
      when they need to be skipped.  Use the kselftest framework to skip code
      instead so it can help us to distinguish the return status.
      
      Criterion to filter out what should be fixed in vm directory:
        grep -r "exit 0" -B1 | grep -i skip
      
      This change might cause some false-positives if people are running these
      test scripts directly and only checking their return codes, which will
      change from 0 to 4.  However I think the impact should be small as most of
      our scripts here are already using this skip code.  And there will be no
      such issue if running them with the kselftest framework.
      
      Link: https://lkml.kernel.org/r/20210823073433.37653-1-po-hsu.lin@canonical.com
      Signed-off-by: default avatarPo-Hsu Lin <po-hsu.lin@canonical.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6260618e
    • Shakeel Butt's avatar
      memcg: make memcg->event_list_lock irqsafe · 4ba9515d
      Shakeel Butt authored
      
      
      The memcg->event_list_lock is usually taken in the normal context but when
      the userspace closes the corresponding eventfd, eventfd_release through
      memcg_event_wake takes memcg->event_list_lock with interrupts disabled.
      This is not an issue on its own but it creates a nested dependency from
      eventfd_ctx->wqh.lock to memcg->event_list_lock.
      
      Independently, for unrelated eventfd, eventfd_signal() can be called in
      the irq context, thus making eventfd_ctx->wqh.lock an irq lock.  For
      example, FPGA DFL driver, VHOST VPDA driver and couple of VFIO drivers.
      This will force memcg->event_list_lock to be an irqsafe lock as well.
      
      One way to break the nested dependency between eventfd_ctx->wqh.lock and
      memcg->event_list_lock is to add an indirection.  However the simplest
      solution would be to make memcg->event_list_lock irqsafe.  This is cgroup
      v1 feature, is in maintenance and may get deprecated in near future.  So,
      no need to add more code.
      
      BTW this has been discussed previously [1] but there weren't irq users of
      eventfd_signal() at the time.
      
      [1] https://www.spinics.net/lists/cgroups/msg06248.html
      
      Link: https://lkml.kernel.org/r/20210830172953.207257-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ba9515d
    • Michal Hocko's avatar
      memcg: fix up drain_local_stock comment · 5c49cf9a
      Michal Hocko authored
      
      
      Thomas and Vlastimil have noticed that the comment in drain_local_stock
      doesn't quite make sense.  It talks about a synchronization with the
      memory hotplug but there is no actual memory hotplug involvement here.  I
      meant to talk about cpu hotplug here.  Fix that up and hopefuly make the
      comment more helpful by referencing the cpu hotplug callback as well.
      
      Link: https://lkml.kernel.org/r/YRDwOhVglJmY7ES5@dhcp22.suse.cz
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c49cf9a
    • Miaohe Lin's avatar
      mm, memcg: save some atomic ops when flush is already true · 27fb0956
      Miaohe Lin authored
      
      
      Add 'else' to save some atomic ops in obj_stock_flush_required() when
      flush is already true.  No functional change intended here.
      
      Link: https://lkml.kernel.org/r/20210807082835.61281-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27fb0956
    • Miaohe Lin's avatar
      mm, memcg: remove unused functions · bec49c06
      Miaohe Lin authored
      Since commit 2d146aa3 ("mm: memcontrol: switch to rstat"), last user
      of memcg_stat_item_in_bytes() is gone.  And since commit fa40d1ee
      
      
      ("mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only
      the declaration of mem_cgroup_select_victim_node() is remained here.
      Remove them.
      
      Link: https://lkml.kernel.org/r/20210807082835.61281-2-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bec49c06
    • Baolin Wang's avatar
      mm: memcontrol: set the correct memcg swappiness restriction · 37bc3cb9
      Baolin Wang authored
      Since commit c843966c ("mm: allow swappiness that prefers reclaiming
      anon over the file workingset") has expended the swappiness value to make
      swap to be preferred in some systems.  We should also change the memcg
      swappiness restriction to allow memcg swap-preferred.
      
      Link: https://lkml.kernel.org/r/d77469b90c45c49953ccbc51e54a1d465bc18f70.1627626255.git.baolin.wang@linux.alibaba.com
      Fixes: c843966c
      
       ("mm: allow swappiness that prefers reclaiming anon over the file workingset")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37bc3cb9
    • Vasily Averin's avatar
      memcg: replace in_interrupt() by !in_task() in active_memcg() · 55a68c82
      Vasily Averin authored
      set_active_memcg() uses in_interrupt() check to select proper storage for
      cgroup: pointer on task struct or per-cpu pointer.
      
      It isn't fully correct: obsoleted in_interrupt() includes tasks with
      disabled BH.  It's better to use '!in_task()' instead.
      
      Link: https://lkml.org/lkml/2021/7/26/487
      Link: https://lkml.kernel.org/r/ed4448b0-4970-616f-7368-ef9dd3cb628d@virtuozzo.com
      Fixes: 37d5985c
      
       ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55a68c82
    • Shakeel Butt's avatar
      memcg: cleanup racy sum avoidance code · 96e51ccf
      Shakeel Butt authored
      
      
      We used to have per-cpu memcg and lruvec stats and the readers have to
      traverse and sum the stats from each cpu.  This summing was racy and may
      expose transient negative values.  So, an explicit check was added to
      avoid such scenarios.  Now these stats are moved to rstat infrastructure
      and are no more per-cpu, so we can remove the fixup for transient negative
      values.
      
      Link: https://lkml.kernel.org/r/20210728012243.3369123-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96e51ccf
    • Vasily Averin's avatar
      memcg: enable accounting for ldt_struct objects · ec403e2a
      Vasily Averin authored
      
      
      Each task can request own LDT and force the kernel to allocate up to 64Kb
      memory per-mm.
      
      There are legitimate workloads with hundreds of processes and there can be
      hundreds of workloads running on large machines.  The unaccounted memory
      can cause isolation issues between the workloads particularly on highly
      utilized machines.
      
      It makes sense to account for this objects to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/38010594-50fe-c06d-7cb0-d1f77ca422f3@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec403e2a
    • Vasily Averin's avatar
      memcg: enable accounting for posix_timers_cache slab · c509723e
      Vasily Averin authored
      
      
      A program may create multiple interval timers using timer_create().  For
      each timer the kernel preallocates a "queued real-time signal",
      Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
      resource limit.  The allocated object is quite small, ~250 bytes, but even
      the default signal limits allow to consume up to 100 megabytes per user.
      
      It makes sense to account for them to limit the host's memory consumption
      from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/57795560-025c-267c-6b1a-dea852d95530@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c509723e
    • Vasily Averin's avatar
      memcg: enable accounting for signals · 5f58c398
      Vasily Averin authored
      
      
      When a user send a signal to any another processes it forces the kernel to
      allocate memory for 'struct sigqueue' objects.  The number of signals is
      limited by RLIMIT_SIGPENDING resource limit, but even the default settings
      allow each user to consume up to several megabytes of memory.
      
      It makes sense to account for these allocations to restrict the host's
      memory consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f58c398
    • Vasily Averin's avatar
      memcg: enable accounting of ipc resources · 18319498
      Vasily Averin authored
      
      
      When user creates IPC objects it forces kernel to allocate memory for
      these long-living objects.
      
      It makes sense to account them to restrict the host's memory consumption
      from inside the memcg-limited container.
      
      This patch enables accounting for IPC shared memory segments, messages
      semaphores and semaphore's undo lists.
      
      Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fi...
      18319498
    • Vasily Averin's avatar
      memcg: enable accounting for new namesapces and struct nsproxy · 30acd0bd
      Vasily Averin authored
      
      
      Container admin can create new namespaces and force kernel to allocate up
      to several pages of memory for the namespaces and its associated
      structures.
      
      Net and uts namespaces have enabled accounting for such allocations.  It
      makes sense to account for rest ones to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc:...
      30acd0bd
    • Vasily Averin's avatar
      memcg: enable accounting for fasync_cache · 839d6820
      Vasily Averin authored
      
      
      fasync_struct is used by almost all character device drivers to set up the
      fasync queue, and for regular files by the file lease code.  This
      structure is quite small but long-living and it can be assigned for any
      open file.
      
      It makes sense to account for its allocations to restrict the host's
      memory consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/1b408625-d71c-0b26-b0b6-9baf00f93e69@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      839d6820
    • Vasily Averin's avatar
      memcg: enable accounting for file lock caches · 0f12156d
      Vasily Averin authored
      
      
      User can create file locks for each open file and force kernel to allocate
      small but long-living objects per each open file.
      
      It makes sense to account for these objects to limit the host's memory
      consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/b009f4c7-f0ab-c0ec-8e83-918f47d677da@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f12156d
    • Vasily Averin's avatar
      memcg: enable accounting for pollfd and select bits arrays · b6558434
      Vasily Averin authored
      
      
      User can call select/poll system calls with a large number of assigned
      file descriptors and force kernel to allocate up to several pages of
      memory till end of these sleeping system calls.  We have here long-living
      unaccounted per-task allocations.
      
      It makes sense to account for these allocations to restrict the host's
      memory consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/56e31cb5-6e1e-bdba-d7ca-be64b9842363@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6558434
    • Vasily Averin's avatar
      memcg: enable accounting for mnt_cache entries · 79f6540b
      Vasily Averin authored
      
      
      Patch series "memcg accounting from OpenVZ", v7.
      
      OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
      Initially we used our own accounting subsystem, then partially committed
      it to upstream, and a few years ago switched to cgroups v1.  Now we're
      rebasing again, revising our old patches and trying to push them upstream.
      
      We try to protect the host system from any misuse of kernel memory
      allocation triggered by untrusted users inside the containers.
      
      Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
      list, though I would be very grateful for any comments from maintainersi
      of affected subsystems or other people added in cc:
      
      Compared to the upstream, we additionally account the following kernel objects:
      - network devices and its Tx/Rx queues
      - ipv4/v6 addresses and routing-related objects
      - inet_bind_bucket cache objects
      - VLAN group arrays
      - ipv6/sit: ip_tunnel_prl
      - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
      - nsproxy and namespace objects itself
      - IPC objects: semaphores, message queues and share memory segments
      - mounts
      - pollfd and select bits arrays
      - signals and posix timers
      - file lock
      - fasync_struct used by the file lease code and driver's fasync queues
      - tty objects
      - per-mm LDT
      
      We have an incorrect/incomplete/obsoleted accounting for few other kernel
      objects: sk_filter, af_packets, netlink and xt_counters for iptables.
      They require rework and probably will be dropped at all.
      
      Also we're going to add an accounting for nft, however it is not ready
      yet.
      
      We have not tested performance on upstream, however, our performance team
      compares our current RHEL7-based production kernel and reports that they
      are at least not worse as the according original RHEL7 kernel.
      
      This patch (of 10):
      
      The kernel allocates ~400 bytes of 'struct mount' for any new mount.
      Creating a new mount namespace clones most of the parent mounts, and this
      can be repeated many times.  Additionally, each mount allocates up to
      PATH_MAX=4096 bytes for mnt->mnt_devname.
      
      It makes sense to account for these allocations to restrict the host's
      memory consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79f6540b
    • Yutian Yang's avatar
      memcg: charge fs_context and legacy_fs_context · bb902cb4
      Yutian Yang authored
      
      
      This patch adds accounting flags to fs_context and legacy_fs_context
      allocation sites so that kernel could correctly charge these objects.
      
      We have written a PoC to demonstrate the effect of the missing-charging
      bugs.  The PoC takes around 1,200MB unaccounted memory, while it is
      charged for only 362MB memory usage.  We evaluate the PoC on QEMU x86_64
      v5.2.90 + Linux kernel v5.10.19 + Debian buster.  All the limitations
      including ulimits and sysctl variables are set as default.  Specifically,
      the hard NOFILE limit and nr_open in sysctl are both 1,048,576.
      
      /*------------------------- POC code ----------------------------*/
      
      #define _GNU_SOURCE
      #include <sys/types.h>
      #include <sys/file.h>
      #include <time.h>
      #include <sys/wait.h>
      #include <stdint.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <stdio.h>
      #include <signal.h>
      #include <sched.h>
      #include <fcntl.h>
      #include <linux/mount.h>
      
      #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                              } while (0)
      
      #define STACK_SIZE (8 * 1024)
      #ifndef __NR_fsopen
      #define __NR_fsopen 430
      #endif
      static inline int fsopen(const char *fs_name, unsigned int flags)
      {
              return syscall(__NR_fsopen, fs_name, flags);
      }
      
      static char thread_stack[512][STACK_SIZE];
      
      int thread_fn(void* arg)
      {
        for (int i = 0; i< 800000; ++i) {
          int fsfd = fsopen("nfs", FSOPEN_CLOEXEC);
          if (fsfd == -1) {
            errExit("fsopen");
          }
        }
        while(1);
        return 0;
      }
      
      int main(int argc, char *argv[]) {
        int thread_pid;
        for (int i = 0; i < 1; ++i) {
          thread_pid = clone(thread_fn, thread_stack[i] + STACK_SIZE, \
            SIGCHLD, NULL);
        }
        while(1);
        return 0;
      }
      
      /*-------------------------- end --------------------------------*/
      
      Link: https://lkml.kernel.org/r/1626517201-24086-1-git-send-email-nglaive@gmail.com
      Signed-off-by: default avatarYutian Yang <nglaive@gmail.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <shenwenbo@zju.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb902cb4
    • Shakeel Butt's avatar
      memcg: infrastructure to flush memcg stats · aa48e47e
      Shakeel Butt authored
      
      
      At the moment memcg stats are read in four contexts:
      
      1. memcg stat user interfaces
      2. dirty throttling
      3. page fault
      4. memory reclaim
      
      Currently the kernel flushes the stats for first two cases.  Flushing the
      stats for remaining two casese may have performance impact.  Always
      flushing the memcg stats on the page fault code path may negatively
      impacts the performance of the applications.  In addition flushing in the
      memory reclaim code path, though treated as slowpath, can become the
      source of contention for the global lock taken for stat flushing because
      when system or memcg is under memory pressure, many tasks may enter the
      reclaim path.
      
      This patch uses following mechanisms to solve these challenges:
      
      1. Periodically flush the stats from root memcg every 2 seconds.  This
         will time limit the out of sync stats.
      
      2. Asynchronously flush the stats after fixed number of stat updates.
         In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for
         2 seconds.
      
      3. For avoiding thundering herd to flush the stats particularly from
         the memory reclaim context, introduce memcg local spinlock and let only
         one flusher active at a time.  This could have been done through
         cgroup_rstat_lock lock but that lock is used by other subsystem and for
         userspace reading memcg stats.  So, it is better to keep flushers
         introduced by this patch decoupled from cgroup_rstat_lock.  However we
         would have to use irqsafe version of rstat flush but that is fine as
         this code path will be flushing for whole tree and do the work for
         everyone.  No one will be waiting for that worker.
      
      [shakeelb@google.com: fix sleep-in-wrong context bug]
        Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com
      
      Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Tested-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa48e47e
    • Shakeel Butt's avatar
      memcg: switch lruvec stats to rstat · 7e1c0d6f
      Shakeel Butt authored
      The commit 2d146aa3 ("mm: memcontrol: switch to rstat") switched memcg
      stats to rstat infrastructure but skipped the conversion of the lruvec
      stats as such stats are read in the performance critical code paths and
      flushing stats may have impacted the performances of the applications.
      This patch converts the lruvec stats to rstat and later patches add
      mechanisms to keep the performance impact to minimum.
      
      The rstat conversion comes with the price i.e.  memory cost.  Effectively
      this patch reverts the savings done by the commit f3344adf
      
       ("mm:
      memcontrol: optimize per-lruvec stats counter memory usage").  However
      this cost is justified due to negative impact of the inaccurate lruvec
      stats on many heuristics.  One such case is reported in [1].
      
      The memory reclaim code is filled with plethora of heuristics and many of
      those heuristics reads the lruvec stats.  So, inaccurate stats can make
      such heuristics ineffective.  [1] reports the impact of inaccurate lruvec
      stats on the "cache trim mode" heuristic.  Inaccurate lruvec stats can
      impact the deactivation and aging anon heuristics as well.
      
      [1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/
      
      Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com
      Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e1c0d6f
    • Vasily Averin's avatar
      memcg: enable accounting for pids in nested pid namespaces · fab827db
      Vasily Averin authored
      Commit 5d097056 ("kmemcg: account certain kmem allocations to memcg")
      enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
      but forgot to adjust the setting for nested pid namespaces.  As a result,
      pid memory is not accounted exactly where it is really needed, inside
      memcg-limited containers with their own pid namespaces.
      
      Pid was one the first kernel objects enabled for memcg accounting.
      init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
      new pids in the system are memcg-accounted.
      
      Though recently I've noticed that it is wrong.  nested pid namespaces
      creates own slab caches for pid objects, nested pids have increased size
      because contain id both for all parent and for own pid namespaces.  The
      problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
      result any pids allocated in nested pid namespaces are not
      memcg-accounted.
      
      Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
      such objects gives us up to ~50Mb unaccounted memory, this allow container
      to exceed assigned memcg limits.
      
      Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com
      Fixes: 5d097056
      
       ("kmemcg: account certain kmem allocations to memcg")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fab827db
    • Suren Baghdasaryan's avatar
      mm, memcg: inline swap-related functions to improve disabled memcg config · 01c4b28c
      Suren Baghdasaryan authored
      
      
      Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and
      cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static
      key check inline before calling the main body of the function.  This
      minimizes the memcg overhead in the pagefault and exit_mmap paths when
      memcgs are disabled using cgroup_disable=memory command-line option.  This
      change results in ~1% overhead reduction when running PFT test [1]
      comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory}
      configuration on an 8-core ARM64 Android device.
      
      [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite
      
      Link: https://lkml.kernel.org/r/20210713010934.299876-3-surenb@google.com
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01c4b28c
    • Suren Baghdasaryan's avatar
      mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg config · 2c8d8f97
      Suren Baghdasaryan authored
      
      
      Inline mem_cgroup_{charge/uncharge} and mem_cgroup_uncharge_list functions
      functions to perform mem_cgroup_disabled static key check inline before
      calling the main body of the function.  This minimizes the memcg overhead
      in the pagefault and exit_mmap paths when memcgs are disabled using
      cgroup_disable=memory command-line option.
      
      This change results in ~0.4% overhead reduction when running PFT test [1]
      comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory}
      configuration on an 8-core ARM64 Android device.
      
      [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite
      
      Link: https://lkml.kernel.org/r/20210713010934.299876-2-surenb@google.com
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c8d8f97
    • Suren Baghdasaryan's avatar
      mm, memcg: add mem_cgroup_disabled checks in vmpressure and swap-related functions · 56cab285
      Suren Baghdasaryan authored
      
      
      Add mem_cgroup_disabled check in vmpressure, mem_cgroup_uncharge_swap and
      cgroup_throttle_swaprate functions.  This minimizes the memcg overhead in
      the pagefault and exit_mmap paths when memcgs are disabled using
      cgroup_disable=memory command-line option.
      
      This change results in ~2.1% overhead reduction when running PFT test [1]
      comparing {CONFIG_MEMCG=n, CONFIG_MEMCG_SWAP=n} against {CONFIG_MEMCG=y,
      CONFIG_MEMCG_SWAP=y, cgroup_disable=memory} configuration on an 8-core
      ARM64 Android device.
      
      [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite
      
      Link: https://lkml.kernel.org/r/20210713010934.299876-1-surenb@google.com
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56cab285
    • Hugh Dickins's avatar
      shmem: shmem_writepage() split unlikely i915 THP · 1e6decf3
      Hugh Dickins authored
      drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback()
      which calls shmem_writepage() from a shrinker: that usually works well
      enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been
      set to "always" (intended to be usable) or "force" (forces huge everywhere
      for easy testing), shmem_writepage() is surprised to be called with a huge
      page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find out
      where the crash happens when CONFIG_DEBUG_VM is off).
      
      LRU page reclaim always splits the shmem huge page first: I'd prefer not
      to demand that of i915, so check and split compound in shmem_writepage().
      
      Patch history: when first sent last year
      http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils
      https://lore.kernel.org/linux-mm/20200919042009.bomzxmrg7%25akpm@linux-foundation.org/
      Matthew Wilcox noticed that tail pages were wrongly left clean.  This
      version brackets the split with Set and Clear PageDirty as he suggested:
      which works very well, even if it falls short of our aspirations.  And
      recently I realized that the crash is not limited to the testing option
      "force", but affects "always" too: which is more important to fix.
      
      Link: https://lkml.kernel.org/r/bac6158c-8b3d-4dca-cffc-4982f58d9794@google.com
      Fixes: 2d6692e6
      
       ("drm/i915: Start writeback from the shrinker")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e6decf3
    • Hugh Dickins's avatar
      huge tmpfs: decide stat.st_blksize by shmem_is_huge() · a7fddc36
      Hugh Dickins authored
      4.18 commit 89fdcd26
      
       ("mm: shmem: make stat.st_blksize return huge
      page size if THP is on") added is_huge_enabled() to decide st_blksize: if
      hugeness is to be defined per file, that will need to be replaced by
      shmem_is_huge().
      
      This does give a different answer (No) for small files on a
      "huge=within_size" mount: but that can be considered a minor bugfix.  And
      a different answer (No) for default files on a "huge=advise" mount: I'm
      reluctant to complicate it, just to reproduce the same debatable answer as
      before.
      
      Link: https://lkml.kernel.org/r/af7fb3f9-4415-9e8e-fdac-b1a5253ad21@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7fddc36
    • Hugh Dickins's avatar
      huge tmpfs: shmem_is_huge(vma, inode, index) · 5e6e5a12
      Hugh Dickins authored
      
      
      Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
      that a consistent set of checks can be applied, even when the inode is
      accessed through read/write syscalls (with NULL vma) instead of mmaps (the
      index argument is seldom of interest, but required by mount option
      "huge=within_size").  Clean up and rearrange the checks a little.
      
      This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
      were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes.
      
      Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
      obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
      else needs that symlink check, so leave it there in shmem_getpage_gfp().
      
      Link: https://lkml.kernel.org/r/23a77889-2ddc-b030-75cd-44ca27fd4d1@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e6e5a12
    • Hugh Dickins's avatar
      huge tmpfs: SGP_NOALLOC to stop collapse_file() on race · acdd9f8e
      Hugh Dickins authored
      
      
      khugepaged's collapse_file() currently uses SGP_NOHUGE to tell
      shmem_getpage() not to try allocating a huge page, in the very unlikely
      event that a racing hole-punch removes the swapped or fallocated page as
      soon as i_pages lock is dropped.
      
      We want to consolidate shmem's huge decisions, removing SGP_HUGE and
      SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress
      the protection in this case - Yang Shi points out that the huge page would
      remain indefinitely, charged to root instead of the intended memcg.
      
      collapse_file() should not even allocate a small page in this case: why
      proceed if someone is punching a hole?  SGP_READ is almost the right flag
      here, except that it optimizes away from a fallocated page, with NULL to
      tell caller to fill with zeroes (like a hole); whereas collapse_file()'s
      sequence relies on using a cache page.  Add SGP_NOALLOC just for this.
      
      There are too many consecutive "if (page"s there in shmem_getpage_gfp():
      group it better; and fix the outdated "bring it back from swap" comment.
      
      Link: https://lkml.kernel.org/r/1355343b-acf-4653-ef79-6aee40214ac5@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acdd9f8e
    • Hugh Dickins's avatar
      huge tmpfs: move shmem_huge_enabled() upwards · c852023e
      Hugh Dickins authored
      
      
      shmem_huge_enabled() is about to be enhanced into shmem_is_huge(), so that
      it can be used more widely throughout: before making functional changes,
      shift it to its final position (to avoid forward declaration).
      
      Link: https://lkml.kernel.org/r/16fec7b7-5c84-415a-8586-69d8bf6a6685@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c852023e
    • Hugh Dickins's avatar
      huge tmpfs: revert shmem's use of transhuge_vma_enabled() · b9e2faaf
      Hugh Dickins authored
      5.14 commit e6be37b2
      
       ("mm/huge_memory.c: add missing read-only THP
      checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
      as a wrapper for two very different checks (one check is whether the app
      has marked its address range not to use THPs, the other check is whether
      the app is running in a hierarchy that has been marked never to use THPs).
      shmem_huge_enabled() prefers to show those two checks explicitly, as
      before.
      
      Link: https://lkml.kernel.org/r/45e5338-18d-c6f9-c17e-34f510bc1728@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9e2faaf
    • Hugh Dickins's avatar
      huge tmpfs: remove shrinklist addition from shmem_setattr() · 2b5bbcb1
      Hugh Dickins authored
      There's a block of code in shmem_setattr() to add the inode to
      shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
      from before 5.7 changed truncation to do split_huge_page() for itself, and
      should have been removed at that time.
      
      I am over-stating that: split_huge_page() can fail (notably if there's an
      extra reference to the page at that time), so there might be value in
      retrying.  But there were already retries as truncation worked through the
      tails, and this addition risks repeating unsuccessful retries
      indefinitely: I'd rather remove it now, and work on reducing the chance of
      split_huge_page() failures separately, if we need to.
      
      Link: https://lkml.kernel.org/r/b73b3492-8822-18f9-83e2-938528cdde94@google.com
      Fixes: 71725ed1
      
       ("mm: huge tmpfs: try to split_huge_page() when punching hole")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b5bbcb1
    • Hugh Dickins's avatar
      huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE · d144bf62
      Hugh Dickins authored
      A successful shmem_fallocate() guarantees that the extent has been
      reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
      But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
      split huge pages and free their excess beyond i_size; and by other uses of
      split_huge_page() near i_size.
      
      It's sad to add a shmem inode field just for this, but I did not find a
      better way to keep the guarantee.  A flag to say KEEP_SIZE has been used
      would be cheaper, but I'm averse to unclearable flags.  The fallocend
      field is not perfect either (many disjoint ranges might be fallocated),
      but good enough; and gains another use later on.
      
      Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com
      Fixes: 779750d2
      
       ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d144bf62
    • Hugh Dickins's avatar
      huge tmpfs: fix fallocate(vanilla) advance over huge pages · 050dcb5c
      Hugh Dickins authored
      Patch series "huge tmpfs: shmem_is_huge() fixes and cleanups".
      
      A series of huge tmpfs fixes and cleanups.
      
      This patch (of 9):
      
      shmem_fallocate() goes to a lot of trouble to leave its newly allocated
      pages !Uptodate, partly to identify and undo them on failure, partly to
      leave the overhead of clearing them until later.  But the huge page case
      did not skip to the end of the extent, walked through the tail pages one
      by one, and appeared to work just fine: but in doing so, cleared and
      Uptodated the huge page, so there was no way to undo it on failure.
      
      And by setting Uptodate too soon, it messed up both its nr_falloced and
      nr_unswapped counts, so that the intended "time to give up" heuristic did
      not work at all.
      
      Now advance immediately to the end of the huge extent, with a comment on
      why this is more than just an optimization.  But although this speeds up
      huge tmpfs fallocation, it does leave the clearing until first use, and
      some users may have come to appreciate slow fallocate but fast first use:
      if they complain, then we can consider adding a pass to clear at the end.
      
      Link: https://lkml.kernel.org/r/da632211-8e3e-6b1-aee-ab24734429a0@google.com
      Link: https://lkml.kernel.org/r/16201bd2-70e-37e2-e89b-5f929430da@google.com
      Fixes: 800d8c63
      
       ("shmem: add huge pages support")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      050dcb5c
    • Miaohe Lin's avatar
      shmem: include header file to declare swap_info · 86a2f3f2
      Miaohe Lin authored
      
      
      It's bad to extern swap_info[] in .c.  Include corresponding header file
      instead.
      
      Link: https://lkml.kernel.org/r/20210812120350.49801-5-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86a2f3f2
    • Miaohe Lin's avatar
      shmem: remove unneeded function forward declaration · cdd89d4c
      Miaohe Lin authored
      
      
      The forward declaration for shmem_should_replace_page() and
      shmem_replace_page() is unnecessary.  Remove them.
      
      Link: https://lkml.kernel.org/r/20210812120350.49801-4-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdd89d4c
    • Miaohe Lin's avatar
      shmem: remove unneeded header file · b6378fc8
      Miaohe Lin authored
      
      
      mfill_atomic_install_pte() is introduced to install pte and update mmu
      cache since commit bf6ebd97aba0 ("userfaultfd/shmem: modify
      shmem_mfill_atomic_pte to use install_pte()").  So we should remove
      tlbflush.h as update_mmu_cache() is not called here now.
      
      Link: https://lkml.kernel.org/r/20210812120350.49801-3-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6378fc8