Skip to content
  1. Nov 07, 2021
    • Qi Zheng's avatar
      mm: remove redundant smp_wmb() · ed33b5a6
      Qi Zheng authored
      
      
      The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes
      setup is visible before the pte is made visible to other CPUs by being
      put into page tables.  We only need this when the pte is actually
      populated, so move it to pmd_install().  __pte_alloc_kernel(),
      __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case.
      
      We can also defer smp_wmb() to the place where the pmd entry is really
      populated by preallocated pte.  There are two kinds of user of
      preallocated pte, one is filemap & finish_fault(), another is THP.  The
      former does not need another smp_wmb() because the smp_wmb() has been
      done by pmd_install().  Fortunately, the latter also does not need
      another smp_wmb() because there is already a smp_wmb() before populating
      the new pte when the THP uses a preallocated pte to split a huge pmd.
      
      Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mika Penttila <mika.penttila@nextfour.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed33b5a6
    • Qi Zheng's avatar
      mm: introduce pmd_install() helper · 03c4f204
      Qi Zheng authored
      
      
      Patch series "Do some code cleanups related to mm", v3.
      
      This patch (of 2):
      
      Currently we have three times the same few lines repeated in the code.
      Deduplicate them by newly introduced pmd_install() helper.
      
      Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.com
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Mika Penttila <mika.penttila@nextfour.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03c4f204
    • Peter Xu's avatar
      mm: add zap_skip_check_mapping() helper · 91b61ef3
      Peter Xu authored
      
      
      Use the helper for the checks.  Rename "check_mapping" into
      "zap_mapping" because "check_mapping" looks like a bool but in fact it
      stores the mapping itself.  When it's set, we check the mapping (it must
      be non-NULL).  When it's cleared we skip the check, which works like the
      old way.
      
      Move the duplicated comments to the helper too.
      
      Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91b61ef3
    • Peter Xu's avatar
      mm: drop first_index/last_index in zap_details · 232a6a1c
      Peter Xu authored
      
      
      The first_index/last_index parameters in zap_details are actually only
      used in unmap_mapping_range_tree().  At the meantime, this function is
      only called by unmap_mapping_pages() once.
      
      Instead of passing these two variables through the whole stack of page
      zapping code, remove them from zap_details and let them simply be
      parameters of unmap_mapping_range_tree(), which is inlined.
      
      Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      232a6a1c
    • Peter Xu's avatar
      mm: clear vmf->pte after pte_unmap_same() returns · 2ca99358
      Peter Xu authored
      
      
      pte_unmap_same() will always unmap the pte pointer.  After the unmap,
      vmf->pte will not be valid any more, we should clear it.
      
      It was safe only because no one is accessing vmf->pte after
      pte_unmap_same() returns, since the only caller of pte_unmap_same() (so
      far) is do_swap_page(), where vmf->pte will in most cases be overwritten
      very soon.
      
      Directly pass in vmf into pte_unmap_same() and then we can also avoid
      the long parameter list too, which should be a nice cleanup.
      
      Link: https://lkml.kernel.org/r/20210915181533.11188-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca99358
    • Peter Xu's avatar
      mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte · 9ae0f87d
      Peter Xu authored
      
      
      Patch series "mm: A few cleanup patches around zap, shmem and uffd", v4.
      
      IMHO all of them are very nice cleanups to existing code already,
      they're all small and self-contained.  They'll be needed by uffd-wp
      coming series.
      
      This patch (of 4):
      
      It was conditionally done previously, as there's one shmem special case
      that we use SetPageDirty() instead.  However that's not necessary and it
      should be easier and cleaner to do it unconditionally in
      mfill_atomic_install_pte().
      
      The most recent discussion about this is here, where Hugh explained the
      history of SetPageDirty() and why it's possible that it's not required
      at all:
      
      https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
      
      Currently mfill_atomic_install_pte() has three callers:
      
              1. shmem_mfill_atomic_pte
              2. mcopy_atomic_pte
              3. mcontinue_atomic_pte
      
      After the change: case (1) should have its SetPageDirty replaced by the
      dirty bit on pte (so we unify them together, finally), case (2) should
      have no functional change at all as it has page_in_cache==false, case
      (3) may add a dirty bit to the pte.  However since case (3) is
      UFFDIO_CONTINUE for shmem, it's merely 100% sure the page is dirty after
      all because UFFDIO_CONTINUE normally requires another process to modify
      the page cache and kick the faulted thread, so should not make a real
      difference either.
      
      This should make it much easier to follow on which case will set dirty
      for uffd, as we'll simply set it all now for all uffd related ioctls.
      Meanwhile, no special handling of SetPageDirty() if there's no need.
      
      Link: https://lkml.kernel.org/r/20210915181456.10739-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210915181456.10739-2-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ae0f87d
    • Amit Daniel Kachhap's avatar
      mm/memory.c: avoid unnecessary kernel/user pointer conversion · b063e374
      Amit Daniel Kachhap authored
      
      
      Annotating a pointer from __user to kernel and then back again might
      confuse sparse.  In copy_huge_page_from_user() it can be avoided by
      removing the intermediate variable since it is never used.
      
      Link: https://lkml.kernel.org/r/20210914150820.19326-1-amit.kachhap@arm.com
      Signed-off-by: default avatarAmit Daniel Kachhap <amit.kachhap@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b063e374
    • Rolf Eike Beer's avatar
      mm: use __pfn_to_section() instead of open coding it · f1dc0db2
      Rolf Eike Beer authored
      
      
      It is defined in the same file just a few lines above.
      
      Link: https://lkml.kernel.org/r/4598487.Rc0NezkW7i@mobilepool36.emlix.com
      Signed-off-by: default avatarRolf Eike Beer <eb@emlix.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1dc0db2
    • Peng Liu's avatar
      mm/mmap.c: fix a data race of mm->total_vm · 7866076b
      Peng Liu authored
      
      
      The variable mm->total_vm could be accessed concurrently during mmaping
      and system accounting as noticed by KCSAN,
      
        BUG: KCSAN: data-race in __acct_update_integrals / mmap_region
      
        read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3:
         mmap_region+0x6dc/0x1400
         do_mmap+0x794/0xca0
         vm_mmap_pgoff+0xdf/0x150
         ksys_mmap_pgoff+0xe1/0x380
         do_syscall_64+0x37/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2:
         __acct_update_integrals+0x187/0x1d0
         acct_account_cputime+0x3c/0x40
         update_process_times+0x5c/0x150
         tick_sched_timer+0x184/0x210
         __run_hrtimer+0x119/0x3b0
         hrtimer_interrupt+0x350/0xaa0
         __sysvec_apic_timer_interrupt+0x7b/0x220
         asm_call_irq_on_stack+0x12/0x20
         sysvec_apic_timer_interrupt+0x4d/0x80
         asm_sysvec_apic_timer_interrupt+0x12/0x20
         smp_call_function_single+0x192/0x2b0
         perf_install_in_context+0x29b/0x4a0
         __se_sys_perf_event_open+0x1a98/0x2550
         __x64_sys_perf_event_open+0x63/0x70
         do_syscall_64+0x37/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Reported by Kernel Concurrency Sanitizer on:
        CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        Ubuntu-1.8.2-1ubuntu1 04/01/2014
      
      In vm_stat_account which called by mmap_region, increase total_vm, and
      __acct_update_integrals may read total_vm at the same time.  This will
      cause a data race which lead to undefined behaviour.  To avoid potential
      bad read/write, volatile property and barrier are both used to avoid
      undefined behaviour.
      
      Link: https://lkml.kernel.org/r/20210913105550.1569419-1-liupeng256@huawei.com
      Signed-off-by: default avatarPeng Liu <liupeng256@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7866076b
    • Vasily Averin's avatar
      memcg: prohibit unconditional exceeding the limit of dying tasks · a4ebf1b6
      Vasily Averin authored
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It is assumed that the amount of the memory charged by those
      tasks is bound and most of the memory will get released while the task
      is exiting.  This is resembling a heuristic for the global OOM situation
      when tasks get access to memory reserves.  There is no global memory
      shortage at the memcg level so the memcg heuristic is more relieved.
      
      The above assumption is overly optimistic though.  E.g.  vmalloc can
      scale to really large requests and the heuristic would allow that.  We
      used to have an early break in the vmalloc allocator for killed tasks
      but this has been reverted by commit b8c8a338
      
       ("Revert "vmalloc:
      back off when the current task is killed"").  There are likely other
      similar code paths which do not check for fatal signals in an
      allocation&charge loop.  Also there are some kernel objects charged to a
      memcg which are not bound to a process life time.
      
      It has been observed that it is not really hard to trigger these
      bypasses and cause global OOM situation.
      
      One potential way to address these runaways would be to limit the amount
      of excess (similar to the global OOM with limited oom reserves).  This
      is certainly possible but it is not really clear how much of an excess
      is desirable and still protects from global OOMs as that would have to
      consider the overall memcg configuration.
      
      This patch is addressing the problem by removing the heuristic
      altogether.  Bypass is only allowed for requests which either cannot
      fail or where the failure is not desirable while excess should be still
      limited (e.g.  atomic requests).  Implementation wise a killed or dying
      task fails to charge if it has passed the OOM killer stage.  That should
      give all forms of reclaim chance to restore the limit before the failure
      (ENOMEM) and tell the caller to back off.
      
      In addition, this patch renames should_force_charge() helper to
      task_is_dying() because now its use is not associated witch forced
      charging.
      
      This patch depends on pagefault_out_of_memory() to not trigger
      out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
      and cause a global OOM killer.
      
      Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4ebf1b6
    • Michal Hocko's avatar
      mm, oom: do not trigger out_of_memory from the #PF · 60e2793d
      Michal Hocko authored
      
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory.  This can happen for 2
      different reasons.  a) Memcg is out of memory and we rely on
      mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
      normal allocation fails.
      
      The latter is quite problematic because allocation paths already trigger
      out_of_memory and the page allocator tries really hard to not fail
      allocations.  Anyway, if the OOM killer has been already invoked there
      is no reason to invoke it again from the #PF path.  Especially when the
      OOM condition might be gone by that time and we have no way to find out
      other than allocate.
      
      Moreover if the allocation failed and the OOM killer hasn't been invoked
      then we are unlikely to do the right thing from the #PF context because
      we have already lost the allocation context and restictions and
      therefore might oom kill a task from a different NUMA domain.
      
      This all suggests that there is no legitimate reason to trigger
      out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
      that no #PF path returns with VM_FAULT_OOM without allocation print a
      warning that this is happening before we restart the #PF.
      
      [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
      This is a local problem related to memcg, however, it causes unnecessary
      global OOM kills that are repeated over and over again and escalate into a
      real disaster.  This has been broken since kmem accounting has been
      introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
      the separate limit so the only way to handle kmem hard limit was to return
      with ENOMEM.  In upstream the problem will be fixed by removing the
      outdated kmem limit, however stable and LTS kernels cannot do it and are
      still affected.  This patch fixes the problem and should be backported
      into stable/LTS.]
      
      Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60e2793d
    • Vasily Averin's avatar
      mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks · 0b28179a
      Vasily Averin authored
      
      
      Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
      
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It can be misused and allowed to trigger global OOM from inside
      a memcg-limited container.  On the other hand if memcg fails allocation,
      called from inside #PF handler it triggers global OOM from inside
      pagefault_out_of_memory().
      
      To prevent these problems this patchset:
       (a) removes execution of out_of_memory() from
           pagefault_out_of_memory(), becasue nobody can explain why it is
           necessary.
       (b) allow memcg to fail allocation of dying/killed tasks.
      
      This patch (of 3):
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory which in turn executes
      out_out_memory() and can kill a random task.
      
      An allocation might fail when the current task is the oom victim and
      there are no memory reserves left.  The OOM killer is already handled at
      the page allocator level for the global OOM and at the charging level
      for the memcg one.  Both have much more information about the scope of
      allocation/charge request.  This means that either the OOM killer has
      been invoked properly and didn't lead to the allocation success or it
      has been skipped because it couldn't have been invoked.  In both cases
      triggering it from here is pointless and even harmful.
      
      It makes much more sense to let the killed task die rather than to wake
      up an eternally hungry oom-killer and send him to choose a fatter victim
      for breakfast.
      
      Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b28179a
    • Muchun Song's avatar
      mm: list_lru: only add memcg-aware lrus to the global lru list · 3eef1127
      Muchun Song authored
      
      
      The non-memcg-aware lru is always skiped when traversing the global lru
      list, which is not efficient.  We can only add the memcg-aware lru to
      the global lru list instead to make traversing more efficient.
      
      Link: https://lkml.kernel.org/r/20211025124353.55781-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eef1127
    • Muchun Song's avatar
      mm: memcontrol: remove the kmem states · e80216d9
      Muchun Song authored
      
      
      Now the kmem states is only used to indicate whether the kmem is
      offline.  However, we can set ->kmemcg_id to -1 to indicate whether the
      kmem is offline.  Finally, we can remove the kmem states to simplify the
      code.
      
      Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e80216d9
    • Muchun Song's avatar
      mm: memcontrol: remove kmemcg_id reparenting · 64268868
      Muchun Song authored
      
      
      Since slab objects and kmem pages are charged to object cgroup instead
      of memory cgroup, memcg_reparent_objcgs() will reparent this cgroup and
      all its descendants to its parent cgroup.  This already makes further
      list_lru_add()'s add elements to the parent's list.  So it is
      unnecessary to change kmemcg_id of an offline cgroup to its parent's id.
      It just wastes CPU cycles.  Just remove the redundant code.
      
      Link: https://lkml.kernel.org/r/20211025125102.56533-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64268868
    • Muchun Song's avatar
      mm: list_lru: fix the return value of list_lru_count_one() · 41d17431
      Muchun Song authored
      Since commit 2788cf0c
      
       ("memcg: reparent list_lrus and free kmemcg_id
      on css offline"), ->nr_items can be negative during memory cgroup
      reparenting.  In this case, list_lru_count_one() will return an unusual
      and huge value, which can surprise users.  At least for now it hasn't
      affected any users.  But it is better to let list_lru_count_ont()
      returns zero when ->nr_items is negative.
      
      Link: https://lkml.kernel.org/r/20211025124910.56433-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41d17431
    • Muchun Song's avatar
      mm: list_lru: remove holding lru lock · 60ec6a48
      Muchun Song authored
      Since commit e5bc3af7
      
       ("rcu: Consolidate PREEMPT and !PREEMPT
      synchronize_rcu()"), the critical section of spin lock can serve as an
      RCU read-side critical section which already allows readers that hold
      nlru->lock to avoid taking rcu lock.  So just remove holding lock.
      
      Link: https://lkml.kernel.org/r/20211025124534.56345-1-songmuchun@bytedance.com
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60ec6a48
    • Shakeel Butt's avatar
      memcg, kmem: further deprecate kmem.limit_in_bytes · 58056f77
      Shakeel Butt authored
      The deprecation process of kmem.limit_in_bytes started with the commit
      0158115f
      
       ("memcg, kmem: deprecate kmem.limit_in_bytes") which also
      explains in detail the motivation behind the deprecation.  To summarize,
      it is the unexpected behavior on hitting the kmem limit.  This patch
      moves the deprecation process to the next stage by disallowing to set
      the kmem limit.  In future we might just remove the kmem.limit_in_bytes
      file completely.
      
      [akpm@linux-foundation.org: s/ENOTSUPP/EOPNOTSUPP/]
      [arnd@arndb.de: mark cancel_charge() inline]
        Link: https://lkml.kernel.org/r/20211022070542.679839-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20211019153408.2916808-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58056f77
    • Len Baker's avatar
      mm/list_lru.c: prefer struct_size over open coded arithmetic · 16f6bf26
      Len Baker authored
      
      
      As noted in the "Deprecated Interfaces, Language Features, Attributes,
      and Conventions" documentation [1], size calculations (especially
      multiplication) should not be performed in memory allocator (or similar)
      function arguments due to the risk of them overflowing.
      
      This could lead to values wrapping around and a smaller allocation being
      made than the caller was expecting.  Using those allocations could lead
      to linear overflows of heap memory and other misbehaviors.
      
      So, use the struct_size() helper to do the arithmetic instead of the
      argument "size + count * size" in the kvmalloc() functions.
      
      Also, take the opportunity to refactor the memcpy() call to use the
      flex_array_size() helper.
      
      This code was detected with the help of Coccinelle and audited and fixed
      manually.
      
      [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
      
      Link: https://lkml.kernel.org/r/20211017105929.9284-1-len.baker@gmx.com
      Signed-off-by: default avatarLen Baker <len.baker@gmx.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16f6bf26
    • Waiman Long's avatar
      mm/memcg: remove obsolete memcg_free_kmem() · 38d4ef44
      Waiman Long authored
      Since commit d648bcc7
      
       ("mm: kmem: make memcg_kmem_enabled()
      irreversible"), the only thing memcg_free_kmem() does is to call
      memcg_offline_kmem() when the memcg is still online which can happen
      when online_css() fails due to -ENOMEM.
      
      However, the name memcg_free_kmem() is confusing and it is more clear
      and straight forward to call memcg_offline_kmem() directly from
      mem_cgroup_css_free().
      
      Link: https://lkml.kernel.org/r/20211005202450.11775-1-longman@redhat.com
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d4ef44
    • Shakeel Butt's avatar
      memcg: unify memcg stat flushing · fd25a9e0
      Shakeel Butt authored
      The memcg stats can be flushed in multiple context and potentially in
      parallel too.  For example multiple parallel user space readers for
      memcg stats will contend on the rstat locks with each other.  There is
      no need for that.  We just need one flusher and everyone else can
      benefit.
      
      In addition after aa48e47e
      
       ("memcg: infrastructure to flush memcg
      stats") the kernel periodically flush the memcg stats from the root, so,
      the other flushers will potentially have much less work to do.
      
      Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd25a9e0
    • Shakeel Butt's avatar
      memcg: flush stats only if updated · 11192d9c
      Shakeel Butt authored
      At the moment, the kernel flushes the memcg stats on every refault and
      also on every reclaim iteration.  Although rstat maintains per-cpu
      update tree but on the flush the kernel still has to go through all the
      cpu rstat update tree to check if there is anything to flush.  This
      patch adds the tracking on the stats update side to make flush side more
      clever by skipping the flush if there is no update.
      
      The stats update codepath is very sensitive performance wise for many
      workloads and benchmarks.  So, we can not follow what the commit
      aa48e47e ("memcg: infrastructure to flush memcg stats") did which
      was triggering async flush through queue_work() and caused a lot
      performance regression reports.  That got reverted by the commit
      1f828223
      
       ("memcg: flush lruvec stats in the refault").
      
      In this patch we kept the stats update codepath very minimal and let the
      stats reader side to flush the stats only when the updates are over a
      specific threshold.  For now the threshold is (nr_cpus * CHARGE_BATCH).
      
      To evaluate the impact of this patch, an 8 GiB tmpfs file is created on
      a system with swap-on-zram and the file was pushed to swap through
      memory.force_empty interface.  On reading the whole file, the memcg stat
      flush in the refault code path is triggered.  With this patch, we
      observed 63% reduction in the read time of 8 GiB file.
      
      Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatar"Michal Koutný" <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11192d9c
    • Peter Xu's avatar
      mm/memcg: drop swp_entry_t* in mc_handle_file_pte() · 48384b0b
      Peter Xu authored
      It is unused after the rework of commit f5df8635
      
       ("mm: use
      find_get_incore_page in memcontrol").
      
      Link: https://lkml.kernel.org/r/20210916193014.80129-1-peterx@redhat.com
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48384b0b
    • Matthew Wilcox (Oracle)'s avatar
      mm: optimise put_pages_list() · 988c69f1
      Matthew Wilcox (Oracle) authored
      
      
      Instead of calling put_page() one page at a time, pop pages off the list
      if their refcount was too high and pass the remainder to
      put_unref_page_list().  This should be a speed improvement, but I have
      no measurements to support that.  Current callers do not care about
      performance, but I hope to add some which do.
      
      Link: https://lkml.kernel.org/r/20211007192138.561673-1-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      988c69f1
    • Rafael Aquini's avatar
      mm/swapfile: fix an integer overflow in swap_show() · 642929a2
      Rafael Aquini authored
      
      
      This one is just a minor nuisance for people going through /proc/swaps
      if any of their swapareas is bigger than, or equal to 1073741824 pages
      (4TB).
      
      seq_printf() format string casts as uint the conversion from pages to
      KB, and that will overflow in the aforementioned case.
      
      Albeit being almost unthinkable that someone would actually set up such
      big of a single swaparea, there is a ticket recently filed against RHEL:
      https://bugzilla.redhat.com/show_bug.cgi?id=2008812
      
      Given that all other codesites that use format strings for the same swap
      pages-to-KB conversion do cast it as ulong, this patch just follows
      suit.
      
      Link: https://lkml.kernel.org/r/20211006184011.2579054-1-aquini@redhat.com
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      642929a2
    • vulab's avatar
      mm/swapfile: remove needless request_queue NULL pointer check · 363dc512
      vulab authored
      
      
      The request_queue pointer returned from bdev_get_queue() shall never be
      NULL, so the null check is unnecessary, just remove it.
      
      Link: https://lkml.kernel.org/r/20210917082111.33923-1-vulab@iscas.ac.cn
      Signed-off-by: default avatarXu Wang <vulab@iscas.ac.cn>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      363dc512
    • John Hubbard's avatar
      mm/gup: further simplify __gup_device_huge() · 20b7fee7
      John Hubbard authored
      Commit 6401c4eb
      
       ("mm: gup: fix potential pgmap refcnt leak in
      __gup_device_huge()") simplified the return paths, but didn't go quite
      far enough, as discussed in [1].
      
      Remove the "ret" variable entirely, because there is enough information
      already available to provide the return value.
      
      [1] https://lore.kernel.org/r/CAHk-=wgQTRX=5SkCmS+zfmpqubGHGJvXX_HgnPG8JSpHKHBMeg@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/20210904004224.86391-1-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20b7fee7
    • Jens Axboe's avatar
      mm: move more expensive part of XA setup out of mapping check · f8ee8909
      Jens Axboe authored
      
      
      The fast path here is not needing any writeback, yet we spend time
      setting up the xarray lookup data upfront.  Move the part that actually
      needs to iterate the address space mapping into a separate helper,
      saving ~30% of the time here.
      
      Link: https://lkml.kernel.org/r/49f67983-b802-8929-edab-d807f745c9ca@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8ee8909
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap.c: remove bogus VM_BUG_ON · d417b49f
      Matthew Wilcox (Oracle) authored
      It is not safe to check page->index without holding the page lock.  It
      can be changed if the page is moved between the swap cache and the page
      cache for a shmem file, for example.  There is a VM_BUG_ON below which
      checks page->index is correct after taking the page lock.
      
      Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
      Fixes: 5c211ba2
      
       ("mm: add and use find_lock_entries")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatar <syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d417b49f
    • Jens Axboe's avatar
      mm: don't read i_size of inode unless we need it · 61d0017e
      Jens Axboe authored
      
      
      We always go through i_size_read(), and we rarely end up needing it.
      Push the read to down where we need to check it, which avoids it for
      most cases.
      
      It looks like we can even remove this check entirely, which might be
      worth pursuing.  But at least this takes it out of the hot path.
      
      Link: https://lkml.kernel.org/r/6b67981f-57d4-c80e-bc07-6020aa601381@kernel.dk
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Acked-by: default avatarChris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61d0017e
    • Christoph Hellwig's avatar
      mm: simplify bdi refcounting · efee1713
      Christoph Hellwig authored
      
      
      Move grabbing and releasing the bdi refcount out of the common
      wb_init/wb_exit helpers into code that is only used for the non-default
      memcg driven bdi_writeback structures.
      
      [hch@lst.de: add comment]
        Link: https://lkml.kernel.org/r/20211027074207.GA12793@lst.de
      [akpm@linux-foundation.org: fix typo]
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-6-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efee1713
    • Christoph Hellwig's avatar
      mm: don't automatically unregister bdis · 702f2d1e
      Christoph Hellwig authored
      
      
      All BDI users now unregister explicitly.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-5-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      702f2d1e
    • Christoph Hellwig's avatar
      fs: explicitly unregister per-superblock BDIs · 0b3ea092
      Christoph Hellwig authored
      
      
      Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
      associated with them, and unregister it when the superblock is shut
      down.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b3ea092
    • Christoph Hellwig's avatar
      mtd: call bdi_unregister explicitly · 9718c59c
      Christoph Hellwig authored
      
      
      Call bdi_unregister explicitly instead of relying on the automatic
      unregistration.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-3-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9718c59c
    • Christoph Hellwig's avatar
      mm: export bdi_unregister · c6fd3ac0
      Christoph Hellwig authored
      
      
      Patch series "simplify bdi unregistation".
      
      This series simplifies the BDI code to get rid of the magic
      auto-unregister feature that hid a recent block layer refcounting bug.
      
      This patch (of 5):
      
      To wind down the magic auto-unregister semantics we'll need to push this
      into modular code.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20211021124441.668816-2-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6fd3ac0
    • David Howells's avatar
      mm: stop filemap_read() from grabbing a superfluous page · 8c8387ee
      David Howells authored
      
      
      Under some circumstances, filemap_read() will allocate sufficient pages
      to read to the end of the file, call readahead/readpages on them and
      copy the data over - and then it will allocate another page at the EOF
      and call readpage on that and then ignore it.  This is unnecessary and a
      waste of time and resources.
      
      filemap_read() *does* check for this, but only after it has already done
      the allocation and I/O.  Fix this by checking before calling
      filemap_get_pages() also.
      
      Link: https://lkml.kernel.org/r/163472463105.3126792.7056099385135786492.stgit@warthog.procyon.org.uk
      Link: https://lore.kernel.org/r/160588481358.3465195.16552616179674485179.stgit@warthog.procyon.org.uk/
      Link: https://lore.kernel.org/r/163456863216.2614702.6384850026368833133.stgit@warthog.procyon.org.uk/
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c8387ee
    • Yinan Zhang's avatar
      mm/page_ext.c: fix a comment · d1fea155
      Yinan Zhang authored
      
      
      I have noticed that the previous macro is #ifndef CONFIG_SPARSEMEM.  I
      think the comment of #else should be CONFIG_SPARSEMEM.
      
      Link: https://lkml.kernel.org/r/20211008140312.6492-1-zhangyinan2019@email.szu.edu.cn
      Signed-off-by: default avatarYinan Zhang <zhangyinan2019@email.szu.edu.cn>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1fea155
    • Kees Cook's avatar
      percpu: add __alloc_size attributes for better bounds checking · 17197dd4
      Kees Cook authored
      
      
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate percpu allocator interfaces, to provide additional hinting
      for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Note that due to the implementation of the percpu API, this is unlikely
      to ever actually provide compile-time checking beyond very simple
      non-SMP builds.  But, since they are technically allocators, mark them
      as such.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-9-keescook@chromium.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Co-developed-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17197dd4
    • Kees Cook's avatar
      mm/page_alloc: add __alloc_size attributes for better bounds checking · abd58f38
      Kees Cook authored
      
      
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate page allocator interfaces, to provide additional hinting for
      better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-8-keescook@chromium.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Co-developed-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abd58f38
    • Kees Cook's avatar
      mm/vmalloc: add __alloc_size attributes for better bounds checking · 894f24bb
      Kees Cook authored
      
      
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate vmalloc allocator interfaces, to provide additional hinting
      for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-7-keescook@chromium.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Co-developed-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      894f24bb