Skip to content
  1. Feb 25, 2017
    • Kirill A. Shutemov's avatar
      mm: convert page_mkclean_one() to use page_vma_mapped_walk() · f27176cf
      Kirill A. Shutemov authored
      
      
      For consistency, it worth converting all page_check_address() to
      page_vma_mapped_walk(), so we could drop the former.
      
      PMD handling here is future-proofing, we don't have users yet.  ext4
      with huge pages will be the first.
      
      Link: http://lkml.kernel.org/r/20170129173858.45174-7-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f27176cf
    • Kirill A. Shutemov's avatar
      mm, rmap: check all VMAs that PTE-mapped THP can be part of · a8fa41ad
      Kirill A. Shutemov authored
      
      
      Current rmap code can miss a VMA that maps PTE-mapped THP if the first
      suppage of the THP was unmapped from the VMA.
      
      We need to walk rmap for the whole range of offsets that THP covers, not
      only the first one.
      
      vma_address() also need to be corrected to check the range instead of
      the first subpage.
      
      Link: http://lkml.kernel.org/r/20170129173858.45174-6-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8fa41ad
    • Kirill A. Shutemov's avatar
      mm: fix handling PTE-mapped THPs in page_idle_clear_pte_refs() · 699fa216
      Kirill A. Shutemov authored
      
      
      For PTE-mapped THP page_check_address_transhuge() is not adequate: it
      cannot find all relevant PTEs, only the first one.i
      
      Let's switch it to page_vma_mapped_walk().
      
      I don't think it's subject for stable@: it's not fatal.
      
      Link: http://lkml.kernel.org/r/20170129173858.45174-5-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      699fa216
    • Kirill A. Shutemov's avatar
      mm: fix handling PTE-mapped THPs in page_referenced() · 8eaedede
      Kirill A. Shutemov authored
      
      
      For PTE-mapped THP page_check_address_transhuge() is not adequate: it
      cannot find all relevant PTEs, only the first one.  It means we can miss
      some references of the page and it can result in suboptimal decisions by
      vmscan.
      
      Let's switch it to page_vma_mapped_walk().
      
      I don't think it's subject for stable@: it's not fatal.  The only side
      effect is that THP can be swapped out when it shouldn't.
      
      Link: http://lkml.kernel.org/r/20170129173858.45174-4-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8eaedede
    • Kirill A. Shutemov's avatar
      mm: introduce page_vma_mapped_walk() · ace71a19
      Kirill A. Shutemov authored
      
      
      Introduce a new interface to check if a page is mapped into a vma.  It
      aims to address shortcomings of page_check_address{,_transhuge}.
      
      Existing interface is not able to handle PTE-mapped THPs: it only finds
      the first PTE.  The rest lefted unnoticed.
      
      page_vma_mapped_walk() iterates over all possible mapping of the page in
      the vma.
      
      Link: http://lkml.kernel.org/r/20170129173858.45174-3-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ace71a19
    • Kirill A. Shutemov's avatar
      uprobes: split THPs before trying to replace them · c8394812
      Kirill A. Shutemov authored
      
      
      Patch series "Fix few rmap-related THP bugs", v3.
      
      The patchset fixes handing PTE-mapped THPs in page_referenced() and
      page_idle_clear_pte_refs().
      
      To achieve that I've intrdocued new helper -- page_vma_mapped_walk() --
      which replaces all page_check_address{,_transhuge}() and covers all THP
      cases.
      
      Patchset overview:
        - First patch fixes one uprobe bug (unrelated to the rest of the
          patchset, just spotted it at the same time);
      
        - Patches 2-5 fix handling PTE-mapped THPs in page_referenced(),
          page_idle_clear_pte_refs() and rmap core;
      
        - Patches 6-12 convert all page_check_address{,_transhuge}() users
          (plus remove_migration_pte()) to page_vma_mapped_walk() and drop
          unused helpers.
      
      I think the fixes are not critical enough for stable@ as they don't lead
      to crashes or hangs, only suboptimal behaviour.
      
      This patch (of 12):
      
      For THPs page_check_address() always fails.  It leads to endless loop in
      uprobe_write_opcode().
      
      Testcase with huge-tmpfs (uprobes cannot probe anonymous memory).
      
      	mount -t debugfs none /sys/kernel/debug
      	mount -t tmpfs -o huge=always none /mnt
      	gcc -Wall -O2 -o /mnt/test -x c - <<EOF
      	int main(void)
      	{
      		return 0;
      	}
      	/* Padding to map the code segment with huge pmd */
      	asm (".zero 2097152");
      	EOF
      	echo 'p /mnt/test:0' > /sys/kernel/debug/tracing/uprobe_events
      	echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
      	/mnt/test
      
      Let's split THPs before trying to replace.
      
      Link: http://lkml.kernel.org/r/20170129173858.45174-2-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8394812
    • Yisheng Xie's avatar
      mm/hotplug: enable memory hotplug for non-lru movable pages · 0efadf48
      Yisheng Xie authored
      We had considered all of the non-lru pages as unmovable before commit
      bda807d4
      
       ("mm: migrate: support non-lru movable page migration").
      But now some of non-lru pages like zsmalloc, virtio-balloon pages also
      become movable.  So we can offline such blocks by using non-lru page
      migration.
      
      This patch straightforwardly adds non-lru migration code, which means
      adding non-lru related code to the functions which scan over pfn and
      collect pages to be migrated and isolate them before migration.
      
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0efadf48
    • Yisheng Xie's avatar
      HWPOISON: soft offlining for non-lru movable page · 85fbe5d1
      Yisheng Xie authored
      Extend soft offlining framework to support non-lru page, which already
      support migration after commit bda807d4
      
       ("mm: migrate: support
      non-lru movable page migration")
      
      When memory corrected errors occur on a non-lru movable page, we can
      choose to stop using it by migrating data onto another page and disable
      the original (maybe half-broken) one.
      
      Link: http://lkml.kernel.org/r/1485867981-16037-4-git-send-email-ysxie@foxmail.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85fbe5d1
    • Yisheng Xie's avatar
      mm/migration: make isolate_movable_page always defined · cbae0170
      Yisheng Xie authored
      
      
      Define isolate_movable_page as a static inline function when
      CONFIG_MIGRATION is not enable.  It should return -EBUSY here which
      means failed to isolate movable pages.
      
      This patch do not have any functional change but prepare for later
      patch.
      
      Link: http://lkml.kernel.org/r/1485867981-16037-3-git-send-email-ysxie@foxmail.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbae0170
    • Yisheng Xie's avatar
      mm/migration: make isolate_movable_page() return int type · 9e5bcd61
      Yisheng Xie authored
      Patch series "HWPOISON: soft offlining for non-lru movable page", v6.
      
      After Minchan's commit bda807d4
      
       ("mm: migrate: support non-lru
      movable page migration"), some type of non-lru page like zsmalloc and
      virtio-balloon page also support migration.
      
      Therefore, we can:
      
      1) soft offlining no-lru movable pages, which means when memory
         corrected errors occur on a non-lru movable page, we can stop to use
         it by migrating data onto another page and disable the original
         (maybe half-broken) one.
      
      2) enable memory hotplug for non-lru movable pages, i.e. we may offline
         blocks, which include such pages, by using non-lru page migration.
      
      This patchset is heavily dependent on non-lru movable page migration.
      
      This patch (of 4):
      
      Change the return type of isolate_movable_page() from bool to int.  It
      will return 0 when isolate movable page successfully, and return -EBUSY
      when it isolates failed.
      
      There is no functional change within this patch but prepare for later
      patch.
      
      [xieyisheng1@huawei.com: v6]
        Link: http://lkml.kernel.org/r/1486108770-630-2-git-send-email-xieyisheng1@huawei.com
      Link: http://lkml.kernel.org/r/1485867981-16037-2-git-send-email-ysxie@foxmail.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e5bcd61
    • Vitaly Wool's avatar
      z3fold: add kref refcounting · 5a27aa82
      Vitaly Wool authored
      
      
      With both coming and already present locking optimizations, introducing
      kref to reference-count z3fold objects is the right thing to do.
      Moreover, it makes buddied list no longer necessary, and allows for a
      simpler handling of headless pages.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170131214650.8ea78033d91ded233f552bc0@gmail.com
      Signed-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a27aa82
    • Vitaly Wool's avatar
      z3fold: use per-page spinlock · 2f1e5e4d
      Vitaly Wool authored
      
      
      Most of z3fold operations are in-page, such as modifying z3fold page
      header or moving z3fold objects within a page.  Taking per-pool spinlock
      to protect per-page objects is therefore suboptimal, and the idea of
      having a per-page spinlock (or rwlock) has been around for some time.
      
      This patch implements spinlock-based per-page locking mechanism which is
      lightweight enough to normally fit ok into the z3fold header.
      
      Link: http://lkml.kernel.org/r/20170131214438.433e0a5fda908337b63206d3@gmail.com
      Signed-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f1e5e4d
    • Vitaly Wool's avatar
      z3fold: extend compaction function · 1b096e5a
      Vitaly Wool authored
      
      
      z3fold_compact_page() currently only handles the situation when there's
      a single middle chunk within the z3fold page.  However it may be worth
      it to move middle chunk closer to either first or last chunk, whichever
      is there, if the gap between them is big enough.
      
      This patch adds the relevant code, using BIG_CHUNK_GAP define as a
      threshold for middle chunk to be worth moving.
      
      Link: http://lkml.kernel.org/r/20170131214334.c4f3eac9a477af0fa9a22c46@gmail.com
      Signed-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b096e5a
    • Vitaly Wool's avatar
      z3fold: fix header size related issues · ede93213
      Vitaly Wool authored
      
      
      Currently the whole kernel build will be stopped if the size of struct
      z3fold_header is greater than the size of one chunk, which is 64 bytes
      by default.  This patch instead defines the offset for z3fold objects as
      the size of the z3fold header in chunks.
      
      Fixed also are the calculation of num_free_chunks() and the address to
      move the middle chunk to in case of in-page compaction in
      z3fold_compact_page().
      
      Link: http://lkml.kernel.org/r/20170131214057.d98677032bc7b1c6c59a80c9@gmail.com
      Signed-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ede93213
    • Vitaly Wool's avatar
      z3fold: make pages_nr atomic · 12d59ae6
      Vitaly Wool authored
      
      
      Convert pages_nr per-pool counter to atomic64_t.
      
      Link: http://lkml.kernel.org/r/20170131213946.b828676ab17bbea42022c213@gmail.com
      Signed-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12d59ae6
    • Dan Williams's avatar
      mm: fix get_user_pages() vs device-dax pud mappings · 220ced16
      Dan Williams authored
      
      
      A new unit test for the device-dax 1GB enabling currently fails with
      this warning before hanging the test thread:
      
       WARNING: CPU: 0 PID: 21 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1e3/0x1f0
       percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
       [..]
       CPU: 0 PID: 21 Comm: rcuos/1 Tainted: G           O    4.10.0-rc7-next-20170207+ #944
       [..]
       Call Trace:
        dump_stack+0x86/0xc3
        __warn+0xcb/0xf0
        warn_slowpath_fmt+0x5f/0x80
        ? rcu_nocb_kthread+0x27a/0x510
        ? dax_pmem_percpu_exit+0x50/0x50 [dax_pmem]
        percpu_ref_switch_to_atomic_rcu+0x1e3/0x1f0
        ? percpu_ref_exit+0x60/0x60
        rcu_nocb_kthread+0x339/0x510
        ? rcu_nocb_kthread+0x27a/0x510
        kthread+0x101/0x140
      
      The get_user_pages() path needs to arrange for references to be taken
      against the dev_pagemap instance backing the pud mapping.  Refactor the
      existing __gup_device_huge_pmd() to also account for the pud case.
      
      Link: http://lkml.kernel.org/r/148653181153.38226.9605457830505509385.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      220ced16
    • Dave Jiang's avatar
      mm: replace FAULT_FLAG_SIZE with parameter to huge_fault · c791ace1
      Dave Jiang authored
      
      
      Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
      been somewhat painful with getting the flags set and removed at the
      correct locations.  More than one kernel oops was introduced due to
      difficulties of getting the placement correctly.
      
      Remove the flag values and introduce an input parameter to huge_fault
      that indicates the size of the page entry.  This makes the code easier
      to trace and should avoid the issues we see with the fault flags where
      removal of the flag was necessary in the fallback paths.
      
      Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c791ace1
    • Dave Jiang's avatar
      dax: support for transparent PUD pages for device DAX · 9557feee
      Dave Jiang authored
      
      
      Add transparent huge PUD pages support for device DAX by adding a
      pud_fault handler.
      
      Link: http://lkml.kernel.org/r/148545060002.17912.6765687780007547551.stgit@djiang5-desk3.ch.intel.com
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9557feee
    • Matthew Wilcox's avatar
      mm, x86: add support for PUD-sized transparent hugepages · a00cc7d9
      Matthew Wilcox authored
      
      
      The current transparent hugepage code only supports PMDs.  This patch
      adds support for transparent use of PUDs with DAX.  It does not include
      support for anonymous pages.  x86 support code also added.
      
      Most of this patch simply parallels the work that was done for huge
      PMDs.  The only major difference is how the new ->pud_entry method in
      mm_walk works.  The ->pmd_entry method replaces the ->pte_entry method,
      whereas the ->pud_entry method works along with either ->pmd_entry or
      ->pte_entry.  The pagewalk code takes care of locking the PUD before
      calling ->pud_walk, so handlers do not need to worry whether the PUD is
      stable.
      
      [dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
        Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
      [dave.jiang@intel.com: native_pud_clear missing on i386 build]
        Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
      Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Tested-by: default avatarAlexander Kapshuk <alexander.kapshuk@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a00cc7d9
    • Dave Jiang's avatar
      mm,fs,dax: change ->pmd_fault to ->huge_fault · a2d58167
      Dave Jiang authored
      
      
      Patch series "1G transparent hugepage support for device dax", v2.
      
      The following series implements support for 1G trasparent hugepage on
      x86 for device dax.  The bulk of the code was written by Mathew Wilcox a
      while back supporting transparent 1G hugepage for fs DAX.  I have
      forward ported the relevant bits to 4.10-rc.  The current submission has
      only the necessary code to support device DAX.
      
      Comments from Dan Williams: So the motivation and intended user of this
      functionality mirrors the motivation and users of 1GB page support in
      hugetlbfs.  Given expected capacities of persistent memory devices an
      in-memory database may want to reduce tlb pressure beyond what they can
      already achieve with 2MB mappings of a device-dax file.  We have
      customer feedback to that effect as Willy mentioned in his previous
      version of these patches [1].
      
      [1]: https://lkml.org/lkml/2016/1/31/52
      
      Comments from Nilesh @ Oracle:
      
      There are applications which have a process model; and if you assume
      10,000 processes attempting to mmap all the 6TB memory available on a
      server; we are looking at the following:
      
      processes         : 10,000
      memory            :    6TB
      pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
      pmd @ 2M page size: 120,000 / 512 = ~240GB
      pud @ 1G page size: 240GB / 512 = ~480MB
      
      As you can see with 2M pages, this system will use up an exorbitant
      amount of DRAM to hold the page tables; but the 1G pages finally brings
      it down to a reasonable level.  Memory sizes will keep increasing; so
      this number will keep increasing.
      
      An argument can be made to convert the applications from process model
      to thread model, but in the real world that may not be always practical.
      Hopefully this helps explain the use case where this is valuable.
      
      This patch (of 3):
      
      In preparation for adding the ability to handle PUD pages, convert
      vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault.  The
      vm_fault structure is extended to include a union of the different page
      table pointers that may be needed, and three flag bits are reserved to
      indicate which type of pointer is in the union.
      
      [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
        Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
      [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
        Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
      Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2d58167
    • Mel Gorman's avatar
      mm, page_alloc: use static global work_struct for draining per-cpu pages · bd233f53
      Mel Gorman authored
      
      
      As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static
      work_struct to co-ordinate the draining of per-cpu pages on the
      workqueue.  Only one task can drain at a time but this is better than
      the previous scheme that allowed multiple tasks to send IPIs at a time.
      
      One consideration is whether parallel requests should synchronise
      against each other.  This patch does not synchronise for a global drain
      as the common case for such callers is expected to be multiple parallel
      direct reclaimers competing for pages when the watermark is close to
      min.  Draining the per-cpu list is unlikely to make much progress and
      serialising the drain is of dubious merit.  Drains are synchonrised for
      callers such as memory hotplug and CMA that care about the drain being
      complete when the function returns.
      
      Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd233f53
    • Vlastimil Babka's avatar
      mm, page_alloc: don't check cpuset allowed twice in fast-path · 51047820
      Vlastimil Babka authored
      Since commit 682a3385
      
       ("mm, page_alloc: inline the fast path of the
      zonelist iterator") we replace a NULL nodemask with
      cpuset_current_mems_allowed in the fast path, so that
      get_page_from_freelist() filters nodes allowed by the cpuset via
      for_next_zone_zonelist_nodemask().
      
      In that case it's pointless to additionaly check __cpuset_zone_allowed()
      in each iteration, which we can avoid by not adding ALLOC_CPUSET to
      alloc_flags in that scenario.
      
      This saves some cycles in the allocator fast path on systems with one or
      more non-root cpuset configured.  In the slow path, ALLOC_CPUSET is
      reset according to __alloc_pages_slowpath().  Without configured
      cpusets, this code is disabled by a static key.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51047820
    • Vlastimil Babka's avatar
      mm, page_alloc: remove redundant checks from alloc fastpath · df76cee6
      Vlastimil Babka authored
      
      
      The allocation fast path contains two similar checks for zoneref->zone
      being NULL, where zoneref points either to the first zone in the
      zonelist, or to the preferred zone.  These can be NULL either due to
      empty zonelist, or no zone being compatible with given nodemask or
      task's cpuset.
      
      These checks are unnecessary, because the zonelist walks in
      first_zones_zonelist() and get_page_from_freelist() handle a NULL
      starting zoneref->zone or preferred_zoneref->zone safely.  It's safe to
      fallback to __alloc_pages_slowpath() where we also have the check early
      enough.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df76cee6
    • Minchan Kim's avatar
      zram: remove waitqueue for IO done · a09759ac
      Minchan Kim authored
      
      
      zram_reset_device() waits for ongoing writepage pages to be completed by
      zram->refcount logic.  However, it's pointless because before the reset,
      we prevent further opening of zram by zram->claim and flush all of
      pending IO by fsync_bdev so there should be no pending IO at the
      zram_reset_device().
      
      So let's remove that code which is even broken due to the lack of
      wake_up elsewhere.
      
      Link: http://lkml.kernel.org/r/1485145031-11661-1-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a09759ac
    • seokhoon.yoon's avatar
      mm: fix comments for mmap_init() · 3edf41d8
      seokhoon.yoon authored
      
      
      mmap_init() is no longer associated with VMA slab.  So fix it.
      
      Link: http://lkml.kernel.org/r/1485182601-9294-1-git-send-email-iamyooon@gmail.com
      Signed-off-by: default avatarseokhoon.yoon <iamyooon@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3edf41d8
    • Dave Jiang's avatar
      mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf · 11bac800
      Dave Jiang authored
      
      
      ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
      take a vma and vmf parameter when the vma already resides in vmf.
      
      Remove the vma parameter to simplify things.
      
      [arnd@arndb.de: fix ARM build]
        Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11bac800
    • Mel Gorman's avatar
      mm, page_alloc: only use per-cpu allocator for irq-safe requests · 374ad05a
      Mel Gorman authored
      
      
      Many workloads that allocate pages are not handling an interrupt at a
      time.  As allocation requests may be from IRQ context, it's necessary to
      disable/enable IRQs for every page allocation.  This cost is the bulk of
      the free path but also a significant percentage of the allocation path.
      
      This patch alters the locking and checks such that only irq-safe
      allocation requests use the per-cpu allocator.  All others acquire the
      irq-safe zone->lock and allocate from the buddy allocator.  It relies on
      disabling preemption to safely access the per-cpu structures.  It could
      be slightly modified to avoid soft IRQs using it but it's not clear it's
      worthwhile.
      
      This modification may slow allocations from IRQ context slightly but the
      main gain from the per-cpu allocator is that it scales better for
      allocations from multiple contexts.  There is an implicit assumption
      that intensive allocations from IRQ contexts on multiple CPUs from a
      single NUMA node are rare and that the fast majority of scaling issues
      are encountered in !IRQ contexts such as page faulting.  It's worth
      noting that this patch is not required for a bulk page allocator but it
      significantly reduces the overhead.
      
      The following is results from a page allocator micro-benchmark.  Only
      order-0 is interesting as higher orders do not use the per-cpu allocator
      
                                                4.10.0-rc2                 4.10.0-rc2
                                                   vanilla               irqsafe-v1r5
      Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
      Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
      Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
      Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
      Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
      Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
      Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
      Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
      Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
      Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
      Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
      Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
      Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
      Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
      Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
      Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
      Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
      Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
      Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
      Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
      Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
      Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
      Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
      Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
      Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
      Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
      Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
      Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
      Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
      Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
      Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
      Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
      Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
      Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
      Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
      Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
      Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
      Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
      Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
      Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
      Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
      Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
      Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
      Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
      Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
      
      This is the alloc, free and total overhead of allocating order-0 pages
      in batches of 1 page up to 16384 pages.  Avoiding disabling/enabling
      overhead massively reduces overhead.  Alloc overhead is roughly reduced
      by 14-20% in most cases.  The free path is reduced by 26-46% and the
      total reduction is significant.
      
      Many users require zeroing of pages from the page allocator which is the
      vast cost of allocation.  Hence, the impact on a basic page faulting
      benchmark is not that significant
      
                                    4.10.0-rc2            4.10.0-rc2
                                       vanilla          irqsafe-v1r5
      Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
      Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
      Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
      Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
      CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
      CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
      Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
      Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
      
      This is from aim9 and the most notable outcome is that fault variability
      is reduced by the patch.  The headline improvement is small as the
      overall fault cost, zeroing, page table insertion etc dominate relative
      to disabling/enabling IRQs in the per-cpu allocator.
      
      Similarly, little benefit was seen on networking benchmarks both
      localhost and between physical server/clients where other costs
      dominate.  It's possible that this will only be noticable on very high
      speed networks.
      
      Jesper Dangaard Brouer independently tested this with a separate
      microbenchmark from
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      Micro-benchmarked with [1] page_bench02:
       modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
        rmmod page_bench02 ; dmesg --notime | tail -n 4
      
      Compared to baseline: 213 cycles(tsc) 53.417 ns
       - against this     : 184 cycles(tsc) 46.056 ns
       - Saving           : -29 cycles
       - Very close to expected 27 cycles saving [see below [2]]
      
      Micro benchmarking via time_bench_sample[3], we get the cost of these
      operations:
      
       time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
       time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
       time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
       time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
       time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
       time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
       time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
       time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
       time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
       [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
       time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
       [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
       time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
       time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
       time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
      
      Thus, expected improvement is: 38-11 = 27 cycles.
      
      [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
        Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      374ad05a
    • Michal Hocko's avatar
      mm, page_alloc: do not depend on cpu hotplug locks inside the allocator · a459eeb7
      Michal Hocko authored
      
      
      Dmitry has reported the following lockdep splat
        lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
        __mutex_lock_common kernel/locking/mutex.c:521 [inline]
        mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
        pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
        __alloc_percpu+0x24/0x30 mm/percpu.c:1075
        smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
        cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
        cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
        _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      pcpu_alloc
        pcpu_alloc_mutex
      
        get_online_cpus+0x62/0x90 kernel/cpu.c:248
        drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
        __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
        __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
        __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
        __alloc_pages include/linux/gfp.h:426 [inline]
        __alloc_pages_node include/linux/gfp.h:439 [inline]
        alloc_pages_node include/linux/gfp.h:453 [inline]
        pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
        pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
        pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
        __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
        bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
        array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
        find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
        map_create kernel/bpf/syscall.c:188 [inline]
        SYSC_bpf kernel/bpf/syscall.c:870 [inline]
        SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
        entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      pcpu_alloc
        pcpu_alloc_mutex
      drain_all_pages
        get_online_cpus
          cpu_hotplug.lock
      
        cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
        _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      
      Pulling cpu hotplug locks inside the page allocator is just too
      dangerous.  Let's remove the dependency by dropping get_online_cpus()
      from drain_all_pages.  This is not so simple though because now we do
      not have a protection against cpu hotplug which means 2 things:
      
        - the work item might be executed on a different cpu in worker from
          unbound pool so it doesn't run on pinned on the cpu
      
        - we have to make sure that we do not race with page_alloc_cpu_dead
          calling drain_pages_zone
      
      Disabling preemption in drain_local_pages_wq will solve the first
      problem drain_local_pages will determine its local CPU from the WQ
      context which will be stable after that point, page_alloc_cpu_dead is
      pinned to the CPU already.  The later condition is achieved by disabling
      IRQs in drain_pages_zone.
      
      Fixes: mm, page_alloc: drain per-cpu pages from workqueue context
      Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a459eeb7
    • Mel Gorman's avatar
      mm, page_alloc: drain per-cpu pages from workqueue context · 0ccce3b9
      Mel Gorman authored
      
      
      The per-cpu page allocator can be drained immediately via
      drain_all_pages() which sends IPIs to every CPU.  In the next patch, the
      per-cpu allocator will only be used for interrupt-safe allocations which
      prevents draining it from IPI context.  This patch uses workqueues to
      drain the per-cpu lists instead.
      
      This is slower but no slowdown during intensive reclaim was measured and
      the paths that use drain_all_pages() are not that sensitive to
      performance.  This is particularly true as the path would only be
      triggered when reclaim is failing.  It also makes a some sense to avoid
      storming a machine with IPIs when it's under memory pressure.  Arguably,
      it should be further adjusted so that only one caller at a time is
      draining pages but it's beyond the scope of the current patch.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ccce3b9
    • Mel Gorman's avatar
      mm, page_alloc: split alloc_pages_nodemask() · 9cd75558
      Mel Gorman authored
      
      
      alloc_pages_nodemask does a number of preperation steps that determine
      what zones can be used for the allocation depending on a variety of
      factors.  This is fine but a hypothetical caller that wanted multiple
      order-0 pages has to do the preparation steps multiple times.  This
      patch structures __alloc_pages_nodemask such that it's relatively easy
      to build a bulk order-0 page allocator.  There is no functional change.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cd75558
    • Mel Gorman's avatar
      mm, page_alloc: split buffered_rmqueue() · 066b2393
      Mel Gorman authored
      
      
      Patch series "Use per-cpu allocator for !irq requests and prepare for a
      bulk allocator", v5.
      
      This series is motivated by a conversation led by Jesper Dangaard Brouer
      at the last LSF/MM proposing a generic page pool for DMA-coherent pages.
      Part of his motivation was due to the overhead of allocating multiple
      order-0 that led some drivers to use high-order allocations and
      splitting them.  This is very slow in some cases.
      
      The first two patches in this series restructure the page allocator such
      that it is relatively easy to introduce an order-0 bulk page allocator.
      A patch exists to do that and has been handed over to Jesper until an
      in-kernel users is created.  The third patch prevents the per-cpu
      allocator being drained from IPI context as that can potentially corrupt
      the list after patch four is merged.  The final patch alters the per-cpu
      alloctor to make it exclusive to !irq requests.  This cuts
      allocation/free overhead by roughly 30%.
      
      Performance tests from both Jesper and me are included in the patch.
      
      This patch (of 4):
      
      buffered_rmqueue removes a page from a given zone and uses the per-cpu
      list for order-0.  This is fine but a hypothetical caller that wanted
      multiple order-0 pages has to disable/reenable interrupts multiple
      times.  This patch structures buffere_rmqueue such that it's relatively
      easy to build a bulk order-0 page allocator.  There is no functional
      change.
      
      [mgorman@techsingularity.net: failed per-cpu refill may blow up]
        Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.net
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      066b2393
    • Johannes Weiner's avatar
      mm: vmscan: move dirty pages out of the way until they're flushed · c55e8d03
      Johannes Weiner authored
      
      
      We noticed a performance regression when moving hadoop workloads from
      3.10 kernels to 4.0 and 4.6.  This is accompanied by increased pageout
      activity initiated by kswapd as well as frequent bursts of allocation
      stalls and direct reclaim scans.  Even lowering the dirty ratios to the
      equivalent of less than 1% of memory would not eliminate the issue,
      suggesting that dirty pages concentrate where the scanner is looking.
      
      This can be traced back to recent efforts of thrash avoidance.  Where
      3.10 would not detect refaulting pages and continuously supply clean
      cache to the inactive list, a thrashing workload on 4.0+ will detect and
      activate refaulting pages right away, distilling used-once pages on the
      inactive list much more effectively.  This is by design, and it makes
      sense for clean cache.  But for the most part our workload's cache
      faults are refaults and its use-once cache is from streaming writes.  We
      end up with most of the inactive list dirty, and we don't go after the
      active cache as long as we have use-once pages around.
      
      But waiting for writes to avoid reclaiming clean cache that *might*
      refault is a bad trade-off.  Even if the refaults happen, reads are
      faster than writes.  Before getting bogged down on writeback, reclaim
      should first look at *all* cache in the system, even active cache.
      
      To accomplish this, activate pages that are dirty or under writeback
      when they reach the end of the inactive LRU.  The pages are marked for
      immediate reclaim, meaning they'll get moved back to the inactive LRU
      tail as soon as they're written back and become reclaimable.  But in the
      meantime, by reducing the inactive list to only immediately reclaimable
      pages, we allow the scanner to deactivate and refill the inactive list
      with clean cache from the active list tail to guarantee forward
      progress.
      
      [hannes@cmpxchg.org: update comment]
        Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c55e8d03
    • Johannes Weiner's avatar
      mm: vmscan: only write dirty pages that the scanner has seen twice · 4eda4823
      Johannes Weiner authored
      
      
      Dirty pages can easily reach the end of the LRU while there are still
      clean pages to reclaim around.  Don't let kswapd write them back just
      because there are a lot of them.  It costs more CPU to find the clean
      pages, but that's almost certainly better than to disrupt writeback from
      the flushers with LRU-order single-page writes from reclaim.  And the
      flushers have been woken up by that point, so we spend IO capacity on
      flushing and CPU capacity on finding the clean cache.
      
      Only start writing dirty pages if they have cycled around the LRU twice
      now and STILL haven't been queued on the IO device.  It's possible that
      the dirty pages are so sparsely distributed across different bdis,
      inodes, memory cgroups, that the flushers take forever to get to the
      ones we want reclaimed.  Once we see them twice on the LRU, we know
      that's the quicker way to find them, so do LRU writeback.
      
      Link: http://lkml.kernel.org/r/20170123181641.23938-5-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eda4823
    • Johannes Weiner's avatar
      mm: vmscan: remove old flusher wakeup from direct reclaim path · bbef9384
      Johannes Weiner authored
      
      
      Direct reclaim has been replaced by kswapd reclaim in pretty much all
      common memory pressure situations, so this code most likely doesn't
      accomplish the described effect anymore.  The previous patch wakes up
      flushers for all reclaimers when we encounter dirty pages at the tail
      end of the LRU.  Remove the crufty old direct reclaim invocation.
      
      Link: http://lkml.kernel.org/r/20170123181641.23938-4-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbef9384
    • Johannes Weiner's avatar
      mm: vmscan: kick flushers when we encounter dirty pages on the LRU · 726d061f
      Johannes Weiner authored
      
      
      Memory pressure can put dirty pages at the end of the LRU without
      anybody running into dirty limits.  Don't start writing individual pages
      from kswapd while the flushers might be asleep.
      
      Unlike the old direct reclaim flusher wakeup (removed in the next patch)
      that flushes the number of pages just scanned, this patch wakes the
      flushers for all outstanding dirty pages.  That seemed to perform better
      in a synthetic test that pushes dirty pages to the end of the LRU and
      into reclaim, because we know LRU aging outstrips writeback already, and
      this way we give younger dirty pages a headstart rather than wait until
      reclaim runs into them as well.  It also means less plugging and risk of
      exhausting the struct request pool from reclaim.
      
      There is a concern that this will cause temporary files that used to get
      dirtied and truncated before writeback to now get written to disk under
      memory pressure.  If this turns out to be a real problem, we'll have to
      revisit this and tame the reclaim flusher wakeups.
      
      [hannes@cmpxchg.org: mention dirty expiration as a condition]
        Link: http://lkml.kernel.org/r/20170126174739.GA30636@cmpxchg.org
      Link: http://lkml.kernel.org/r/20170123181641.23938-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      726d061f
    • Johannes Weiner's avatar
      mm: vmscan: scan dirty pages even in laptop mode · 1276ad68
      Johannes Weiner authored
      
      
      Patch series "mm: vmscan: fix kswapd writeback regression".
      
      We noticed a regression on multiple hadoop workloads when moving from
      3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
      writeout, causing direct reclaim herds that also don't make progress.
      
      I tracked it down to the thrash avoidance efforts after 3.10 that make
      the kernel better at keeping use-once cache and use-many cache sorted on
      the inactive and active list, with more aggressive protection of the
      active list as long as there is inactive cache.  Unfortunately, our
      workload's use-once cache is mostly from streaming writes.  Waiting for
      writes to avoid potential reloads in the future is not a good tradeoff.
      
      These patches do the following:
      
      1. Wake the flushers when kswapd sees a lump of dirty pages. It's
         possible to be below the dirty background limit and still have cache
         velocity push them through the LRU. So start a-flushin'.
      
      2. Let kswapd only write pages that have been rotated twice. This makes
         sure we really tried to get all the clean pages on the inactive list
         before resorting to horrible LRU-order writeback.
      
      3. Move rotating dirty pages off the inactive list. Instead of churning
         or waiting on page writeback, we'll go after clean active cache. This
         might lead to thrashing, but in this state memory demand outstrips IO
         speed anyway, and reads are faster than writes.
      
      Mel backported the series to 4.10-rc5 with one minor conflict and ran a
      couple of tests on it.  Mix of read/write random workload didn't show
      anything interesting.  Write-only database didn't show much difference
      in performance but there were slight reductions in IO -- probably in the
      noise.
      
      simoop did show big differences although not as big as Mel expected.
      This is Chris Mason's workload that similate the VM activity of hadoop.
      Mel won't go through the full details but over the samples measured
      during an hour it reported
      
                                               4.10.0-rc5            4.10.0-rc5
                                                  vanilla         johannes-v1r1
      Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
      Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
      Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
      Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
      Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
      Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
      Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
      Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
      Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
      Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
      Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
      Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
      Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
      Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
      Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
      Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
      Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
      Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)
      
      The latencies are actually completely horrific in comparison to 4.4 (and
      4.10-rc5 is worse than 4.9 according to historical data for reasons Mel
      hasn't analysed yet).
      
      Still, 95% of write latency (p95-write) is halved by the series and
      allocation latency is way down.  Direct reclaim activity is one fifth of
      what it was according to vmstats.  Kswapd activity is higher but this is
      not necessarily surprising.  Kswapd efficiency is unchanged at 99% (99%
      of pages scanned were reclaimed) but direct reclaim efficiency went from
      77% to 99%
      
      In the vanilla kernel, 627MB of data was written back from reclaim
      context.  With the series, no data was written back.  With or without
      the patch, pages are being immediately reclaimed after writeback
      completes.  However, with the patch, only 1/8th of the pages are
      reclaimed like this.
      
      This patch (of 5):
      
      We have an elaborate dirty/writeback throttling mechanism inside the
      reclaim scanner, but for that to work the pages have to go through
      shrink_page_list() and get counted for what they are.  Otherwise, we
      mess up the LRU order and don't match reclaim speed to writeback.
      
      Especially during deactivation, there is never a reason to skip dirty
      pages; nothing is even trying to write them out from there.  Don't mess
      up the LRU order for nothing, shuffle these pages along.
      
      Link: http://lkml.kernel.org/r/20170123181641.23938-2-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1276ad68
    • Mike Rapoport's avatar
      userfaultfd: non-cooperative: selftest: enable REMOVE event test for shmem · 64527f5d
      Mike Rapoport authored
      
      
      Now when madvise(MADV_REMOVE) notifies uffd reader, we should verify
      that appliciation actually sees zeros at the removed range.
      
      Link: http://lkml.kernel.org/r/1484814154-1557-4-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64527f5d
    • Mike Rapoport's avatar
      userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE request · a6bf53eb
      Mike Rapoport authored
      
      
      When a page is removed from a shared mapping, the uffd reader should be
      notified, so that it won't attempt to handle #PF events for the removed
      pages.
      
      We can reuse the UFFD_EVENT_REMOVE because from the uffd monitor point
      of view, the semantices of madvise(MADV_DONTNEED) and
      madvise(MADV_REMOVE) is exactly the same.
      
      Link: http://lkml.kernel.org/r/1484814154-1557-3-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarPavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6bf53eb
    • Mike Rapoport's avatar
      userfaultfd: non-cooperative: rename *EVENT_MADVDONTNEED to *EVENT_REMOVE · d811914d
      Mike Rapoport authored
      
      
      Patch series "userfaultfd: non-cooperative: add madvise() event for
      MADV_REMOVE request".
      
      These patches add notification of madvise(MADV_REMOVE) event to
      non-cooperative userfaultfd monitor.
      
      The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with
      relevant functions and structures.  Using _REMOVE instead of
      _MADVDONTNEED describes the event semantics more clearly and I hope it's
      not too late for such change in the ABI.
      
      This patch (of 3):
      
      The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about
      removal of certain range from address space tracked by userfaultfd.
      Hence, UFFD_EVENT_REMOVE seems to better reflect the operation
      semantics.  Respectively, 'madv_dn' field of uffd_msg is renamed to
      'remove' and the madvise_userfault_dontneed callback is renamed to
      userfaultfd_remove.
      
      Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d811914d
    • Heiko Carstens's avatar
      memblock: embed memblock type name within struct memblock_type · 0262d9c8
      Heiko Carstens authored
      
      
      Provide the name of each memblock type with struct memblock_type.  This
      allows to get rid of the function memblock_type_name() and duplicating
      the type names in __memblock_dump_all().
      
      The only memblock_type usage out of mm/memblock.c seems to be
      arch/s390/kernel/crash_dump.c.  While at it, give it a name.
      
      Link: http://lkml.kernel.org/r/20170120123456.46508-4-heiko.carstens@de.ibm.com
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0262d9c8