Skip to content
  1. Sep 09, 2017
    • Michal Hocko's avatar
      fs, proc: remove priv argument from is_stack · 1240ea0d
      Michal Hocko authored
      Commit b18cb64e
      
       ("fs/proc: Stop trying to report thread stacks")
      removed the priv parameter user in is_stack so the argument is
      redundant.  Drop it.
      
      [arnd@arndb.de: remove unused variable]
        Link: http://lkml.kernel.org/r/20170801120150.1520051-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170728075833.7241-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1240ea0d
    • Anshuman Khandual's avatar
      mm/mempolicy.c: remove BUG_ON() checks for VMA inside mpol_misplaced() · 149728e9
      Anshuman Khandual authored
      
      
      VMA and its address bounds checks are too late in this function.  They
      must have been verified earlier in the page fault sequence.  Hence just
      remove them.
      
      Link: http://lkml.kernel.org/r/20170901130137.7617-1-khandual@linux.vnet.ibm.com
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      149728e9
    • David Rientjes's avatar
      mm/swapfile.c: fix swapon frontswap_map memory leak on error · b6b1fd2a
      David Rientjes authored
      
      
      Free frontswap_map if an error is encountered before enable_swap_info().
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6b1fd2a
    • Darrick J. Wong's avatar
      mm: kvfree the swap cluster info if the swap file is unsatisfactory · 8606a1a9
      Darrick J. Wong authored
      If initializing a small swap file fails because the swap file has a
      problem (holes, etc.) then we need to free the cluster info as part of
      cleanup.  Unfortunately a previous patch changed the code to use kvzalloc
      but did not change all the vfree calls to use kvfree.
      
      Found by running generic/357 from xfstests.
      
      Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia
      Fixes: 54f180d3
      
       ("mm, swap: use kvzalloc to allocate some swap data structures")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8606a1a9
    • Tetsuo Handa's avatar
      mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt · f19360f0
      Tetsuo Handa authored
      We are by error initializing alloc_flags before gfp_allowed_mask is
      applied.  This could cause problems after pm_restrict_gfp_mask() is called
      during suspend operation.  Apply gfp_allowed_mask before initializing
      alloc_flags so that the first allocation attempt uses correct flags.
      
      Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
      Fixes: 83d4ca81
      
       ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f19360f0
    • Cyrill Gorcunov's avatar
      tools/testing/selftests/kcmp/kcmp_test.c: add KCMP_EPOLL_TFD testing · 66559a69
      Cyrill Gorcunov authored
      KCMP's KCMP_EPOLL_TFD mode merged in commit 0791e364
      
       ("kcmp: add
      KCMP_EPOLL_TFD mode to compare epoll target files") we've had no selftest
      for it yet (except in criu development list).  Thus add it.
      
      Link: http://lkml.kernel.org/r/20170901151620.GK1898@uranus.lan
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66559a69
    • Michal Hocko's avatar
      mm/sparse.c: fix typo in online_mem_sections · b4ccec41
      Michal Hocko authored
      online_mem_sections() accidentally marks online only the first section
      in the given range.  This is a typo which hasn't been noticed because I
      haven't tested large 2GB blocks previously.  All users of
      pfn_to_online_page would get confused on the the rest of the pfn range
      in the block.
      
      All we need to fix this is to use iterator (pfn) rather than start_pfn.
      
      Link: http://lkml.kernel.org/r/20170904112210.3401-1-mhocko@kernel.org
      Fixes: 2d070eab
      
       ("mm: consider zone which is not fully populated to have holes")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4ccec41
    • Laurent Dufour's avatar
      mm/memory.c: fix mem_cgroup_oom_disable() call missing · de0c799b
      Laurent Dufour authored
      Seen while reading the code, in handle_mm_fault(), in the case
      arch_vma_access_permitted() is failing the call to
      mem_cgroup_oom_disable() is not made.
      
      To fix that, move the call to mem_cgroup_oom_enable() after calling
      arch_vma_access_permitted() as it should not have entered the memcg OOM.
      
      Link: http://lkml.kernel.org/r/1504625439-31313-1-git-send-email-ldufour@linux.vnet.ibm.com
      Fixes: bae473a4
      
       ("mm: introduce fault_env")
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de0c799b
    • Roman Gushchin's avatar
      mm: memcontrol: use per-cpu stocks for socket memory uncharging · 475d0487
      Roman Gushchin authored
      
      
      We've noticed a quite noticeable performance overhead on some hosts with
      significant network traffic when socket memory accounting is enabled.
      
      Perf top shows that socket memory uncharging path is hot:
        2.13%  [kernel]                [k] page_counter_cancel
        1.14%  [kernel]                [k] __sk_mem_reduce_allocated
        1.14%  [kernel]                [k] _raw_spin_lock
        0.87%  [kernel]                [k] _raw_spin_lock_irqsave
        0.84%  [kernel]                [k] tcp_ack
        0.84%  [kernel]                [k] ixgbe_poll
        0.83%  < workload >
        0.82%  [kernel]                [k] enqueue_entity
        0.68%  [kernel]                [k] __fget
        0.68%  [kernel]                [k] tcp_delack_timer_handler
        0.67%  [kernel]                [k] __schedule
        0.60%  < workload >
        0.59%  [kernel]                [k] __inet6_lookup_established
        0.55%  [kernel]                [k] __switch_to
        0.55%  [kernel]                [k] menu_select
        0.54%  libc-2.20.so            [.] __memcpy_avx_unaligned
      
      To address this issue, the existing per-cpu stock infrastructure can be
      used.
      
      refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
      charge to a per-cpu stock instead of calling atomic
      page_counter_uncharge().
      
      To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
      will explicitly drain the cached charge, if the cached value exceeds
      CHARGE_BATCH.
      
      This allows significantly optimize the load:
        1.21%  [kernel]                [k] _raw_spin_lock
        1.01%  [kernel]                [k] ixgbe_poll
        0.92%  [kernel]                [k] _raw_spin_lock_irqsave
        0.90%  [kernel]                [k] enqueue_entity
        0.86%  [kernel]                [k] tcp_ack
        0.85%  < workload >
        0.74%  perf-11120.map          [.] 0x000000000061bf24
        0.73%  [kernel]                [k] __schedule
        0.67%  [kernel]                [k] __fget
        0.63%  [kernel]                [k] __inet6_lookup_established
        0.62%  [kernel]                [k] menu_select
        0.59%  < workload >
        0.59%  [kernel]                [k] __switch_to
        0.57%  libc-2.20.so            [.] __memcpy_avx_unaligned
      
      Link: http://lkml.kernel.org/r/20170829100150.4580-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      475d0487
    • Shakeel Butt's avatar
      mm: fadvise: avoid fadvise for fs without backing device · 3a77d214
      Shakeel Butt authored
      
      
      The fadvise() manpage is silent on fadvise()'s effect on memory-based
      filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
      sysfs, kernfs).  The current implementaion of fadvise is mostly a noop
      for such filesystems except for FADV_DONTNEED which will trigger
      expensive remote LRU cache draining.  This patch makes the noop of
      fadvise() on such file systems very explicit.
      
      However this change has two side effects for ramfs and one for tmpfs.
      First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
      pages of ramfs (allocated through read, readahead & read fault) and
      tmpfs (allocated through read fault).  Also fadvise(FADV_WILLNEED) could
      create such clean zero'ed pages for ramfs.  This change removes those
      possibilities.
      
      One of our generic libraries does fadvise(FADV_DONTNEED).  Recently we
      observed high latency in fadvise() and noticed that the users have
      started using tmpfs files and the latency was due to expensive remote
      LRU cache draining.  For normal tmpfs files (have data written on them),
      fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
      draining.
      
      Link: http://lkml.kernel.org/r/20170818011023.181465-1-shakeelb@google.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a77d214
    • Matthias Kaehlcke's avatar
      mm/zsmalloc.c: change stat type parameter to int · 3eb95fea
      Matthias Kaehlcke authored
      
      
      zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however
      some callers pass an enum fullness_group value.  Change the type to int to
      reflect the actual use of the functions and get rid of 'enum-conversion'
      warnings
      
      Link: http://lkml.kernel.org/r/20170731175000.56538-1-mka@chromium.org
      Signed-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Doug Anderson <dianders@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eb95fea
    • Joonsoo Kim's avatar
      mm/mlock.c: use page_zone() instead of page_zone_id() · 9472f23c
      Joonsoo Kim authored
      
      
      page_zone_id() is a specialized function to compare the zone for the pages
      that are within the section range.  If the section of the pages are
      different, page_zone_id() can be different even if their zone is the same.
      This wrong usage doesn't cause any actual problem since
      __munlock_pagevec_fill() would be called again with failed index.
      However, it's better to use more appropriate function here.
      
      Link: http://lkml.kernel.org/r/1503559211-10259-1-git-send-email-iamjoonsoo.kim@lge.com
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9472f23c
    • Kemi Wang's avatar
      mm: consider the number in local CPUs when reading NUMA stats · 63803222
      Kemi Wang authored
      
      
      To avoid deviation, the per cpu number of NUMA stats in
      vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
      
      Since NUMA stats does not be read by users frequently, and kernel does not
      need it to make a decision, it will not be a problem to make the readers
      more expensive.
      
      Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63803222
    • Kemi Wang's avatar
      mm: update NUMA counter threshold size · 1d90ca89
      Kemi Wang authored
      
      
      There is significant overhead in cache bouncing caused by zone counters
      (NUMA associated counters) update in parallel in multi-threaded page
      allocation (suggested by Dave Hansen).
      
      This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
      as a small threshold greatly increases the update frequency of the global
      counter from local per cpu counter(suggested by Ying Huang).
      
      The rationality is that these statistics counters don't affect the
      kernel's decision, unlike other VM counters, so it's not a problem to use
      a large threshold.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
      single page allocation and reclaim on Jesper's page_bench03 benchmark.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
      https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
      bench
      
       Threshold   CPU cycles    Throughput(88 threads)
           32          799         241760478
           64          640         301628829
           125         537         358906028 <==> system by default (base)
           256         468         412397590
           512         428         450550704
           4096        399         482520943
           20000       394         489009617
           30000       395         488017817
           65533       369(-31.3%) 521661345(+45.3%) <==> with this patchset
           N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d90ca89
    • Kemi Wang's avatar
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang authored
      
      
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
    • Anshuman Khandual's avatar
      mm/memory.c: remove reduntant check for write access · fde26bed
      Anshuman Khandual authored
      
      
      Flags argument has been copied into vmf.flags and it is not changed in
      between.  Hence a single write access check can be used for both PUD and
      PMD.
      
      Link: http://lkml.kernel.org/r/20170823082839.1812-1-khandual@linux.vnet.ibm.com
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fde26bed
    • Andrea Arcangeli's avatar
      userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS · 656710a6
      Andrea Arcangeli authored
      
      
      This is an enhancement to avoid a non cooperative userfaultfd manager
      having to unregister all regions before it can close the uffd after all
      userfaultfd activity completed.
      
      The UFFDIO_UNREGISTER would serialize against the handle_userfault by
      taking the mmap_sem for writing, but we can simply repeat the page fault
      if we detect the uffd was closed and so the regular page fault paths
      should takeover.
      
      Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      656710a6
    • Laurent Dufour's avatar
      mm: remove useless vma parameter to offset_il_node · 98c70baa
      Laurent Dufour authored
      
      
      While reading the code I found that offset_il_node() has a vm_area_struct
      pointer parameter which is unused.
      
      Link: http://lkml.kernel.org/r/1502899755-23146-1-git-send-email-ldufour@linux.vnet.ibm.com
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98c70baa
    • Jérôme Glisse's avatar
      mm/hmm: fix build when HMM is disabled · de540a97
      Jérôme Glisse authored
      
      
      Combinatorial Kconfig is painfull. Withi this patch all below combination
      build.
      
      1)
      
      2)
      CONFIG_HMM_MIRROR=y
      
      3)
      CONFIG_DEVICE_PRIVATE=y
      
      4)
      CONFIG_DEVICE_PUBLIC=y
      
      5)
      CONFIG_HMM_MIRROR=y
      CONFIG_DEVICE_PUBLIC=y
      
      6)
      CONFIG_HMM_MIRROR=y
      CONFIG_DEVICE_PRIVATE=y
      
      7)
      CONFIG_DEVICE_PRIVATE=y
      CONFIG_DEVICE_PUBLIC=y
      
      8)
      CONFIG_HMM_MIRROR=y
      CONFIG_DEVICE_PRIVATE=y
      CONFIG_DEVICE_PUBLIC=y
      
      Link: http://lkml.kernel.org/r/20170826002149.20919-1-jglisse@redhat.com
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de540a97
    • Jérôme Glisse's avatar
      mm/hmm: avoid bloating arch that do not make use of HMM · 6b368cd4
      Jérôme Glisse authored
      
      
      This moves all new code including new page migration helper behind kernel
      Kconfig option so that there is no codee bloat for arch or user that do
      not want to use HMM or any of its associated features.
      
      arm allyesconfig (without all the patchset, then with and this patch):
         text	   data	    bss	    dec	    hex	filename
      83721896	46511131	27582964	157815991	96814b7	../without/vmlinux
      83722364	46511131	27582964	157816459	968168b	vmlinux
      
      [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
        Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
      [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
        Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b368cd4
    • Jérôme Glisse's avatar
      mm/hmm: add new helper to hotplug CDM memory region · d3df0a42
      Jérôme Glisse authored
      
      
      Unlike unaddressable memory, coherent device memory has a real resource
      associated with it on the system (as CPU can address it).  Add a new
      helper to hotplug such memory within the HMM framework.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-20-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3df0a42
    • Jérôme Glisse's avatar
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse authored
      
      
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • Jérôme Glisse's avatar
      mm/migrate: allow migrate_vma() to alloc new page on empty entry · 8315ada7
      Jérôme Glisse authored
      
      
      This allows callers of migrate_vma() to allocate new page for empty CPU
      page table entry (pte_none or back by zero page).  This is only for
      anonymous memory and it won't allow new page to be instanced if the
      userfaultfd is armed.
      
      This is useful to device driver that want to migrate a range of virtual
      address and would rather allocate new memory than having to fault later
      on.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8315ada7
    • Jérôme Glisse's avatar
      mm/migrate: support un-addressable ZONE_DEVICE page in migration · a5430dda
      Jérôme Glisse authored
      
      
      Allow to unmap and restore special swap entry of un-addressable
      ZONE_DEVICE memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5430dda
    • Jérôme Glisse's avatar
      mm/migrate: migrate_vma() unmap page from vma while collecting pages · 8c3328f1
      Jérôme Glisse authored
      
      
      Common case for migration of virtual address range is page are map only
      once inside the vma in which migration is taking place.  Because we
      already walk the CPU page table for that range we can directly do the
      unmap there and setup special migration swap entry.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c3328f1
    • Jérôme Glisse's avatar
      mm/migrate: new memory migration helper for use with device memory · 8763cb45
      Jérôme Glisse authored
      
      
      This patch add a new memory migration helpers, which migrate memory
      backing a range of virtual address of a process to different memory (which
      can be allocated through special allocator).  It differs from numa
      migration by working on a range of virtual address and thus by doing
      migration in chunk that can be large enough to use DMA engine or special
      copy offloading engine.
      
      Expected users are any one with heterogeneous memory where different
      memory have different characteristics (latency, bandwidth, ...).  As an
      example IBM platform with CAPI bus can make use of this feature to migrate
      between regular memory and CAPI device memory.  New CPU architecture with
      a pool of high performance memory not manage as cache but presented as
      regular memory (while being faster and with lower latency than DDR) will
      also be prime user of this patch.
      
      Migration to private device memory will be useful for device that have
      large pool of such like GPU, NVidia plans to use HMM for that.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8763cb45
    • Jérôme Glisse's avatar
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse authored
      
      
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
    • Jérôme Glisse's avatar
      mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory · 858b54da
      Jérôme Glisse authored
      
      
      This introduce a dummy HMM device class so device driver can use it to
      create hmm_device for the sole purpose of registering device memory.  It
      is useful to device driver that want to manage multiple physical device
      memory under same struct device umbrella.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-13-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      858b54da
    • Jérôme Glisse's avatar
      mm/hmm/devmem: device memory hotplug using ZONE_DEVICE · 4ef589dc
      Jérôme Glisse authored
      
      
      This introduce a simple struct and associated helpers for device driver to
      use when hotpluging un-addressable device memory as ZONE_DEVICE.  It will
      find a unuse physical address range and trigger memory hotplug for it
      which allocates and initialize struct page for the device memory.
      
      Device driver should use this helper during device initialization to
      hotplug the device memory.  It should only need to remove the memory once
      the device is going offline (shutdown or hotremove).  There should not be
      any userspace API to hotplug memory expect maybe for host device driver to
      allow to add more memory to a guest device driver.
      
      Device's memory is manage by the device driver and HMM only provides
      helpers to that effect.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-12-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ef589dc
    • Jérôme Glisse's avatar
      mm/memcontrol: support MEMORY_DEVICE_PRIVATE · c733a828
      Jérôme Glisse authored
      
      
      HMM pages (private or public device pages) are ZONE_DEVICE page and thus
      need special handling when it comes to lru or refcount.  This patch make
      sure that memcontrol properly handle those when it face them.  Those pages
      are use like regular pages in a process address space either as anonymous
      page or as file back page.  So from memcg point of view we want to handle
      them like regular page for now at least.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c733a828
    • Jérôme Glisse's avatar
      mm/memcontrol: allow to uncharge page without using page->lru field · a9d5adee
      Jérôme Glisse authored
      
      
      HMM pages (private or public device pages) are ZONE_DEVICE page and
      thus you can not use page->lru fields of those pages. This patch
      re-arrange the uncharge to allow single page to be uncharge without
      modifying the lru field of the struct page.
      
      There is no change to memcontrol logic, it is the same as it was
      before this patch.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9d5adee
    • Jérôme Glisse's avatar
      mm/ZONE_DEVICE: special case put_page() for device private pages · 7b2d55d2
      Jérôme Glisse authored
      
      
      A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have
      any user.  For device private pages this is important to catch and thus we
      need to special case put_page() for this.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-9-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b2d55d2
    • Jérôme Glisse's avatar
      mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory · 5042db43
      Jérôme Glisse authored
      
      
      HMM (heterogeneous memory management) need struct page to support
      migration from system main memory to device memory.  Reasons for HMM and
      migration to device memory is explained with HMM core patch.
      
      This patch deals with device memory that is un-addressable memory (ie CPU
      can not access it).  Hence we do not want those struct page to be manage
      like regular memory.  That is why we extend ZONE_DEVICE to support
      different types of memory.
      
      A persistent memory type is define for existing user of ZONE_DEVICE and a
      new device un-addressable type is added for the un-addressable memory
      type.  There is a clear separation between what is expected from each
      memory type and existing user of ZONE_DEVICE are un-affected by new
      requirement and new use of the un-addressable type.  All specific code
      path are protect with test against the memory type.
      
      Because memory is un-addressable we use a new special swap type for when a
      page is migrated to device memory (this reduces the number of maximum swap
      file).
      
      The main two additions beside memory type to ZONE_DEVICE is two callbacks.
      First one, page_free() is call whenever page refcount reach 1 (which
      means the page is free as ZONE_DEVICE page never reach a refcount of 0).
      This allow device driver to manage its memory and associated struct page.
      
      The second callback page_fault() happens when there is a CPU access to an
      address that is back by a device page (which are un-addressable by the
      CPU).  This callback is responsible to migrate the page back to system
      main memory.  Device driver can not block migration back to system memory,
      HMM make sure that such page can not be pin into device memory.
      
      If device is in some error condition and can not migrate memory back then
      a CPU page fault to device memory should end with SIGBUS.
      
      [arnd@arndb.de: fix warning]
        Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5042db43
    • Michal Hocko's avatar
      mm/memory_hotplug: introduce add_pages · 3072e413
      Michal Hocko authored
      
      
      There are new users of memory hotplug emerging.  Some of them require
      different subset of arch_add_memory.  There are some which only require
      allocation of struct pages without mapping those pages to the kernel
      address space.  We currently have __add_pages for that purpose.  But this
      is rather lowlevel and not very suitable for the code outside of the
      memory hotplug.  E.g.  x86_64 wants to update max_pfn which should be done
      by the caller.  Introduce add_pages() which should care about those
      details if they are needed.  Each architecture should define its
      implementation and select CONFIG_ARCH_HAS_ADD_PAGES.  All others use the
      currently existing __add_pages.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.com
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3072e413
    • Jérôme Glisse's avatar
      mm/hmm/mirror: device page fault handler · 74eee180
      Jérôme Glisse authored
      
      
      This handles page fault on behalf of device driver, unlike
      handle_mm_fault() it does not trigger migration back to system memory for
      device memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-6-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74eee180
    • Jérôme Glisse's avatar
      mm/hmm/mirror: helper to snapshot CPU page table · da4c3c73
      Jérôme Glisse authored
      
      
      This does not use existing page table walker because we want to share
      same code for our page fault handler.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-5-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da4c3c73
    • Jérôme Glisse's avatar
      mm/hmm/mirror: mirror process address space on device with HMM helpers · c0b12405
      Jérôme Glisse authored
      
      
      This is a heterogeneous memory management (HMM) process address space
      mirroring.  In a nutshell this provide an API to mirror process address
      space on a device.  This boils down to keeping CPU and device page table
      synchronize (we assume that both device and CPU are cache coherent like
      PCIe device can be).
      
      This patch provide a simple API for device driver to achieve address space
      mirroring thus avoiding each device driver to grow its own CPU page table
      walker and its own CPU page table synchronization mechanism.
      
      This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
      hardware in the future.
      
      [jglisse@redhat.com: fix hmm for "mmu_notifier kill invalidate_page callback"]
        Link: http://lkml.kernel.org/r/20170830231955.GD9445@redhat.com
      Link: http://lkml.kernel.org/r/20170817000548.32038-4-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0b12405
    • Jérôme Glisse's avatar
      mm/hmm: heterogeneous memory management (HMM for short) · 133ff0ea
      Jérôme Glisse authored
      
      
      HMM provides 3 separate types of functionality:
          - Mirroring: synchronize CPU page table and device page table
          - Device memory: allocating struct page for device memory
          - Migration: migrating regular memory to device memory
      
      This patch introduces some common helpers and definitions to all of
      those 3 functionality.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      133ff0ea
    • Jérôme Glisse's avatar
      hmm: heterogeneous memory management documentation · bffc33ec
      Jérôme Glisse authored
      
      
      Patch series "HMM (Heterogeneous Memory Management)", v25.
      
      Heterogeneous Memory Management (HMM) (description and justification)
      
      Today device driver expose dedicated memory allocation API through their
      device file, often relying on a combination of IOCTL and mmap calls.
      The device can only access and use memory allocated through this API.
      This effectively split the program address space into object allocated
      for the device and useable by the device and other regular memory
      (malloc, mmap of a file, share memory, â) only accessible by
      CPU (or in a very limited way by a device by pinning memory).
      
      Allowing different isolated component of a program to use a device thus
      require duplication of the input data structure using device memory
      allocator.  This is reasonable for simple data structure (array, grid,
      image, â) but this get extremely complex with advance data
      structure (list, tree, graph, â) that rely on a web of memory
      pointers.  This is becoming a serious limitation on the kind of work
      load that can be offloaded to device like GPU.
      
      New industry standard like C++, OpenCL or CUDA are pushing to remove
      this barrier.  This require a shared address space between GPU device
      and CPU so that GPU can access any memory of a process (while still
      obeying memory protection like read only).  This kind of feature is also
      appearing in various other operating systems.
      
      HMM is a set of helpers to facilitate several aspects of address space
      sharing and device memory management.  Unlike existing sharing mechanism
      that rely on pining pages use by a device, HMM relies on mmu_notifier to
      propagate CPU page table update to device page table.
      
      Duplicating CPU page table is only one aspect necessary for efficiently
      using device like GPU.  GPU local memory have bandwidth in the TeraBytes/
      second range but they are connected to main memory through a system bus
      like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x).  Thus it
      is necessary to allow migration of process memory from main system memory
      to device memory.  Issue is that on platform that only have PCIE the
      device memory is not accessible by the CPU with the same properties as
      main memory (cache coherency, atomic operations, ...).
      
      To allow migration from main memory to device memory HMM provides a set of
      helper to hotplug device memory as a new type of ZONE_DEVICE memory which
      is un-addressable by CPU but still has struct page representing it.  This
      allow most of the core kernel logic that deals with a process memory to
      stay oblivious of the peculiarity of device memory.
      
      When page backing an address of a process is migrated to device memory the
      CPU page table entry is set to a new specific swap entry.  CPU access to
      such address triggers a migration back to system memory, just like if the
      page was swap on disk.  HMM also blocks any one from pinning a ZONE_DEVICE
      page so that it can always be migrated back to system memory if CPU access
      it.  Conversely HMM does not migrate to device memory any page that is pin
      in system memory.
      
      To allow efficient migration between device memory and main memory a new
      migrate_vma() helpers is added with this patchset.  It allows to leverage
      device DMA engine to perform the copy operation.
      
      This feature will be use by upstream driver like nouveau mlx5 and probably
      other in the future (amdgpu is next suspect in line).  We are actively
      working on nouveau and mlx5 support.  To test this patchset we also worked
      with NVidia close source driver team, they have more resources than us to
      test this kind of infrastructure and also a bigger and better userspace
      eco-system with various real industry workload they can be use to test and
      profile HMM.
      
      The expected workload is a program builds a data set on the CPU (from
      disk, from network, from sensors, â).  Program uses GPU API (OpenCL,
      CUDA, ...) to give hint on memory placement for the input data and also
      for the output buffer.  Program call GPU API to schedule a GPU job, this
      happens using device driver specific ioctl.  All this is hidden from
      programmer point of view in case of C++ compiler that transparently
      offload some part of a program to GPU.  Program can keep doing other stuff
      on the CPU while the GPU is crunching numbers.
      
      It is expected that CPU will not access the same data set as the GPU while
      GPU is working on it, but this is not mandatory.  In fact we expect some
      small memory object to be actively access by both GPU and CPU concurrently
      as synchronization channel and/or for monitoring purposes.  Such object
      will stay in system memory and should not be bottlenecked by system bus
      bandwidth (rare write and read access from both CPU and GPU).
      
      As we are relying on device driver API, HMM does not introduce any new
      syscall nor does it modify any existing ones.  It does not change any
      POSIX semantics or behaviors.  For instance the child after a fork of a
      process that is using HMM will not be impacted in anyway, nor is there any
      data hazard between child COW or parent COW of memory that was migrated to
      device prior to fork.
      
      HMM assume a numbers of hardware features.  Device must allow device page
      table to be updated at any time (ie device job must be preemptable).
      Device page table must provides memory protection such as read only.
      Device must track write access (dirty bit).  Device must have a minimum
      granularity that match PAGE_SIZE (ie 4k).
      
      Reviewer (just hint):
      Patch 1  HMM documentation
      Patch 2  introduce core infrastructure and definition of HMM, pretty
               small patch and easy to review
      Patch 3  introduce the mirror functionality of HMM, it relies on
               mmu_notifier and thus someone familiar with that part would be
               in better position to review
      Patch 4  is an helper to snapshot CPU page table while synchronizing with
               concurrent page table update. Understanding mmu_notifier makes
               review easier.
      Patch 5  is mostly a wrapper around handle_mm_fault()
      Patch 6  add new add_pages() helper to avoid modifying each arch memory
               hot plug function
      Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
               in various core mm to support this new type. Dan Williams and
               any core mm contributor are best people to review each half of
               this patchset
      Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
               Dan Williams are best person to review this
      Patch 9  allow to uncharge a page from memory group without using the lru
               list field of struct page (best reviewer: Johannes Weiner or
               Vladimir Davydov or Michal Hocko)
      Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best
               reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
      Patch 11 add helper to hotplug un-addressable device memory as new type
               of ZONE_DEVICE memory (new type introducted in patch 3 of this
               serie). This is boiler plate code around memory hotplug and it
               also pick a free range of physical address for the device memory.
               Note that the physical address do not point to anything (at least
               as far as the kernel knows).
      Patch 12 introduce a new hmm_device class as an helper for device driver
               that want to expose multiple device memory under a common fake
               device driver. This is usefull for multi-gpu configuration.
               Anyone familiar with device driver infrastructure can review
               this. Boiler plate code really.
      Patch 13 add a new migrate mode. Any one familiar with page migration is
               welcome to review.
      Patch 14 introduce a new migration helper (migrate_vma()) that allow to
               migrate a range of virtual address of a process using device DMA
               engine to perform the copy. It is not limited to do copy from and
               to device but can also do copy between any kind of source and
               destination memory. Again anyone familiar with migration code
               should be able to verify the logic.
      Patch 15 optimize the new migrate_vma() by unmapping pages while we are
               collecting them. This can be review by any mm folks.
      Patch 16 add unaddressable memory migration to helper introduced in patch
               7, this can be review by anyone familiar with migration code
      Patch 17 add a feature that allow device to allocate non-present page on
               the GPU when migrating a range of address to device memory. This
               is an helper for device driver to avoid having to first allocate
               system memory before migration to device memory
      Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
               memory (CDM)
      Patch 19 add an helper to hotplug CDM memory
      
      Previous patchset posting :
      v1 http://lwn.net/Articles/597289/
      v2 https://lkml.org/lkml/2014/6/12/559
      v3 https://lkml.org/lkml/2014/6/13/633
      v4 https://lkml.org/lkml/2014/8/29/423
      v5 https://lkml.org/lkml/2014/11/3/759
      v6 http://lwn.net/Articles/619737/
      v7 http://lwn.net/Articles/627316/
      v8 https://lwn.net/Articles/645515/
      v9 https://lwn.net/Articles/651553/
      v10 https://lwn.net/Articles/654430/
      v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
      v12 http://www.kernelhub.org/?msg=972982&p=2
      v13 https://lwn.net/Articles/706856/
      v14 https://lkml.org/lkml/2016/12/8/344
      v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
      v16 http://www.spinics.net/lists/linux-mm/msg119814.html
      v17 https://lkml.org/lkml/2017/1/27/847
      v18 https://lkml.org/lkml/2017/3/16/596
      v19 https://lkml.org/lkml/2017/4/5/831
      v20 https://lwn.net/Articles/720715/
      v21 https://lkml.org/lkml/2017/4/24/747
      v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
      v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1404788.html
      v24 https://lwn.net/Articles/726691/
      
      This patch (of 19):
      
      This adds documentation for HMM (Heterogeneous Memory Management).  It
      presents the motivation behind it, the features necessary for it to be
      useful and and gives an overview of how this is implemented.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-2-jglisse@redhat.com
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bffc33ec
    • Naoya Horiguchi's avatar
      mm: memory_hotplug: memory hotremove supports thp migration · 8135d892
      Naoya Horiguchi authored
      
      
      This patch enables thp migration for memory hotremove.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8135d892