Skip to content
  1. Apr 03, 2020
    • Vincenzo Frascino's avatar
      mm/memcontrol.c: make mem_cgroup_id_get_many() __maybe_unused · c1514c0a
      Vincenzo Frascino authored
      
      
      mem_cgroup_id_get_many() is currently used only when MMU or MEMCG_SWAP
      configuration options are enabled.  Having them disabled triggers the
      following warning at compile time:
      
        linux/mm/memcontrol.c:4797:13: warning: `mem_cgroup_id_get_many' defined but not used [-Wunused-function]
         static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n)
      
      Make mem_cgroup_id_get_many() __maybe_unused to address the issue.
      
      Signed-off-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200305164354.48147-1-vincenzo.frascino@arm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1514c0a
    • Shakeel Butt's avatar
      memcg: css_tryget_online cleanups · 8965aa28
      Shakeel Butt authored
      
      
      Currently multiple locations in memcg code, css_tryget_online() is being
      used. However it doesn't matter whether the cgroup is online for the
      callers. Online used to matter when we had reparenting on offlining and
      we needed a way to prevent new ones from showing up.
      
      The failure case for couple of these css_tryget_online usage is to
      fallback to root_mem_cgroup which kind of make bypassing the memcg
      limits possible for some workloads. For example creating an inotify
      group in a subcontainer and then deleting that container after moving the
      process to a different container will make all the event objects
      allocated for that group to the root_mem_cgroup. So, using
      css_tryget_online() is dangerous for such cases.
      
      Two locations still use the online version. The swapin of offlined
      memcg's pages and the memcg kmem cache creation. The kmem cache indeed
      needs the online version as the kernel does the reparenting of memcg
      kmem caches. For the swapin case, it has been left for later as the
      fallback is not really that concerning.
      
      With swap accounting enabled, if the memcg of the swapped out page is
      not online then the memcg extracted from the given 'mm' will be charged
      and if 'mm' is NULL then root memcg will be charged.  However I could
      not find a code path where the given 'mm' will be NULL for swap-in
      case.
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200302203109.179417-1-shakeelb@google.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8965aa28
    • Johannes Weiner's avatar
      mm: memcontrol: recursive memory.low protection · 8a931f80
      Johannes Weiner authored
      
      
      Right now, the effective protection of any given cgroup is capped by its
      own explicit memory.low setting, regardless of what the parent says.  The
      reasons for this are mostly historical and ease of implementation: to make
      delegation of memory.low safe, effective protection is the min() of all
      memory.low up the tree.
      
      Unfortunately, this limitation makes it impossible to protect an entire
      subtree from another without forcing the user to make explicit protection
      allocations all the way to the leaf cgroups - something that is highly
      undesirable in real life scenarios.
      
      Consider memory in a data center host.  At the cgroup top level, we have a
      distinction between system management software and the actual workload the
      system is executing.  Both branches are further subdivided into individual
      services, job components etc.
      
      We want to protect the workload as a whole from the system management
      software, but that doesn't mean we want to protect and prioritize
      individual workload wrt each other.  Their memory demand can vary over
      time, and we'd want the VM to simply cache the hottest data within the
      workload subtree.  Yet, the current memory.low limitations force us to
      allocate a fixed amount of protection to each workload component in order
      to get protection from system management software in general.  This
      results in very inefficient resource distribution.
      
      Another concern with mandating downward allocation is that, as the
      complexity of the cgroup tree grows, it gets harder for the lower levels
      to be informed about decisions made at the host-level.  Consider a
      container inside a namespace that in turn creates its own nested tree of
      cgroups to run multiple workloads.  It'd be extremely difficult to
      configure memory.low parameters in those leaf cgroups that on one hand
      balance pressure among siblings as the container desires, while also
      reflecting the host-level protection from e.g.  rpm upgrades, that lie
      beyond one or more delegation and namespacing points in the tree.
      
      It's highly unusual from a cgroup interface POV that nested levels have to
      be aware of and reflect decisions made at higher levels for them to be
      effective.
      
      To enable such use cases and scale configurability for complex trees, this
      patch implements a resource inheritance model for memory that is similar
      to how the CPU and the IO controller implement work-conserving resource
      allocations: a share of a resource allocated to a subree always applies to
      the entire subtree recursively, while allowing, but not mandating,
      children to further specify distribution rules.
      
      That means that if protection is explicitly allocated among siblings,
      those configured shares are being followed during page reclaim just like
      they are now.  However, if the memory.low set at a higher level is not
      fully claimed by the children in that subtree, the "floating" remainder is
      applied to each cgroup in the tree in proportion to its size.  Since
      reclaim pressure is applied in proportion to size as well, each child in
      that tree gets the same boost, and the effect is neutral among siblings -
      with respect to each other, they behave as if no memory control was
      enabled at all, and the VM simply balances the memory demands optimally
      within the subtree.  But collectively those cgroups enjoy a boost over the
      cgroups in neighboring trees.
      
      E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
      it's not getting a share of the hierarchically assigned resource, just
      that it doesn't claim a fixed amount of it to protect from its siblings.
      
      This allows us to recursively protect one subtree (workload) from another
      (system management), while letting subgroups compete freely among each
      other - without having to assign fixed shares to each leaf, and without
      nested groups having to echo higher-level settings.
      
      The floating protection composes naturally with fixed protection.
      Consider the following example tree:
      
      		A            A: low = 2G
                     / \          A1: low = 1G
                    A1 A2         A2: low = 0G
      
      As outside pressure is applied to this tree, A1 will enjoy a fixed
      protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
      evenly among A1 and A2, coming out to 1.5G and 0.5G.
      
      There is a slight risk of regressing theoretical setups where the
      top-level cgroups don't know about the true budgeting and set bogusly high
      "bypass" values that are meaningfully allocated down the tree.  Such
      setups would rely on unclaimed protection to be discarded, and
      distributing it would change the intended behavior.  Be safe and hide the
      new behavior behind a mount option, 'memory_recursiveprot'.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a931f80
    • Johannes Weiner's avatar
      mm: memcontrol: clean up and document effective low/min calculations · bc50bcc6
      Johannes Weiner authored
      
      
      The effective protection of any given cgroup is a somewhat complicated
      construct that depends on the ancestor's configuration, siblings'
      configurations, as well as current memory utilization in all these groups.
      It's done this way to satisfy hierarchical delegation requirements while
      also making the configuration semantics flexible and expressive in complex
      real life scenarios.
      
      Unfortunately, all the rules and requirements are sparsely documented, and
      the code is a little too clever in merging different scenarios into a
      single min() expression.  This makes it hard to reason about the
      implementation and avoid breaking semantics when making changes to it.
      
      This patch documents each semantic rule individually and splits out the
      handling of the overcommit case from the regular case.
      
      Michal Koutný also points out that the points of equilibrium as described
      in the existing example scenarios aren't actually accurate.  Delete these
      examples for now to avoid confusion.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-3-hannes@cmpxchg.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc50bcc6
    • Johannes Weiner's avatar
      mm: memcontrol: fix memory.low proportional distribution · 503970e4
      Johannes Weiner authored
      Patch series "mm: memcontrol: recursive memory.low protection", v3.
      
      The current memory.low (and memory.min) semantics require protection to be
      assigned to a cgroup in an untinterrupted chain from the top-level cgroup
      all the way to the leaf.
      
      In practice, we want to protect entire cgroup subtrees from each other
      (system management software vs.  workload), but we would like the VM to
      balance memory optimally *within* each subtree, without having to make
      explicit weight allocations among individual components.  The current
      semantics make that impossible.
      
      They also introduce unmanageable complexity into more advanced resource
      trees.  For example:
      
                host root
                `- system.slice
                   `- rpm upgrades
                   `- logging
                `- workload.slice
                   `- a container
                      `- system.slice
                      `- workload.slice
                         `- job A
                            `- component 1
                            `- component 2
                         `- job B
      
      At a host-level perspective, we would like to protect the outer
      workload.slice subtree as a whole from rpm upgrades, logging etc.  But for
      that to be effective, right now we'd have to propagate it down through the
      container, the inner workload.slice, into the job cgroup and ultimately
      the component cgroups where memory is actually, physically allocated.
      This may cross several tree delegation points and namespace boundaries,
      which make such a setup near impossible.
      
      CPU and IO on the other hand are already distributed recursively.  The
      user would simply configure allowances at the host level, and they would
      apply to the entire subtree without any downward propagation.
      
      To enable the above-mentioned usecases and bring memory in line with other
      resource controllers, this patch series extends memory.low/min such that
      settings apply recursively to the entire subtree.  Users can still assign
      explicit shares in subgroups, but if they don't, any ancestral protection
      will be distributed such that children compete freely amongst each other -
      as if no memory control were enabled inside the subtree - but enjoy
      protection from neighboring trees.
      
      In the above example, the user would then be able to configure shares of
      CPU, IO and memory at the host level to comprehensively protect and
      isolate the workload.slice as a whole from system.slice activity.
      
      Patch #1 fixes an existing bug that can give a cgroup tree more protection
      than it should receive as per ancestor configuration.
      
      Patch #2 simplifies and documents the existing code to make it easier to
      reason about the changes in the next patch.
      
      Patch #3 finally implements recursive memory protection semantics.
      
      Because of a risk of regressing legacy setups, the new semantics are
      hidden behind a cgroup2 mount option, 'memory_recursiveprot'.
      
      More details in patch #3.
      
      This patch (of 3):
      
      When memory.low is overcommitted - i.e.  the children claim more
      protection than their shared ancestor grants them - the allowance is
      distributed in proportion to how much each sibling uses their own declared
      protection:
      
      	low_usage = min(memory.low, memory.current)
      	elow = parent_elow * (low_usage / siblings_low_usage)
      
      However, siblings_low_usage is not the sum of all low_usages. It sums
      up the usages of *only those cgroups that are within their memory.low*
      That means that low_usage can be *bigger* than siblings_low_usage, and
      consequently the total protection afforded to the children can be
      bigger than what the ancestor grants the subtree.
      
      Consider three groups where two are in excess of their protection:
      
        A/memory.low = 10G
        A/A1/memory.low = 10G, memory.current = 20G
        A/A2/memory.low = 10G, memory.current = 20G
        A/A3/memory.low = 10G, memory.current =  8G
        siblings_low_usage = 8G (only A3 contributes)
      
        A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
        A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
        A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
      
        (the 12.5G are capped to the explicit memory.low setting of 10G)
      
      With that, the sum of all awarded protection below A is 30G, when A
      only grants 10G for the entire subtree.
      
      What does this mean in practice? A1 and A2 would still be in excess of
      their 10G allowance and would be reclaimed, whereas A3 would not. As
      they eventually drop below their protection setting, they would be
      counted in siblings_low_usage again and the error would right itself.
      
      When reclaim was applied in a binary fashion (cgroup is reclaimed when
      it's above its protection, otherwise it's skipped) this would actually
      work out just fine. However, since 1bc63fb1 ("mm, memcg: make scan
      aggression always exclude protection"), reclaim pressure is scaled to
      how much a cgroup is above its protection. As a result this
      calculation error unduly skews pressure away from A1 and A2 toward the
      rest of the system.
      
      But why did we do it like this in the first place?
      
      The reasoning behind exempting groups in excess from
      siblings_low_usage was to go after them first during reclaim in an
      overcommitted subtree:
      
        A/memory.low = 2G, memory.current = 4G
        A/A1/memory.low = 3G, memory.current = 2G
        A/A2/memory.low = 1G, memory.current = 2G
      
        siblings_low_usage = 2G (only A1 contributes)
        A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
      
      While the children combined are overcomitting A and are technically
      both at fault, A2 is actively declaring unprotected memory and we
      would like to reclaim that first.
      
      However, while this sounds like a noble goal on the face of it, it
      doesn't make much difference in actual memory distribution: Because A
      is overcommitted, reclaim will not stop once A2 gets pushed back to
      within its allowance; we'll have to reclaim A1 either way. The end
      result is still that protection is distributed proportionally, with A1
      getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
      
      [ If A weren't overcommitted, it wouldn't make a difference since each
        cgroup would just get the protection it declares:
      
        A/memory.low = 2G, memory.current = 3G
        A/A1/memory.low = 1G, memory.current = 1G
        A/A2/memory.low = 1G, memory.current = 2G
      
        With the current calculation:
      
        siblings_low_usage = 1G (only A1 contributes)
        A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
      
        Including excess groups in siblings_low_usage:
      
        siblings_low_usage = 2G
        A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
      
      Simplify the calculation and fix the proportional reclaim bug by
      including excess cgroups in siblings_low_usage.
      
      After this patch, the effective memory.low distribution from the
      example above would be as follows:
      
        A/memory.low = 10G
        A/A1/memory.low = 10G, memory.current = 20G
        A/A2/memory.low = 10G, memory.current = 20G
        A/A3/memory.low = 10G, memory.current =  8G
        siblings_low_usage = 28G
      
        A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
        A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
        A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G
      
      Fixes: 1bc63fb1 ("mm, memcg: make scan aggression always exclude protection")
      Fixes: 23067153
      
       ("mm: memory.low hierarchical behavior")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-2-hannes@cmpxchg.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      503970e4
    • Roman Gushchin's avatar
      mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() · 4b13f64d
      Roman Gushchin authored
      
      
      Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions.  It's
      shorter and more obvious.
      
      These are the most basic functions which are just (un)charging the given
      cgroup with the given amount of pages.
      
      Also fix up the corresponding comments.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200109202659.752357-7-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b13f64d
    • Roman Gushchin's avatar
      mm: memcg/slab: cache page number in memcg_(un)charge_slab() · 9c315e4d
      Roman Gushchin authored
      
      
      There are many places in memcg_charge_slab() and memcg_uncharge_slab()
      which are calculating the number of pages to charge, css references to
      grab etc depending on the order of the slab page.
      
      Let's simplify the code by calculating it once and caching in the local
      variable.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200109202659.752357-6-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c315e4d
    • Roman Gushchin's avatar
      mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() · 92d0510c
      Roman Gushchin authored
      
      
      These functions are charging the given number of kernel pages to the given
      memory cgroup.  The number doesn't have to be a power of two.  Let's make
      them to take the unsigned int nr_pages as an argument instead of the page
      order.
      
      It makes them look consistent with the corresponding uncharge functions
      and functions like: mem_cgroup_charge_skmem(memcg, nr_pages).
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200109202659.752357-5-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92d0510c
    • Roman Gushchin's avatar
      mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() · f4b00eab
      Roman Gushchin authored
      
      
      Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
      to better reflect what they are actually doing:
      
      1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
         the current memcg
      
      2) set or clear the PageKmemcg flag
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4b00eab
    • Roman Gushchin's avatar
      mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments · 50591183
      Roman Gushchin authored
      
      
      Drop the unused page argument and put the memcg pointer at the first
      place.  This make the function consistent with its peers:
      __memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200109202659.752357-3-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50591183
    • Roman Gushchin's avatar
      mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments · 10eaec2f
      Roman Gushchin authored
      
      
      Patch series "mm: memcg: kmem API cleanup", v2.
      
      This patchset aims to clean up the kernel memory charging API.  It doesn't
      bring any functional changes, just removes unused arguments, renames some
      functions and fixes some comments.
      
      Currently it's not obvious which functions are most basic
      (memcg_kmem_(un)charge_memcg()) and which are based on them
      (memcg_kmem_(un)charge()).  The patchset renames these functions and
      removes unused arguments:
      
      TL;DR:
      was:
        memcg_kmem_charge_memcg(page, gfp, order, memcg)
        memcg_kmem_uncharge_memcg(memcg, nr_pages)
        memcg_kmem_charge(page, gfp, order)
        memcg_kmem_uncharge(page, order)
      
      now:
        memcg_kmem_charge(memcg, gfp, nr_pages)
        memcg_kmem_uncharge(memcg, nr_pages)
        memcg_kmem_charge_page(page, gfp, order)
        memcg_kmem_uncharge_page(page, order)
      
      This patch (of 6):
      
      The first argument of memcg_kmem_charge_memcg() and
      __memcg_kmem_charge_memcg() is the page pointer and it's not used.  Let's
      drop it.
      
      Memcg pointer is passed as the last argument.  Move it to the first place
      for consistency with other memcg functions, e.g.
      __memcg_kmem_uncharge_memcg() or try_charge().
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10eaec2f
    • Roman Gushchin's avatar
      mm: memcg/slab: use mem_cgroup_from_obj() · 4f103c63
      Roman Gushchin authored
      
      
      Sometimes we need to get a memcg pointer from a charged kernel object.
      The right way to get it depends on whether it's a proper slab object or
      it's backed by raw pages (e.g.  it's a vmalloc alloction).  In the first
      case the kmem_cache->memcg_params.memcg indirection should be used; in
      other cases it's just page->mem_cgroup.
      
      To simplify this task and hide the implementation details let's use the
      mem_cgroup_from_obj() helper, which takes a pointer to any kernel object
      and returns a valid memcg pointer or NULL.
      
      Passing a kernel address rather than a pointer to a page will allow to use
      this helper for per-object (rather than per-page) tracked objects in the
      future.
      
      The caller is still responsible to ensure that the returned memcg isn't
      going away underneath: take the rcu read lock, cgroup mutex etc; depending
      on the context.
      
      mem_cgroup_from_kmem() defined in mm/list_lru.c is now obsolete and can be
      removed.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200117203609.3146239-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f103c63
    • Kirill Tkhai's avatar
      mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node · 86daf94e
      Kirill Tkhai authored
      
      
      The shrinker_map may be touched from any cpu (e.g., a bit there may be set
      by a task running everywhere) but kswapd is always bound to specific node.
      So allocate shrinker_map from the related NUMA node to respect its NUMA
      locality.  Also, this follows generic way we use for allocation of memcg's
      per-node data.
      
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86daf94e
    • Yafang Shao's avatar
      mm, memcg: fix build error around the usage of kmem_caches · a87425a3
      Yafang Shao authored
      When I manually set default n to MEMCG_KMEM in init/Kconfig, bellow error
      occurs,
      
        mm/slab_common.c: In function 'memcg_slab_start':
        mm/slab_common.c:1530:30: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          return seq_list_start(&memcg->kmem_caches, *pos);
                                      ^
        mm/slab_common.c: In function 'memcg_slab_next':
        mm/slab_common.c:1537:32: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          return seq_list_next(p, &memcg->kmem_caches, pos);
                                        ^
        mm/slab_common.c: In function 'memcg_slab_show':
        mm/slab_common.c:1551:16: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          if (p == memcg->kmem_caches.next)
                        ^
          CC      arch/x86/xen/smp.o
        mm/slab_common.c: In function 'memcg_slab_start':
        mm/slab_common.c:1531:1: warning: control reaches end of non-void function
        [-Wreturn-type]
         }
         ^
        mm/slab_common.c: In function 'memcg_slab_next':
        mm/slab_common.c:1538:1: warning: control reaches end of non-void function
        [-Wreturn-type]
         }
         ^
      
      That's because kmem_caches is defined only when CONFIG_MEMCG_KMEM is set,
      while memcg_slab_start() will use it no matter CONFIG_MEMCG_KMEM is defined
      or not.
      
      By the way, the reason I mannuly undefined CONFIG_MEMCG_KMEM is to verify
      whether my some other code change is still stable when CONFIG_MEMCG_KMEM is
      not set. Unfortunately, the existing code has been already unstable since
      v4.11.
      
      Fixes: bc2791f8
      
       ("slab: link memcg kmem_caches on their associated memory cgroup")
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/1580970260-2045-1-git-send-email-laoar.shao@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a87425a3
    • Wei Yang's avatar
      mm/swap_state.c: use the same way to count page in [add_to|delete_from]_swap_cache · cb774451
      Wei Yang authored
      
      
      add_to_swap_cache() and delete_from_swap_cache() are counterparts, while
      currently they use different ways to count pages.
      
      It doesn't break anything because we only have two sizes for PageAnon, but
      this is confusing and not good practice.
      
      This patch corrects it by making both functions use hpage_nr_pages().
      
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200315012920.2687-1-richard.weiyang@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb774451
    • Yang Shi's avatar
      mm: swap: use smp_mb__after_atomic() to order LRU bit set · 9a9b6cce
      Yang Shi authored
      Memory barrier is needed after setting LRU bit, but smp_mb() is too
      strong.  Some architectures, i.e.  x86, imply memory barrier with atomic
      operations, so replacing it with smp_mb__after_atomic() sounds better,
      which is nop on strong ordered machines, and full memory barriers on
      others.  With this change the vm-scalability cases would perform better on
      x86, I saw total 6% improvement with this patch and previous inline fix.
      
      The test data (lru-file-readtwice throughput) against v5.6-rc4:
      	mainline	w/ inline fix	w/ both (adding this)
      	150MB		154MB		159MB
      
      Fixes: 9c4e6b1a
      
       ("mm, mlock, vmscan: no more skipping pagevecs")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Link: http://lkml.kernel.org/r/1584500541-46817-2-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a9b6cce
    • Yang Shi's avatar
      mm: swap: make page_evictable() inline · 1eb6234e
      Yang Shi authored
      When backporting commit 9c4e6b1a ("mm, mlock, vmscan: no more skipping
      pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
      a couple of vm-scalability's test cases (lru-file-readonce,
      lru-file-readtwice and lru-file-mmap-read).  I didn't see that much down
      on my VM (32c-64g-2nodes).  It might be caused by the test configuration,
      which is 32c-256g with NUMA disabled and the tests were run in root memcg,
      so the tests actually stress only one inactive and active lru.  It sounds
      not very usual in mordern production environment.
      
      That commit did two major changes:
      1. Call page_evictable()
      2. Use smp_mb to force the PG_lru set visible
      
      It looks they contribute the most overhead.  The page_evictable() is a
      function which does function prologue and epilogue, and that was used by
      page reclaim path only.  However, lru add is a very hot path, so it sounds
      better to make it inline.  However, it calls page_mapping() which is not
      inlined either, but the disassemble shows it doesn't do push and pop
      operations and it sounds not very straightforward to inline it.
      
      Other than this, it sounds smp_mb() is not necessary for x86 since
      SetPageLRU is atomic which enforces memory barrier already, replace it
      with smp_mb__after_atomic() in the following patch.
      
      With the two fixes applied, the tests can get back around 5% on that test
      bench and get back normal on my VM.  Since the test bench configuration is
      not that usual and I also saw around 6% up on the latest upstream, so it
      sounds good enough IMHO.
      
      The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
      	mainline	w/ inline fix
                150MB            154MB
      
      With this patch the throughput gets 2.67% up.  The data with using
      smp_mb__after_atomic() is showed in the following patch.
      
      Shakeel Butt did the below test:
      
      On a real machine with limiting the 'dd' on a single node and reading 100
      GiB sparse file (less than a single node).  Just ran a single instance to
      not cause the lru lock contention.  The cmdline used is "dd if=file-100GiB
      of=/dev/null bs=4k".  Ran the cmd 10 times with drop_caches in between and
      measured the time it took.
      
      Without patch: 56.64143 +- 0.672 sec
      
      With patches: 56.10 +- 0.21 sec
      
      [akpm@linux-foundation.org: move page_evictable() to internal.h]
      Fixes: 9c4e6b1a
      
       ("mm, mlock, vmscan: no more skipping pagevecs")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1eb6234e
    • Wei Yang's avatar
      mm/swap_slots.c: assign|reset cache slot by value directly · 2406b76f
      Wei Yang authored
      
      
      Currently we use a tmp pointer, pentry, to transfer and reset swap cache
      slot, which is a little redundant.  Swap cache slot stores the entry value
      directly, assign and reset it by value would be straight forward.
      
      Also this patch merges the else and if, since this is the only case we
      refill and repeat swap cache.
      
      Signed-off-by: default avatarWei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200311055352.50574-1-richard.weiyang@linux.alibaba.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2406b76f
    • Qian Cai's avatar
      mm/swapfile: fix data races in try_to_unuse() · 21820948
      Qian Cai authored
      
      
      si->inuse_pages could be accessed concurrently as noticed by KCSAN,
      
       write to 0xffff98b00ebd04dc of 4 bytes by task 82262 on cpu 92:
        swap_range_free+0xbe/0x230
        swap_range_free at mm/swapfile.c:719
        swapcache_free_entries+0x1be/0x250
        free_swap_slot+0x1c8/0x220
        __swap_entry_free.constprop.19+0xa3/0xb0
        free_swap_and_cache+0x53/0xa0
        unmap_page_range+0x7e0/0x1ce0
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0xe7/0x240
        do_exit+0x598/0xfd0
        do_group_exit+0x8b/0x180
        get_signal+0x293/0x13d0
        do_signal+0x37/0x5d0
        prepare_exit_to_usermode+0x1b7/0x2c0
        ret_from_intr+0x32/0x42
      
       read to 0xffff98b00ebd04dc of 4 bytes by task 82499 on cpu 46:
        try_to_unuse+0x86b/0xc80
        try_to_unuse at mm/swapfile.c:2185
        __x64_sys_swapoff+0x372/0xd40
        do_syscall_64+0x91/0xb05
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The plain reads in try_to_unuse() are outside si->lock critical section
      which result in data races that could be dangerous to be used in a loop.
      Fix them by adding READ_ONCE().
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/1582578903-29294-1-git-send-email-cai@lca.pw
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21820948
    • Wei Yang's avatar
      mm/swap.c: not necessary to export __pagevec_lru_add() · bde07cfc
      Wei Yang authored
      
      
      __pagevec_lru_add() is only used in mm directory now.
      
      Remove the export symbol.
      
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200126011436.22979-1-richardw.yang@linux.intel.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bde07cfc
    • Chen Wandun's avatar
      mm/swapfile.c: fix comments for swapcache_prepare · 3eeba135
      Chen Wandun authored
      
      
      The -EEXIST returned by __swap_duplicate means there is a swap cache
      instead -EBUSY
      
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200212145754.27123-1-chenwandun@huawei.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eeba135
    • Pingfan Liu's avatar
      mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path · df3a0a21
      Pingfan Liu authored
      
      
      FOLL_LONGTERM is a special case of FOLL_PIN.  It suggests a pin which is
      going to be given to hardware and can't move.  It would truncate CMA
      permanently and should be excluded.
      
      In gup slow path, where
      __gup_longterm_locked->check_and_migrate_cma_pages() handles
      FOLL_LONGTERM, but in fast path, there lacks such a check, which means a
      possible leak of CMA page to longterm pinned.
      
      Place a check in try_grab_compound_head() in the fast path to fix the
      leak, and if FOLL_LONGTERM happens on CMA, it will fall back to slow path
      to migrate the page.
      
      Some note about the check: Huge page's subpages have the same migrate type
      due to either allocation from a free_list[] or alloc_contig_range() with
      param MIGRATE_MOVABLE.  So it is enough to check on a single subpage by
      is_migrate_cma_page(subpage)
      
      Signed-off-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Link: http://lkml.kernel.org/r/1584876733-17405-3-git-send-email-kernelfans@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df3a0a21
    • Pingfan Liu's avatar
      mm/gup: rename nr as nr_pinned in get_user_pages_fast() · 4628b063
      Pingfan Liu authored
      
      
      To better reflect the held state of pages and make code self-explaining,
      rename nr as nr_pinned.
      
      Signed-off-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Link: http://lkml.kernel.org/r/1584876733-17405-2-git-send-email-kernelfans@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4628b063
    • Claudio Imbrenda's avatar
      mm/gup/writeback: add callbacks for inaccessible pages · f28d4363
      Claudio Imbrenda authored
      
      
      With the introduction of protected KVM guests on s390 there is now a
      concept of inaccessible pages.  These pages need to be made accessible
      before the host can access them.
      
      While cpu accesses will trigger a fault that can be resolved, I/O accesses
      will just fail.  We need to add a callback into architecture code for
      places that will do I/O, namely when writeback is started or when a page
      reference is taken.
      
      This is not only to enable paging, file backing etc, it is also necessary
      to protect the host against a malicious user space.  For example a bad
      QEMU could simply start direct I/O on such protected memory.  We do not
      want userspace to be able to trigger I/O errors and thus the logic is
      "whenever somebody accesses that page (gup) or does I/O, make sure that
      this page can be accessed".  When the guest tries to access that page we
      will wait in the page fault handler for writeback to have finished and for
      the page_ref to be the expected value.
      
      On s390x the function is not supposed to fail, so it is ok to use a
      WARN_ON on failure.  If we ever need some more finegrained handling we can
      tackle this when we know the details.
      
      Signed-off-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f28d4363
    • John Hubbard's avatar
      mm: dump_page(): additional diagnostics for huge pinned pages · dc8fb2f2
      John Hubbard authored
      
      
      As part of pin_user_pages() and related API calls, pages are "dma-pinned".
      For the case of compound pages of order > 1, the per-page accounting of
      dma pins is accomplished via the 3rd struct page in the compound page.  In
      order to support debugging of any pin_user_pages()- related problems,
      enhance dump_page() so as to report the pin count in that case.
      
      Documentation/core-api/pin_user_pages.rst is also updated accordingly.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-13-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc8fb2f2
    • Matthew Wilcox (Oracle)'s avatar
      mm: improve dump_page() for compound pages · 6197ab98
      Matthew Wilcox (Oracle) authored
      
      
      There was no protection against a corrupted struct page having an
      implausible compound_head().  Sanity check that a compound page has a head
      within reach of the maximum allocatable page (this will need to be
      adjusted if one of the plans to allocate 1GB pages comes to fruition).  In
      addition,
      
       - Print the mapping pointer using %p insted of %px.  The actual value of
         the pointer can be read out of the raw page dump and using %p gives a
         chance to correlate it with an earlier printk of the mapping pointer
       - Print the mapping pointer from the head page, not the tail page
         (the tail ->mapping pointer may be in use for other purposes, eg part
         of a list_head)
       - Print the order of the page for compound pages
       - Dump the raw head page as well as the raw page
       - Print the refcount from the head page, not the tail page
      
      Suggested-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Co-developed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-12-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6197ab98
    • John Hubbard's avatar
      selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage · be871411
      John Hubbard authored
      
      
      It's good to have basic unit test coverage of the new FOLL_PIN behavior.
      Fortunately, the gup_benchmark unit test is extremely fast (a few
      milliseconds), so adding it the the run_vmtests suite is going to cause no
      noticeable change in running time.
      
      So, add two new invocations to run_vmtests:
      
      1) Run gup_benchmark with normal get_user_pages().
      
      2) Run gup_benchmark with pin_user_pages().  This is much like the
         first call, except that it sets FOLL_PIN.
      
      Running these two in quick succession also provide a visual comparison of
      the running times, which is convenient.
      
      The new invocations are fairly early in the run_vmtests script, because
      with test suites, it's usually preferable to put the shorter, faster tests
      first, all other things being equal.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-11-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be871411
    • John Hubbard's avatar
      mm/gup_benchmark: support pin_user_pages() and related calls · 41c45d37
      John Hubbard authored
      
      
      Up until now, gup_benchmark supported testing of the following kernel
      functions:
      
      * get_user_pages(): via the '-U' command line option
      * get_user_pages_longterm(): via the '-L' command line option
      * get_user_pages_fast(): as the default (no options required)
      
      Add test coverage for the new corresponding pin_*() functions:
      
      * pin_user_pages_fast(): via the '-a' command line option
      * pin_user_pages():      via the '-b' command line option
      
      Also, add an option for clarity: '-u' for what is now (still) the default
      choice: get_user_pages_fast().
      
      Also, for the commands that set FOLL_PIN, verify that the pages really are
      dma-pinned, via the new is_dma_pinned() routine.  Those commands are:
      
          PIN_FAST_BENCHMARK     : calls pin_user_pages_fast()
          PIN_BENCHMARK          : calls pin_user_pages()
      
      In between the calls to pin_*() and unpin_user_pages(), check each page:
      if page_maybe_dma_pinned() returns false, then WARN and return.
      
      Do this outside of the benchmark timestamps, so that it doesn't affect
      reported times.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-10-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41c45d37
    • John Hubbard's avatar
      mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting · 1970dc6f
      John Hubbard authored
      
      
      Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
      unpin_user_pages*(), we need some visibility into whether all of this is
      working correctly.
      
      Add two new fields to /proc/vmstat:
      
          nr_foll_pin_acquired
          nr_foll_pin_released
      
      These are documented in Documentation/core-api/pin_user_pages.rst.  They
      represent the number of pages (since boot time) that have been pinned
      ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
      pin_user_pages*() and unpin_user_pages*().
      
      In the absence of long-running DMA or RDMA operations that hold pages
      pinned, the above two fields will normally be equal to each other.
      
      Also: update Documentation/core-api/pin_user_pages.rst, to remove an
      earlier (now confirmed untrue) claim about a performance problem with
      /proc/vmstat.
      
      Also: update Documentation/core-api/pin_user_pages.rst to rename the new
      /proc/vmstat entries, to the names listed here.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1970dc6f
    • John Hubbard's avatar
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard authored
      
      
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
    • John Hubbard's avatar
      mm/gup: track FOLL_PIN pages · 3faa52c0
      John Hubbard authored
      
      
      Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
      implemented via overloading of page->_refcount: pins are added by adding
      GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
      indication of pinning, and it can have false positives (and that's OK).
      Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
      details.
      
      As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
      (typically via pin_user_pages*()) are required to ultimately free such
      pages via unpin_user_page().
      
      Please also note the limitation, discussed in pin_user_pages.rst under the
      "TODO: for 1GB and larger huge pages" section.  (That limitation will be
      removed in a following patch.)
      
      The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
      thought of as "FOLL_GET for DIO and/or RDMA use".
      
      Pages that have been pinned via FOLL_PIN are identifiable via a new
      function call:
      
         bool page_maybe_dma_pinned(struct page *page);
      
      What to do in response to encountering such a page, is left to later
      patchsets. There is discussion about this in [1], [2], [3], and [4].
      
      This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      [4] LWN kernel index: get_user_pages():
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      [jhubbard@nvidia.com: add kerneldoc]
        Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
      [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
        Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
      [akpm@linux-foundation.org: fix put_compound_head defined but not used]
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Suggested-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3faa52c0
    • John Hubbard's avatar
      mm/gup: require FOLL_GET for get_user_pages_fast() · 94202f12
      John Hubbard authored
      
      
      Internal to mm/gup.c, require that get_user_pages_fast() and
      __get_user_pages_fast() identify themselves, by setting FOLL_GET.  This is
      required in order to be able to make decisions based on "FOLL_PIN, or
      FOLL_GET, or both or neither are set", in upcoming patches.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94202f12
    • John Hubbard's avatar
      mm/gup: pass gup flags to two more routines · 3b78d834
      John Hubbard authored
      
      
      In preparation for an upcoming patch, send gup flags args to two more
      routines: put_compound_head(), and undo_dev_pagemap().
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b78d834
    • John Hubbard's avatar
      mm: introduce page_ref_sub_return() · 566d774a
      John Hubbard authored
      
      
      An upcoming patch requires subtracting a large chunk of refcounts from a
      page, and checking what the resulting refcount is.  This is a little
      different than the usual "check for zero refcount" that many of the page
      ref functions already do.  However, it is similar to a few other routines
      that (like this one) are generally useful for things such as 1-based
      refcounting.
      
      Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically,
      and returns an atomic snapshot of the result.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-4-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      566d774a
    • John Hubbard's avatar
      mm/gup: pass a flags arg to __gup_device_* functions · 86dfbed4
      John Hubbard authored
      
      
      A subsequent patch requires access to gup flags, so pass the flags
      argument through to the __gup_device_* functions.
      
      Also placate checkpatch.pl by shortening a nearby line.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86dfbed4
    • John Hubbard's avatar
      mm/gup: split get_user_pages_remote() into two routines · 22bf29b6
      John Hubbard authored
      
      
      Patch series "mm/gup: track FOLL_PIN pages", v6.
      
      This activates tracking of FOLL_PIN pages.  This is in support of fixing
      the get_user_pages()+DMA problem described in [1]-[4].
      
      FOLL_PIN support is now in the main linux tree.  However, the patch to use
      FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test
      suite failure that involved (I think) page refcount overflows when huge
      pages were used.
      
      This patch definitively solves that kind of overflow problem, by adding an
      exact pincount, for compound pages (of order > 1), in the 3rd struct page
      of a compound page.  If available, that form of pincounting is used,
      instead of the GUP_PIN_COUNTING_BIAS approach.  Thanks again to Jan Kara
      for that idea.
      
      Other interesting changes:
      
      * dump_page(): added one, or two new things to report for compound
        pages: head refcount (for all compound pages), and map_pincount (for
        compound pages of order > 1).
      
      * Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
        huge page refcount upper limit problems, and added notes about how it
        works now.  Also added a note about the dump_page() enhancements.
      
      * Added some comments in gup.c and mm.h, to explain that there are two
        ways to count pinned pages: exact (for compound pages of order > 1) and
        fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
      
      ============================================================
      General notes about the tracking patch:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], [4] and in a remarkable number of email threads since about
      2017.  :)
      
      In contrast to earlier approaches, the page tracking can be incrementally
      applied to the kernel call sites that, until now, have been simply calling
      get_user_pages() ("gup").  In other words, opt-in by changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      ============================================================
      Future steps:
      
      * Convert more subsystems from get_user_pages() to pin_user_pages().
        The first probably needs to be bio/biovecs, because any filesystem
        testing is too difficult without those in place.
      
      * Change VFS and filesystems to respond appropriately when encountering
        dma-pinned pages.
      
      * Work with Ira and others to connect this all up with file system
        leases.
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      
      [4] LWN kernel index: get_user_pages()
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      This patch (of 12):
      
      An upcoming patch requires reusing the implementation of
      get_user_pages_remote().  Split up get_user_pages_remote() into an outer
      routine that checks flags, and an implementation routine that will be
      reused.  This makes subsequent changes much easier to understand.
      
      There should be no change in behavior due to this patch.
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22bf29b6
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap.c: rewrite pagecache_get_page documentation · 2294b32e
      Matthew Wilcox (Oracle) authored
      
      
       - These were never called PCG flags; they've been called FGP flags since
         their introduction in 2014.
       - The FGP_FOR_MMAP flag was misleadingly documented as if it was an
         alternative to FGP_CREAT instead of an option to it.
       - Rename the 'offset' parameter to 'index'.
       - Capitalisation, formatting, rewording.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-9-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2294b32e
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap.c: unexport find_get_entry · 83daf837
      Matthew Wilcox (Oracle) authored
      
      
      No in-tree users (proc, madvise, memcg, mincore) can be built as a module.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-8-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83daf837
    • Matthew Wilcox (Oracle)'s avatar
      mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_io · 184b4fef
      Matthew Wilcox (Oracle) authored
      
      
      Dumping the page information in this circumstance helps for debugging.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-7-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      184b4fef
    • Matthew Wilcox (Oracle)'s avatar
      include/linux/pagemap.h: rename arguments to find_subpage · ec848215
      Matthew Wilcox (Oracle) authored
      
      
      This isn't just a random struct page, it's known to be a head page, and
      calling it head makes the function better self-documenting.  The pgoff_t
      is less confusing if it's named index instead of offset.  Also add a
      couple of comments to explain why we're doing various things.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-3-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec848215