Skip to content
  1. Oct 09, 2013
    • Rik van Riel's avatar
      mm: numa: Copy cpupid on page migration · 7851a45c
      Rik van Riel authored
      
      
      After page migration, the new page has the nidpid unset. This makes
      every fault on a recently migrated page look like a first numa fault,
      leading to another page migration.
      
      Copying over the nidpid at page migration time should prevent erroneous
      migrations of recently migrated pages.
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-46-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7851a45c
    • Mel Gorman's avatar
      sched/numa: Report a NUMA task group ID · e29cf08b
      Mel Gorman authored
      
      
      It is desirable to model from userspace how the scheduler groups tasks
      over time. This patch adds an ID to the numa_group and reports it via
      /proc/PID/status.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e29cf08b
    • Peter Zijlstra's avatar
      sched/numa: Use {cpu, pid} to create task groups for shared faults · 8c8a743c
      Peter Zijlstra authored
      
      
      While parallel applications tend to align their data on the cache
      boundary, they tend not to align on the page or THP boundary.
      Consequently tasks that partition their data can still "false-share"
      pages presenting a problem for optimal NUMA placement.
      
      This patch uses NUMA hinting faults to chain tasks together into
      numa_groups. As well as storing the NID a task was running on when
      accessing a page a truncated representation of the faulting PID is
      stored. If subsequent faults are from different PIDs it is reasonable
      to assume that those two tasks share a page and are candidates for
      being grouped together. Note that this patch makes no scheduling
      decisions based on the grouping information.
      
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-44-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8c8a743c
    • Peter Zijlstra's avatar
      mm: numa: Change page last {nid,pid} into {cpu,pid} · 90572890
      Peter Zijlstra authored
      
      
      Change the per page last fault tracking to use cpu,pid instead of
      nid,pid. This will allow us to try and lookup the alternate task more
      easily. Note that even though it is the cpu that is store in the page
      flags that the mpol_misplaced decision is still based on the node.
      
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
      [ Fixed build failure on 32-bit systems. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      90572890
    • Rik van Riel's avatar
      sched/numa: Fix placement of workloads spread across multiple nodes · e1dda8a7
      Rik van Riel authored
      
      
      The load balancer will spread workloads across multiple NUMA nodes,
      in order to balance the load on the system. This means that sometimes
      a task's preferred node has available capacity, but moving the task
      there will not succeed, because that would create too large an imbalance.
      
      In that case, other NUMA nodes need to be considered.
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-42-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e1dda8a7
    • Mel Gorman's avatar
      sched/numa: Favor placing a task on the preferred node · 2c8a50aa
      Mel Gorman authored
      
      
      A tasks preferred node is selected based on the number of faults
      recorded for a node but the actual task_numa_migate() conducts a global
      search regardless of the preferred nid. This patch checks if the
      preferred nid has capacity and if so, searches for a CPU within that
      node. This avoids a global search when the preferred node is not
      overloaded.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-41-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2c8a50aa
    • Mel Gorman's avatar
      sched/numa: Use a system-wide search to find swap/migration candidates · fb13c7ee
      Mel Gorman authored
      
      
      This patch implements a system-wide search for swap/migration candidates
      based on total NUMA hinting faults. It has a balance limit, however it
      doesn't properly consider total node balance.
      
      In the old scheme a task selected a preferred node based on the highest
      number of private faults recorded on the node. In this scheme, the preferred
      node is based on the total number of faults. If the preferred node for a
      task changes then task_numa_migrate will search the whole system looking
      for tasks to swap with that would improve both the overall compute
      balance and minimise the expected number of remote NUMA hinting faults.
      
      Not there is no guarantee that the node the source task is placed
      on by task_numa_migrate() has any relationship to the newly selected
      task->numa_preferred_nid due to compute overloading.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      [ Do not swap with tasks that cannot run on source cpu]
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      [ Fixed compiler warning on UP. ]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-40-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fb13c7ee
    • Peter Zijlstra's avatar
      sched/numa: Introduce migrate_swap() · ac66f547
      Peter Zijlstra authored
      
      
      Use the new stop_two_cpus() to implement migrate_swap(), a function that
      flips two tasks between their respective cpus.
      
      I'm fairly sure there's a less crude way than employing the stop_two_cpus()
      method, but everything I tried either got horribly fragile and/or complex. So
      keep it simple for now.
      
      The notable detail is how we 'migrate' tasks that aren't runnable
      anymore. We'll make it appear like we migrated them before they went to
      sleep. The sole difference is the previous cpu in the wakeup path, so we
      override this.
      
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ac66f547
    • Peter Zijlstra's avatar
      stop_machine: Introduce stop_two_cpus() · 1be0bd77
      Peter Zijlstra authored
      
      
      Introduce stop_two_cpus() in order to allow controlled swapping of two
      tasks. It repurposes the stop_machine() state machine but only stops
      the two cpus which we can do with on-stack structures and avoid
      machine wide synchronization issues.
      
      The ordering of CPUs is important to avoid deadlocks. If unordered then
      two cpus calling stop_two_cpus on each other simultaneously would attempt
      to queue in the opposite order on each CPU causing an AB-BA style deadlock.
      By always having the lowest number CPU doing the queueing of works, we can
      guarantee that works are always queued in the same order, and deadlocks
      are avoided.
      
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      [ Implemented deadlock avoidance. ]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Link: http://lkml.kernel.org/r/1381141781-10992-38-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1be0bd77
    • Mel Gorman's avatar
      mm: numa: Trap pmd hinting faults only if we would otherwise trap PTE faults · 25cbbef1
      Mel Gorman authored
      
      
      Base page PMD faulting is meant to batch handle NUMA hinting faults from
      PTEs. However, even is no PTE faults would ever be handled within a
      range the kernel still traps PMD hinting faults. This patch avoids the
      overhead.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-37-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      25cbbef1
    • Mel Gorman's avatar
      sched/numa: Do not trap hinting faults for shared libraries · 4591ce4f
      Mel Gorman authored
      
      
      NUMA hinting faults will not migrate a shared executable page mapped by
      multiple processes on the grounds that the data is probably in the CPU
      cache already and the page may just bounce between tasks running on multipl
      nodes. Even if the migration is avoided, there is still the overhead of
      trapping the fault, updating the statistics, making scheduler placement
      decisions based on the information etc. If we are never going to migrate
      the page, it is overhead for no gain and worse a process may be placed on
      a sub-optimal node for shared executable pages. This patch avoids trapping
      faults for shared libraries entirely.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-36-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4591ce4f
    • Rik van Riel's avatar
      sched/numa: Increment numa_migrate_seq when task runs in correct location · 06ea5e03
      Rik van Riel authored
      
      
      When a task is already running on its preferred node, increment
      numa_migrate_seq to indicate that the task is settled if migration is
      temporarily disabled, and memory should migrate towards it.
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      [ Only increment migrate_seq if migration temporarily disabled. ]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-35-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      06ea5e03
    • Mel Gorman's avatar
      sched/numa: Retry migration of tasks to CPU on a preferred node · 6b9a7460
      Mel Gorman authored
      
      
      When a preferred node is selected for a tasks there is an attempt to migrate
      the task to a CPU there. This may fail in which case the task will only
      migrate if the active load balancer takes action. This may never happen if
      the conditions are not right. This patch will check at NUMA hinting fault
      time if another attempt should be made to migrate the task. It will only
      make an attempt once every five seconds.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-34-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6b9a7460
    • Mel Gorman's avatar
      sched/numa: Avoid overloading CPUs on a preferred NUMA node · 58d081b5
      Mel Gorman authored
      
      
      This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
      find_idlest_cpu_node has two critical limitations. It does not take the
      scheduling class into account when calculating the load and it is unsuitable
      for using when comparing loads between NUMA nodes.
      
      task_numa_find_cpu uses similar load calculations to wake_affine() when
      selecting the least loaded CPU within a scheduling domain common to the
      source and destimation nodes. It avoids causing CPU load imbalances in
      the machine by refusing to migrate if the relative load on the target
      CPU is higher than the source CPU.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-33-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      58d081b5
    • Mel Gorman's avatar
      mm: numa: Limit NUMA scanning to migrate-on-fault VMAs · fc314724
      Mel Gorman authored
      
      
      There is a 90% regression observed with a large Oracle performance test
      on a 4 node system. Profiles indicated that the overhead was due to
      contention on sp_lock when looking up shared memory policies. These
      policies do not have the appropriate flags to allow them to be
      automatically balanced so trapping faults on them is pointless. This
      patch skips VMAs that do not have MPOL_F_MOF set.
      
      [riel@redhat.com: Initial patch]
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-and-tested-by: default avatarJoe Mario <jmario@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fc314724
    • Rik van Riel's avatar
      sched/numa: Do not migrate memory immediately after switching node · 6fe6b2d6
      Rik van Riel authored
      
      
      The load balancer can move tasks between nodes and does not take NUMA
      locality into account. With automatic NUMA balancing this may result in the
      tasks working set being migrated to the new node. However, as the fault
      buffer will still store faults from the old node the schduler may decide to
      reset the preferred node and migrate the task back resulting in more
      migrations.
      
      The ideal would be that the scheduler did not migrate tasks with a heavy
      memory footprint but this may result nodes being overloaded. We could
      also discard the fault information on task migration but this would still
      cause all the tasks working set to be migrated. This patch simply avoids
      migrating the memory for a short time after a task is migrated.
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6fe6b2d6
    • Mel Gorman's avatar
      sched/numa: Set preferred NUMA node based on number of private faults · b795854b
      Mel Gorman authored
      
      
      Ideally it would be possible to distinguish between NUMA hinting faults that
      are private to a task and those that are shared. If treated identically
      there is a risk that shared pages bounce between nodes depending on
      the order they are referenced by tasks. Ultimately what is desirable is
      that task private pages remain local to the task while shared pages are
      interleaved between sharing tasks running on different nodes to give good
      average performance. This is further complicated by THP as even
      applications that partition their data may not be partitioning on a huge
      page boundary.
      
      To start with, this patch assumes that multi-threaded or multi-process
      applications partition their data and that in general the private accesses
      are more important for cpu->memory locality in the general case. Also,
      no new infrastructure is required to treat private pages properly but
      interleaving for shared pages requires additional infrastructure.
      
      To detect private accesses the pid of the last accessing task is required
      but the storage requirements are a high. This patch borrows heavily from
      Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
      to encode some bits from the last accessing task in the page flags as
      well as the node information. Collisions will occur but it is better than
      just depending on the node information. Node information is then used to
      determine if a page needs to migrate. The PID information is used to detect
      private/shared accesses. The preferred NUMA node is selected based on where
      the maximum number of approximately private faults were measured. Shared
      faults are not taken into consideration for a few reasons.
      
      First, if there are many tasks sharing the page then they'll all move
      towards the same node. The node will be compute overloaded and then
      scheduled away later only to bounce back again. Alternatively the shared
      tasks would just bounce around nodes because the fault information is
      effectively noise. Either way accounting for shared faults the same as
      private faults can result in lower performance overall.
      
      The second reason is based on a hypothetical workload that has a small
      number of very important, heavily accessed private pages but a large shared
      array. The shared array would dominate the number of faults and be selected
      as a preferred node even though it's the wrong decision.
      
      The third reason is that multiple threads in a process will race each
      other to fault the shared page making the fault information unreliable.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      [ Fix complication error when !NUMA_BALANCING. ]
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b795854b
    • Mel Gorman's avatar
      sched/numa: Remove check that skips small VMAs · 073b5bee
      Mel Gorman authored
      
      
      task_numa_work skips small VMAs. At the time the logic was to reduce the
      scanning overhead which was considerable. It is a dubious hack at best.
      It would make much more sense to cache where faults have been observed
      and only rescan those regions during subsequent PTE scans. Remove this
      hack as motivation to do it properly in the future.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-29-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      073b5bee
    • Mel Gorman's avatar
      mm: numa: Scan pages with elevated page_mapcount · 1bc115d8
      Mel Gorman authored
      
      
      Currently automatic NUMA balancing is unable to distinguish between false
      shared versus private pages except by ignoring pages with an elevated
      page_mapcount entirely. This avoids shared pages bouncing between the
      nodes whose task is using them but that is ignored quite a lot of data.
      
      This patch kicks away the training wheels in preparation for adding support
      for identifying shared/private pages is now in place. The ordering is so
      that the impact of the shared/private detection can be easily measured. Note
      that the patch does not migrate shared, file-backed within vmas marked
      VM_EXEC as these are generally shared library pages. Migrating such pages
      is not beneficial as there is an expectation they are read-shared between
      caches and iTLB and iCache pressure is generally low.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1bc115d8
    • Mel Gorman's avatar
      sched/numa: Check current->mm before allocating NUMA faults · 9ff1d9ff
      Mel Gorman authored
      
      
      task_numa_placement checks current->mm but after buffers for faults
      have already been uselessly allocated. Move the check earlier.
      
      [peterz@infradead.org: Identified the problem]
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-27-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9ff1d9ff
    • Mel Gorman's avatar
      sched/numa: Add infrastructure for split shared/private accounting of NUMA hinting faults · ac8e895b
      Mel Gorman authored
      
      
      Ideally it would be possible to distinguish between NUMA hinting faults
      that are private to a task and those that are shared.  This patch prepares
      infrastructure for separately accounting shared and private faults by
      allocating the necessary buffers and passing in relevant information. For
      now, all faults are treated as private and detection will be introduced
      later.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-26-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ac8e895b
    • Mel Gorman's avatar
      sched/numa: Reschedule task on preferred NUMA node once selected · e6628d5b
      Mel Gorman authored
      
      
      A preferred node is selected based on the node the most NUMA hinting
      faults was incurred on. There is no guarantee that the task is running
      on that node at the time so this patch rescheules the task to run on
      the most idle CPU of the selected node when selected. This avoids
      waiting for the balancer to make a decision.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-25-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e6628d5b
    • Mel Gorman's avatar
      sched/numa: Resist moving tasks towards nodes with fewer hinting faults · 7a0f3083
      Mel Gorman authored
      
      
      Just as "sched: Favour moving tasks towards the preferred node" favours
      moving tasks towards nodes with a higher number of recorded NUMA hinting
      faults, this patch resists moving tasks towards nodes with lower faults.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-24-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7a0f3083
    • Mel Gorman's avatar
      sched/numa: Favour moving tasks towards the preferred node · 3a7053b3
      Mel Gorman authored
      
      
      This patch favours moving tasks towards NUMA node that recorded a higher
      number of NUMA faults during active load balancing.  Ideally this is
      self-reinforcing as the longer the task runs on that node, the more faults
      it should incur causing task_numa_placement to keep the task running on that
      node. In reality a big weakness is that the nodes CPUs can be overloaded
      and it would be more efficient to queue tasks on an idle node and migrate
      to the new node. This would require additional smarts in the balancer so
      for now the balancer will simply prefer to place the task on the preferred
      node for a PTE scans which is controlled by the numa_balancing_settle_count
      sysctl. Once the settle_count number of scans has complete the schedule
      is free to place the task on an alternative node if the load is imbalanced.
      
      [srikar@linux.vnet.ibm.com: Fixed statistics]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      [ Tunable and use higher faults instead of preferred. ]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3a7053b3
    • Mel Gorman's avatar
      sched/numa: Update NUMA hinting faults once per scan · 745d6147
      Mel Gorman authored
      
      
      NUMA hinting fault counts and placement decisions are both recorded in the
      same array which distorts the samples in an unpredictable fashion. The values
      linearly accumulate during the scan and then decay creating a sawtooth-like
      pattern in the per-node counts. It also means that placement decisions are
      time sensitive. At best it means that it is very difficult to state that
      the buffer holds a decaying average of past faulting behaviour. At worst,
      it can confuse the load balancer if it sees one node with an artifically high
      count due to very recent faulting activity and may create a bouncing effect.
      
      This patch adds a second array. numa_faults stores the historical data
      which is used for placement decisions. numa_faults_buffer holds the
      fault activity during the current scan window. When the scan completes,
      numa_faults decays and the values from numa_faults_buffer are copied
      across.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      745d6147
    • Mel Gorman's avatar
      sched/numa: Select a preferred node with the most numa hinting faults · 688b7585
      Mel Gorman authored
      
      
      This patch selects a preferred node for a task to run on based on the
      NUMA hinting faults. This information is later used to migrate tasks
      towards the node during balancing.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      688b7585
    • Mel Gorman's avatar
      sched/numa: Track NUMA hinting faults on per-node basis · f809ca9a
      Mel Gorman authored
      
      
      This patch tracks what nodes numa hinting faults were incurred on.
      This information is later used to schedule a task on the node storing
      the pages most frequently faulted by the task.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f809ca9a
    • Mel Gorman's avatar
      sched/numa: Slow scan rate if no NUMA hinting faults are being recorded · f307cd1a
      Mel Gorman authored
      
      
      NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
      was migrated. For long-lived but idle processes there may be no faults
      but the scan rate will be high and just waste CPU. This patch will slow
      the scan rate for processes that are not trapping faults.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-19-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f307cd1a
    • Mel Gorman's avatar
      sched/numa: Set the scan rate proportional to the memory usage of the task being scanned · 598f0ec0
      Mel Gorman authored
      
      
      The NUMA PTE scan rate is controlled with a combination of the
      numa_balancing_scan_period_min, numa_balancing_scan_period_max and
      numa_balancing_scan_size. This scan rate is independent of the size
      of the task and as an aside it is further complicated by the fact that
      numa_balancing_scan_size controls how many pages are marked pte_numa and
      not how much virtual memory is scanned.
      
      In combination, it is almost impossible to meaningfully tune the min and
      max scan periods and reasoning about performance is complex when the time
      to complete a full scan is is partially a function of the tasks memory
      size. This patch alters the semantic of the min and max tunables to be
      about tuning the length time it takes to complete a scan of a tasks occupied
      virtual address space. Conceptually this is a lot easier to understand. There
      is a "sanity" check to ensure the scan rate is never extremely fast based on
      the amount of virtual memory that should be scanned in a second. The default
      of 2.5G seems arbitrary but it is to have the maximum scan rate after the
      patch roughly match the maximum scan rate before the patch was applied.
      
      On a similar note, numa_scan_period is in milliseconds and not
      jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
      to numa_scan_period means that the rate scanning slows depends on HZ which
      is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      598f0ec0
    • Mel Gorman's avatar
      sched/numa: Initialise numa_next_scan properly · 7e8d16b6
      Mel Gorman authored
      
      
      Scan delay logic and resets are currently initialised to start scanning
      immediately instead of delaying properly. Initialise them properly at
      fork time and catch when a new mm has been allocated.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7e8d16b6
    • Mel Gorman's avatar
      Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" · b726b7df
      Mel Gorman authored
      PTE scanning and NUMA hinting fault handling is expensive so commit
      5bca2303
      
       ("mm: sched: numa: Delay PTE scanning until a task is scheduled
      on a new node") deferred the PTE scan until a task had been scheduled on
      another node. The problem is that in the purely shared memory case that
      this may never happen and no NUMA hinting fault information will be
      captured. We are not ruling out the possibility that something better
      can be done here but for now, this patch needs to be reverted and depend
      entirely on the scan_delay to avoid punishing short-lived processes.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b726b7df
    • Peter Zijlstra's avatar
      sched/numa: Continue PTE scanning even if migrate rate limited · 9e645ab6
      Peter Zijlstra authored
      
      
      Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
      limited sees like a bad idea. Even if this node can't migrate anymore other
      nodes might and we want up-to-date information to do balance decisions.
      We already rate limit the actual migrations, this should leave enough
      bandwidth to allow the non-migrating scanning. I think its important we
      keep up-to-date information if we're going to do placement based on it.
      
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9e645ab6
    • Peter Zijlstra's avatar
      sched/numa: Mitigate chance that same task always updates PTEs · 19a78d11
      Peter Zijlstra authored
      
      
      With a trace_printk("working\n"); right after the cmpxchg in
      task_numa_work() we can see that of a 4 thread process, its always the
      same task winning the race and doing the protection change.
      
      This is a problem since the task doing the protection change has a
      penalty for taking faults -- it is busy when marking the PTEs. If its
      always the same task the ->numa_faults[] get severely skewed.
      
      Avoid this by delaying the task doing the protection change such that
      it is unlikely to win the privilege again.
      
      Before:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
            thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
            thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
            thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
            thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
            thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
            thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
            thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
            thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
            thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
            thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
            thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
            thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
            thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
            thread 0/0-3232  [022] ....   214.209342: task_numa_work: working
      
      After:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
            thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
            thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
            thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
            thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
            thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
            thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
            thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
            thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
            thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
            thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
            thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
            thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
            thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
            thread 0/3-3256  [024] ....   138.267207: task_numa_work: working
      
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      19a78d11
    • Mel Gorman's avatar
      mm: numa: Do not migrate or account for hinting faults on the zero page · a1a46184
      Mel Gorman authored
      
      
      The zero page is not replicated between nodes and is often shared between
      processes. The data is read-only and likely to be cached in local CPUs
      if heavily accessed meaning that the remote memory access cost is less
      of a concern. This patch prevents trapping faults on the zero pages. For
      tasks using the zero page this will reduce the number of PTE updates,
      TLB flushes and hinting faults.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      [ Correct use of is_huge_zero_page]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-13-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a1a46184
    • Mel Gorman's avatar
      mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning · f123d74a
      Mel Gorman authored
      
      
      NUMA PTE scanning is expensive both in terms of the scanning itself and
      the TLB flush if there are any updates. The TLB flush is avoided if no
      PTEs are updated but there is a bug where transhuge PMDs are considered
      to be updated even if they were already pmd_numa. This patch addresses
      the problem and TLB flushes should be reduced.
      
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-12-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f123d74a
    • Mel Gorman's avatar
      mm: Do not flush TLB during protection change if !pte_present && !migration_entry · e920e14c
      Mel Gorman authored
      
      
      NUMA PTE scanning is expensive both in terms of the scanning itself and
      the TLB flush if there are any updates. Currently non-present PTEs are
      accounted for as an update and incurring a TLB flush where it is only
      necessary for anonymous migration entries. This patch addresses the
      problem and should reduce TLB flushes.
      
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-11-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e920e14c
    • Mel Gorman's avatar
      mm: Account for a THP NUMA hinting update as one PTE update · afcae265
      Mel Gorman authored
      
      
      A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
      large difference when estimating the cost of automatic NUMA balancing and
      can be misleading when comparing results that had collapsed versus split
      THP. This patch addresses the accounting issue.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      afcae265
    • Mel Gorman's avatar
      mm: Close races between THP migration and PMD numa clearing · a54a407f
      Mel Gorman authored
      
      
      THP migration uses the page lock to guard against parallel allocations
      but there are cases like this still open
      
        Task A					Task B
        ---------------------				---------------------
        do_huge_pmd_numa_page				do_huge_pmd_numa_page
        lock_page
        mpol_misplaced == -1
        unlock_page
        goto clear_pmdnuma
      						lock_page
      						mpol_misplaced == 2
      						migrate_misplaced_transhuge
        pmd = pmd_mknonnuma
        set_pmd_at
      
      During hours of testing, one crashed with weird errors and while I have
      no direct evidence, I suspect something like the race above happened.
      This patch extends the page lock to being held until the pmd_numa is
      cleared to prevent migration starting in parallel while the pmd_numa is
      being cleared. It also flushes the old pmd entry and orders pagetable
      insertion before rmap insertion.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a54a407f
    • Mel Gorman's avatar
      mm: numa: Sanitize task_numa_fault() callsites · 8191acbd
      Mel Gorman authored
      
      
      There are three callers of task_numa_fault():
      
       - do_huge_pmd_numa_page():
           Accounts against the current node, not the node where the
           page resides, unless we migrated, in which case it accounts
           against the node we migrated to.
      
       - do_numa_page():
           Accounts against the current node, not the node where the
           page resides, unless we migrated, in which case it accounts
           against the node we migrated to.
      
       - do_pmd_numa_page():
           Accounts not at all when the page isn't migrated, otherwise
           accounts against the node we migrated towards.
      
      This seems wrong to me; all three sites should have the same
      sementaics, furthermore we should accounts against where the page
      really is, we already know where the task is.
      
      So modify all three sites to always account; we did after all receive
      the fault; and always account to where the page is after migration,
      regardless of success.
      
      They all still differ on when they clear the PTE/PMD; ideally that
      would get sorted too.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8191acbd
    • Mel Gorman's avatar
      mm: Prevent parallel splits during THP migration · b8916634
      Mel Gorman authored
      
      
      THP migrations are serialised by the page lock but on its own that does
      not prevent THP splits. If the page is split during THP migration then
      the pmd_same checks will prevent page table corruption but the unlock page
      and other fix-ups potentially will cause corruption. This patch takes the
      anon_vma lock to prevent parallel splits during migration.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b8916634