Skip to content
  1. Sep 09, 2017
    • Jérôme Glisse's avatar
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse authored
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • Jérôme Glisse's avatar
      mm/migrate: allow migrate_vma() to alloc new page on empty entry · 8315ada7
      Jérôme Glisse authored
      This allows callers of migrate_vma() to allocate new page for empty CPU
      page table entry (pte_none or back by zero page).  This is only for
      anonymous memory and it won't allow new page to be instanced if the
      userfaultfd is armed.
      
      This is useful to device driver that want to migrate a range of virtual
      address and would rather allocate new memory than having to fault later
      on.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8315ada7
    • Jérôme Glisse's avatar
      mm/migrate: support un-addressable ZONE_DEVICE page in migration · a5430dda
      Jérôme Glisse authored
      Allow to unmap and restore special swap entry of un-addressable
      ZONE_DEVICE memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5430dda
    • Jérôme Glisse's avatar
      mm/migrate: migrate_vma() unmap page from vma while collecting pages · 8c3328f1
      Jérôme Glisse authored
      Common case for migration of virtual address range is page are map only
      once inside the vma in which migration is taking place.  Because we
      already walk the CPU page table for that range we can directly do the
      unmap there and setup special migration swap entry.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c3328f1
    • Jérôme Glisse's avatar
      mm/migrate: new memory migration helper for use with device memory · 8763cb45
      Jérôme Glisse authored
      This patch add a new memory migration helpers, which migrate memory
      backing a range of virtual address of a process to different memory (which
      can be allocated through special allocator).  It differs from numa
      migration by working on a range of virtual address and thus by doing
      migration in chunk that can be large enough to use DMA engine or special
      copy offloading engine.
      
      Expected users are any one with heterogeneous memory where different
      memory have different characteristics (latency, bandwidth, ...).  As an
      example IBM platform with CAPI bus can make use of this feature to migrate
      between regular memory and CAPI device memory.  New CPU architecture with
      a pool of high performance memory not manage as cache but presented as
      regular memory (while being faster and with lower latency than DDR) will
      also be prime user of this patch.
      
      Migration to private device memory will be useful for device that have
      large pool of such like GPU, NVidia plans to use HMM for that.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8763cb45
    • Jérôme Glisse's avatar
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse authored
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
    • Jérôme Glisse's avatar
      mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory · 858b54da
      Jérôme Glisse authored
      This introduce a dummy HMM device class so device driver can use it to
      create hmm_device for the sole purpose of registering device memory.  It
      is useful to device driver that want to manage multiple physical device
      memory under same struct device umbrella.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-13-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      858b54da
    • Jérôme Glisse's avatar
      mm/hmm/devmem: device memory hotplug using ZONE_DEVICE · 4ef589dc
      Jérôme Glisse authored
      This introduce a simple struct and associated helpers for device driver to
      use when hotpluging un-addressable device memory as ZONE_DEVICE.  It will
      find a unuse physical address range and trigger memory hotplug for it
      which allocates and initialize struct page for the device memory.
      
      Device driver should use this helper during device initialization to
      hotplug the device memory.  It should only need to remove the memory once
      the device is going offline (shutdown or hotremove).  There should not be
      any userspace API to hotplug memory expect maybe for host device driver to
      allow to add more memory to a guest device driver.
      
      Device's memory is manage by the device driver and HMM only provides
      helpers to that effect.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-12-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ef589dc
    • Jérôme Glisse's avatar
      mm/memcontrol: support MEMORY_DEVICE_PRIVATE · c733a828
      Jérôme Glisse authored
      HMM pages (private or public device pages) are ZONE_DEVICE page and thus
      need special handling when it comes to lru or refcount.  This patch make
      sure that memcontrol properly handle those when it face them.  Those pages
      are use like regular pages in a process address space either as anonymous
      page or as file back page.  So from memcg point of view we want to handle
      them like regular page for now at least.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c733a828
    • Jérôme Glisse's avatar
      mm/memcontrol: allow to uncharge page without using page->lru field · a9d5adee
      Jérôme Glisse authored
      HMM pages (private or public device pages) are ZONE_DEVICE page and
      thus you can not use page->lru fields of those pages. This patch
      re-arrange the uncharge to allow single page to be uncharge without
      modifying the lru field of the struct page.
      
      There is no change to memcontrol logic, it is the same as it was
      before this patch.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9d5adee
    • Jérôme Glisse's avatar
      mm/ZONE_DEVICE: special case put_page() for device private pages · 7b2d55d2
      Jérôme Glisse authored
      A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have
      any user.  For device private pages this is important to catch and thus we
      need to special case put_page() for this.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-9-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b2d55d2
    • Jérôme Glisse's avatar
      mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory · 5042db43
      Jérôme Glisse authored
      HMM (heterogeneous memory management) need struct page to support
      migration from system main memory to device memory.  Reasons for HMM and
      migration to device memory is explained with HMM core patch.
      
      This patch deals with device memory that is un-addressable memory (ie CPU
      can not access it).  Hence we do not want those struct page to be manage
      like regular memory.  That is why we extend ZONE_DEVICE to support
      different types of memory.
      
      A persistent memory type is define for existing user of ZONE_DEVICE and a
      new device un-addressable type is added for the un-addressable memory
      type.  There is a clear separation between what is expected from each
      memory type and existing user of ZONE_DEVICE are un-affected by new
      requirement and new use of the un-addressable type.  All specific code
      path are protect with test against the memory type.
      
      Because memory is un-addressable we use a new special swap type for when a
      page is migrated to device memory (this reduces the number of maximum swap
      file).
      
      The main two additions beside memory type to ZONE_DEVICE is two callbacks.
      First one, page_free() is call whenever page refcount reach 1 (which
      means the page is free as ZONE_DEVICE page never reach a refcount of 0).
      This allow device driver to manage its memory and associated struct page.
      
      The second callback page_fault() happens when there is a CPU access to an
      address that is back by a device page (which are un-addressable by the
      CPU).  This callback is responsible to migrate the page back to system
      main memory.  Device driver can not block migration back to system memory,
      HMM make sure that such page can not be pin into device memory.
      
      If device is in some error condition and can not migrate memory back then
      a CPU page fault to device memory should end with SIGBUS.
      
      [arnd@arndb.de: fix warning]
        Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5042db43
    • Michal Hocko's avatar
      mm/memory_hotplug: introduce add_pages · 3072e413
      Michal Hocko authored
      There are new users of memory hotplug emerging.  Some of them require
      different subset of arch_add_memory.  There are some which only require
      allocation of struct pages without mapping those pages to the kernel
      address space.  We currently have __add_pages for that purpose.  But this
      is rather lowlevel and not very suitable for the code outside of the
      memory hotplug.  E.g.  x86_64 wants to update max_pfn which should be done
      by the caller.  Introduce add_pages() which should care about those
      details if they are needed.  Each architecture should define its
      implementation and select CONFIG_ARCH_HAS_ADD_PAGES.  All others use the
      currently existing __add_pages.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.com
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3072e413
    • Jérôme Glisse's avatar
      mm/hmm/mirror: device page fault handler · 74eee180
      Jérôme Glisse authored
      This handles page fault on behalf of device driver, unlike
      handle_mm_fault() it does not trigger migration back to system memory for
      device memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-6-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74eee180
    • Jérôme Glisse's avatar
      mm/hmm/mirror: helper to snapshot CPU page table · da4c3c73
      Jérôme Glisse authored
      This does not use existing page table walker because we want to share
      same code for our page fault handler.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-5-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da4c3c73
    • Jérôme Glisse's avatar
      mm/hmm/mirror: mirror process address space on device with HMM helpers · c0b12405
      Jérôme Glisse authored
      This is a heterogeneous memory management (HMM) process address space
      mirroring.  In a nutshell this provide an API to mirror process address
      space on a device.  This boils down to keeping CPU and device page table
      synchronize (we assume that both device and CPU are cache coherent like
      PCIe device can be).
      
      This patch provide a simple API for device driver to achieve address space
      mirroring thus avoiding each device driver to grow its own CPU page table
      walker and its own CPU page table synchronization mechanism.
      
      This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
      hardware in the future.
      
      [jglisse@redhat.com: fix hmm for "mmu_notifier kill invalidate_page callback"]
        Link: http://lkml.kernel.org/r/20170830231955.GD9445@redhat.com
      Link: http://lkml.kernel.org/r/20170817000548.32038-4-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0b12405
    • Jérôme Glisse's avatar
      mm/hmm: heterogeneous memory management (HMM for short) · 133ff0ea
      Jérôme Glisse authored
      HMM provides 3 separate types of functionality:
          - Mirroring: synchronize CPU page table and device page table
          - Device memory: allocating struct page for device memory
          - Migration: migrating regular memory to device memory
      
      This patch introduces some common helpers and definitions to all of
      those 3 functionality.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      133ff0ea
    • Jérôme Glisse's avatar
      hmm: heterogeneous memory management documentation · bffc33ec
      Jérôme Glisse authored
      Patch series "HMM (Heterogeneous Memory Management)", v25.
      
      Heterogeneous Memory Management (HMM) (description and justification)
      
      Today device driver expose dedicated memory allocation API through their
      device file, often relying on a combination of IOCTL and mmap calls.
      The device can only access and use memory allocated through this API.
      This effectively split the program address space into object allocated
      for the device and useable by the device and other regular memory
      (malloc, mmap of a file, share memory, â) only accessible by
      CPU (or in a very limited way by a device by pinning memory).
      
      Allowing different isolated component of a program to use a device thus
      require duplication of the input data structure using device memory
      allocator.  This is reasonable for simple data structure (array, grid,
      image, â) but this get extremely complex with advance data
      structure (list, tree, graph, â) that rely on a web of memory
      pointers.  This is becom...
      bffc33ec
    • Naoya Horiguchi's avatar
      mm: memory_hotplug: memory hotremove supports thp migration · 8135d892
      Naoya Horiguchi authored
      This patch enables thp migration for memory hotremove.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8135d892
    • Naoya Horiguchi's avatar
      mm: migrate: move_pages() supports thp migration · e8db67eb
      Naoya Horiguchi authored
      This patch enables thp migration for move_pages(2).
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-10-zi.yan@sent.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8db67eb
    • Naoya Horiguchi's avatar
      mm: mempolicy: mbind and migrate_pages support thp migration · c8633798
      Naoya Horiguchi authored
      
      
      This patch enables thp migration for mbind(2) and migrate_pages(2).
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8633798
    • Naoya Horiguchi's avatar
      mm: soft-dirty: keep soft-dirty bits over thp migration · ab6e3d09
      Naoya Horiguchi authored
      
      
      Soft dirty bit is designed to keep tracked over page migration.  This
      patch makes it work in the same manner for thp migration too.
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab6e3d09
    • Zi Yan's avatar
      mm: thp: check pmd migration entry in common path · 84c3fc4e
      Zi Yan authored
      
      
      When THP migration is being used, memory management code needs to handle
      pmd migration entries properly.  This patch uses !pmd_present() or
      is_swap_pmd() (depending on whether pmd_none() needs separate code or
      not) to check pmd migration entries at the places where a pmd entry is
      present.
      
      Since pmd-related code uses split_huge_page(), split_huge_pmd(),
      pmd_trans_huge(), pmd_trans_unstable(), or
      pmd_none_or_trans_huge_or_clear_bad(), this patch:
      
      1. adds pmd migration entry split code in split_huge_pmd(),
      
      2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
      
      3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
      
      Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
      is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
      them.
      
      Until this commit, a pmd entry should be:
      1. pointing to a pte page,
      2. is_swap_pmd(),
      3. pmd_trans_huge(),
      4. pmd_devmap(), or
      5. pmd_none().
      
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84c3fc4e
    • Zi Yan's avatar
      mm: thp: enable thp migration in generic path · 616b8371
      Zi Yan authored
      Add thp migration's core code, including conversions between a PMD entry
      and a swap entry, setting PMD migration entry, removing PMD migration
      entry, and waiting on PMD migration entries.
      
      This patch makes it possible to support thp migration.  If you fail to
      allocate a destination page as a thp, you just split the source thp as
      we do now, and then enter the normal page migration.  If you succeed to
      allocate destination thp, you enter thp migration.  Subsequent patches
      actually enable thp migration for each caller of page migration by
      allowing its get_new_page() callback to allocate thps.
      
      [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
        Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
      
      
      [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      616b8371
    • Naoya Horiguchi's avatar
      mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION · 9c670ea3
      Naoya Horiguchi authored
      Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
      functionality to x86_64, which should be safer at the first step.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c670ea3
    • Naoya Horiguchi's avatar
      mm: thp: introduce separate TTU flag for thp freezing · b5ff8161
      Naoya Horiguchi authored
      TTU_MIGRATION is used to convert pte into migration entry until thp
      split completes.  This behavior conflicts with thp migration added later
      patches, so let's introduce a new TTU flag specifically for freezing.
      
      try_to_unmap() is used both for thp split (via freeze_page()) and page
      migration (via __unmap_and_move()).  In freeze_page(), ttu_flag given
      for head page is like below (assuming anonymous thp):
      
          (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
           TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
      
      and ttu_flag given for tail pages is:
      
          (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
           TTU_MIGRATION)
      
      __unmap_and_move() calls try_to_unmap() with ttu_flag:
      
          (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
      
      Now I'm trying to insert a branch for thp migration at the top of
      try_to_unmap_one() like below
      
      static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                             unsigned long address, void *arg)
        {
                ...
                /* PMD-mapped THP migration entry */
                if (!pvmw.pte && (flags & TTU_MIGRATION)) {
                    if (!PageAnon(page))
                        continue;
      
                    set_pmd_migration_entry(&pvmw, page);
                    continue;
                }
      	  ...
        }
      
      so try_to_unmap() for tail pages called by thp split can go into thp
      migration code path (which converts *pmd* into migration entry), while
      the expectation is to freeze thp (which converts *pte* into migration
      entry.)
      
      I detected this failure as a "bad page state" error in a testcase where
      split_huge_page() is called from queue_pages_pte_range().
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-4-zi.yan@sent.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5ff8161
    • Naoya Horiguchi's avatar
      mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 · eee4818b
      Naoya Horiguchi authored
      _PAGE_PSE is used to distinguish between a truly non-present
      (_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and
      should be treated as present.
      
      But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would
      cause confusion between one of those PMDs undergoing a THP split, and a
      soft-dirty PMD.  Dropping _PAGE_PSE check in pmd_present() does not work
      well, because it can hurt optimization of tlb handling in thp split.
      
      Thus, we need to move the bit.
      
      In the current kernel, bits 1-4 are not used in non-present format since
      commit 00839ee3 ("x86/mm: Move swap offset/type up in PTE to work
      around erratum").  So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.  Bit 7
      is used as reserved (always clear), so please don't use it for other
      purpose.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-3-zi.yan@sent.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eee4818b
    • Naoya Horiguchi's avatar
      mm: mempolicy: add queue_pages_required() · 88aaa2a1
      Naoya Horiguchi authored
      Patch series "mm: page migration enhancement for thp", v9.
      
      Motivations:
      
      1. THP migration becomes important in the upcoming heterogeneous memory
         systems. As David Nellans from NVIDIA pointed out from other threads
         (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349227.html),
         future GPUs or other accelerators will have their memory managed by
         operating systems. Moving data into and out of these memory nodes
         efficiently is critical to applications that use GPUs or other
         accelerators. Existing page migration only supports base pages, which
         has a very low memory bandwidth utilization. My experiments (see
         below) show THP migration can migrate pages more efficiently.
      
      2. Base page migration vs THP migration throughput.
      
         Here are cross-socket page migration results from calling
         move_pages() syscall:
      
         In x86_64, a Intel two-socket E5-2640v3 box,
          - single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
          - single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
          - 512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.
      
         In ppc64, a two-socket Power8 box,
          - single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
          - single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
          - 256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.
      
         THP migration can give us 3x and 1.15x throughput over base page
         migration in x86_64 and ppc64 respectivley.
      
         You can test it out by using the code here:
            https://github.com/x-y-z/thp-migration-bench
      
      3. Existing page migration splits THP before migration and cannot
         guarantee the migrated pages are still contiguous. Contiguity is
         always what GPUs and accelerators look for. Without THP migration,
         khugepaged needs to do extra work to reassemble the migrated pages
         back to THPs.
      
      This patch (of 10):
      
      Introduce a separate check routine related to MPOL_MF_INVERT flag.  This
      patch just does cleanup, no behavioral change.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-2-zi.yan@sent.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88aaa2a1
    • Linus Torvalds's avatar
      RDMA/netlink: clean up message validity array initializer · 015a9e66
      Linus Torvalds authored
      
      
      The fix in the parent made me look at that function, and react to how
      illogical and illegible the array initializer was.
      
      Use named array indexes to make it clearer what is going on, and make
      the initializer not depend silently on the exact index numbers.
      
      [ The initializer now also shows an odd inconsistency in the naming:
        note the IWCM vs IWPM..   - Linus ]
      
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      015a9e66
    • Leon Romanovsky's avatar
      RDAM/netlink: Fix out-of-bound access while checking message validity · 8b2c7e7a
      Leon Romanovsky authored
      
      
      The netlink message sent with type == 0, which doesn't have any client
      behind it, caused to the overflow in max_num_ops array.
      
      Fix it by declaring zero number of ops for the first client.
      
      Fixes: c9901724 ("RDMA/netlink: Remove netlink clients infrastructure")
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b2c7e7a
  2. Sep 08, 2017
    • Linus Torvalds's avatar
      Merge branch 'gperf-removal' · 5969d1bb
      Linus Torvalds authored
      Remove our use of 'gperf' for generating perfect hashes from some of our
      build tools.
      
      This removal was prompted by Masahiro Yamada sending out a patch that
      removes all our pre-generated files, and when I tested it, I noticed
      that the gperf version I have (3.1) apparently generates code that no
      longer works with out code-base because the function interfaces
      generated by gperf have changed.
      
      We really don't care that much, and the gperf people changed their
      interfaces in ways that makes it annoying to work with them.  Tools that
      make it hard to use them should not be used, and the kernel is not at
      all interested in some autoconf mess.  So remove the gperf dependency
      entirely.
      
      It turns out that if you ignore the pre-generated files, the use of
      gperf apparently saved us a whopping fifteen lines of code.  It
      obviously wasn't worth it, considering that the pre-generated files are
      about 500 lines.
      
      I sent this out as a patch about three weeks ago, and got absolutely
      zero responses.  So let's see if anybody notices now that I merge it.
      Because there might be serious bugs here, but it WorksForMe(tm).
      
      * gperf-removal:
        Remove gperf usage from toolchain
      5969d1bb
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 572c01ba
      Linus Torvalds authored
      Pull SCSI updates from James Bottomley:
       "This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas,
        megaraid_sas, zfcp and a host of minor updates.
      
        The major driver change here is the elimination of the block based
        cciss driver in favour of the SCSI based hpsa driver (which now drives
        all the legacy cases cciss used to be required for). Plus a reset
        handler clean up and the redo of the SAS SMP handler to use bsg lib"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits)
        scsi: scsi-mq: Always unprepare before requeuing a request
        scsi: Show .retries and .jiffies_at_alloc in debugfs
        scsi: Improve requeuing behavior
        scsi: Call scsi_initialize_rq() for filesystem requests
        scsi: qla2xxx: Reset the logo flag, after target re-login.
        scsi: qla2xxx: Fix slow mem alloc behind lock
        scsi: qla2xxx: Clear fc4f_nvme flag
        scsi: qla2xxx: add missing includes for qla_isr
        scsi: qla2xxx: Fix an integer overflow in sysfs code
        scsi: aacraid: report -ENOMEM to upper layer from aac_convert_sgraw2()
        scsi: aacraid: get rid of one level of indentation
        scsi: aacraid: fix indentation errors
        scsi: storvsc: fix memory leak on ring buffer busy
        scsi: scsi_transport_sas: switch to bsg-lib for SMP passthrough
        scsi: smartpqi: remove the smp_handler stub
        scsi: hpsa: remove the smp_handler stub
        scsi: bsg-lib: pass the release callback through bsg_setup_queue
        scsi: Rework handling of scsi_device.vpd_pg8[03]
        scsi: Rework the code for caching Vital Product Data (VPD)
        scsi: rcu: Introduce rcu_swap_protected()
        ...
      572c01ba
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk · cef5d0f9
      Linus Torvalds authored
      Pull printk updates from Petr Mladek:
      
       - Do not allow use of freed init data and code even when boot consoles
         are forced to stay. Also check for the init memory more precisely.
      
       - Some code clean up by starting contributors.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
        printk: Clean up do_syslog() error handling
        printk/console: Enhance the check for consoles using init memory
        printk/console: Always disable boot consoles that use init memory before it is freed
        printk: Modify operators of printed_len and text_len
      cef5d0f9
    • Linus Torvalds's avatar
      Merge tag 'audit-pr-20170907' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit · 0fb02e71
      Linus Torvalds authored
      Pull audit updates from Paul Moore:
       "A small pull request for audit this time, only four patches and only
        two with any real code changes.
      
        Those two changes are the removal of a pointless SELinux AVC
        initialization audit event and a fix to improve the audit timestamp
        overhead.
      
        The other two patches are comment cleanup and administrative updates,
        nothing very exciting.
      
        Everything passes our tests"
      
      * tag 'audit-pr-20170907' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
        audit: update the function comments
        selinux: remove AVC init audit log message
        audit: update the audit info in MAINTAINERS
        audit: Reduce overhead using a coarse clock
      0fb02e71
    • Linus Torvalds's avatar
      Merge tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 828f4257
      Linus Torvalds authored
      Pull secureexec update from Kees Cook:
       "This series has the ultimate goal of providing a sane stack rlimit
        when running set*id processes.
      
        To do this, the bprm_secureexec LSM hook is collapsed into the
        bprm_set_creds hook so the secureexec-ness of an exec can be
        determined early enough to make decisions about rlimits and the
        resulting memory layouts. Other logic acting on the secureexec-ness of
        an exec is similarly consolidated. Capabilities needed some special
        handling, but the refactoring removed other special handling, so that
        was a wash"
      
      * tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        exec: Consolidate pdeath_signal clearing
        exec: Use sane stack rlimit under secureexec
        exec: Consolidate dumpability logic
        smack: Remove redundant pdeath_signal clearing
        exec: Use secureexec for clearing pdeath_signal
        exec: Use secureexec for setting dumpability
        LSM: drop bprm_secureexec hook
        commoncap: Move cap_elevated calculation into bprm_set_creds
        commoncap: Refactor to remove bprm_secureexec hook
        smack: Refactor to remove bprm_secureexec hook
        selinux: Refactor to remove bprm_secureexec hook
        apparmor: Refactor to remove bprm_secureexec hook
        binfmt: Introduce secureexec flag
        exec: Correct comments about "point of no return"
        exec: Rename bprm->cred_prepared to called_set_creds
      828f4257
    • Linus Torvalds's avatar
      Merge tag 'gcc-plugins-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 44ccba3f
      Linus Torvalds authored
      Pull gcc plugins update from Kees Cook:
       "This finishes the porting work on randstruct, and introduces a new
        option to structleak, both noted below:
      
         - For the randstruct plugin, enable automatic randomization of
           structures that are entirely function pointers (along with a couple
           designated initializer fixes).
      
         - For the structleak plugin, provide an option to perform zeroing
           initialization of all otherwise uninitialized stack variables that
           are passed by reference (Ard Biesheuvel)"
      
      * tag 'gcc-plugins-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        gcc-plugins: structleak: add option to init all vars used as byref args
        randstruct: Enable function pointer struct detection
        drivers/net/wan/z85230.c: Use designated initializers
        drm/amd/powerplay: rv: Use designated initializers
      44ccba3f
    • Linus Torvalds's avatar
      Merge tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 21d236bf
      Linus Torvalds authored
      Pull pstore update from Kees Cook:
       "Make pstore permissions more versatile by removing CAP_SYSLOG
        requirement and defining more restrictive root directory DAC
        permissions default (0750, which can be adjust after boot unlike the
        CAP_SYSLOG check).
      
        Suggested by Nick Kralevich"
      
      * tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"
        pstore: Make default pstorefs root dir perms 0750
      21d236bf
    • Linus Torvalds's avatar
      Merge tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6 · 8dc5b3a6
      Linus Torvalds authored
      Pull cifs update from Steve French:
       "Enable xattr support for smb3 and also a bugfix"
      
      * tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: Check for timeout on Negotiate stage
        cifs: Add support for writing attributes on SMB2+
        cifs: Add support for reading attributes on SMB2+
      8dc5b3a6
    • Linus Torvalds's avatar
      Merge git://git.kvack.org/~bcrl/aio-next · 2500e287
      Linus Torvalds authored
      Pull aio fix from Ben LaHaise:
       "Improve aio-nr counting on large SMP systems.
      
        It has been in linux-next for quite some time"
      
      * git://git.kvack.org/~bcrl/aio-next:
        fs: aio: fix the increment of aio-nr and counting against aio-max-nr
      2500e287
    • Linus Torvalds's avatar
      Merge branch 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · ae8ac6b7
      Linus Torvalds authored
      Pull quota scaling updates from Jan Kara:
       "This contains changes to make the quota subsystem more scalable.
      
        Reportedly it improves number of files created per second on ext4
        filesystem on fast storage by about a factor of 2x"
      
      * 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
        quota: Add lock annotations to struct members
        quota: Reduce contention on dq_data_lock
        fs: Provide __inode_get_bytes()
        quota: Inline dquot_[re]claim_reserved_space() into callsite
        quota: Inline inode_{incr,decr}_space() into callsites
        quota: Inline functions into their callsites
        ext4: Disable dirty list tracking of dquots when journalling quotas
        quota: Allow disabling tracking of dirty dquots in a list
        quota: Remove dq_wait_unused from dquot
        quota: Move locking into clear_dquot_dirty()
        quota: Do not dirty bad dquots
        quota: Fix possible corruption of dqi_flags
        quota: Propagate ->quota_read errors from v2_read_file_info()
        quota: Fix error codes in v2_read_file_info()
        quota: Push dqio_sem down to ->read_file_info()
        quota: Push dqio_sem down to ->write_file_info()
        quota: Push dqio_sem down to ->get_next_id()
        quota: Push dqio_sem down to ->release_dqblk()
        quota: Remove locking for writing to the old quota format
        quota: Do not acquire dqio_sem for dquot overwrites in v2 format
        ...
      ae8ac6b7