Skip to content
  1. Jul 27, 2016
    • Kirill A. Shutemov's avatar
      page-flags: relax policy for PG_mappedtodisk and PG_reclaim · e2f0a0db
      Kirill A. Shutemov authored
      
      
      These flags are in use for file THP.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-23-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2f0a0db
    • Kirill A. Shutemov's avatar
      vmscan: split file huge pages before paging them out · 7751b2da
      Kirill A. Shutemov authored
      
      
      This is preparation of vmscan for file huge pages.  We cannot write out
      huge pages, so we need to split them on the way out.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-22-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7751b2da
    • Kirill A. Shutemov's avatar
      thp, mlock: do not mlock PTE-mapped file huge pages · 9a73f61b
      Kirill A. Shutemov authored
      
      
      As with anon THP, we only mlock file huge pages if we can prove that the
      page is not mapped with PTE.  This way we can avoid mlock leak into
      non-mlocked vma on split.
      
      We rely on PageDoubleMap() under lock_page() to check if the the page
      may be PTE mapped.  PG_double_map is set by page_add_file_rmap() when
      the page mapped with PTEs.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a73f61b
    • Kirill A. Shutemov's avatar
      thp: file pages support for split_huge_page() · baa355fd
      Kirill A. Shutemov authored
      
      
      Basic scheme is the same as for anon THP.
      
      Main differences:
      
        - File pages are on radix-tree, so we have head->_count offset by
          HPAGE_PMD_NR. The count got distributed to small pages during split.
      
        - mapping->tree_lock prevents non-lockless access to pages under split
          over radix-tree;
      
        - Lockless access is prevented by setting the head->_count to 0 during
          split;
      
        - After split, some pages can be beyond i_size. We drop them from
          radix-tree.
      
        - We don't setup migration entries. Just unmap pages. It helps
          handling cases when i_size is in the middle of the page: no need
          handle unmap pages beyond i_size manually.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      baa355fd
    • Kirill A. Shutemov's avatar
      thp: run vma_adjust_trans_huge() outside i_mmap_rwsem · 37f9f559
      Kirill A. Shutemov authored
      
      
      vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
      During split we munlock the huge page which requires rmap walk.  rmap
      wants to take the lock on its own.
      
      Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37f9f559
    • Kirill A. Shutemov's avatar
      thp: prepare change_huge_pmd() for file thp · b237aded
      Kirill A. Shutemov authored
      
      
      change_huge_pmd() has assert which is not relvant for file page.  For
      shared mapping it's perfectly fine to have page table entry writable,
      without explicit mkwrite.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b237aded
    • Kirill A. Shutemov's avatar
      thp: skip file huge pmd on copy_huge_pmd() · 628d47ce
      Kirill A. Shutemov authored
      
      
      copy_page_range() has a check for "Don't copy ptes where a page fault
      will fill them correctly." It works on VMA level.  We still copy all
      page table entries from private mappings, even if they map page cache.
      
      We can simplify copy_huge_pmd() a bit by skipping file PMDs.
      
      We don't map file private pages with PMDs, so they only can map page
      cache.  It's safe to skip them as they can be re-faulted later.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      628d47ce
    • Kirill A. Shutemov's avatar
      thp: handle file COW faults · af9e4d5f
      Kirill A. Shutemov authored
      
      
      File COW for THP is handled on pte level: just split the pmd.
      
      It's not clear how benefitial would be allocation of huge pages on COW
      faults.  And it would require some code to make them work.
      
      I think at some point we can consider teaching khugepaged to collapse
      pages in COW mappings, but allocating huge on fault is probably
      overkill.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af9e4d5f
    • Kirill A. Shutemov's avatar
      thp: handle file pages in split_huge_pmd() · d21b9e57
      Kirill A. Shutemov authored
      
      
      Splitting THP PMD is simple: just unmap it as in DAX case.  This way we
      can avoid memory overhead on page table allocation to deposit.
      
      It's probably a good idea to try to allocation page table with
      GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area,
      but clearing pmd should be good enough for now.
      
      Unlike DAX, we also remove the page from rmap and drop reference.
      pmd_young() is transfered to PageReferenced().
      
      Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d21b9e57
    • Kirill A. Shutemov's avatar
      thp: support file pages in zap_huge_pmd() · b5072380
      Kirill A. Shutemov authored
      
      
      split_huge_pmd() for file mappings (and DAX too) is implemented by just
      clearing pmd entry as we can re-fill this area from page cache on pte
      level later.
      
      This means we don't need deposit page tables when file THP is mapped.
      Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
      file THP PMD.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5072380
    • Kirill A. Shutemov's avatar
      thp, vmstats: add counters for huge file pages · 95ecedcd
      Kirill A. Shutemov authored
      
      
      THP_FILE_ALLOC: how many times huge page was allocated and put page
      cache.
      
      THP_FILE_MAPPED: how many times file huge page was mapped.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95ecedcd
    • Kirill A. Shutemov's avatar
      mm: introduce do_set_pmd() · 10102459
      Kirill A. Shutemov authored
      
      
      With postponed page table allocation we have chance to setup huge pages.
      do_set_pte() calls do_set_pmd() if following criteria met:
      
       - page is compound;
       - pmd entry in pmd_none();
       - vma has suitable size and alignment;
      
      Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10102459
    • Kirill A. Shutemov's avatar
      rmap: support file thp · dd78fedd
      Kirill A. Shutemov authored
      
      
      Naive approach: on mapping/unmapping the page as compound we update
      ->_mapcount on each 4k page.  That's not efficient, but it's not obvious
      how we can optimize this.  We can look into optimization later.
      
      PG_double_map optimization doesn't work for file pages since lifecycle
      of file pages is different comparing to anon pages: file page can be
      mapped again at any time.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd78fedd
    • Kirill A. Shutemov's avatar
      mm: postpone page table allocation until we have page to map · 7267ec00
      Kirill A. Shutemov authored
      
      
      The idea (and most of code) is borrowed again: from Hugh's patchset on
      huge tmpfs[1].
      
      Instead of allocation pte page table upfront, we postpone this until we
      have page to map in hands.  This approach opens possibility to map the
      page as huge if filesystem supports this.
      
      Comparing to Hugh's patch I've pushed page table allocation a bit
      further: into do_set_pte().  This way we can postpone allocation even in
      faultaround case without moving do_fault_around() after __do_fault().
      
      do_set_pte() got renamed to alloc_set_pte() as it can allocate page
      table if required.
      
      [1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils
      
      Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7267ec00
    • Kirill A. Shutemov's avatar
      mm: introduce fault_env · bae473a4
      Kirill A. Shutemov authored
      
      
      The idea borrowed from Peter's patch from patchset on speculative page
      faults[1]:
      
      Instead of passing around the endless list of function arguments,
      replace the lot with a single structure so we can change context without
      endless function signature changes.
      
      The changes are mostly mechanical with exception of faultaround code:
      filemap_map_pages() got reworked a bit.
      
      This patch is preparation for the next one.
      
      [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org
      
      Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bae473a4
    • Kirill A. Shutemov's avatar
      mm: do not pass mm_struct into handle_mm_fault · dcddffd4
      Kirill A. Shutemov authored
      
      
      We always have vma->vm_mm around.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcddffd4
    • Kirill A. Shutemov's avatar
      thp, mlock: update unevictable-lru.txt · 6fb8ddfc
      Kirill A. Shutemov authored
      
      
      Add description of THP handling into unevictable-lru.txt.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-7-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fb8ddfc
    • Kirill A. Shutemov's avatar
      khugepaged: recheck pmd after mmap_sem re-acquired · 1f52e67e
      Kirill A. Shutemov authored
      
      
      Vlastimil noted[1] that pmd can be no longer valid after we drop
      mmap_sem.  We need recheck it once mmap_sem taken again.
      
      [1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz
      
      Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f52e67e
    • Ebru Akagunduz's avatar
      mm, thp: fix locking inconsistency in collapse_huge_page · 8024ee2a
      Ebru Akagunduz authored
      
      
      After creating revalidate vma function, locking inconsistency occured
      due to directing the code path to wrong label.  This patch directs to
      correct label and fix the inconsistency.
      
      Related commit that caused inconsistency:
       http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0
      
      Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com
      Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarEbru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8024ee2a
    • Ebru Akagunduz's avatar
      mm, thp: make swapin readahead under down_read of mmap_sem · 72695862
      Ebru Akagunduz authored
      
      
      Currently khugepaged makes swapin readahead under down_write.  This
      patch supplies to make swapin readahead under down_read instead of
      down_write.
      
      The patch was tested with a test program that allocates 800MB of memory,
      writes to it, and then sleeps.  The system was forced to swap out all.
      Afterwards, the test program touches the area by writing, it skips a
      page in each 20 pages of the area.
      
      [akpm@linux-foundation.org: update comment to match new code]
      [kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless]
        Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com
        Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com
      Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com
      Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com
      Signed-off-by: default avatarEbru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72695862
    • Ebru Akagunduz's avatar
      mm: make swapin readahead to improve thp collapse rate · 8a966ed7
      Ebru Akagunduz authored
      
      
      This patch makes swapin readahead to improve thp collapse rate.  When
      khugepaged scanned pages, there can be a few of the pages in swap area.
      
      With the patch THP can collapse 4kB pages into a THP when there are up
      to max_ptes_swap swap ptes in a 2MB range.
      
      The patch was tested with a test program that allocates 400B of memory,
      writes to it, and then sleeps.  I force the system to swap out all.
      Afterwards, the test program touches the area by writing, it skips a
      page in each 20 pages of the area.
      
      Without the patch, system did not swap in readahead.  THP rate was %65
      of the program of the memory, it did not change over time.
      
      With this patch, after 10 minutes of waiting khugepaged had collapsed
      %99 of the program's memory.
      
      [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
      [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
      [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
      Signed-off-by: default avatarEbru Akagunduz <ebru.akagunduz@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a966ed7
    • Ebru Akagunduz's avatar
      mm: make optimistic check for swapin readahead · 70652f6e
      Ebru Akagunduz authored
      
      
      Introduce a new sysfs integer knob
      /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
      optimistic check for swapin readahead to increase thp collapse rate.
      Before getting swapped out pages to memory, checks them and allows up to a
      certain number.  It also prints out using tracepoints amount of unmapped
      ptes.
      
      [vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE]
      [sfr@canb.auug.org.au: build fix]
        Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au
      Signed-off-by: default avatarEbru Akagunduz <ebru.akagunduz@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70652f6e
    • nimisolo's avatar
      mm/memblock.c:memblock_add_range(): if nr_new is 0 just return · ef3cc4db
      nimisolo authored
      
      
      If nr_new is 0 which means there's no region would be added, so just
      return to the caller.
      
      Signed-off-by: default avatarnimisolo <nimisolo@gmail.com>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef3cc4db
    • Michal Hocko's avatar
      mm, memcg: use consistent gfp flags during readahead · 8a5c743e
      Michal Hocko authored
      
      
      Vladimir has noticed that we might declare memcg oom even during
      readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
      restriction) while __do_page_cache_readahead uses
      page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
      OOMs.  This gfp mask discrepancy is really unfortunate and easily
      fixable.  Drop page_cache_alloc_readahead() which only has one user and
      outsource the gfp_mask logic into readahead_gfp_mask and propagate this
      mask from __do_page_cache_readahead down to read_pages.
      
      This alone would have only very limited impact as most filesystems are
      implementing ->readpages and the common implementation mpage_readpages
      does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
      use readahead_gfp_mask instead as this function is called only during
      readahead as well.  The same applies to read_cache_pages.
      
      ext4 has its own ext4_mpage_readpages but the path which has pages !=
      NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
      doing a very similar pattern to mpage_readpages so the same can be
      applied to them as well.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@suse.com: restrict gfp mask in mpage_alloc]
        Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Changman Lee <cm224.lee@samsung.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a5c743e
    • Michal Hocko's avatar
      mm, oom_reaper: make sure that mmput_async is called only when memory was reaped · e5e3f4c4
      Michal Hocko authored
      
      
      Tetsuo is worried that mmput_async might still lead to a premature new
      oom victim selection due to the following race:
      
      __oom_reap_task				exit_mm
        find_lock_task_mm
        atomic_inc(mm->mm_users) # = 2
        task_unlock
        					  task_lock
      					  task->mm = NULL
      					  up_read(&mm->mmap_sem)
      		< somebody write locks mmap_sem >
      					  task_unlock
      					  mmput
        					    atomic_dec_and_test # = 1
      					  exit_oom_victim
        down_read_trylock # failed - no reclaim
        mmput_async # Takes unpredictable amount of time
        		< new OOM situation >
      
      the final __mmput will be executed in the delayed context which might
      happen far in the future.  Such a race is highly unlikely because the
      write holder of mmap_sem would have to be an external task (all direct
      holders are already killed or exiting) and it usually have to pin
      mm_users in order to do anything reasonable.
      
      We can, however, make sure that the mmput_async is only called when we
      do not back off and reap some memory.  That would reduce the impact of
      the delayed __mmput because the real content would be already freed.
      Pin mm_count to keep it alive after we drop task_lock and before we try
      to get mmap_sem.  If the mmap_sem succeeds we can try to grab mm_users
      reference and then go on with unmapping the address space.
      
      It is not clear whether this race is possible at all but it is better to
      be more robust and do not pin mm_users unless we are sure we are
      actually doing some real work during __oom_reap_task.
      
      Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5e3f4c4
    • Chen Gang's avatar
      include/linux/memblock.h: Clean up code for several trivial details · ba6c19fd
      Chen Gang authored
      
      
      Correct the function parameters alignment, since original code already
      use both tabs and white spaces together for the incorrect parameters
      alignment functions.
      
      If one line can hold one statement within 80 columns, let it in one line
      (original code did not consider about the tabs/spaces for 2nd line when
      a statement is separated into 2 lines).
      
      Try to let '' aligned within one macro, since all related lines are
      short enough.
      
      Remove useless statement "idx = 0;", and always assign rgn within the
      'for' statement.
      
      Link: http://lkml.kernel.org/r/1464904899-1714-1-git-send-email-chengang@emindsoft.com.cn
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba6c19fd
    • Minchan Kim's avatar
      mm: add NR_ZSMALLOC to vmstat · 91537fee
      Minchan Kim authored
      
      
      zram is very popular for some of the embedded world (e.g., TV, mobile
      phones).  On those system, zsmalloc's consumed memory size is never
      trivial (one of example from real product system, total memory: 800M,
      zsmalloc consumed: 150M), so we have used this out of tree patch to
      monitor system memory behavior via /proc/vmstat.
      
      With zsmalloc in vmstat, it helps in tracking down system behavior due
      to memory usage.
      
      [minchan@kernel.org: zsmalloc: follow up zsmalloc vmstat]
        Link: http://lkml.kernel.org/r/20160607091737.GC23435@bbox
      [akpm@linux-foundation.org: fix build with CONFIG_ZSMALLOC=m]
      Link: http://lkml.kernel.org/r/1464919731-13255-1-git-send-email-minchan@kernel.org
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Chanho Min <chanho.min@lge.com>
      Cc: Chan Gyun Jeong <chan.jeong@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91537fee
    • Vlastimil Babka's avatar
      mm, frontswap: convert frontswap_enabled to static key · 8ea1d2a1
      Vlastimil Babka authored
      
      
      I have noticed that frontswap.h first declares "frontswap_enabled" as
      extern bool variable, and then overrides it with "#define
      frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled.  The
      bool variable isn't actually instantiated anywhere.
      
      This all looks like an unfinished attempt to make frontswap_enabled
      reflect whether a backend is instantiated.  But in the current state,
      all frontswap hooks call unconditionally into frontswap.c just to check
      if frontswap_ops is non-NULL.  This should at least be checked inline,
      but we can further eliminate the overhead when CONFIG_FRONTSWAP is
      enabled and no backend registered, using a static key that is initially
      disabled, and gets enabled only upon first backend registration.
      
      Thus, checks for "frontswap_enabled" are replaced with
      "frontswap_enabled()" wrapping the static key check.  There are two
      exceptions:
      
      - xen's selfballoon_process() was testing frontswap_enabled in code guarded
        by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable.
        The patch just removes this check. Using frontswap_enabled() does not sound
        correct here, as this can be true even without xen's own backend being
        registered.
      
      - in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP)
        as it seems the bitmap allocation cannot currently be postponed until a
        backend is registered. This means that frontswap will still have some
        memory overhead by being configured, but without a backend.
      
      After the patch, we can expect that some functions in frontswap.c are
      called only when frontswap_ops is non-NULL.  Change the checks there to
      VM_BUG_ONs.  While at it, convert other BUG_ONs to VM_BUG_ONs as
      frontswap has been stable for some time.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ea1d2a1
    • Tetsuo Handa's avatar
      mm,oom: remove unused argument from oom_scan_process_thread(). · fbe84a09
      Tetsuo Handa authored
      
      
      oom_scan_process_thread() does not use totalpages argument.
      oom_badness() uses it.
      
      Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbe84a09
    • Vladimir Davydov's avatar
      af_unix: charge buffers to kmemcg · 3aa9799e
      Vladimir Davydov authored
      
      
      Unix sockets can consume a significant amount of system memory, hence
      they should be accounted to kmemcg.
      
      Since unix socket buffers are always allocated from process context, all
      we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
      sock->sk_allocation mask.
      
      Eric asked:
      
      > 1) What happens when a buffer, allocated from socket <A> lands in a
      > different socket <B>, maybe owned by another user/process.
      >
      > Who owns it now, in term of kmemcg accounting ?
      
      We never move memcg charges.  E.g.  if two processes from different
      cgroups are sharing a memory region, each page will be charged to the
      process which touched it first.  Or if two processes are working with
      the same directory tree, inodes and dentries will be charged to the
      first user.  The same is fair for unix socket buffers - they will be
      charged to the sender.
      
      > 2) Has performance impact been evaluated ?
      
      I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
      x2 HT box.  The results are below:
      
       # clients            bandwidth (10^6bits/sec)
                          base              patched
               1      67643 +-  725      64874 +-  353    - 4.0 %
               4     193585 +- 2516     186715 +- 1460    - 3.5 %
               8     194820 +-  377     187443 +- 1229    - 3.7 %
      
      So the accounting doesn't come for free - it takes ~4% of performance.
      I believe we could optimize it by using per cpu batching not only on
      charge, but also on uncharge in memcg core, but that's beyond the scope
      of this patch set - I'll take a look at this later.
      
      Anyway, if performance impact is found to be unacceptable, it is always
      possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
      or not use memory cgroups at runtime at all (thanks to jump labels
      there'll be no overhead even if they are compiled in).
      
      Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3aa9799e
    • Vladimir Davydov's avatar
      pipe: account to kmemcg · d86133bd
      Vladimir Davydov authored
      
      
      Pipes can consume a significant amount of system memory, hence they
      should be accounted to kmemcg.
      
      This patch marks pipe_inode_info and anonymous pipe buffer page
      allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
      Note, since a pipe buffer page can be "stolen" and get reused for other
      purposes, including mapping to userspace, we clear PageKmemcg thus
      resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
      is introduced by this patch.
      
      A note regarding anon_pipe_buf_steal implementation.  We allow to steal
      the page if its ref count equals 1.  It looks racy, but it is correct
      for anonymous pipe buffer pages, because:
      
       - We lock out all other pipe users, because ->steal is called with
         pipe_lock held, so the page can't be spliced to another pipe from
         under us.
      
       - The page is not on LRU and it never was.
      
       - Thus a parallel thread can access it only by PFN. Although this is
         quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
         this is not dangerous, because all such functions do is increase page
         ref count, check if the page is the one they are looking for, and
         decrease ref count if it isn't. Since our page is clean except for
         PageKmemcg mark, which doesn't conflict with other _mapcount users,
         the worst that can happen is we see page_count > 2 due to a transient
         ref, in which case we false-positively abort ->steal, which is still
         fine, because ->steal is not guaranteed to succeed.
      
      Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d86133bd
    • Vladimir Davydov's avatar
      arch: x86: charge page tables to kmemcg · 3e79ec7d
      Vladimir Davydov authored
      
      
      Page tables can bite a relatively big chunk off system memory and their
      allocations are easy to trigger from userspace, so they should be
      accounted to kmemcg.
      
      This patch marks page table allocations as __GFP_ACCOUNT for x86.  Note
      we must not charge allocations of kernel page tables, because they can
      be shared among processes from different cgroups so accounting them to a
      particular one can pin other cgroups for indefinitely long.  So we clear
      __GFP_ACCOUNT flag if a page table is allocated for the kernel.
      
      Link: http://lkml.kernel.org/r/7d5c54f6a2bcbe76f03171689440003d87e6c742.1464079538.git.vdavydov@virtuozzo.com
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e79ec7d
    • Vladimir Davydov's avatar
      mm: memcontrol: teach uncharge_list to deal with kmem pages · 5e8d35f8
      Vladimir Davydov authored
      
      
      Page table pages are batched-freed in release_pages on most
      architectures.  If we want to charge them to kmemcg (this is what is
      done later in this series), we need to teach mem_cgroup_uncharge_list to
      handle kmem pages.
      
      Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e8d35f8
    • Vladimir Davydov's avatar
      mm: charge/uncharge kmemcg from generic page allocator paths · 4949148a
      Vladimir Davydov authored
      
      
      Currently, to charge a non-slab allocation to kmemcg one has to use
      alloc_kmem_pages helper with __GFP_ACCOUNT flag.  A page allocated with
      this helper should finally be freed using free_kmem_pages, otherwise it
      won't be uncharged.
      
      This API suits its current users fine, but it turns out to be impossible
      to use along with page reference counting, i.e.  when an allocation is
      supposed to be freed with put_page, as it is the case with pipe or unix
      socket buffers.
      
      To overcome this limitation, this patch moves charging/uncharging to
      generic page allocator paths, i.e.  to __alloc_pages_nodemask and
      free_pages_prepare, and zaps alloc/free_kmem_pages helpers.  This way,
      one can use any of the available page allocation functions to get the
      allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
      just like in case of kmalloc and friends.  A charged page will be
      automatically uncharged on free.
      
      To make it possible, we need to mark pages charged to kmemcg somehow.
      To avoid introducing a new page flag, we make use of page->_mapcount for
      marking such pages.  Since pages charged to kmemcg are not supposed to
      be mapped to userspace, it should work just fine.  There are other
      (ab)users of page->_mapcount - buddy and balloon pages - but we don't
      conflict with them.
      
      In case kmemcg is compiled out or not used at runtime, this patch
      introduces no overhead to generic page allocator paths.  If kmemcg is
      used, it will be plus one gfp flags check on alloc and plus one
      page->_mapcount check on free, which shouldn't hurt performance, because
      the data accessed are hot.
      
      Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4949148a
    • Vladimir Davydov's avatar
      mm: memcontrol: cleanup kmem charge functions · 45264778
      Vladimir Davydov authored
      
      
       - Handle memcg_kmem_enabled check out to the caller. This reduces the
         number of function definitions making the code easier to follow. At
         the same time it doesn't result in code bloat, because all of these
         functions are used only in one or two places.
      
       - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
         have to dive deep into memcg implementation to see which allocations
         are charged and which are not.
      
       - Refresh comments.
      
      Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45264778
    • Vladimir Davydov's avatar
      mm: clean up non-standard page->_mapcount users · 632c0a1a
      Vladimir Davydov authored
      
      
       - Add a proper comment to page->_mapcount.
      
       - Introduce a macro for generating helper functions.
      
       - Place all special page->_mapcount values next to each other so that
         readers can see all possible values and so we don't get duplicates.
      
      Link: http://lkml.kernel.org/r/502f49000e0b63e6c62e338fac6b420bf34fb526.1464079537.git.vdavydov@virtuozzo.com
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      632c0a1a
    • Vladimir Davydov's avatar
      mm: remove pointless struct in struct page definition · 99691add
      Vladimir Davydov authored
      
      
      This patchset implements per kmemcg accounting of page tables
      (x86-only), pipe buffers, and unix socket buffers.
      
      Patches 1-3 are just cleanups that are not supposed to introduce any
      functional changes.  Patches 4 and 5 move charge/uncharge to generic
      page allocator paths for the sake of accounting pipe and unix socket
      buffers.  Patches 5-7 make x86 page tables, pipe buffers, and unix
      socket buffers accountable.
      
      This patch (of 8):
      
      ... to reduce indentation level thus leaving more space for comments.
      
      Link: http://lkml.kernel.org/r/f34ffe70fce2b0b9220856437f77972d67c14275.1464079537.git.vdavydov@virtuozzo.com
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99691add
    • Aneesh Kumar K.V's avatar
      mm/mmu_gather: track page size with mmu gather and force flush if page size change · e77b0852
      Aneesh Kumar K.V authored
      
      
      This allows an arch which needs to do special handing with respect to
      different page size when flushing tlb to implement the same in mmu
      gather.
      
      Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e77b0852
    • Aneesh Kumar K.V's avatar
      mm: change the interface for __tlb_remove_page() · e9d55e15
      Aneesh Kumar K.V authored
      
      
      This updates the generic and arch specific implementation to return true
      if we need to do a tlb flush.  That means if a __tlb_remove_page
      indicate a flush is needed, the page we try to remove need to be tracked
      and added again after the flush.  We need to track it because we have
      already update the pte to none and we can't just loop back.
      
      This change is done to enable us to do a tlb_flush when we try to flush
      a range that consists of different page sizes.  For architectures like
      ppc64, we can do a range based tlb flush and we need to track page size
      for that.  When we try to remove a huge page, we will force a tlb flush
      and starts a new mmu gather.
      
      [aneesh.kumar@linux.vnet.ibm.com: mm-change-the-interface-for-__tlb_remove_page-v3]
        Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9d55e15
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: simplify hugetlb unmap · 31d49da5
      Aneesh Kumar K.V authored
      
      
      For hugetlb like THP (and unlike regular page), we do tlb flush after
      dropping ptl.  Because of the above, we don't need to track force_flush
      like we do now.  Instead we can simply call tlb_remove_page() which will
      do the flush if needed.
      
      No functionality change in this patch.
      
      Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31d49da5