Skip to content
  1. Sep 07, 2017
    • Andrea Arcangeli's avatar
      userfaultfd: provide pid in userfault msg - add feat union · a36985d3
      Andrea Arcangeli authored
      
      
      No ABI change, but this will make it more explicit to software that ptid
      is only available if requested by passing UFFD_FEATURE_THREAD_ID to
      UFFDIO_API.  The fact it's a union will also self document it shouldn't
      be taken for granted there's a tpid there.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a36985d3
    • Alexey Perevalov's avatar
      userfaultfd: provide pid in userfault msg · 9d4ac934
      Alexey Perevalov authored
      
      
      It could be useful for calculating downtime during postcopy live
      migration per vCPU.  Side observer or application itself will be
      informed about proper task's sleep during userfaultfd processing.
      
      Process's thread id is being provided when user requeste it by setting
      UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com
      Signed-off-by: default avatarAlexey Perevalov <a.perevalov@samsung.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d4ac934
    • Andrea Arcangeli's avatar
      userfaultfd: call userfaultfd_unmap_prep only if __split_vma succeeds · 2376dd7c
      Andrea Arcangeli authored
      
      
      A __split_vma is not a worthy event to report, and it's definitely not a
      unmap so it would be incorrect to report unmap for the whole region to
      the userfaultfd manager if a __split_vma fails.
      
      So only call userfaultfd_unmap_prep after the __vma_splitting is over
      and do_munmap cannot fail anymore.
      
      Also add unlikely because it's better to optimize for the vast majority
      of apps that aren't using userfaultfd in a non cooperative way.  Ideally
      we should also find a way to eliminate the branch entirely if
      CONFIG_USERFAULTFD=n, but it would complicate things so stick to
      unlikely for now.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-5-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2376dd7c
    • Andrea Arcangeli's avatar
      userfaultfd: selftest: explicit failure if the SIGBUS test failed · d312cb1e
      Andrea Arcangeli authored
      
      
      Showing zero in the output isn't very self explanatory as a successful
      result.  Show a more explicit error output if the test fails.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-4-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d312cb1e
    • Andrea Arcangeli's avatar
      userfaultfd: selftest: exercise UFFDIO_COPY/ZEROPAGE -EEXIST · 67e80328
      Andrea Arcangeli authored
      
      
      This will retry the UFFDIO_COPY/ZEROPAGE to verify it returns -EEXIST at
      the first invocation and then later every 10 seconds.
      
      In the filebacked MAP_SHARED case this also verifies the -EEXIST
      triggered in the filesystem pagecache insertion, if the offset in the
      file was not a hole.
      
      shmem MAP_SHARED tries to index the newly allocated pagecache in the
      radix tree before checking the pagetable so it doesn't need any
      assistance to exercise that case.
      
      hugetlbfs checks the pmd to be not none before trying to index the
      hugetlbfs page in the radix tree, so it requires to run UFFDIO_COPY into
      an alias mapping (the alternative would be to use MADV_DONTNEED to only
      zap the pagetables, but that doesn't work on hugetlbfs).
      
      [akpm@linux-foundation.org: fix uffdio_zeropage(), per Mike Kravetz]
      Link: http://lkml.kernel.org/r/20170802165145.22628-3-aarcange@redhat.com
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67e80328
    • Prakash Sangappa's avatar
      userfaultfd: selftest: add tests for UFFD_FEATURE_SIGBUS feature · 81aac3a1
      Prakash Sangappa authored
      
      
      Add tests for UFFD_FEATURE_SIGBUS feature.  The tests will verify signal
      delivery instead of userfault events.  Also, test use of UFFDIO_COPY to
      allocate memory and retry accessing monitored area after signal
      delivery.
      
      Also fix a bug in uffd_poll_thread() where 'uffd' is leaked.
      
      Link: http://lkml.kernel.org/r/1501552446-748335-3-git-send-email-prakash.sangappa@oracle.com
      Signed-off-by: default avatarPrakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81aac3a1
    • Prakash Sangappa's avatar
      mm: userfaultfd: add feature to request for a signal delivery · 2d6d6f5a
      Prakash Sangappa authored
      
      
      In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
      to the faulting process, instead of the page-fault event.  Dealing with
      page-fault event using a monitor thread can be an overhead in these
      cases.  For example applications like the database could use the
      signaling mechanism for robustness purpose.
      
      Database uses hugetlbfs for performance reason.  Files on hugetlbfs
      filesystem are created and huge pages allocated using fallocate() API.
      Pages are deallocated/freed using fallocate() hole punching support.
      These files are mmapped and accessed by many processes as shared memory.
      The database keeps track of which offsets in the hugetlbfs file have
      pages allocated.
      
      Any access to mapped address over holes in the file, which can occur due
      to bugs in the application, is considered invalid and expect the process
      to simply receive a SIGBUS.  However, currently when a hole in the file
      is accessed via the mapped address, kernel/mm attempts to automatically
      allocate a page at page fault time, resulting in implicitly filling the
      hole in the file.  This may not be the desired behavior for applications
      like the database that want to explicitly manage page allocations of
      hugetlbfs files.
      
      Using userfaultfd mechanism with this support to get a signal, database
      application can prevent pages from being allocated implicitly when
      processes access mapped address over holes in the file.
      
      This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
      request for a SIGBUS signal.
      
      See following for previous discussion about the database requirement
      leading to this proposal as suggested by Andrea.
      
      http://www.spinics.net/lists/linux-mm/msg129224.html
      
      Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com
      Signed-off-by: default avatarPrakash Sangappa <prakash.sangappa@oracle.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d6d6f5a
    • Michal Hocko's avatar
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko authored
      
      
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • Mike Kravetz's avatar
      mm: shm: use new hugetlb size encoding definitions · 4da243ac
      Mike Kravetz authored
      
      
      Use the common definitions from hugetlb_encode.h header file for
      encoding hugetlb size definitions in shmget system call flags.
      
      In addition, move these definitions from the internal (kernel) to user
      (uapi) header file.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-4-git-send-email-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4da243ac
    • Mike Kravetz's avatar
      mm: arch: consolidate mmap hugetlb size encodings · aafd4562
      Mike Kravetz authored
      
      
      A non-default huge page size can be encoded in the flags argument of the
      mmap system call.  The definitions for these encodings are in arch
      specific header files.  However, all architectures use the same values.
      
      Consolidate all the definitions in the primary user header file
      (uapi/linux/mman.h).  Include definitions for all known huge page sizes.
      Use the generic encoding definitions in hugetlb_encode.h as the basis
      for these definitions.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aafd4562
    • Mike Kravetz's avatar
      mm: hugetlb: define system call hugetlb size encodings in single file · e652f694
      Mike Kravetz authored
      
      
      Patch series "Consolidate system call hugetlb page size encodings".
      
      These patches are the result of discussions in
      https://lkml.org/lkml/2017/3/8/548.  The following changes are made in the
      patch set:
      
      1) Put all the log2 encoded huge page size definitions in a common
         header file.  The idea is have a set of definitions that can be use as
         the basis for system call specific definitions such as MAP_HUGE_* and
         SHM_HUGE_*.
      
      2) Remove MAP_HUGE_* definitions in arch specific files.  All these
         definitions are the same.  Consolidate all definitions in the primary
         user header file (uapi/linux/mman.h).
      
      3) Remove SHM_HUGE_* definitions intended for user space from kernel
         header file, and add to user (uapi/linux/shm.h) header file.  Add
         definitions for all known huge page size encodings as in mmap.
      
      This patch (of 3):
      
      If hugetlb pages are requested in mmap or shmget system calls, a huge
      page size other than default can be requested.  This is accomplished by
      encoding the log2 of the huge page size in the upper bits of the flag
      argument.  asm-generic and arch specific headers all define the same
      values for these encodings.
      
      Put common definitions in a single header file.  The primary uapi header
      files for mmap and shm will use these definitions as a basis for
      definitions specific to those system calls.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-2-git-send-email-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e652f694
    • Jeff Layton's avatar
      include/linux/fs.h: remove unneeded forward definition of mm_struct · a446d6f9
      Jeff Layton authored
      
      
      Link: http://lkml.kernel.org/r/20170525102927.6163-1-jlayton@redhat.com
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a446d6f9
    • Jeff Layton's avatar
      fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_range · de23abd1
      Jeff Layton authored
      
      
      fsync codepath assumes that f_mapping can never be NULL, but
      sync_file_range has a check for that.
      
      Remove the one from sync_file_range as I don't see how you'd ever get a
      NULL pointer in here.
      
      Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.com
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de23abd1
    • Mike Rapoport's avatar
      userfaultfd: selftest: enable testing of UFFDIO_ZEROPAGE for shmem · 824f9739
      Mike Rapoport authored
      
      
      Link: http://lkml.kernel.org/r/1497939652-16528-8-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      824f9739
    • Mike Rapoport's avatar
      userfaultfd: report UFFDIO_ZEROPAGE as available for shmem VMAs · ce53e8e6
      Mike Rapoport authored
      
      
      Now when shmem VMAs can be filled with zero page via userfaultfd we can
      report that UFFDIO_ZEROPAGE is available for those VMAs
      
      Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce53e8e6
    • Mike Rapoport's avatar
      userfaultfd: shmem: wire up shmem_mfill_zeropage_pte · 8fb44e54
      Mike Rapoport authored
      
      
      For shmem VMAs we can use shmem_mfill_zeropage_pte for UFFDIO_ZEROPAGE
      
      Link: http://lkml.kernel.org/r/1497939652-16528-6-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fb44e54
    • Mike Rapoport's avatar
      userfaultfd: mcopy_atomic: introduce mfill_atomic_pte helper · 3217d3c7
      Mike Rapoport authored
      
      
      Shuffle the code a bit to improve readability.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-5-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3217d3c7
    • Mike Rapoport's avatar
      userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support · 8d103963
      Mike Rapoport authored
      
      
      shmem_mfill_zeropage_pte is the low level routine that implements the
      userfaultfd UFFDIO_ZEROPAGE command.  Since for shmem mappings zero
      pages are always allocated and accounted, the new method is a slight
      extension of the existing shmem_mcopy_atomic_pte.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d103963
    • Mike Rapoport's avatar
      shmem: introduce shmem_inode_acct_block · 0f079694
      Mike Rapoport authored
      
      
      The shmem_acct_block and the update of used_blocks are following one
      another in all the places they are used.  Combine these two into a
      helper function.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-3-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f079694
    • Mike Rapoport's avatar
      shmem: shmem_charge: verify max_block is not exceeded before inode update · b1cc94ab
      Mike Rapoport authored
      
      
      Patch series "userfaultfd: enable zeropage support for shmem".
      
      These patches enable support for UFFDIO_ZEROPAGE for shared memory.
      
      The first two patches are not strictly related to userfaultfd, they are
      just minor refactoring to reduce amount of code duplication.
      
      This patch (of 7):
      
      Currently we update inode and shmem_inode_info before verifying that
      used_blocks will not exceed max_blocks.  In case it will, we undo the
      update.  Let's switch the order and move the verification of the blocks
      count before the inode and shmem_inode_info update.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-2-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1cc94ab
    • Huang Ying's avatar
      mm, THP, swap: add THP swapping out fallback counting · fe490cc0
      Huang Ying authored
      
      
      When swapping out THP (Transparent Huge Page), instead of swapping out
      the THP as a whole, sometimes we have to fallback to split the THP into
      normal pages before swapping, because no free swap clusters are
      available, or cgroup limit is exceeded, etc.  To count the number of the
      fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
      we fallback to split the THP.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zra...
      fe490cc0
    • Huang Ying's avatar
      mm, THP, swap: delay splitting THP after swapped out · bd4c82c2
      Huang Ying authored
      
      
      In this patch, splitting transparent huge page (THP) during swapping out
      is delayed from after adding the THP into the swap cache to after
      swapping out finishes.  After the patch, more operations for the
      anonymous THP reclaiming, such as writing the THP to the swap device,
      removing the THP from the swap cache could be batched.  So that the
      performance of anonymous THP swapping out could be improved.
      
      This is the second step for the THP swap support.  The plan is to delay
      splitting the THP step by step and avoid splitting the THP finally.
      
      With the patchset, the swap out throughput improves 42% (from about
      5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
      with 16 processes.  At the same time, the IPI (reflect TLB flushing)
      reduced about 78.9%.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-12-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd4c82c2
    • Huang Ying's avatar
      memcg, THP, swap: make mem_cgroup_swapout() support THP · d6810d73
      Huang Ying authored
      
      
      This patch makes mem_cgroup_swapout() works for the transparent huge
      page (THP).  Which will move the memory cgroup charge from memory to
      swap for a THP.
      
      This will be used for the THP swap support.  Where a THP may be swapped
      out as a whole to a set of (HPAGE_PMD_NR) continuous swap slots on the
      swap device.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-11-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6810d73
    • Huang Ying's avatar
      memcg, THP, swap: avoid to duplicated charge THP in swap cache · abe2895b
      Huang Ying authored
      
      
      For a THP (Transparent Huge Page), tail_page->mem_cgroup is NULL.  So to
      check whether the page is charged already, we need to check the head
      page.  This is not an issue before because it is impossible for a THP to
      be in the swap cache before.  But after we add delaying splitting THP
      after swapped out support, it is possible now.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-10-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abe2895b
    • Huang Ying's avatar
      memcg, THP, swap: support move mem cgroup charge for THP swapped out · 3e14a57b
      Huang Ying authored
      
      
      PTE mapped THP (Transparent Huge Page) will be ignored when moving
      memory cgroup charge.  But for THP which is in the swap cache, the
      memory cgroup charge for the swap of a tail-page may be moved in current
      implementation.  That isn't correct, because the swap charge for all
      sub-pages of a THP should be moved together.  Following the processing
      of the PTE mapped THP, the mem cgroup charge moving for the swap entry
      for a tail-page of a THP is ignored too.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-9-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e14a57b
    • Huang Ying's avatar
      mm, THP, swap: support splitting THP for THP swap out · 59807685
      Huang Ying authored
      
      
      After adding swapping out support for THP (Transparent Huge Page), it is
      possible that a THP in swap cache (partly swapped out) need to be split.
      To split such a THP, the swap cluster backing the THP need to be split
      too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap
      cluster.  The patch implemented this.
      
      And because the THP swap writing needs the THP keeps as huge page during
      writing.  The PageWriteback flag is checked before splitting.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-8-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59807685
    • Huang Ying's avatar
      mm: test code to write THP to swap device as a whole · 225311a4
      Huang Ying authored
      
      
      To support delay splitting THP (Transparent Huge Page) after swapped
      out, we need to enhance swap writing code to support to write a THP as a
      whole.  This will improve swap write IO performance.
      
      As Ming Lei <ming.lei@redhat.com> pointed out, this should be based on
      multipage bvec support, which hasn't been merged yet.  So this patch is
      only for testing the functionality of the other patches in the series.
      And will be reimplemented after multipage bvec support is merged.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      225311a4
    • Huang Ying's avatar
      block, THP: make block_device_operations.rw_page support THP · 98cc093c
      Huang Ying authored
      
      
      The .rw_page in struct block_device_operations is used by the swap
      subsystem to read/write the page contents from/into the corresponding
      swap slot in the swap device.  To support the THP (Transparent Huge
      Page) swap optimization, the .rw_page is enhanced to support to
      read/write THP if possible.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98cc093c
    • Huang Ying's avatar
      mm, THP, swap: don't allocate huge cluster for file backed swap device · f0eea189
      Huang Ying authored
      
      
      It's hard to write a whole transparent huge page (THP) to a file backed
      swap device during swapping out and the file backed swap device isn't
      very popular.  So the huge cluster allocation for the file backed swap
      device is disabled.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-5-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0eea189
    • Huang Ying's avatar
      mm, THP, swap: make reuse_swap_page() works for THP swapped out · ba3c4ce6
      Huang Ying authored
      
      
      After supporting to delay THP (Transparent Huge Page) splitting after
      swapped out, it is possible that some page table mappings of the THP are
      turned into swap entries.  So reuse_swap_page() need to check the swap
      count in addition to the map count as before.  This patch done that.
      
      In the huge PMD write protect fault handler, in addition to the page map
      count, the swap count need to be checked too, so the page lock need to
      be acquired too when calling reuse_swap_page() in addition to the page
      table lock.
      
      [ying.huang@intel.com: silence a compiler warning]
        Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
      Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba3c4ce6
    • Huang Ying's avatar
      mm, THP, swap: support to reclaim swap space for THP swapped out · e0709829
      Huang Ying authored
      
      
      The normal swap slot reclaiming can be done when the swap count reaches
      SWAP_HAS_CACHE.  But for the swap slot which is backing a THP, all swap
      slots backing one THP must be reclaimed together, because the swap slot
      may be used again when the THP is swapped out again later.  So the swap
      slots backing one THP can be reclaimed together when the swap count for
      all swap slots for the THP reached SWAP_HAS_CACHE.  In the patch, the
      functions to check whether the swap count for all swap slots backing one
      THP reached SWAP_HAS_CACHE are implemented and used when checking
      whether a swap slot can be reclaimed.
      
      To make it easier to determine whether a swap slot is backing a THP, a
      new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
      cluster which is backing a THP (Transparent Huge Page).  Because THP
      swap in as a whole isn't supported now.  After deleting the THP from the
      swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
      flag will be cleared.  So that, the normal pages inside THP can be
      swapped in individually.
      
      [ying.huang@intel.com: fix swap_page_trans_huge_swapped on HDD]
        Link: http://lkml.kernel.org/r/874ltsm0bi.fsf@yhuang-dev.intel.com
      Link: http://lkml.kernel.org/r/20170724051840.2309-3-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0709829
    • Huang Ying's avatar
      mm, THP, swap: support to clear swap cache flag for THP swapped out · a3aea839
      Huang Ying authored
      
      
      Patch series "mm, THP, swap: Delay splitting THP after swapped out", v3.
      
      This is the second step of THP (Transparent Huge Page) swap
      optimization.  In the first step, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  In the second
      step, the splitting is delayed further to after the swapping out
      finished.  The plan is to delay splitting THP step by step, finally
      avoid splitting THP for the THP swapping out and swap out/in the THP as
      a whole.
      
      In the patchset, more operations for the anonymous THP reclaiming, such
      as TLB flushing, writing the THP to the swap device, removing the THP
      from the swap cache are batched.  So that the performance of anonymous
      THP swapping out are improved.
      
      During the development, the following scenarios/code paths have been
      checked,
      
       - swap out/in
       - swap off
       - write protect page fault
       - madvise_free
       - process exit
       - split huge page
      
      With the patchset, the swap out throughput improves 42% (from about
      5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
      with 16 processes.  At the same time, the IPI (reflect TLB flushing)
      reduced about 78.9%.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      Below is the part of the cover letter for the first step patchset of THP
      swap optimization which applies to all steps.
      
      =========================
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce TLB flushing and lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.
      
      For example, it can be selected via "always/never/madvise" logic, to be
      turned on globally, turned off globally, or turned on only for VMA with
      MADV_HUGEPAGE, etc.
      
      This patch (of 12):
      
      Previously, swapcache_free_cluster() is used only in the error path of
      shrink_page_list() to free the swap cluster just allocated if the THP
      (Transparent Huge Page) is failed to be split.  In this patch, it is
      enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap
      cluster that holds the contents of THP swapped out.
      
      This will be used in delaying splitting THP after swapping out support.
      Because there is no THP swapping in as a whole support yet, after
      clearing the swap cache flag, the swap cluster backing the THP swapped
      out will be split.  So that the swap slots in the swap cluster can be
      swapped in as normal pages later.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-2-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3aea839
    • Matthias Kaehlcke's avatar
      mm: memcontrol: use int for event/state parameter in several functions · 04fecbf5
      Matthias Kaehlcke authored
      
      
      Several functions use an enum type as parameter for an event/state, but
      are called in some locations with an argument of a different enum type.
      Adjust the interface of these functions to reality by changing the
      parameter to int.
      
      This fixes a ton of enum-conversion warnings that are generated when
      building the kernel with clang.
      
      [mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()]
        Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org
      Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.org
      Signed-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Doug Anderson <dianders@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04fecbf5
    • Arvind Yadav's avatar
      mm/hugetlb.c: constify attribute_group structures · 67e5ed96
      Arvind Yadav authored
      
      
      attribute_group are not supposed to change at runtime.  All functions
      working with attribute_group provided by <linux/sysfs.h> work with const
      attribute_group.  So mark the non-const structs as const.
      
      Link: http://lkml.kernel.org/r/1501157260-3922-1-git-send-email-arvind.yadav.cs@gmail.com
      Signed-off-by: default avatarArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67e5ed96
    • Arvind Yadav's avatar
      mm/huge_memory.c: constify attribute_group structures · 8aa95a21
      Arvind Yadav authored
      
      
      attribute_group are not supposed to change at runtime.  All functions
      working with attribute_group provided by <linux/sysfs.h> work with const
      attribute_group.  So mark the non-const structs as const.
      
      Link: http://lkml.kernel.org/r/1501157240-3876-1-git-send-email-arvind.yadav.cs@gmail.com
      Signed-off-by: default avatarArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aa95a21
    • Arvind Yadav's avatar
      mm/page_idle.c: constify attribute_group structures · fd147cbb
      Arvind Yadav authored
      
      
      attribute_group are not supposed to change at runtime.  All functions
      working with attribute_group provided by <linux/sysfs.h> work with const
      attribute_group.  So mark the non-const structs as const.
      
      Link: http://lkml.kernel.org/r/1501157221-3832-1-git-send-email-arvind.yadav.cs@gmail.com
      Signed-off-by: default avatarArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd147cbb
    • Arvind Yadav's avatar
      mm/slub.c: constify attribute_group structures · 1fdaaa23
      Arvind Yadav authored
      
      
      attribute_group are not supposed to change at runtime.  All functions
      working with attribute_group provided by <linux/sysfs.h> work with const
      attribute_group.  So mark the non-const structs as const.
      
      Link: http://lkml.kernel.org/r/1501157186-3749-1-git-send-email-arvind.yadav.cs@gmail.com
      Signed-off-by: default avatarArvind Yadav <arvind.yadav.cs@gmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fdaaa23
    • Arvind Yadav's avatar
      mm/ksm.c: constify attribute_group structures · f907c26a
      Arvind Yadav authored
      
      
      attribute_group are not supposed to change at runtime.  All functions
      working with attribute_group provided by <linux/sysfs.h> work with const
      attribute_group.  So mark the non-const structs as const.
      
      Link: http://lkml.kernel.org/r/1501157167-3706-2-git-send-email-arvind.yadav.cs@gmail.com
      Signed-off-by: default avatarArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f907c26a
    • Roman Gushchin's avatar
      cgroup: revert fa06235b ("cgroup: reset css on destruction") · 65f3975f
      Roman Gushchin authored
      Commit fa06235b
      
       ("cgroup: reset css on destruction") caused
      css_reset callback to be called from the offlining path.  Although it
      solves the problem mentioned in the commit description ("For instance,
      memory cgroup needs to reset memory.low, otherwise pages charged to a
      dead cgroup might never get reclaimed."), generally speaking, it's not
      correct.
      
      An offline cgroup can still be a resource domain, and we shouldn't grant
      it more resources than it had before deletion.
      
      For instance, if an offline memory cgroup has dirty pages, we should
      still imply i/o limits during writeback.
      
      The css_reset callback is designed to return the cgroup state into the
      original state, that means reset all limits and counters.  It's
      spomething different from the offlining, and we shouldn't use it from
      the offlining path.  Instead, we should adjust necessary settings from
      the per-controller css_offline callbacks (e.g.  reset memory.low).
      
      Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65f3975f
    • Roman Gushchin's avatar
      mm, memcg: reset memory.low during memcg offlining · 63677c74
      Roman Gushchin authored
      A removed memory cgroup with a defined memory.low and some belonging
      pagecache has very low chances to be freed.
      
      If a cgroup has been removed, there is likely no memory pressure inside
      the cgroup, and the pagecache is protected from the external pressure by
      the defined low limit.  The cgroup will be freed only after the reclaim
      of all belonging pages.  And it will not happen until there are any
      reclaimable memory in the system.  That means, there is a good chance,
      that a cold pagecache will reside in the memory for an undefined amount
      of time, wasting system resources.
      
      This problem was fixed earlier by fa06235b
      
       ("cgroup: reset css on
      destruction"), but it's not a best way to do it, as we can't really
      reset all limits/counters during cgroup offlining.
      
      Link: http://lkml.kernel.org/r/20170727130428.28856-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63677c74