Commit 2bad466c authored by Peter Xu's avatar Peter Xu Committed by Andrew Morton
Browse files

mm/uffd: UFFD_FEATURE_WP_UNPOPULATED

Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4.

The new feature bit makes anonymous memory acts the same as file memory on
userfaultfd-wp in that it'll also wr-protect none ptes.

It can be useful in two cases:

(1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot,
    so pre-fault can be replaced by enabling this flag and speed up
    protections

(2) It helps to implement async uffd-wp mode that Muhammad is working on [1]

It's debatable whether this is the most ideal solution because with the
new feature bit set, wr-protect none pte needs to pre-populate the
pgtables to the last level (PAGE_SIZE).  But it seems fine so far to
service either purpose above, so we can leave optimizations for later.

The series brings pte markers to anonymous memory too.  There's some
change in the common mm code path in the 1st patch, great to have some eye
looking at it, but hopefully they're still relatively straightforward.


This patch (of 2):

This is a new feature that controls how uffd-wp handles none ptes.  When
it's set, the kernel will handle anonymous memory the same way as file
memory, by allowing the user to wr-protect unpopulated ptes.

File memories handles none ptes consistently by allowing wr-protecting of
none ptes because of the unawareness of page cache being exist or not. 
For anonymous it was not as persistent because we used to assume that we
don't need protections on none ptes or known zero pages.

One use case of such a feature bit was VM live snapshot, where if without
wr-protecting empty ptes the snapshot can contain random rubbish in the
holes of the anonymous memory, which can cause misbehave of the guest when
the guest OS assumes the pages should be all zeros.

QEMU worked it around by pre-populate the section with reads to fill in
zero page entries before starting the whole snapshot process [1].

Recently there's another need raised on using userfaultfd wr-protect for
detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
case if without being able to wr-protect none ptes by default, the dirty
info can get lost, since we cannot treat every none pte to be dirty (the
current design is identify a page dirty based on uffd-wp bit being
cleared).

In general, we want to be able to wr-protect empty ptes too even for
anonymous.

This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
uffd-wp handling on none ptes being consistent no matter what the memory
type is underneath.  It doesn't have any impact on file memories so far
because we already have pte markers taking care of that.  So it only
affects anonymous.

The feature bit is by default off, so the old behavior will be maintained.
Sometimes it may be wanted because the wr-protect of none ptes will
contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte
markers to anonymous), but also on creating the pgtables to store the pte
markers.  So there's potentially less chance of using thp on the first
fault for a none pmd or larger than a pmd.

The major implementation part is teaching the whole kernel to understand
pte markers even for anonymously mapped ranges, meanwhile allowing the
UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
new feature bit is set.

Note that even if the patch subject starts with mm/uffd, there're a few
small refactors to major mm path of handling anonymous page faults.  But
they should be straightforward.

With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all
the memory before wr-protect during taking a live snapshot.  Quotting from
Muhammad's test result here [3] based on a simple program [4]:

  (1) With huge page disabled
  echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
  ./uffd_wp_perf
  Test DEFAULT: 4
  Test PRE-READ: 1111453 (pre-fault 1101011)
  Test MADVISE: 278276 (pre-fault 266378)
  Test WP-UNPOPULATE: 11712

  (2) With Huge page enabled
  echo always > /sys/kernel/mm/transparent_hugepage/enabled
  ./uffd_wp_perf
  Test DEFAULT: 4
  Test PRE-READ: 22521 (pre-fault 22348)
  Test MADVISE: 4909 (pre-fault 4743)
  Test WP-UNPOPULATE: 14448

There'll be a great perf boost for no-thp case, while for thp enabled with
extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE,
but that's low possibility in reality, also the overhead was not reduced
but postponed until a follow up write on any huge zero thp, so potentially
it is faster by making the follow up writes slower.

[1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
[2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
[3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/
[4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c

[peterx@redhat.com: comment changes, oneliner fix to khugepaged]
  Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n
Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com


Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent c6a690e0
Loading
Loading
Loading
Loading
+17 −0
Original line number Diff line number Diff line
@@ -219,6 +219,23 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used.

Userfaultfd write-protect mode currently behave differently on none ptes
(when e.g. page is missing) over different types of memories.

For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
(e.g. when pages are missing and not populated).  For file-backed memories
like shmem and hugetlbfs, none ptes will be write protected just like a
present pte.  In other words, there will be a userfaultfd write fault
message generated when writing to a missing page on file typed memories,
as long as the page range was write-protected before.  Such a message will
not be generated on anonymous memories by default.

If the application wants to be able to write protect none ptes on anonymous
memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ.  On
newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
and set the feature bit in advance to make sure none ptes will also be
write protected even upon anonymous memory.

QEMU/KVM
========

+16 −0
Original line number Diff line number Diff line
@@ -108,6 +108,21 @@ static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
	return ctx->features & UFFD_FEATURE_INITIALIZED;
}

/*
 * Whether WP_UNPOPULATED is enabled on the uffd context.  It is only
 * meaningful when userfaultfd_wp()==true on the vma and when it's
 * anonymous.
 */
bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
{
	struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;

	if (!ctx)
		return false;

	return ctx->features & UFFD_FEATURE_WP_UNPOPULATED;
}

static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
				     vm_flags_t flags)
{
@@ -1971,6 +1986,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
#endif
#ifndef CONFIG_PTE_MARKER_UFFD_WP
	uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
	uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
#endif
	uffdio_api.ioctls = UFFD_API_IOCTLS;
	ret = -EFAULT;
+6 −0
Original line number Diff line number Diff line
@@ -557,6 +557,12 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
	/* The current status of the pte should be "cleared" before calling */
	WARN_ON_ONCE(!pte_none(*pte));

	/*
	 * NOTE: userfaultfd_wp_unpopulated() doesn't need this whole
	 * thing, because when zapping either it means it's dropping the
	 * page, or in TTU where the present pte will be quickly replaced
	 * with a swap pte.  There's no way of leaking the bit.
	 */
	if (vma_is_anonymous(vma) || !userfaultfd_wp(vma))
		return;

+23 −0
Original line number Diff line number Diff line
@@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
				  unsigned long end, struct list_head *uf);
extern void userfaultfd_unmap_complete(struct mm_struct *mm,
				       struct list_head *uf);
extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma);

#else /* CONFIG_USERFAULTFD */

@@ -274,8 +275,30 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
	return false;
}

static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
{
	return false;
}

#endif /* CONFIG_USERFAULTFD */

static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
{
	/* Only wr-protect mode uses pte markers */
	if (!userfaultfd_wp(vma))
		return false;

	/* File-based uffd-wp always need markers */
	if (!vma_is_anonymous(vma))
		return true;

	/*
	 * Anonymous uffd-wp only needs the markers if WP_UNPOPULATED
	 * enabled (to apply markers on zero pages).
	 */
	return userfaultfd_wp_unpopulated(vma);
}

static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
{
#ifdef CONFIG_PTE_MARKER_UFFD_WP
+9 −1
Original line number Diff line number Diff line
@@ -38,7 +38,8 @@
			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
			   UFFD_FEATURE_MINOR_SHMEM |		\
			   UFFD_FEATURE_EXACT_ADDRESS |		\
			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
			   UFFD_FEATURE_WP_UNPOPULATED)
#define UFFD_API_IOCTLS				\
	((__u64)1 << _UFFDIO_REGISTER |		\
	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -203,6 +204,12 @@ struct uffdio_api {
	 *
	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
	 * write-protection mode is supported on both shmem and hugetlbfs.
	 *
	 * UFFD_FEATURE_WP_UNPOPULATED indicates that userfaultfd
	 * write-protection mode will always apply to unpopulated pages
	 * (i.e. empty ptes).  This will be the default behavior for shmem
	 * & hugetlbfs, so this flag only affects anonymous memory behavior
	 * when userfault write-protection mode is registered.
	 */
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
#define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -217,6 +224,7 @@ struct uffdio_api {
#define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
#define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
#define UFFD_FEATURE_WP_UNPOPULATED		(1<<13)
	__u64 features;

	__u64 ioctls;
Loading