!2831 [22.03-LTS-SP3] TDP MMU Support (094e5940) · Commits · EulixOS / Software / Kernel

Documentation/virt/kvm/locking.rst

+37 −26

Original line number	Diff line number	Diff line
		@@ -16,7 +16,19 @@ The acquisition orders for mutexes are as follows:
		- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
		them together is quite rare.

		On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
		- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
		synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock
		can be taken inside a kvm->srcu read-side critical section,
		while kvm->slots_lock cannot.

		On x86:

		- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock

		- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and
		kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
		cannot be taken without already holding kvm->arch.mmu_lock (typically with
		``read_lock`` for the TDP MMU, thus the need for additional spinlocks).

		Everything else is a leaf: no other lock is taken inside the critical
		sections.
		@@ -31,25 +43,24 @@ the mmu-lock on x86. Currently, the page fault can be fast in one of the
		following two cases:

		1. Access Tracking: The SPTE is not present, but it is marked for access
		tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
		restore the saved R/X bits. This is described in more detail later below.
		tracking. That means we need to restore the saved R/X bits. This is
		described in more detail later below.

		2. Write-Protection: The SPTE is present and the fault is
		caused by write-protect. That means we just need to change the W bit of
		the spte.
		2. Write-Protection: The SPTE is present and the fault is caused by
		write-protect. That means we just need to change the W bit of the spte.

		What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
		SPTE_MMU_WRITEABLE bit on the spte:
		What we use to avoid all the race is the Host-writable bit and MMU-writable bit
		on the spte:

		- SPTE_HOST_WRITEABLE means the gfn is writable on host.
		- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
		the gfn is writable on guest mmu and it is not write-protected by shadow
		page write-protection.
		- Host-writable means the gfn is writable in the host kernel page tables and in
		its KVM memslot.
		- MMU-writable means the gfn is writable in the guest's mmu and it is not
		write-protected by shadow page write-protection.

		On fast page fault path, we will use cmpxchg to atomically set the spte W
		bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
		restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
		is safe because whenever changing these bits can be detected by cmpxchg.
		bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved
		R/X bits if for an access-traced spte, or both. This is safe because whenever
		changing these bits can be detected by cmpxchg.

		But we need carefully check these cases:

		@@ -178,17 +189,17 @@ See the comments in spte_has_volatile_bits() and mmu_spte_update().
		Lockless Access Tracking:

		This is used for Intel CPUs that are using EPT but do not support the EPT A/D
		bits. In this case, when the KVM MMU notifier is called to track accesses to a
		page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
		by clearing the RWX bits in the PTE and storing the original R & X bits in
		some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
		PTE (using the ignored bit 62). When the VM tries to access the page later on,
		a fault is generated and the fast page fault mechanism described above is used
		to atomically restore the PTE to a Present state. The W bit is not saved when
		the PTE is marked for access tracking and during restoration to the Present
		state, the W bit is set depending on whether or not it was a write access. If
		it wasn't, then the W bit will remain clear until a write access happens, at
		which time it will be set using the Dirty tracking mechanism described above.
		bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and
		when the KVM MMU notifier is called to track accesses to a page (via
		kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware
		by clearing the RWX bits in the PTE and storing the original R & X bits in more
		unused/ignored bits. When the VM tries to access the page later on, a fault is
		generated and the fast page fault mechanism described above is used to
		atomically restore the PTE to a Present state. The W bit is not saved when the
		PTE is marked for access tracking and during restoration to the Present state,
		the W bit is set depending on whether or not it was a write access. If it
		wasn't, then the W bit will remain clear until a write access happens, at which
		time it will be set using the Dirty tracking mechanism described above.

		3. Reference
		------------

arch/arm64/include/asm/kvm_host.h

+1 −6

Original line number	Diff line number	Diff line
		@@ -29,7 +29,6 @@

		#define __KVM_HAVE_ARCH_INTC_INITIALIZED

		#define KVM_USER_MEM_SLOTS 512
		#define KVM_HALT_POLL_NS_DEFAULT 500000

		#include <kvm/arm_vgic.h>
		@@ -526,11 +525,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
		struct kvm_vcpu_events *events);

		#define KVM_ARCH_WANT_MMU_NOTIFIER
		int kvm_unmap_hva_range(struct kvm *kvm,
		unsigned long start, unsigned long end, unsigned flags);
		int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
		int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
		int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
		#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS

		void kvm_arm_halt_guest(struct kvm *kvm);
		void kvm_arm_resume_guest(struct kvm *kvm);

arch/arm64/include/asm/spinlock.h

+1 −1

Original line number	Diff line number	Diff line
		@@ -5,12 +5,12 @@
		#ifndef __ASM_SPINLOCK_H
		#define __ASM_SPINLOCK_H

		#include <asm/qrwlock.h>
		#include <asm/qspinlock.h>
		#include <asm/paravirt.h>

		/* How long a lock should spin before we consider blocking */
		#define SPIN_THRESHOLD (1 << 15)
		#include <asm/qrwlock.h>

		/* See include/linux/spinlock.h */
		#define smp_mb__after_spinlock() smp_mb()

arch/arm64/kvm/mmu.c

+31 −87

Original line number	Diff line number	Diff line
		@@ -872,7 +872,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
		* gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
		* the page we just got a reference to gets unmapped before we have a
		* chance to grab the mmu_lock, which ensure that if the page gets
		* unmapped afterwards, the call to kvm_unmap_hva will take it away
		* unmapped afterwards, the call to kvm_unmap_gfn will take it away
		* from us again properly. This smp_rmb() interacts with the smp_wmb()
		* in kvm_mmu_notifier_invalidate_<page\|range_end>.
		*/
		@@ -1106,126 +1106,70 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
		return ret;
		}

		static int handle_hva_to_gpa(struct kvm *kvm,
		unsigned long start,
		unsigned long end,
		int (handler)(struct kvm kvm,
		gpa_t gpa, u64 size,
		void *data),
		void *data)
		{
		struct kvm_memslots *slots;
		struct kvm_memory_slot *memslot;
		int ret = 0;

		slots = kvm_memslots(kvm);

		/* we only care about the pages that the guest sees */
		kvm_for_each_memslot(memslot, slots) {
		unsigned long hva_start, hva_end;
		gfn_t gpa;

		hva_start = max(start, memslot->userspace_addr);
		hva_end = min(end, memslot->userspace_addr +
		(memslot->npages << PAGE_SHIFT));
		if (hva_start >= hva_end)
		continue;

		gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
		ret \|= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
		}

		return ret;
		}

		static int kvm_unmap_hva_handler(struct kvm kvm, gpa_t gpa, u64 size, void data)
		{
		unsigned flags = (unsigned )data;
		bool may_block = flags & MMU_NOTIFIER_RANGE_BLOCKABLE;

		__unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
		return 0;
		}

		int kvm_unmap_hva_range(struct kvm *kvm,
		unsigned long start, unsigned long end, unsigned flags)
		bool kvm_unmap_gfn_range(struct kvm kvm, struct kvm_gfn_range range)
		{
		if (!kvm->arch.mmu.pgt)
		return 0;

		trace_kvm_unmap_hva_range(start, end);
		handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, &flags);
		return 0;
		}

		static int kvm_set_spte_handler(struct kvm kvm, gpa_t gpa, u64 size, void data)
		{
		kvm_pfn_t pfn = (kvm_pfn_t )data;

		WARN_ON(size != PAGE_SIZE);
		__unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
		(range->end - range->start) << PAGE_SHIFT,
		range->may_block);

		/*
		* The MMU notifiers will have unmapped a huge PMD before calling
		* ->change_pte() (which in turn calls kvm_set_spte_hva()) and
		* therefore we never need to clear out a huge PMD through this
		* calling path and a memcache is not required.
		*/
		kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, gpa, PAGE_SIZE,
		__pfn_to_phys(*pfn), KVM_PGTABLE_PROT_R, NULL);
		return 0;
		}

		int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
		bool kvm_set_spte_gfn(struct kvm kvm, struct kvm_gfn_range range)
		{
		unsigned long end = hva + PAGE_SIZE;
		kvm_pfn_t pfn = pte_pfn(pte);
		kvm_pfn_t pfn = pte_pfn(range->pte);

		if (!kvm->arch.mmu.pgt)
		return 0;

		trace_kvm_set_spte_hva(hva);
		WARN_ON(range->end - range->start != 1);

		/*
		* We've moved a page around, probably through CoW, so let's treat it
		* just like a translation fault and clean the cache to the PoC.
		*/
		clean_dcache_guest_page(pfn, PAGE_SIZE);
		handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);

		/*
		* The MMU notifiers will have unmapped a huge PMD before calling
		* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
		* therefore we never need to clear out a huge PMD through this
		* calling path and a memcache is not required.
		*/
		kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
		PAGE_SIZE, __pfn_to_phys(pfn),
		KVM_PGTABLE_PROT_R, NULL);

		return 0;
		}

		static int kvm_age_hva_handler(struct kvm kvm, gpa_t gpa, u64 size, void data)
		bool kvm_age_gfn(struct kvm kvm, struct kvm_gfn_range range)
		{
		pte_t pte;
		u64 size = (range->end - range->start) << PAGE_SHIFT;
		kvm_pte_t kpte;
		pte_t pte;

		if (!kvm->arch.mmu.pgt)
		return 0;

		WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
		kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt, gpa);

		kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
		range->start << PAGE_SHIFT);
		pte = __pte(kpte);
		return pte_valid(pte) && pte_young(pte);
		}

		static int kvm_test_age_hva_handler(struct kvm kvm, gpa_t gpa, u64 size, void data)
		{
		WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
		return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt, gpa);
		}

		int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
		bool kvm_test_age_gfn(struct kvm kvm, struct kvm_gfn_range range)
		{
		if (!kvm->arch.mmu.pgt)
		return 0;
		trace_kvm_age_hva(start, end);
		return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
		}

		int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
		{
		if (!kvm->arch.mmu.pgt)
		return 0;
		trace_kvm_test_age_hva(hva);
		return handle_hva_to_gpa(kvm, hva, hva + PAGE_SIZE,
		kvm_test_age_hva_handler, NULL);
		return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
		range->start << PAGE_SHIFT);
		}

		phys_addr_t kvm_mmu_get_httbr(void)

arch/arm64/kvm/trace_arm.h

+0 −66

Original line number	Diff line number	Diff line
		@@ -136,72 +136,6 @@ TRACE_EVENT(kvm_mmio_emulate,
		__entry->vcpu_pc, __entry->instr, __entry->cpsr)
		);

		TRACE_EVENT(kvm_unmap_hva_range,
		TP_PROTO(unsigned long start, unsigned long end),
		TP_ARGS(start, end),

		TP_STRUCT__entry(
		__field( unsigned long, start )
		__field( unsigned long, end )
		),

		TP_fast_assign(
		__entry->start = start;
		__entry->end = end;
		),

		TP_printk("mmu notifier unmap range: %#016lx -- %#016lx",
		__entry->start, __entry->end)
		);

		TRACE_EVENT(kvm_set_spte_hva,
		TP_PROTO(unsigned long hva),
		TP_ARGS(hva),

		TP_STRUCT__entry(
		__field( unsigned long, hva )
		),

		TP_fast_assign(
		__entry->hva = hva;
		),

		TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva)
		);

		TRACE_EVENT(kvm_age_hva,
		TP_PROTO(unsigned long start, unsigned long end),
		TP_ARGS(start, end),

		TP_STRUCT__entry(
		__field( unsigned long, start )
		__field( unsigned long, end )
		),

		TP_fast_assign(
		__entry->start = start;
		__entry->end = end;
		),

		TP_printk("mmu notifier age hva: %#016lx -- %#016lx",
		__entry->start, __entry->end)
		);

		TRACE_EVENT(kvm_test_age_hva,
		TP_PROTO(unsigned long hva),
		TP_ARGS(hva),

		TP_STRUCT__entry(
		__field( unsigned long, hva )
		),

		TP_fast_assign(
		__entry->hva = hva;
		),

		TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
		);

		TRACE_EVENT(kvm_set_way_flush,
		TP_PROTO(unsigned long vcpu_pc, bool cache),
		TP_ARGS(vcpu_pc, cache),