Commit a6283010 authored by Alex Sierra's avatar Alex Sierra Committed by Alex Deucher
Browse files

drm/amdkfd: avoid recursive lock in migrations back to RAM



[Why]:
When we call hmm_range_fault to map memory after a migration, we don't
expect memory to be migrated again as a result of hmm_range_fault. The
driver ensures that all memory is in GPU-accessible locations so that
no migration should be needed. However, there is one corner case where
hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE
back to system memory due to a write-fault when a system memory page in
the same range was mapped read-only (e.g. COW). Ranges with individual
pages in different locations are usually the result of failed page
migrations (e.g. page lock contention). The unexpected migration back
to system memory causes a deadlock from recursive locking in our
driver.

[How]:
Creating a task reference new member under svm_range_list struct.
Setting this with "current" reference, right before the hmm_range_fault
is called. This member is checked against "current" reference at
svm_migrate_to_ram callback function. If equal, the migration will be
ignored.

Signed-off-by: default avatarAlex Sierra <alex.sierra@amd.com>
Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
parent 25a1a08f
Loading
Loading
Loading
Loading
+5 −0
Original line number Diff line number Diff line
@@ -861,6 +861,11 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
		pr_debug("failed find process at fault address 0x%lx\n", addr);
		return VM_FAULT_SIGBUS;
	}
	if (READ_ONCE(p->svms.faulting_task) == current) {
		pr_debug("skipping ram migration\n");
		kfd_unref_process(p);
		return 0;
	}
	addr >>= PAGE_SHIFT;
	pr_debug("CPU page fault svms 0x%p address 0x%lx\n", &p->svms, addr);

+1 −0
Original line number Diff line number Diff line
@@ -768,6 +768,7 @@ struct svm_range_list {
	atomic_t			evicted_ranges;
	struct delayed_work		restore_work;
	DECLARE_BITMAP(bitmap_supported, MAX_GPU_INSTANCE);
	struct task_struct 		*faulting_task;
};

/* Process data */
+2 −0
Original line number Diff line number Diff line
@@ -1496,9 +1496,11 @@ static int svm_range_validate_and_map(struct mm_struct *mm,

		next = min(vma->vm_end, end);
		npages = (next - addr) >> PAGE_SHIFT;
		WRITE_ONCE(p->svms.faulting_task, current);
		r = amdgpu_hmm_range_get_pages(&prange->notifier, mm, NULL,
					       addr, npages, &hmm_range,
					       readonly, true, owner);
		WRITE_ONCE(p->svms.faulting_task, NULL);
		if (r) {
			pr_debug("failed %d to get svm range pages\n", r);
			goto unreserve_out;