Commit 408c46bd authored by Tomer Tayar's avatar Tomer Tayar Committed by Oded Gabbay
Browse files

habanalabs: print context refcount value if hard reset fails



Failing to kill a user process during a hard reset can be due to a
reference to the user context which isn't released.
To make it easier to understand if this the reason for the failure and
not something else, add a print of the context refcount value.

Signed-off-by: default avatarTomer Tayar <ttayar@habana.ai>
Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
parent 0abcae8b
Loading
Loading
Loading
Loading
+15 −3
Original line number Diff line number Diff line
@@ -696,10 +696,22 @@ static void device_hard_reset_pending(struct work_struct *work)
	flags = device_reset_work->flags | HL_DRV_RESET_FROM_RESET_THR;

	rc = hl_device_reset(hdev, flags);

	if ((rc == -EBUSY) && !hdev->device_fini_pending) {
		struct hl_ctx *ctx = hl_get_compute_ctx(hdev);

		if (ctx) {
			/* The read refcount value should subtracted by one, because the read is
			 * protected with hl_get_compute_ctx().
			 */
			dev_info(hdev->dev,
			"Could not reset device. will try again in %u seconds",
				"Could not reset device (compute_ctx refcount %u). will try again in %u seconds",
				kref_read(&ctx->refcount) - 1, HL_PENDING_RESET_PER_SEC);
			hl_ctx_put(ctx);
		} else {
			dev_info(hdev->dev, "Could not reset device. will try again in %u seconds",
				HL_PENDING_RESET_PER_SEC);
		}

		queue_delayed_work(hdev->reset_wq, &device_reset_work->reset_work,
					msecs_to_jiffies(HL_PENDING_RESET_PER_SEC * 1000));