Commits · 6d06779e86724322d79eb53b26989edd9db188f6 · Mirrors / github.com / c-sky_csky-linux

May 22, 2019

drm/i915: Load balancing across a virtual engine · 6d06779e

Chris Wilson authored May 21, 2019

Having allowed the user to define a set of engines that they will want
to only use, we go one step further and allow them to bind those engines
into a single virtual instance. Submitting a batch to the virtual engine
will then forward it to any one of the set in a manner as best to
distribute load.  The virtual engine has a single timeline across all
engines (it operates as a single queue), so it is not able to concurrently
run batches across multiple engines by itself; that is left up to the user
to submit multiple concurrent batches to multiple queues. Multiple users
will be load balanced across the system.

The mechanism used for load balancing in this patch is a late greedy
balancer. When a request is ready for execution, it is added to each
engine's queue, and when an engine is ready for its next request it
claims it from the virtual engine. The first engine to do so, wins, i.e.
the request is executed at the earliest opportunity (idle moment) in the
system.

As not all HW is created equal, the user is still able to skip the
virtual engine and execute the batch on a specific engine, all within the
same queue. It will then be executed in order on the correct engine,
with execution on other virtual engines being moved away due to the load
detection.

A couple of areas for potential improvement left!

- The virtual engine always take priority over equal-priority tasks.
Mostly broken up by applying FQ_CODEL rules for prioritising new clients,
and hopefully the virtual and real engines are not then congested (i.e.
all work is via virtual engines, or all work is to the real engine).

- We require the breadcrumb irq around every virtual engine request. For
normal engines, we eliminate the need for the slow round trip via
interrupt by using the submit fence and queueing in order. For virtual
engines, we have to allow any job to transfer to a new ring, and cannot
coalesce the submissions, so require the completion fence instead,
forcing the persistent use of interrupts.

- We only drip feed single requests through each virtual engine and onto
the physical engines, even if there was enough work to fill all ELSP,
leaving small stalls with an idle CS event at the end of every request.
Could we be greedy and fill both slots? Being lazy is virtuous for load
distribution on less-than-full workloads though.

Other areas of improvement are more general, such as reducing lock
contention, reducing dispatch overhead, looking at direct submission
rather than bouncing around tasklets etc.

sseu: Lift the restriction to allow sseu to be reconfigured on virtual
engines composed of RENDER_CLASS (rcs).

v2: macroize check_user_mbz()
v3: Cancel virtual engines on wedging
v4: Commence commenting
v5: Replace 64b sibling_mask with a list of class:instance
v6: Drop the one-element array in the uabi
v7: Assert it is an virtual engine in to_virtual_engine()
v8: Skip over holes in [class][inst] so we can selftest with (vcs0, vcs2)

Link: https://github.com/intel/media-driver/pull/283


Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190521211134.16117-6-chris@chris-wilson.co.uk

6d06779e

drm/i915: Allow userspace to clone contexts on creation · b81dde71

Chris Wilson authored May 21, 2019



A usecase arose out of handling context recovery in mesa, whereby they
wish to recreate a context with fresh logical state but preserving all
other details of the original. Currently, they create a new context and
iterate over which bits they want to copy across, but it would much more
convenient if they were able to just pass in a target context to clone
during creation. This essentially extends the setparam during creation
to pull the details from a target context instead of the user supplied
parameters.

The ideal here is that we don't expose control over anything more than
can be obtained via CONTEXT_PARAM. That is userspace retains explicit
control over all features, and this api is just convenience.

For example, you could replace

	struct context_param p = { .param = CONTEXT_PARAM_VM };

	param.ctx_id = old_id;
	gem_context_get_param(&p.param);

	new_id = gem_context_create();

	param.ctx_id = new_id;
	gem_context_set_param(&p.param);

	gem_vm_destroy(param.value); /* drop the ref to VM_ID handle */

with

	struct create_ext_param p = {
	  { .name = CONTEXT_CREATE_CLONE },
	  .clone_id = old_id,
	  .flags = CLONE_FLAGS_VM
	}
	new_id = gem_context_create_ext(&p);

and not have to worry about stray namespace pollution etc.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190521211134.16117-5-chris@chris-wilson.co.uk

b81dde71

drm/i915: Re-expose SINGLE_TIMELINE flags for context creation · 8319f44c

Chris Wilson authored May 21, 2019

The SINGLE_TIMELINE flag can be used to create a context such that all
engine instances within that context share a common timeline. This can
be useful for mixing operations between real and virtual engines, or
when using a composite context for a single client API context.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190521211134.16117-4-chris@chris-wilson.co.uk

8319f44c

drm/i915: Extend I915_CONTEXT_PARAM_SSEU to support local ctx->engine[] · e620f7b3

Chris Wilson authored May 21, 2019



Allow the user to specify a local engine index (as opposed to
class:index) that they can use to refer to a preset engine inside the
ctx->engine[] array defined by an earlier I915_CONTEXT_PARAM_ENGINES.
This will be useful for setting SSEU parameters on virtual engines that
are local to the context and do not have a valid global class:instance
lookup.

Note that due to the ambiguity in using class:instance with
ctx->engines[], if a user supplied engine map is active the user must
specify the engine to alter by its index into the ctx->engines[].

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190521211134.16117-3-chris@chris-wilson.co.uk

e620f7b3

drm/i915: Allow a context to define its set of engines · 976b55f0

Chris Wilson authored May 21, 2019

Over the last few years, we have debated how to extend the user API to
support an increase in the number of engines, that may be sparse and
even be heterogeneous within a class (not all video decoders created
equal). We settled on using (class, instance) tuples to identify a
specific engine, with an API for the user to construct a map of engines
to capabilities. Into this picture, we then add a challenge of virtual
engines; one user engine that maps behind the scenes to any number of
physical engines. To keep it general, we want the user to have full
control over that mapping. To that end, we allow the user to constrain a
context to define the set of engines that it can access, order fully
controlled by the user via (class, instance). With such precise control
in context setup, we can continue to use the existing execbuf uABI of
specifying a single index; only now it doesn't automagically map onto
the engines, it uses the user defined engine map from the context.

v2: Fixup freeing of local on success of get_engines()
v3: Allow empty engines[]
v4: s/nengine/num_engines/
v5: Replace 64 limit on num_engines with a note that execbuf is
currently limited to only using the first 64 engines.
v6: Actually use the engines_mutex to guard the ctx->engines.

Testcase: igt/gem_ctx_engines
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190521211134.16117-2-chris@chris-wilson.co.uk

976b55f0

drm/i915: Restore control over ppgtt for context creation ABI · 7f3f317a

Chris Wilson authored May 21, 2019



Having hid the partially exposed new ABI from the PR, put it back again
for completion of context recovery. A significant part of context
recovery is the ability to reuse as much of the old context as is
feasible (to avoid expensive reconstruction). The biggest chunk kept
hidden at the moment is fine-control over the ctx->ppgtt (the GPU page
tables and associated translation tables and kernel maps), so make
control over the ctx->ppgtt explicit.

This allows userspace to create and share virtual memory address spaces
(within the limits of a single fd) between contexts they own, along with
the ability to query the contexts for the vm state.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190521211134.16117-1-chris@chris-wilson.co.uk

7f3f317a

May 20, 2019

drm/i915: Bump gen7+ fb size limits to 16kx16k · 5c000fb3

Ville Syrjälä authored May 09, 2019



With gtt remapping in place we can use arbitrarily large
framebuffers. Let's bump the limits to 16kx16k on gen7+.
The limit was chosen to match the maximum 2D surface size
of the 3D engine.

With the remapping we could easily go higher than that for the
display engine. However the modesetting ddx will blindly assume
it can handle whatever is reported via kms. The oversized
buffer dimensions are not caught by glamor nor Mesa until
finally an assert will trip when genxml attempts to pack the
SURFACE_STATE. So we pick a safe limit to avoid the X server
from crashing (or potentially misbehaving if the genxml asserts
are compiled out).

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110187


Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-9-ville.syrjala@linux.intel.com


Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

5c000fb3

drm/i915: Bump fb stride limit to 128KiB for gen4+ and 256KiB for gen7+ · 20330129

Ville Syrjälä authored May 09, 2019



With gtt remapping plugged in we can simply raise the stride
limit on gen4+. Let's just pick the limit to match the render
engine max stride (256KiB on gen7+, 128KiB on gen4+).

No remapping CCS because the virtual address of each page actually
matters due to the new hash mode
(WaCompressedResourceDisplayNewHashMode:skl,kbl etc.), and no
remapping on gen2/3 due extra complications from fence alignment
and gen2 2KiB GTT tile size. Also no real benefit since the
display engine limits already match the other limits.

v2: Rebase due to is_ccs_modifier()
v3: Tweak the comment and commit msg
v4: Fix gen4+ stride limit to be 128KiB

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> #v3
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-8-ville.syrjala@linux.intel.com


Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

20330129

drm/i915: Align dumb buffer stride to 4k to allow for gtt remapping · aa5ca8b7

Ville Syrjälä authored May 09, 2019



Align dumb buffer stride to 4k if the fb will be big enough to
require gtt remapping.

v2: Leave the stride alone for buffers that look to be for the cursor
v3: Make it not a hack (Daniel)

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-7-ville.syrjala@linux.intel.com


Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

aa5ca8b7

drm/i915: Overcome display engine stride limits via GTT remapping · 54d4d719

Ville Syrjälä authored May 09, 2019



The display engine stride limits are getting in our way. On SKL+
we are limited to 8k pixels, which is easily exceeded with three
4k displays. To overcome this limitation we can remap the pages
in the GTT to provide the display engine with a view of memory
with a smaller stride.

The code is mostly already there as We already play tricks with
the plane surface address and x/y offsets.

A few caveats apply:
* linear buffers need the fb stride to be page aligned, as
  otherwise the remapped lines wouldn't start at the same
  spot
* compressed buffers can't be remapped due to the new
  ccs hash mode causing the virtual address of the pages
  to affect the interpretation of the compressed data. IIRC
  the old hash was limited to the low 12 bits so if we were
  using that mode we could remap. As it stands we just refuse
  to remapp with compressed fbs.
* no remapping gen2/3 as we'd need a fence for the remapped
  vma, which we currently don't have. Need to deal with the
  fence POT requirements, and do something about the gen2
  gtt page size vs tile size difference

v2: Rebase due to is_ccs_modifier()
    Fix up the skl+ stride_mult mess
    memset() the gtt_view because otherwise we could leave
    junk in plane[1] when going from 2 plane to 1 plane format
v3: intel_check_plane_stride() was split out
v4: Drop the aligned viewport stuff, it was meant for ccs which
    can't be remapped anyway
v5: Introduce intel_plane_can_remap()
    Reorder the code so that plane_state->view gets filled
    even for invisible planes, otherwise we'd keep using
    stale values and could explode during remapping. The new
    logic never remaps invisible planes since we don't have
    a viewport, and instead pins the full fb instead
v6: Fix plane src coord checks after remapping by moving
    plane_state->base.src to the final plane x/y offsets.
    Allow intel_plane_check_stride() to fail even with
    remapping (can happen at least with a linear 64bpp
    fb with a 4k plane and a suitably inconvenient src
    coordinates).
    Improve aux plane FIXME (Daniel)
    Move some code shuffling into a separate patch (Daniel)

Testcase: igt/kms_big_fb
Cc: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-6-ville.syrjala@linux.intel.com


Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

54d4d719

drm/i915: Shuffle stride checking code around · a88c40eb

Ville Syrjälä authored May 09, 2019



Reorganize some fb stride checking code a bit to prepare for
gtt remapping. And do a bit of s/pitch/stride/ renaming in the
process for a bit more uniformity (apart from the whole
fb->pitches[] thing).

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-5-ville.syrjala@linux.intel.com


Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

a88c40eb

drm/i915/selftests: Add live vma selftest · bb211c3d

Ville Syrjälä authored May 09, 2019



Add a live selftest to excercise rotated/remapped vmas. We simply
write through the rotated/remapped vma, and confirm that the data
appears in the right page when read through the normal vma.

Not sure what the fallout of making all rotated/remapped vmas
mappable/fenceable would be, hence I just hacked it in the test.

v2: Grab rpm reference (Chris)
    GEM_BUG_ON(view.type not as expected) (Chris)
    Allow CAN_FENCE for rotated/remapped vmas (Chris)
    Update intel_plane_uses_fence() to ask for a fence
    only for normal vmas on gen4+
v3: Deal with intel_wakeref_t
v4: Rebase

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-4-ville.syrjala@linux.intel.com


Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

bb211c3d

drm/i915/selftests: Add mock selftest for remapped vmas · e2e394bf

Ville Syrjälä authored May 09, 2019



Extend the rotated vma mock selftest to cover remapped vmas as
well.

TODO: reindent the loops I guess? Left like this for now to
ease review

v2: Include the vma type in the error message (Chris)
v3: Deal with trimmed sg
v4: Drop leftover debugs

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-3-ville.syrjala@linux.intel.com


Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

e2e394bf

drm/i915: Add a new "remapped" gtt_view · 1a74fc0b

Ville Syrjälä authored May 09, 2019



To overcome display engine stride limits we'll want to remap the
pages in the GTT. To that end we need a new gtt_view type which
is just like the "rotated" type except not rotated.

v2: Use intel_remapped_plane_info base type
    s/unused/unused_mbz/ (Chris)
    Separate BUILD_BUG_ON()s (Chris)
    Use I915_GTT_PAGE_SIZE (Chris)
v3: Use i915_gem_object_get_dma_address() (Chris)
    Trim the sg (Tvrtko)
v4: Actually trim this time. Limit the max length
    to one row of pages to keep things simple

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509122159.24376-2-ville.syrjala@linux.intel.com


Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

1a74fc0b

May 17, 2019

drm/i915/execlists: Drop promotion on unsubmit · 4cc79cbb

Chris Wilson authored May 15, 2019



With the disappearance of NEWCLIENT, we no longer need to provide the
priority boost on preemption in order to prevent repeated gazumping,
and we can remove the dead code.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-5-chris@chris-wilson.co.uk

4cc79cbb

drm/i915: Downgrade NEWCLIENT to non-preemptive · 68fc728b

Chris Wilson authored May 15, 2019

Commit 1413b2bc ("drm/i915: Trim NEWCLIENT boosting") had the
intended consequence of not allowing a sequence of work that merely
crossed into a new engine the privilege to be promoted to NEWCLIENT
status. It also had the unintended consequence of actually making
NEWCLIENT effective on heavily oversubscribed transcode machines and
impacting upon their throughput.

If we consider a client packet composed of (rcsA, rcsB, vcs) and 30 of
those clients, using the NEWCLIENT boost that will be scheduled as

	rcsA x 30, (rcsB, vcs) x 30

where as before it would have been

	(rcsA, rcsB, vcs) x 30

That is with NEWCLIENT only boosting the first request of each client,
we would execute all rcsA requests prior to running on the vcs engines;
acruing a lot of dead time as compared to the previous case where the
vcs engine would be started in parallel to processing the second client.

The previous patch has the effect of delaying submission until it is
required by a third party (either the user with an explicit wait, or by
another client/engine). We reduce the NEWCLIENT bump to a mere WAIT,
which has the effect of removing its preemptive grant and reducing it to
the same level as any other user interaction -- that it will not be
promoted above the interengine dependencies, and so preventing NEWCLIENTS
from starving other engines. This a large nerf to the rrul properties of
the current NEWCLIENT, but it still does give prioritised submission to
new requests from light workloads.

References: b16c7651 ("drm/i915: Priority boost for new clients")
Fixes: 1413b2bc

 ("drm/i915: Trim NEWCLIENT boosting") # customer impact
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-4-chris@chris-wilson.co.uk

68fc728b

drm/i915: Bump signaler priority on adding a waiter · 6e7eb7a8

Chris Wilson authored May 15, 2019

The handling of the no-preemption priority level imposes the restriction
that we need to maintain the implied ordering even though preemption is
disabled. Otherwise we may end up with an AB-BA deadlock across multiple
engine due to a real preemption event reordering the no-preemption
WAITs. To resolve this issue we currently promote all requests to WAIT
on unsubmission, however this interferes with the timeslicing
requirement that we do not apply any implicit promotion that will defeat
the round-robin timeslice list. (If we automatically promote the active
request it will go back to the head of the queue and not the tail!)

So we need implicit promotion to prevent reordering around semaphores
where we are not allowed to preempt, and we must avoid implicit
promotion on unsubmission. So instead of at unsubmit, if we apply that
implicit promotion on adding the dependency, we avoid the semaphore
deadlock and we also reduce the gains made by the promotion for user
space waiting. Furthermore, by keeping the earlier dependencies at a
higher level, we reduce the search space for timeslicing without
altering runtime scheduling too badly (no dependencies at all will be
assigned a higher priority for rrul).

v2: Limit the bump to external edges (as originally intended) i.e.
between contexts and out to the user.

Testcase: igt/gem_concurrent_blit
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-3-chris@chris-wilson.co.uk

6e7eb7a8

drm/i915/hdcp: Use both bits for device_count · af461ff3

Chris Wilson authored May 17, 2019

Smatch spotted:
drivers/gpu/drm/i915//intel_hdcp.c:1406 hdcp2_authenticate_repeater_topology() warn: should this be a bitwise op?

and indeed looks to be suspect that we do need to use a bitwise or to
combine the two register fields into one counter.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190517102225.3069-3-chris@chris-wilson.co.uk

af461ff3

drm/i915/dp: Initialise locals for static analysis · 96ac0813

Chris Wilson authored May 17, 2019

Just to squelch an smatch warning that doesn't see the with_() being
taken unconditionally:
drivers/gpu/drm/i915//intel_dp.c:230 intel_dp_get_fia_supported_lane_count() error: uninitialized symbol 'lane_info'.
drivers/gpu/drm/i915//intel_dp.c:5338 intel_digital_port_connected() error: uninitialized symbol 'is_connected'.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Imre Deak <imre.deak@intel.com>
Reviewed-by: Imre Deak <imre.deak@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190517102225.3069-2-chris@chris-wilson.co.uk

96ac0813

drm/i915: Truly bump ready tasks ahead of busywaits · 17db337f

Chris Wilson authored May 15, 2019

In commit b7404c7e ("drm/i915: Bump ready tasks ahead of
busywaits"), I tried cutting a corner in order to not install a signal
for each of our dependencies, and only listened to requests on which we
were intending to busywait. The compromise that was made was that
instead of then being able to promote the request with a full
NOSEMAPHORE like its non-busywaiting brethren, as we had not ensured we
had cleared the semaphore chain, we settled for only using the NEWCLIENT
boost. With an over saturated system with multiple NEWCLIENTS in flight
at any time, this was found to be an inadequate promotion and left us
with a much poorer scheduling order than prior to using semaphores.

The outcome of this patch, is that all requests have NOSEMAPHORE
priority when they have no dependencies and are ready to run and not
busywait, restoring the pre-semaphore ordering on saturated systems.

We can demonstrate the effect of poor scheduling order by oversaturating
the system using gem_wsim on a system with multiple vcs engines
(i.e running the same workloads across more clients than required for
peak throughput, e.g. media_load_balance_17i7.wsim -c4 -b context):

x v5.1 (normalized)
+ tip
* fix
+------------------------------------------------------------------------+
|                                                                    x   |
|                                                                    x   |
|                                                                    x   |
|                                                                    x   |
|                                                                   %x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %%x   |
|                                                                  %#x   |
|                                                                  %#x   |
|                                                                  %#x   |
|                                                                  %#x   |
|                                                                  %#x   |
|         +                                                        %#xx  |
|         +                                                        %#xx  |
|         +                                                       %%#xx  |
|         +                                                       %%#xx  |
|         +                                                       %%#xx  |
|         +                                                       %%#xx  |
|         +                                                       %%##x  |
|         +++                                                     %%##x  |
|         +++                                                     %%##x  |
|         +++                                                     %%##x  |
|        ++++                                                     %%##x  |
|        ++++                                                     %%##x  |
|        ++++                                                     %%##xx |
|        ++++                                                     %###xx |
|        ++++                                                     %###xx |
|        ++++                                                     %###xx |
|        ++++                                                     %###xx |
|        ++++ +                                                   %#O#xx |
|        ++++ +                                                   %#O#xx |
|        ++++++ +                                                 %#O#xx |
|       ++++++++++                                                %OOOxxx|
|       ++++++++++       +                                       %#OOO#xx|
|     + ++++++++++++ ++ +++++    +                        ++    @@OOOO#xx|
|                                                                   |A_| |
||__________M_______A____________________|                               |
|                                                                 |A_|   |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 120       0.99456       1.00628      0.999985     1.0001545  0.0024387139
+ 120      0.873021       1.00037      0.884134    0.90148752   0.039190862
Difference at 99.5% confidence
	-0.098667 +/- 0.0110762
	-9.86517% +/- 1.10745%
	(Student's t, pooled s = 0.0277657)
% 120      0.990207       1.00165     0.9970265    0.99699748     0.0021024
Difference at 99.5% confidence
	-0.003157 +/- 0.000908245
	-0.315651% +/- 0.0908105%
	(Student's t, pooled s = 0.00227678)

Fixes: b7404c7e

 ("drm/i915: Bump ready tasks ahead of busywaits")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-2-chris@chris-wilson.co.uk

17db337f

drm/i915: Mark semaphores as complete on unsubmit out if payload was started · dba5a7f3

Chris Wilson authored May 15, 2019

Avoid charging us for the presumed busywait if the request was preempted
after successfully using semaphores to reduce inter-engine latency.

v2: Bump the priority to reflect the lack of semaphores now required.

References: ca6e56f6

 ("drm/i915: Disable semaphore busywaits on saturated systems")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-1-chris@chris-wilson.co.uk

dba5a7f3

May 14, 2019

drm/i915: Assert that TypeC ports are not used for eDP · 4e309baf

Imre Deak authored May 09, 2019

Add an assert that we don't use TypeC ports for eDP. That may in theory
be possible on TypeC legacy ports, but I'm not sure if that's a
practical scenario, so let's deal with that only if there's a use case.
Adding support for that wouldn't be too difficult, since TypeC mode
switching is not possible on TypeC legacy ports.

Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-12-imre.deak@intel.com

4e309baf

drm/i915: Avoid taking the PPS lock for non-eDP/VLV/CHV · b4c7ea63

Imre Deak authored May 09, 2019



On ICL we have to make sure that we enable the AUX power domain in a
controlled way (corresponding to the port's actual TypeC mode). Since
the PPS lock - which takes an AUX power ref - is only needed on
eDP on all platforms and eDP/DP on VLV/CHV avoid taking it in all
other cases.

v2:
- Clarify commit log about the condition for taking the PPS lock.
  (Ville)

Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-11-imre.deak@intel.com

b4c7ea63

drm/i915: Replace use of PLLS power domain with DISPLAY_CORE domain · 08d8e170

Imre Deak authored May 09, 2019



There isn't a separate power domain specific to PLLs. When programming
them we require the same power domain to be enabled which is needed when
accessing other display core parts (not specific to any
pipe/port/transcoder). This corresponds to the DISPLAY_CORE domain added
previously in this patchset, so use that instead to save bits in the
power domain mask.

Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-10-imre.deak@intel.com

08d8e170

drm/i915: Remove the unneeded AUX power ref from intel_dp_hpd_pulse() · 6f08ebe7

Imre Deak authored May 09, 2019

The power get/put was added in

commit 1c767b33 ("drm/i915: take display port power domain in DP HPD handler")
Author: Imre Deak <imre.deak@intel.com>
Date:   Mon Aug 18 14:42:42 2014 +0300

to account for the HW access in ibx_digital_port_connected(). This
latter call was in turn removed in

commit 7d23e3c3

 ("drm/i915: Cleaning up intel_dp_hpd_pulse")
Author: Shubhangi Shrivastava <shubhangi.shrivastava@intel.com>
Date:   Wed Mar 30 18:05:23 2016 +0530

after which we didn't actually need the power reference.

One way we are accessing the HW during HPD pulse handling is via DP AUX
transfers, but the transfer function takes its own reference, so doesn't
need the reference in intel_dp_hpd_pulse().

The other spot is in

	intel_psr_short_pulse()->intel_psr_disable_locked()

but that can only happen when the panel is enabled with the
corresponding modeset already holding the required power reference.

v2:
- Remove the unneeded power get/put from intel_psr_disable_locked().
  (Ville)
- Checkpatch commit quoting format fix in the commit log.

Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-9-imre.deak@intel.com

6f08ebe7

drm/i915: Remove the unneeded AUX power ref from intel_dp_detect() · 6cfe7ec0

Imre Deak authored May 09, 2019



We don't need the AUX power for the whole duration of the detect, only
when we're doing AUX transfers. The AUX transfer function takes its own
reference on the AUX power domain already. The two places during detect
which access display core registers (not specific to a
pipe/port/transcoder) only need the power domain that is required for
that access. That power domain is equivalent to the device global power
domain on most platforms (enabled whenever we hold a runtime PM
reference) except on CHV/VLV where it's equivalent to the display power
well.

Add a new power domain that reflects the above, and use this at the two
spots accessing registers. With that we can avoid taking the AUX
reference for the whole duration of the detect function.

Put the domains asynchronously to avoid the unneeded on-off-on toggling.

Also adapt the idea from with_intel_runtime_pm et al. for making it easy
to write short sequences where a display power ref is needed.

v2: (Ville)
- Add with_intel_display_power() helper to simplify things.
- s/bool res/bool is_connected/

Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-8-imre.deak@intel.com

6cfe7ec0

drm/i915: WARN for eDP encoders in intel_dp_detect_dpcd() · ad5125d6

Imre Deak authored May 09, 2019



We are not calling this function for eDP, so add an early assert about
this for clarity.

Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-7-imre.deak@intel.com

ad5125d6

drm/i915: Disable power asynchronously during DP AUX transfers · f39194a7

Imre Deak authored May 09, 2019

In a follow-up patch we will restrict holding the reference on the AUX
power domain to the AUX transfer function. To avoid the unnecessary
on-off-on power togglings drop the reference asynchronously.

There is no reason we couldn't do this in general and also put the
reference asynchronously in pps_unlock(); but that's a separate change
that can be done as a follow-up.

f39194a7

drm/i915: Add support for asynchronous display power disabling · e0da2d63

Imre Deak authored May 13, 2019



By disabling a power domain asynchronously we can restrict holding a
reference on that power domain to the actual code sequence that
requires the power to be on for the HW access it's doing, by also
avoiding unneeded on-off-on togglings of the power domain (since the
disabling happens with a delay).

One benefit is potential power saving due to the following two reasons:
1. The fact that we will now be holding the reference only for the
   necessary duration by the end of the patchset. While simply not
   delaying the disabling has the same benefit, it has the problem that
   frequent on-off-on power switching has its own power cost (see the 2.
   point below) and the debug trace for power well on/off events will
   cause a lot of dmesg spam (see details about this further below).
2. Avoiding the power cost of freuqent on-off-on power switching. This
   requires us to find the optimal disabling delay based on the measured
   power cost of on->off and off->on switching of each power well vs.
   the power of keeping the given power well on.

   In this patchset I'm not providing this optimal delay for two
   reasons:
   a) I don't have the means yet to perform the measurement (with high
      enough signal-to-noise ratio, or with the help of an energy
      counter that takes switching into account). I'm currently looking
      for a way to measure this.

   b) Before reducing the disabling delay we need an alternative way for
      debug tracing powerwell on/off events. Simply avoiding/throttling
      the debug messages is not a solution, see further below.

   Note that even in the case where we can't measure any considerable
   power cost of frequent on-off switching of powerwells, it still would
   make sense to do the disabling asynchronously (with 0 delay) to avoid
   blocking on the disabling. On VLV I measured this disabling time
   overhead to be 1ms on average with a worst case of 4ms.

In the case of the AUX power domains on ICL we would also need to keep
the sequence where we hold the power reference short, the way it would
be by the end of this patchset where we hold it only for the actual AUX
transfer. Anything else would make the locking we need for ICL TypeC
ports (whenever we hold a reference on any AUX power domain) rather
problematic, adding for instance unnecessary lockdep dependencies to
the required TypeC port lock.

I chose the disabling delay to be 100msec for now to avoid the unneeded
toggling (and so not to introduce dmesg spamming) in the DP MST sideband
signaling code. We could optimize this delay later, once we have the
means to measure the switching power cost (see above).

Note that simply removing/throttling the debug tracing for power well
on/off events is not a solution. We need to know the exact spots of
these events and cannot rely only on incorrect register accesses caught
(due to not holding a wakeref at the time of access). Incorrect
powerwell enabling/disabling could lead to other problems, for instance
we need to keep certain powerwells enabled for the duration of modesets
and AUX transfers.

v2:
- Clarify the commit log parts about power cost measurement and the
  problem of simply removing/throttling debug tracing. (Chris)
- Optimize out local wakeref vars at intel_runtime_pm_put_raw() and
  intel_display_power_put_async() call sites if
  CONFIG_DRM_I915_DEBUG_RUNTIME_PM=n. (Chris)
- Rebased on v2 of the wakeref w/o power-on guarantee patch.
- Add missing docbook headers.
v3:
- Checkpatch spelling/missing-empty-line fix.
v4:
- Fix unintended local wakeref var optimization when using
  call-arguments with side-effects, by using inline funcs instead of
  macros. In this patch in particular this will fix the
  intel_display_power_grab_async_put_ref()->intel_runtime_pm_put_raw()
  call).

  No size change in practice (would be the same disregarding the
  corresponding change in intel_display_power_grab_async_put_ref()):
  $ size i915-macro.ko
     text	   data	    bss	    dec	    hex	filename
  2455190	 105890	  10272	2571352	 273c58	i915-macro.ko
  $ size i915-inline.ko
     text	   data	    bss	    dec	    hex	filename
  2455195	 105890	  10272	2571357	 273c5d	i915-inline.ko

  Kudos to Stan for reporting the raw-wakeref WARNs this issue caused. His
  config has CONFIG_DRM_I915_DEBUG_RUNTIME_PM=n, which I didn't retest
  after v1, and we are also not testing this config in CI.

  Now tested both with CONFIG_DRM_I915_DEBUG_RUNTIME_PM=y/n on ICL,
  connecting both Chamelium and regular DP, HDMI sinks.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Cc: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190513192533.12586-1-imre.deak@intel.com

e0da2d63

drm/i915: Verify power domains state during suspend in all cases · ee70080a

Imre Deak authored May 09, 2019



There is no reason why we couldn't verify the power domains state during
suspend in all cases, so do that. I overlooked this when originally
adding the check.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-4-imre.deak@intel.com

ee70080a

drm/i915: Force printing wakeref tacking during pm_cleanup · dbf99c1f

Imre Deak authored May 09, 2019

Make sure we print and drop the wakeref tracking info during pm_cleanup
even if there are wakeref holders (either raw-wakeref or wakelock
holders). Dropping the wakeref tracking means that a late put on the ref
will WARN since the wakeref will be unknown, but that is rightly so,
since the put is late and we want to catch that case.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-3-imre.deak@intel.com

dbf99c1f

drm/i915: Add support for tracking wakerefs w/o power-on guarantee · 4547c255

Imre Deak authored May 09, 2019



It's useful to track runtime PM refs that don't guarantee a device
power-on state to the rest of the driver. One such case is holding a
reference that will be put asynchronously, during which normal users
without their own reference shouldn't access the HW. A follow-up patch
will add support for disabling display power domains asynchronously
which needs this.

For this we can split wakeref_count into a low half-word tracking
all references (raw-wakerefs) and a high half-word tracking
references guaranteeing a power-on state (wakelocks).

Follow-up patches will make use of the API added here.

While at it add the missing docbook header for the unchecked
display-power and runtime_pm put functions.

No functional changes, except for printing leaked raw-wakerefs
and wakelocks separately in intel_runtime_pm_cleanup().

v2:
- Track raw wakerefs/wakelocks in the low/high half-word of
  wakeref_count, instead of adding a new counter. (Chris)
v3:
- Add a struct_member(T, m) helper instead of open-coding it. (Chris)
- Checkpatch indentation formatting fix.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190509173446.31095-2-imre.deak@intel.com

4547c255

drm/i915/icl: More workaround for port F detection due to broken VBTs · 1aa37508

Imre Deak authored May 10, 2019

Add another ICL-Y PCIID that proved to have only 5 ports to the
corresponding PCIID list.

Meanwhile I'm trying to get a complete list of all PCIIDs with less than
6 ports and/or get a VBT fix to mark these ports non-existent, but until
then the only way is to go one-by-one.

This fixes the following error on machines with less than 6 port:

	[drm:intel_power_well_enable [i915]] enabling AUX F
	------------[ cut here ]------------
	WARN_ON(intel_wait_for_register(&dev_priv->uncore, regs->driver, (0x1 << ((pw_idx) * 2)), (0x1 << ((pw_idx) * 2)), 1))

(Internal reference: BSpec/Index/20584/Issues, HSD/1306084116)

Cc: Mika Kahola <mika.kahola@intel.com>
References: https://bugs.freedesktop.org/show_bug.cgi?id=108915


Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Mika Kahola <mika.kahola@intel.com>
Tested-by: Mika Kahola <mika.kahola@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190510140255.25215-1-imre.deak@intel.com

1aa37508

drm/i915: Fix pixel clock and crtc clock config mismatch · e3c54da0

Vandita Kulkarni authored May 02, 2019



In case of dual link mode, the mode clock that we get
from the VBT is halved.

v2: Simplify the calculation (Jani).

Signed-off-by: Vandita Kulkarni <vandita.kulkarni@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1556809862-31203-4-git-send-email-vandita.kulkarni@intel.com

e3c54da0

drm/i915: Fix pipe config mismatch for bpp, output format · 30bd7efd

Vandita Kulkarni authored May 02, 2019

Read back the pixel fomrat register and get the bpp.

v2: Read the PIPE_MISC register (Jani).

Signed-off-by: Vandita Kulkarni <vandita.kulkarni@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1556809862-31203-3-git-send-email-vandita.kulkarni@intel.com

30bd7efd

drm/i915: Refactor bdw_get_pipemisc_bpp · 8ae89743

Vandita Kulkarni authored May 02, 2019

Move bdw_get_pipemisc_bpp alongside bdw_set_pipemisc

8ae89743

drm/i915: Fix the pipe state timing mismatch warnings · 3c23ed13

Vandita Kulkarni authored May 02, 2019



Adjust the get transcoder timings for mipi dsi as per the
set timing calculations.

v2: Use the existing intel_get_pipe_timings and do the dsi
    specific adjustments in the encoder get_config hook.(Ville, Jani)

v3: Exclude VBLANK and HBLANK registers for dsi transcoder.

v4: Fix the incomplete conditional logic.

Signed-off-by: Vandita Kulkarni <vandita.kulkarni@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1556809862-31203-1-git-send-email-vandita.kulkarni@intel.com

3c23ed13

drm/hdcp: drm_hdcp_request_srm() as static · cdd07596

Ramalingam C authored May 13, 2019



Below Sparsh warnings are fixed.

Commit: drm: revocation check at drm subsystem
+drivers/gpu/drm/drm_hdcp.c:235:6: warning: symbol
'drm_hdcp_request_srm' was not declared. Should it be static?
+drivers/gpu/drm/drm_hdcp.c:27:3: warning: symbol 'srm_data' was not
declared. Should it be static?
+drivers/gpu/drm/drm_hdcp.c:317:5: warning: symbol 'drm_setup_hdcp_srm'
was not declared. Should it be static?
+drivers/gpu/drm/drm_hdcp.c:327:6: warning: symbol
'drm_teardown_hdcp_srm' was not declared. Should it be static?

cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
Acked-by: Dave Airlie <airlied@gmail.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20190513133504.18612-1-ramalingam.c@intel.com

cdd07596

May 13, 2019

drm/i915: Check for no-op priority changes first · 19098018

Chris Wilson authored May 13, 2019



In all likelihood, the priority and node are already in the CPU cache
and by checking them first, we can avoid having to chase the
*request->hwsp for the current breadcrumb.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190513120102.29660-3-chris@chris-wilson.co.uk

19098018

drm/i915: Pass i915_sched_node around internally · 52c76fb1

Chris Wilson authored May 13, 2019



To simplify the next patch, update bump_priority and schedule to accept
the internal i915_sched_ndoe directly and not expect a request pointer.

add/remove: 0/0 grow/shrink: 2/1 up/down: 8/-15 (-7)
Function                                     old     new   delta
i915_schedule_bump_priority                  109     113      +4
i915_schedule                                 50      54      +4
__i915_schedule                              922     907     -15

v2: Adopt node for the old rq local, since it no longer is a request but
the origin node.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190513120102.29660-2-chris@chris-wilson.co.uk

52c76fb1