Unlike the Gen11 code, this requires us to allocate a pipe_resource
for the pixel pipe hashing tables and hold a reference to it from the
context, since we need to add it to the validation list of every
batch, the tables may be accessed by the hardware at any time after
they're specified via 3DSTATE_SLICE_TABLE_STATE_POINTERS.
Note that this has an effect even for unfused native die platforms,
since the pixel pipe hashing tables we intend to program aren't
equivalent to the hardware's defaults on such configs.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13569>
INTEL_DEBUG is defined (since 4015e1876a) as:
#define INTEL_DEBUG __builtin_expect(intel_debug, 0)
which unfortunately chops off upper 32 bits from intel_debug
on platforms where sizeof(long) != sizeof(uint64_t) because
__builtin_expect is defined only for the long type.
Fix this by changing the definition of INTEL_DEBUG to be function-like
macro with "flags" argument. New definition returns 0 or 1 when
any of the flags match.
Most of the changes in this commit were generated using:
for c in `git grep INTEL_DEBUG | grep "&" | grep -v i915 | awk -F: '{print $1}' | sort | uniq`; do
perl -pi -e "s/INTEL_DEBUG & ([A-Z0-9a-z_]+)/INTEL_DBG(\1)/" $c
perl -pi -e "s/INTEL_DEBUG & (\([A-Z0-9_ |]+\))/INTEL_DBG\1/" $c
done
but it didn't handle all cases and required minor cleanups (like removal
of round brackets which were not needed anymore).
Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13334>
We don't want to export suballocated resources to external consumers,
for a variety of reasons. First of all, it would be exporting random
other pieces of memory which we may not want those external consumers
to have access to. Secondly, external clients wouldn't be aware of
what buffers are packed together and busy-tracking implications there.
Nor should they be. And those are just the obvious reasons.
When we allocate a resource with the PIPE_BIND_SHARED flag, indicating
that it's going to be used externally, we avoid suballocation.
However, there are times when the client may suddenly decide to export
a texture or buffer, without any prior warning. Since we had no idea
this buffer would be exported, we suballocated it. Unfortunately, this
means we need to transition it to a dedicated allocation on the fly, by
allocating a new buffer and copying the contents over.
Making things worse, this often happens in DRI hooks that don't have an
associated context (which we need to say, run BLORP commands). We have
to create an temporary context for this purpose, perform our blit, then
destroy it. The radeonsi driver uses a permanent auxiliary context
stored in the screen for this purpose, but we can't do that because it
causes circular reference counting. radeonsi doesn't do the reference
counting that we do, but also doesn't use u_transfer_helper, so they
get lucky in avoiding stale resource->screen pointers. Other drivers
don't create an auxiliary context, so they avoid this problem for now.
For auxiliary data, rather than copying it over bit-for-bit, we simply
copy over the underlying data using iris_copy_region (GPU memcpy), and
take whatever the resulting aux state is from that operation. Assuming
the copy operation compresses, the result will be compressed.
v2: Stop using a screen->aux_context and just invent one on the fly to
avoid circular reference counting issues.
Acked-by: Paulo Zanoni <paulo.r.zanoni@intel.com> [v1]
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12623>
Use the flag that was set by nir_lower_passthrough_edgeflags. The
lowering passes will soon be moved to a finalize_nir hook, so there
won't be any choice. Ideally we'd like to eliminate iris_fix_edge_flags
completely, and this is a first step.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12858>
...instead of looking for "ARB" in the name of the shader. This matches
the behavior of i965. Using "ARB" was added in a1ebac3750 ("iris:
Implement ALT mode for ARB_{vertex,fragment}_shader"), but there's no
explanation of why that method was used.
v2: Just use shader_info::is_arb_asm everywhere instead of
iris_uncompiled_shader::use_alt_mode. Suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12858>
This is the last step before we can start removing the history flush
mechanism: In cases where a dirtied buffer has the potential to be
concurrently bound to the pipeline (as indicated by the bind_history
mask), flag the "flush" dirty bits corresponding to its binding point.
This ensures that the buffer-local memory barriers introduced earlier
in this series are executed before the next draw call, which in turn
will emit any necessary PIPE_CONTROLs in cases where the buffer is
bound through a cache incoherent with the cache that performed the
write.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12691>
There are a couple minor things that can be improved:
1. Eliminate (or reduce) the dynamic allocation of the
threaded_compile_job.
2. For apps like shader-db, improve the case where nr_threads=0. Right
now this adds thread switching and mutex overhead.
3. Other performance improvements? iris_uncompiled_shader::variants has
some special properties that make it ripe for replacement with a
lockless list. Without gathering some data, it's hard to guess what
impact that could have.
v2: Fix whitespace and formatting issues. Noticed by Ken.
s/threaded_compile_job/iris_threaded_compile_job/g. Suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11229>
I tried to find a way to break this into some smaller commits, but
everything is very intertwined. :(
When searching the variants list in the iris_uncompiled_shader, add the
new variant if it is not found. This will be necessary for threaded
shader compilation. This conceptually simple change had a bunch of
fallout.
Much of this was at least conceptually borrowed from radeonsi.
- Other threads might find a variant in the list before the variant has
been compiled. To accomdate this, add a fence. Each thread will wait
on the fence in the variant when searching the list.
- A variant in the list may fail compilation. To accomodate this, add a
flag. All paths will examine iris_compiled_shader::compilation_failed
before trying to use the variant.
- The race condition between multiple threads trying to create the same
variant at the same time is handled *before* both thread spend the
effort to compile the shader. The means that iris_upload_shader
cannot change shaders on the caller, so it does not need to return
anything.
v2: Change "found" parameter of find_or_add_variant to "added." This
inverts the values returned, and it probably makes uses of the returned
value more easily understood. Always set the value in the called
function. Suggested by Ken.
v3: Move shader->compilation_failed check to avoid shader != NULL test.
Rearrange some logic and add a comment in iris_update_compiled_tcs.
Suggested by Ken. Don't call find_or_add_variant in
iris_create_shader_state. See
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11229#note_1000843
for more details. Noticed by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11229>
The added assertion in iris_create_shader_variant helped catch a bug in
the next commit.
v2: Drop (unnecessary) initialization of shader->assembly.res when
moving to iris_create_shader_variant. Suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11229>
I tried /just/ ref counting the uncompiled shaders, but that is not
sufficient. At the very least, it's a problem for blorp shaders that
only have variants (and no uncompiled shader).
This is in prepartion for using the live shader cache.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11229>
+ 9.31% drawover:gdrv0 iris_dri.so [.] iris_binder_reserve_3d
+ 2.36% drawover:gdrv0 iris_dri.so [.] iris_binder_reserve_3d
If the app never uses compute, then the compute bindings bit will always
be dirty causing these two paths never get shortcuts.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11699>
Rework:
* Jordan: Handle prog_data->total_scratch==0 in iris_upload_compute_walker
* Jordan: Resolve iris_get_scratch_space conflict with e2c5ef6cd6
* Jordan: Rebase on 4256f7ed58. broken
* Ken: Mostly fixed the rebase
* Jordan: Fix two small compilation issues
* Jordan: Rebase on Ken's ("iris: Make a pin_scratch_space() helper")
* Lionel: Fix a few bugs with scratch handles
* Jason: Tidy the patch up a bit
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11582>
the only case in which this is nonzero is if a multidraw gets split by the frontend,
i.e., mesa core, and in all other cases it can be ignored. the value can also be ignored
for all indirect draws, though it seems many (most?) gallium drivers are not aware of this
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10166>
We are flushing tile cache more often than is necessary. In
unified cache mode, tile cache flushing is expensive, evicting all
depth/pixel data from the L3$. This is only need for a handful of
cases, such as: making cpu or gpu changes globally visible
(e.g. map), fast color clears, or slow depth clears. Tile cache
flushing is a gen12+ feature.
Remove blanket flushing of tile cache on all depth/RT flushes.
Replace with selective tile cache flushing.
Improves performance in several workloads:
AztecRuins.ogl-high-offscreen-1440p 1%
UnigineValley.ogl-g2 1%
Dota 2 (replay Jul 2020).ogl-g2 1%
Counter-Strike GO.ogl-g2 1%
Manhattan.ogl-Off-19x10 2%
CarChase.ogl-Off-19x10 1%
Bioshock Infinite.ogl-g2 1%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10217>
This implements most of the remaining u_threaded_context support. Most
of the heavy lifting was done in the previous patches which fixed things
up for the new thread safety requirements. Only a few things remain.
u_threaded_context support can be disabled via an environment variable:
GALLIUM_THREAD=0
On Felix's Tigerlake with the GPU at fixed frequency, enabling
u_threaded_context improves performance of several games:
- Civilization VI: +17%
- Shadow of Mordor: +6%
- Bioshock Infinite +6%
- Xonotic: +6%
Various microbenchmarks improve substantially as well:
- GfxBench5 gl_driver2: +58%
- SynMark2 OglBatch6: +54%
- Piglit drawoverhead: +25%
Reviewed-by: Zoltán Böszörményi <zboszor@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8964>
pipe->transfer_map can be called from u_threaded_context's thread
rather than the driver thread. We need to use two different slab
allocators, one for each thread. transfer_unmap, on the other hand,
is only ever called from the driver thread.
Reviewed-by: Zoltán Böszörményi <zboszor@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8964>
u_threaded_context requires various objects to inherit from a new
threaded_foo base class rather than directly from pipe_foo. This
patch does most of the mechanical changes required for that.
It also initializes the new threaded_resource fields.
Reviewed-by: Zoltán Böszörményi <zboszor@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8964>
When we enable u_threaded_context, the pipe->create_*_state hooks
(precompile variants) are going to be called from one thread, while
iris_update_compiled_shaders (on-the-fly variants) are going to be
called from a driver thread. BLORP shaders also happen from
clear, blit, and so on in the driver thread.
u_upload_mgr isn't thread-safe, so use an uploader for each purpose.
Reviewed-by: Zoltán Böszörményi <zboszor@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8964>
The previous mechanism was a bit fragile. We stored the zero offset
in the pre-baked packet, and used an flag to override 0xFFFFFFFF
(append) offsets until our first emit - then prohibited anyone from
trying to re-emit the packet by flagging IRIS_DIRTY_SO_BUFFERS,
because that would re-emit the version with the zeroing of the offset.
Now, we always store 0xFFFFFFFF in the pre-baked packet, and use a
flag to override it to zero on the first emit. That way, we can
re-emit that packet at any time, and it'll just keep appending.
Reviewed-by: Zoltán Böszörményi <zboszor@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8964>
We call update_last_vue_map after updating the shaders, which compares
the new and old VUE maps. Except...updating the shaders may have
dropped the last reference to the variant that ice->shaders.last_vue_map
belonged to, leading to a classic use-after-free.
Fix this by taking a reference to the variant for the last VUE stage,
so it stays around until we're done with it.
Fixes: 1afed51445 ("iris: Store a list of shader variants in the shader itself")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/4311
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9143>
Prepare/finish a framebuffer's depth buffer with the aux usage that's
appropriate for the given miplevel instead of wrongly assuming that
compression is always enabled. Enables code simplifications later on.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8853>
We've traditionally stored shader variants in a per-context hash table,
based on a key with many per-stage fields. On older hardware supported
by i965, there were potentially quite a few variants, as many features
had to be emulated in shaders, including things like texture swizzling.
However, on the modern hardware targeted by iris, our NOS dependencies
are much smaller. We almost always guess the correct state when doing
the initial precompile, and so we have maybe 1-3 variants. iris NOS
keys are also dramatically smaller (4 to 24 bytes) than i965's.
Unlike the classic world, Gallium also provides a single kind of object
for API shaders---pipe_shader_state aka iris_uncompiled_shader. We can
simply store a list of shader variants there. This makes it possible
to access shader variants across contexts, rather than compiling them
separately for each context, which better matches how the APIs work.
To look up variants, we simply walk the list and memcmp the keys.
Since the list is almost always singular (and rarely ever long),
and the keys are tiny, this should be quite low overhead.
We continue storing internally generated shaders for BLORP and
passthrough TCS in the per-context hash table, as they don't have
an associated pipe_shader_state / iris_uncompiled_shader object.
(There can also be many BLORP shaders, and the blit keys are large,
so having a hash table rather than a list makes sense there.)
Because iris_uncompiled_shaders are shared across multiple contexts,
we do require locking when accessing this list. Fortunately, this
is a per-shader lock, rather than a global one. Additionally, since
we only append variants to the list, and generate the first one at
precompile time (while only one context has the uncompiled shader),
we can assume that it is safe to access that first entry without
locking the list. This means that we only have to lock when we
have multiple variants, which is relatively uncommon.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7668>
There is a small gap of time where the currently bound uncompiled
shaders, and compiled shader variant, are out of sync. Specifically,
between pipe->bind_*_state() and the next draw.
Currently, shaders variants live entirely within a single context,
and when deleting an iris_uncompiled_shader, we check if any of its
variants are currently bound, and defer deleting those until the next
iris_update_compiled_shaders() hook runs and binds new shaders to
replace them. (This is due to the time gap between binding new
uncompiled shaders, and updating variants at draw time when we have
the required NOS in place.)
This works pretty well in a single context world. But as we move to
share compiled shader variants across multiple contexts, it breaks down.
When deleting a shader, we can't look at all contexts to see if its
variants are bound anywhere. We can't even quantify whether those
contexts will run a future draw any time soon, to update and unbind.
One fairly crazy solution would be to delete the variants anyway, and
leave the stale pointers to dead variants in place. This requires
removing any code that compares old and new variants. Today, we do
that sometimes for seeing if the old/new shaders toggled some feature.
Worse than that, though, we don't just have to avoid dereferences, we'd
have to avoid pointer comparisons. If we free a variant, and quickly
allocate a new variant, malloc may return the same pointer. If it's
for the same shader stage, we may get a new different program that has
the same pointer as a previously bound stale one, causing us to think
nothing had changed when we really needed to do updates. Again, this
is doable, but leaves the code fragile - we'd have to guard against
future patches adding such checks back in.
So, don't do that. Instead, do basic reference counting. When a
variant is bound in a context, up the reference. When it's unbound,
decrement it. When it hits zero, we know it's not bound anywhere and
is safe to delete, with no stale references. This ends up being
reasonably cheap anyway, since the atomic is usually uncontested.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7668>
Reconfiguring the URB partitioning is likely to cause shader stalls,
as the dividing line between each stage's section of memory is moving.
(Technically, 3DSTATE_URB_* are pipelined commands, but that mostly
means that the command streamer doesn't need to stall.) So it should
be beneficial to update the URB configuration less often.
If the previous URB configuration already has enough space for our
current shader's needs, we can just continue using it, assuming we
are able to allocate the maximum number of URB entries per stage.
However, if we ran out of URB space and had to limit the number of
URB entrties for a stage, and the per-entry size is larger than we
need, we should reconfigure it to try and improve concurrency.
So, we begin tracking the last URB configuration in the context,
and compare against that when updating shader variants.
Cuts 36% of the URB reconfigurations (excluding BLORP) from a
Shadow of Mordor trace, and 46% from a GFXBench Manhattan 3.0 trace.
One nice thing is that this removes the need to look at the old
prog_data when updating shaders, which should make it possible to
unbind shader variants without causing spurious URB updates.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8721>
I have never used this to debug anything in iris, and it's been years
since I even thought about using i965's similar functionality. I'm
planning to move a bunch of shaders out of the global hash table, at
which point it'll be much less useful. So, just drop it.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8634>