Stop using brw_compiler to lower the final binding table indices for
surface access. This is done by simply not setting the
'prog_data->binding_table.*_start' fields. Then make the driver
perform this lowering.
This is a better place to perfom the binding table assignments, since
the driver has more information and will also later consume those
assignments to upload resources.
This also prepares us for two changes: use ibc without having to
implement binding table logic there; and remove unused entries from
the binding table.
Since the `block` field in brw_ubo_range now refers to the final
binding table index, we need to adjust it before using to index
shs->constbuf.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Shader-db results on Kaby Lake:
total instructions in shared programs: 15306230 -> 15304726 (<.01%)
instructions in affected programs: 4570 -> 3066 (-32.91%)
helped: 16
HURT: 0
total cycles in shared programs: 361703436 -> 361680041 (<.01%)
cycles in affected programs: 129388 -> 105993 (-18.08%)
helped: 16
HURT: 0
LOST: 0
GAINED: 2
The helped programs were in XCom 2, Deus Ex: Mankind Divided, and Kerbal
Space Program
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Jason pointed out that we don't need to keep an entire copy of the
serialized NIR around, we just need the SHA1. This does change our
disk cache key to be taking a SHA1 of a SHA1, which is a bit odd,
but should work out and be faster and use less memory.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Felix noticed a crash when using INTEL_DEBUG=bat decoding. It turned
out that we were sometimes placing variable length data near the end
of a buffer, and with the decoder guessing random lengths rather than
having an actual count, it was walking off the end and crashing. So
this does more than improve the decoder output.
Unfortunately, this is a bit more complicated than i965's handling,
because we don't have a single state buffer. Various places upload
data via u_upload_mgr, and so there isn't a central place to record
the size. We don't need to catch every single place, however, since
it's only important to record variable length packets (like viewports
and binding tables).
State data also lives arbitrarily long, rather than being discarded on
every batch like i965, so we don't know when to clear out old entries
either. (We also don't have a callback when an upload buffer is
released.) So, this tracking may space leak over time. That's probably
okay though, as this is only a debugging feature and it's a slow leak.
We may also get lucky and overwrite existing entries as we reuse BOs,
though I find this unlikely to happen.
The fact that the decoder works in terms of offsets from a state base
address is also not ideal, as dynamic state base address and surface
state base address differ for iris. However, because dynamic state
addresses start from the top of a 4GB region, and binding tables start
from addresses [0, 64K), it's highly unlikely that we'll get overlap.
We can always improve this, but for now it's better than what we had.
We were checking this based on nir->info.name, but with the shader
cache enabled, nir_strip throws out the name, causing us to use IEEE
mode for ARB programs.
gl-1.0-spot-light regressed because it wants ALT mode for 0^0 behavior.
Fixes: dc5dc727d5 iris: Serialize the NIR to a blob we can use for shader cache purposes.
We will use a hash of the serialized NIR together with brw_prog_*_key
(for NOS) as the disk cache key, where the disk cache contains actual
assembly shaders.
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
It had been internal to iris_program.c, but with the upcoming disk cache
code, the "program module" is going to be spread across a couple source
files. Into a header it goes!
Now it lives alongside iris_compiled_shader, which makes sense.
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
This mechanism lets the driver inform the state tracker about GPU
resets, say for destroying a robust API context and reporting a "device
lost" error to the application, making it take action to deal with this.
The iris batch module now tries to detect that the kernel has banned
our GEM context, creates a new non-banned context, and informs the
iris context module that all assumptions about state are now invalid
and it needs to reinitialize the relevant state.
Based on Chris Wilson's work, but significantly rewritten by me.
I was setting it based off a pipe_rasterizer_state field that appears
to be entirely dead outside of the draw module respecting it.
I should be setting it when the primitive type reaching the SF is
neither points nor lines. This is, unfortunately, rather dirty,
as we have to look at the rasterizer state, the geometry shader state,
the tessellation evaluation shader state, and the primitive type...
Some of the dEQP.functional.transform_feedback tests end up doing
the following sequence of operations:
1. BeginTransformFeedback
2. PauseTransformFeedback
3. Draw
4. ResumeTransformFeedback
At step 1, we'd pack 3DSTATE_SO_BUFFER commands saying to zero the
SO_WRITE_OFFSET registers. At step 2, we disable streamout, so step 3
doesn't bother emitting those commands. Then, step 4 re-packs new
3DSTATE_SO_BUFFER commands with offset = 0xFFFFFFFF, saying to continue
appending at the existing offset. This loads the value from the BO as
the offsets - but we never actually zeroed it.
So, just maintain a flag saying "we actually emitted the commands",
and stomp offset back to zero until we emit some.
Just enable it during init_render_context on Gen10+, and move the
Gen9 state tracking into iris_genx_state so it only exists on Gen9.
Reviewed-by: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
We need to subtract the starting offset from the final offset before
dividing by the stride. See src/intel/vulkan/genX_cmd_buffer.c:3142.
Not known to fix anything.
This implements PIPE_CAP_INVALIDATE_BUFFER and invalidate_resource(),
as well as the PIPE_TRANSFER_DISCARD_WHOLE_RESOURCE flag. When either
of these happen, we swap out the backing storage of the buffer for a
new idle BO, allowing us to write to it immediately without stalling
or queueing a blit.
On my Skylake GT4e at 1920x1080, this improves performance in games:
-----------------------------------------------
| DiRT Rally | +25% (avg) | +17% (max) |
| Bioshock Infinite | +22% (avg) | +11% (max) |
| Shadow of Mordor | +27% (avg) | +83% (max) |
-----------------------------------------------
This unifies a bunch of the UBO and SSBO code to use common structures.
Beyond iris_state_ref, pipe_shader_buffer also gives us a buffer size,
which can be useful when filling out the surface state.
Marek recently extended pipe->set_shader_buffers() to take an extra
writable_bitmask parameter, indicating which SSBOs are writable (some
may be bound read-only). We can use this to decide whether to set
EXEC_OBJECT_WRITE when pinning. Avoiding the write flag can save us
some cross-batch flushing if the SSBO is used for reading in both the
render and compute engines.
Pipeline statistics queries should not count BLORP's rectangles.
(23) How do operations like Clear, TexSubImage, etc. affect the
results of the newly introduced queries?
DISCUSSION: Implementations might require "helper" rendering
commands be issued to implement certain operations like Clear,
TexSubImage, etc.
RESOLVED: They don't. Only application submitted rendering
commands should have an effect on the results of the queries.
Piglit's arb_pipeline_statistics_query-vert_adj exposes this bug when
the driver is hacked to always perform glBufferData via a GPU staging
copy (for debugging purposes).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
libintel_common depends on libintel_compiler, but it contains debug
functionality that is needed by libintel_compiler. Break the circular
dependency by moving gen_debug files to libintel_dev.
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
MI_PREDICATE_DATA is an intermediate storage for the MI_PREDICATE
command's calculations - it holds the result of the subtraction when
the compare operation is SRCS_EQUAL or DELTAS_EQUAL. But the actual
result of the predication is MI_PREDICATE_RESULT, which is what we
want to copy from the render context to the compute context.
This function can be used to stall on the CPU and resolve the predicate
for the conditional render. It will convert ice->state.predicate from
IRIS_PREDICATE_STATE_USE_BIT to either IRIS_PREDICATE_STATE_RENDER or
IRIS_PREDICATE_STATE_DONT_RENDER, depending on the result of the query.
v2:
- return void (Ken)
- update the stored condition (Ken)
- simplify the code leading to resolve the predicate (Ken)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
I'll want to use this for transfer maps, which already do their own
flushing. This lets us avoid a double flush, and also gives us more
control over the batch which is selected.
Gallium might call us multiple times to bind subsets of the samplers,
at which point we'd recreate the table a bunch of times. It doesn't
really buy us anything to do it here - even if we defer to draw time,
the dirty tracking ensures we'll only do it on the first draw after a
bind_sampler_states() call.
We now use the number of samplers specified by the shader instead of
the binding count. If this number changes, we flag sampler state as
dirty so we re-upload a table with the right number of entries.
This also fixes a bug where ice->state.need_border_colors was never
unset, so once something needed border colors, the pool would always
be pinned in all future batches.
v2: Explicitly flag sampler states as dirty, rather than assuming that
bind_sampler_states() will be called if the program texture count
changes. While this may be true for st/mesa, it isn't the case for
Gallium HUD.
Tested-by: Timur Kristóf <timur.kristof@gmail.com>
Tested-by: Andre Heider <a.heider@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
If Vertex Shader uses EdgeFlag the hardware request that it is setup
as the last VERTEX_ELEMENT_STATE. If SGVS are add at draw time we
need to also reconfigure the last 3DSTATE_VF_INSTANCING so its
VertexElementIndex points to the new Vertex Element that contains
the EdgeFlag.
So if draw parameters or edgeflag are not used the CSO generated at
iris_create_vertex_element is sent directly in the batches. But if
edge flag is used we adjust last VERTEX_ELEMENT_STATE and
last 3DSTATE_VF_INSTANCING using their alternative edge flag version
we generate at iris_create_vertex_element and store at the CSO.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Additional VERTEX_ELEMENT_STATE are used to store basevertex and
baseinstance and drawid updating the DWordLength of the
3DSTATE_VERTEX_ELEMENTS command.
This passes all piglit tests for spec.*draw_parameters.* tests
and VK-GL-CTS KHR-GL45.shader_draw_parameters_tests.* tests.
Now we only mark a dirty_update when parameters are changed or
when we have an indirect draw.
We enable PIPE_CAP_DRAW_PARAMETERS on Iris.
There is no edge flag support in the Vertex Elements setup.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of allocating 4K BO per query object, we can create a large blob
of memory and split it into pieces as required.
Having one BO for multiple query objects, we don't want to wait on all
of them, instead when we write last snapshot, we create a sync point, and
check syncpoints while waiting on particular object.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
In st_nir_lower_uniforms_to_ubo() all UBO access in the shader have
its index incremented to open room for uniforms in constbuf0. So if
we use UBOs, we always need to include the extra binding entry in the
table.
To avoid doing this checks both when compiling the shader and when
assigning binding tables, store the num_cbufs in iris_compiled_shader.
Fixes a bunch of tests from Piglit and CTS that use UBOs but don't use
uniforms or system values. Note that some tests fitting this criteria
were passing because the UBOs were moved to be push
constants (avoiding the problem).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We were relying on CSE/GVN/etc to coalesce all intrinsics that load the
same value, but that's a bad idea. We might have a couple intrinsics
that reload the same value. If so, we only want to set up the uniform
on the first one we see.
I was using the Gallium API wrong. set_* functions with start_slot
and count parameters are supposed to update a subrange of the items.
I had been trashing all bound vertex buffers and starting over.
This should hopefully also make it easier to slot in additional
VERTEX_BUFFER_STATEs at draw time, say, for shader draw parameters.