It'll grow further, and we'd like to avoid adding an additional
parameter to fs_generator() for each new piece of data.
v2 (idr): Rebase on 17 months. Track a visitor instead of a cfg.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Gen11 SLM is not on L3 anymore, so now the hardware has two separate
fences. Add a way to control which fence types to use.
At this time, we don't have enough information in NIR to control the
visibility of the memory being fenced, so for now be conservative and
assume that fences will need a stall. With more information later
we'll be able to reduce those.
Fixes Vulkan CTS tests in ICL:
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_nonlocal.workgroup.guard_local.buffer.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.image.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.image.guard_nonlocal.workgroup.comp
The whole set of supported tests in dEQP-VK.memory_model.* group
should be passing in ICL now.
v2: Pass BTI around instead of having an enum. (Jason)
Emit two SHADER_OPCODE_MEMORY_FENCE instead of one that gets
transformed into two. (Jason)
List tests fixed. (Lionel)
v3: For clarity, split the decision of which fences to emit from the
emission code. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: 1) Drop changes for vec4 backend as on Gen11+ we don't support
align16 mode (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
This rewrites the ddy in EXECUTE_4 mode with a loop to make it more
obvious what is going on and also sets the group each of the 4 threads
in the groups are supposed to execute.
Fixes the following CTS tests :
dEQP-VK.glsl.derivate.dfdyfine.dynamic_*
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Co-Authored-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 2134ea3800 ("intel/compiler/fs: Implement ddy without using align16 for Gen11+")
With 8 and 16-bit types and anything where we have to use non-trivial
strides registersto deal with restrictions, we end up with things that
look like partial writes even though we don't care about any values in
the register except those written by that instruction. This is
particularly important when dealing with loops because liveness sees
is_partial_write and the fact that an old version from a previous loop
iteration may be valid at that point and extends all purely partially
written values to the entire loop.
This commit adds a new UNDEF instruction which does nothing (the
generator doesn't emit anything) but which does a fake write to the
register. This informs liveness that we don't care about any values
before that point so it won't consider those registers to be falsely
live. We can safely emit UNDEF instructions for all SSA values that
come in from NIR and nearly all temporaries generated by various stages
of the compiler. In particular, we need to insert UNDEF instructions
when we handle region restrictions because the newly allocated registers
are almost guaranteed to be partially written.
No shader-db changes.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110432
Reviewed-by: Matt Turner <mattst88@gmail.com>
We set header_present but then pass it some random garbage. Give it g0
instead. I'm not actually sure this does anything but g0 is the usual
header data and this is what the windows driver does so it seems like a
good idea.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
In the FS IR we pretend that the instruction is predicated with (+f0.1)
just for flag dependency tracking purposes. Since the instruction
doesn't support predication before Haswell, we unset the predicate so we
should also unset the flag register so that we can round-trip the
disassembly.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We now have a lowering pass that will do this at the fs_visitor level,
so we can remove this code from gen11+.
v2: Reduce size of the "i" array from 4 to 2 (Matt).
Reviewed-by: Matt Turner <mattst88@gmail.com>
Move the scalar-region conversion from the IR to the generator, so it
doesn't affect the Gen11 path. We need the non-scalar regioning
for a later lowering pass that we are adding.
v2: Better commit message (Matt)
Reviewed-by: Matt Turner <mattst88@gmail.com>
Broadwell has restrictions that apply to Align16 half-float that
make the Align16 implementation of this invalid for this platform.
Use the gen11 path for this instead, which uses Align1 mode.
The restriction is not present in cherryview, gen9 or gen10, where
the Align16 implementation seems to work just fine.
v2:
- Rework the comment in the code, move the PRM citation from the
commit message to the comment in the code (Matt)
- Cherryview isn't affected, only Broadwell (Matt)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v1)
Reviewed-by: Matt Turner <mattst88@gmail.com>
We were assuming 32-bit elements. Also, In SIMD8 we pack 2 vector components
in a single SIMD register, so for example, component Y of a 16-bit vec2
starts is at byte offset 16B. This means that when we compute the offset of
the elements to be differentiated we should not stomp whatever base offset we
have, but instead add to it.
v2
- Use byte_offset() helper (Jason)
- Merge the fix for SIMD8: using byte_offset() fixes that too.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v1)
Reviewed-by: Matt Turner <mattst88@gmail.com>
For split indirect sends we have to put the EOT parameter in the
extended descriptor as well as the instruction itself so just calling
brw_inst_set_eot is insufficient. Moving the EOT handling handling into
the send_indirect_[split]_message helper lets us handle it properly.
Strides up to 32B can be implemented for the source regions of most
instructions by leveraging either the vertical or the horizontal
stride of the hardware Align1 region. The main motivation for this is
that currently the lower_integer_multiplication() pass will happily
double the stride of one of the 32-bit sources, which can blow up if
the stride of the original source was already the maximum value
allowed by the hardware.
An alternative would be to use the regioning legalization pass in
order to lower such strides into the composition of multiple legal
strides, but that would be somewhat less efficient.
This showed up as a regression from my commit cbea91eb57
in Vulkan 1.1 CTS tests on CHV/BXT platforms, however it was really a
pre-existing problem that had affected conformance on other platforms
without native support for integer multiplication. CHV/BXT were
getting around it because the code I removed in that commit had the
"fortunate" side effect of emitting narrower regions that didn't hit
the hardware stride limit after lowering. Beyond fixing the
regression this fixes ~90 additional Vulkan 1.1 subgroup CTS tests on
ICL (that's why this patch is marked for inclusion in mesa-stable even
though the original regressing patch was not).
According to Jason, a nearly equivalent change had been committed
previously as e8c9e65185 and then (mistakenly?) reverted as
a31d038208.
Cc: mesa-stable@lists.freedesktop.org
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109328
Reported-by: Mark Janes <mark.a.janes@intel.com>
Tested-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The original idea was that the backend compiler could eliminate
surfaces, so we would have it mark which ones are actually used,
then shrink the binding table accordingly. Unfortunately, it's a
pretty blunt mechanism - it can only prune things from the end,
not the middle - since we decide the layout before we even start
the backend compiler, and only limit the size. It also basically
gives up if it sees indirect array access.
Besides, we do the vast majority of our surface elimination in NIR
anyway, not the backend - and I don't see that trend changing any
time soon. Vulkan abandoned this plan a long time ago, and I don't
use it in Iris, but it's still been kicking around in i965.
I hacked shader-db to print the binding table size in bytes, and
observed no changes with this patch. So, this code appears to do
nothing useful.
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
These are broken on a future platform, but it turns out we don't need
to fix them, since they're just type-converting moves with strided
source. Kill them.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Align16 is no longer a thing, so a new implementation is provided
using Align1 instead. Not all possible swizzles can be represented as
a single Align1 region, but some fast paths are provided for
frequently used swizzles that can be represented efficiently in Align1
mode.
Fixes ~90 subgroup quad swap Vulkan CTS tests.
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
To have uniform behavior while disassembling send(c) instruction use
register type of unsigned doubleword for src1 when message descriptor is
immediate value. Bspec does not specifiy anything for src1 immediate
default type.
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
v2: Split changes to the message type field to another patch. Suggested
by Caio.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This makes the message length available at the IR level, which should
save some guesswork in a future commit.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The current approach of returning a setup instruction where additional
descriptor fields can be specified is still supported in order to keep
things working, but it will be removed later in this series.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
On g4x through Sandy Bridge, src1 (the coordinates) of the PLN
instruction is required to be an even register number. When it's odd
(which can happen with SIMD32), we have to emit a LINE+MAC combination
instead. Unfortunately, we can't just fall through to the gen4 case
because the input registers are still set up for PLN which lays out the
four src1 registers differently in SIMD16 than LINE.
v2 (Jason Ekstrand):
- Take advantage of both accumulators and emit LINE LINE MAC MAC
(Based on a patch from Francisco Jerez)
- Unify the gen4 and gen4x-6 cases using a loop
v3 (Jason Ekstrand):
- Don't unify gen4 with gen4x-6 as this turns out to be more fragile
than first thought without reworking the gen4 barycentric coordinate
layout.
Reviewed-by: Matt Turner <mattst88@gmail.com>
This reworks INTERPOLATE_AT_PER_SLOT_OFFSET to work more like an ALU
operation and less like a send. This is less code over-all and, as a
side-effect, it now properly handles execution groups and lowering so
SIMD32 support just falls out.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
We want consistent behavior in the meaning of the flag_subreg field
between SNB and IVB+.
v2 (Jason Ekstrand):
- Add some extra commentary
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Doing instruction header setup in the generator is awful for a number
of reasons. For one, we can't schedule the header setup at all. For
another, it means lots of implied writes which the instruction scheduler
and other passes can't properly read about. The second isn't a huge
problem for FB writes since they always happen at the end. We made a
similar change to sampler handling in ff4726077d.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Now that we have the implied header in src[0] for tracking purposes, we
may as well use it in the generator. This makes things a tiny bit more
general.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Adds suppport for ARB_fragment_shader_interlock. We achieve
the interlock and fragment ordering by issuing a memory fence
via sendc.
Signed-off-by: Plamena Manolova <plamena.manolova@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
When using multiple RT write messages to the same RT such as for
dual-source blending or all RT writes in SIMD32, we have to set the
"Last Render Target Select" bit on all write messages that target the
last RT but only set EOT on the last RT write in the shader.
Special-casing for dual-source blend works today because that is the
only case which requires multiple RT write messages per RT. When we
start doing SIMD32, this will become much more common so we add a
dedicated bit for it.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
The only reason it was it's own opcode was so that we could detect it
and adjust the source register based on the payload setup. Now that
we're using the ATTR file for FS inputs, there's no point in having a
magic opcode for this.
v2 (Jason Ekstrand):
- Break the bit which removes the CINTERP opcode into its own patch
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
This is better than compression control because it naturally extends to
SIMD32.
v2:
- Push/pop instruction state around adjusted codegen (Ken)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
The fall-back does not work correctly in SIMD16 mode and the register
allocator should ensure that we never hit this case anyway.
Reviewed-by: Matt Turner <mattst88@gmail.com>