With the advent of SPIR-V subgroup operations, compute shaders will have
to be slightly different depending on the SIMD size at which they
execute. In order to allow us to do dispatch-width specific things in
NIR, we re-run the final NIR stages for each sIMD width.
One side-effect of this change is that we start rallocing fs_visitors
which means we need DECLARE_RALLOC_CXX_OPERATORS.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Previously, brw_nir_lower_intrinsics added the param and then emitted a
load_uniform intrinsic to load it directly. This commit switches things
over to use a specific NIR intrinsic for the thread id. The one thing I
don't like about this approach is that we have to copy thread_local_id
over to the new visitor in import_uniforms.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
This isn't often a problem , when we're in a compute shader, we must
push the thread local ID so we decrement the amount of available push
space by 1 and it's no longer even and 64-bit data can, in theory, span
it. By marking those uniforms contiguous, we ensure that they never get
split in half between push and pull constants.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
The only things that adjust fs_visitor::max_dispatch_width are render
target writes which don't happen in compute shaders so they're
pointless.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
It's 8 for everything except compute shaders. For compute shaders,
there's no need to duplicate the computation and it's just a possible
source of error.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Before, we bailing in assign_constant_locations based on the minimum
dispatch size. The more direct thing to do is simply to check for
whether or not we have constant locations and bail if we do. For
nir_setup_uniforms, it's completely safe to do it multiple times because
we just copy a value from the NIR shader.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
This is what we really wanted all along. Always retyping to D works
because that's what get_nir_src() always gives us, at least for 32-bit
types. The SPIR-V variants of these operations accept arbitrary types
and we need this if we're going to handle 64 or 16-bit values.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
The index is any value provided by the shader and this can be called in
non-uniform control flow so we can't just take component 0. Found by
inspection.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Stop retyping the output of shuffle_64bit_data_for_32bit_write. It's
always BRW_REGISTER_TYPE_D which is perfectly fine for writing out.
Also, when we change get_nir_src to return something with a 64-bit type
for 64-bit values, the retyping will not be at all what we want. Also,
retyping the output based on src.type before we whack it back to 32 bits
is a problem because the output is always 32 bits.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
All callers of this function allocate a fs_reg expressly to pass into
it. It's much easier if we just let the helper allocate the register.
While we're here, we switch it to doing the MOVs with an integer type so
that we don't accidentally canonicalize floats on half of a double.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
The swizzles weren't doing any good because swiz is just XYZW. Also, we
were emitting an extra set of MOVs because shuffle_64bit_data_for_32bit
already does a MOV for us. Finally, the temporary was only ever used
inside the inner loop so there's no need for it to actually be an array.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Some hardware (CHV, BXT) have special restrictions on register regions
when doing integer multiplication. We want to respect those when we
lower to DxW multiplication.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
The same workaround we need for 64-bit values on little core also takes
care of the Ivy Bridge problem and does so a bit more efficiently so we
can drop that code while we're here.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
We're not using broadcast for any 32-bit types right now since we mostly
use it for emit_uniformize on 32-bit buffer indices. However, SPIR-V
subgroups are going to need it for 64-bit so let's make it work.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
This means we have to drop const from a variable but it also means that
100% of the code which deals with the offset limit is in one place.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
These restrictions effectively already existed due to the way we use
indirect sources but weren't being directly enforced.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
For some reason, the any/all predicates don't work properly with SIMD32.
In particular, it appears that a SEL with a QtrCtrl of 2H doesn't read
the correct subset of the flag register and you end up getting garbage
in the second half. Work around this by using a pair of 1-wide MOVs and
scattering the result. This fixes the any/all instructions for SIMD32.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
The any/all intrinsics return a boolean value so D or UD is the correct
type. Unfortunately, get_nir_dest has the annoying behavior of
returnning a float type by default. This causes format conversion which
gives us -1.0f or 0.0f in the register. If the consumer of the result
does an integer comparison to zero, it will give you the right boolean
value but if we do something more clever based on the 0/~0 assumption
for booleans, this will give the wrong value.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
In fragment shaders f0.1 is used for discards so doing ballot after a
discard can potentially cause the discard to not happen. However, we
don't support SIMD32 fragment shaders yet so this isn't a problem.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
We have ANY/ALL32 predicates and, for the most part, they work just
fine. (See the next commit for more details.) Also, due to the way
that flag registers are handled in hardware, instruction splitting is
able to split the CMP correctly. Specifically, that hardware looks at
the execution group and knows to shift it's flag usage up correctly so a
2H instruction will write to f0.1 instead of f0.0.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
Before, we were careful to place the zip after the last of the split
instructions but did unzip on-demand. This changes things so that the
unzips go before all of the split instructions and the unzip comes
explicitly after all the split instructions. As a side-effect of this
change, we now emit the split instruction from highest SIMD group to
lowest instead of low to high. We could have kept the old behavior, but
it shouldn't matter and this made the code easier.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
This makes it far more explicit where we're inserting the instructions
rather than the magic "before and after" stuff that the emit_[un]zip
helpers did based on block and inst.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
Register strides higher than 4 are uncommon but they can happen. For
instance, if you have a 64-bit extract_u8 operation, we turn that into
UB -> UQ MOV with a source stride of 8. Our previous calculation would
try to generate a stride of <32;8,8>:ub which is invalid because the
maximum horizontal stride is 4. To solve this problem, we instead use a
stride of <8;1,0>. As noted in the comment, this does not work as a
destination but that's ok as very few things actually generate that
stride.
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
Thanks to the ralloc invariant of "any pointer returned from ralloc can
be used as a context", calling ralloc_size with a size of zero will
cause it to allocate at least a header. If we don't have any push
constants, then NULL is perfectly acceptable (and even preferred).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
This will be used by the on disk shader cache.
v2:
* Set in brw_compile_* rather than brw_codegen_*. (Jason)
Signed-off-by: Timothy Arceri <timothy.arceri@collabora.com>
[jordan.l.justen@intel.com: Only add to brw_stage_prog_data]
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
NIR does not have these instructions. TGSI and Mesa IR both implement
them using < and >=, repsectively. Removing them deletes a bunch of
code and means I don't have to add code to the SPIR-V generator for
them.
v2: Rebase on 2+ years of change... and fix a major bug added in the
rebase.
text data bss dec hex filename
8255291 268856 294072 8818219 868e2b 32-bit i965_dri.so before
8254235 268856 294072 8817163 868a0b 32-bit i965_dri.so after
7815339 345592 420592 8581523 82f193 64-bit i965_dri.so before
7813995 345560 420592 8580147 82ec33 64-bit i965_dri.so after
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Patch uses mem_ctx for allocation to ensure param array gets freed
later.
==6164== 48 bytes in 1 blocks are definitely lost in loss record 61 of 193
==6164== at 0x4C2EB6B: malloc (vg_replace_malloc.c:299)
==6164== by 0x12E31C6C: ralloc_size (ralloc.c:121)
==6164== by 0x130189F1: fs_visitor::assign_constant_locations() (brw_fs.cpp:2095)
==6164== by 0x13022D32: fs_visitor::optimize() (brw_fs.cpp:5715)
==6164== by 0x13024D5A: fs_visitor::run_fs(bool, bool) (brw_fs.cpp:6229)
==6164== by 0x1302549A: brw_compile_fs (brw_fs.cpp:6570)
==6164== by 0x130C4B07: blorp_compile_fs (blorp.c:194)
==6164== by 0x130D384B: blorp_params_get_clear_kernel (blorp_clear.c:79)
==6164== by 0x130D3C56: blorp_fast_clear (blorp_clear.c:332)
==6164== by 0x12EFA439: do_single_blorp_clear (brw_blorp.c:1261)
==6164== by 0x12EFC4AF: brw_blorp_clear_color (brw_blorp.c:1326)
==6164== by 0x12EFF72B: brw_clear (brw_clear.c:297)
Fixes: 8d90e28839 ("intel/compiler: Allocate pull_param in assign_constant_locations")
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable@lists.freedesktop.org
Fixes intermittent GPU hangs on Broxton with an Intel internal
test case.
There are plenty of similar fragment shaders in piglit that do
not use any varyings and any uniforms. According to the
documentation special timing is needed between pipeline stages.
Apparently we just don't hit that with piglit. Even with the
failing test case one doesn't always get the hang.
Moreover, according to the error states the hang happens
significantly later than the execution of the problematic shader.
There are multiple render cycles (primitive submissions) in between.
I've also seen error states where the ACTHD points outside the
batch. Almost as if the hardware writes somewhere that gets used
later on. That would also explain why piglit doesn't suffer from
this - most tests kick off one render cycle and any corruption
is left unseen.
v2 (Ken): Instead of enabling push constants, enable one of the
inputs (PSIZ).
v3 (Ken, Jason): Use LAYER instead making vulkan emit_3dstate_sbe()
happy.
Cc: "17.3 17.2" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
The PRM says "The execution size must be 1." In 73137997e2, the
execution size was set to 1 when it should have been BRW_EXECUTE_1
(which maps to 0). Later, in dc2d3a7f5c, JMPI was used for
line AA on gen6 and earlier and we started manually stomping the
exeution size to BRW_EXECUTE_1 in the generator. This commit fixes the
original bug and makes brw_JMPI just do the right thing.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 73137997e2
Returns the brw_type for a given ssa.bit_size, and a reference type.
So if bit_size is 64, and the reference type is BRW_REGISTER_TYPE_F,
it returns BRW_REGISTER_TYPE_DF. The same applies if bit_size is 32
and reference type is BRW_REGISTER_TYPE_HF it returns BRW_REGISTER_TYPE_F
v2 (Jason Ekstrand):
- Use better unreachable() messages
- Add Q types
Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
In order to implement the ballot intrinsic, we do a MOV from flag
register to some GRF. If that GRF is used in a SEL, cmod propagation
helpfully changes it into a MOV from the flag register with a cmod.
This is perfectly valid but when lower_simd_width comes along, it simply
splits into two instructions which both have conditional modifiers.
This is a problem since we're reading the flag register. This commit
makes us check whether or not flags_written() overlaps with the flag
values that we are reading via the instruction source and, if we have
any interference, will force us to emit a copy of the source.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org