We only support 32-bit versions of ufind_msb, find_lsb, and bit_count,
so we need to lower them via nir_lower_int64.
Previously, we were failing to do so on platforms older than Icelake
and let those operations fall through to nir_lower_bit_size, which
used a callback to determine it should lower them for bit_size != 32.
However, that pass only emulates small bit-size operations by promoting
them to supported, larger bit-sizes (i.e. 16-bit using 32-bit). It
doesn't support emulating larger operations (i.e. 64-bit using 32-bit).
So nir_lower_bit_size would just u2u32 the 64-bit source, causing us to
flat ignore half of the bits.
Commit 78a195f252 (intel/compiler: Postpone most int64 lowering to
brw_postprocess_nir) provoked this bug on Icelake and later as well,
by moving the nir_lower_int64 handling for ufind_msb until late in
compilation, allowing it to reach nir_lower_bit_size which broke it.
To fix this, we always set int64 lowering for these opcodes, and also
correct the nir_lower_bit_size callback to ignore 64-bit operations.
Cc: mesa-stable
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23123>
GLSL IR opcodes generated for bitfieldExtract and bitfieldInsert are
lowered by lower_instructions. 4dff3ff005 ("nir/opt_algebraic:
Optimize open coded bfm.") adds an optimization that can rematerialize
nir_op_bfm that was prevented by the GLSL IR lowering.
It appears that every piece of hardware, except older Intel GPUS, that
has real integers (i.e., lower_bitops is not set) also sets
lower_bitfield_extract_to_shifts and lower_bitfield_insert_to_shifts.
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 4dff3ff005 ("nir/opt_algebraic: Optimize open coded bfm.")
Closes: #7874
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20323>
Setting the NIR options takes care of iris thanks to the common st/mesa
linking code, and updating brw_nir_link_shaders should handle anv.
The main effort here is updating remap_tess_levels, which needs to
handle vector stores, writemasking, and swizzling. Unfortunately,
we also need to continue handling the existing single-component
access because it's used for TES inputs, which we don't vectorize.
We could try to vectorize TES inputs too, but they're all pushed
anyway, so it wouldn't buy us much other than deleting this code.
Also, we do have opt_combine_stores, but not one for loads.
One limitation of using nir_vectorize_tess_levels is that it works
on variables, and so isn't able to combine outer/inner writes that
happen to live in the same vec4 slot (for triangle domains). That
said, it's still better than before.
For writes, we allow the intrinsics to supply up to the full size
of the variable (vec4 for outer, vec2 for inner) even if the domain
only requires a subset of those components (i.e. triangles needs 3).
shader-db results on Icelake:
total instructions in shared programs: 19600314 -> 19597528 (-0.01%)
instructions in affected programs: 65338 -> 62552 (-4.26%)
helped: 271 / HURT: 0
helped stats (abs) min: 6 max: 24 x̄: 10.28 x̃: 12
helped stats (rel) min: 1.30% max: 18.18% x̄: 5.80% x̃: 7.59%
95% mean confidence interval for instructions value: -10.71 -9.85
95% mean confidence interval for instructions %-change: -6.17% -5.43%
Instructions are helped.
total cycles in shared programs: 851842332 -> 851808165 (<.01%)
cycles in affected programs: 618577 -> 584410 (-5.52%)
helped: 271 / HURT: 0
helped stats (abs) min: 64 max: 540 x̄: 126.08 x̃: 111
helped stats (rel) min: 2.57% max: 37.97% x̄: 6.12% x̃: 5.06%
95% mean confidence interval for cycles value: -135.35 -116.80
95% mean confidence interval for cycles %-change: -6.67% -5.57%
Cycles are helped.
total sends in shared programs: 1025238 -> 1024308 (-0.09%)
sends in affected programs: 6454 -> 5524 (-14.41%)
helped: 271 / HURT: 0
helped stats (abs) min: 2 max: 8 x̄: 3.43 x̃: 4
helped stats (rel) min: 5.71% max: 25.00% x̄: 14.98% x̃: 17.39%
95% mean confidence interval for sends value: -3.57 -3.29
95% mean confidence interval for sends %-change: -15.42% -14.54%
Sends are helped.
According to Felix DeGrood, this results in a 10% improvement in
the draw call time for certain draw calls from Strange Brigade.
v2: Fix assertions about number of components and add more of them.
Combine the quads and triangles handling as it's nearly identical.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> [v1]
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19061>
Setting the NIR options takes care of iris thanks to the common st/mesa
linking code, and updating brw_nir_link_shaders should handle anv.
The main effort here is updating remap_tess_levels, which needs to
handle vector stores, writemasking, and swizzling. Unfortunately,
we also need to continue handling the existing single-component
access because it's used for TES inputs, which we don't vectorize.
We could try to vectorize TES inputs too, but they're all pushed
anyway, so it wouldn't buy us much other than deleting this code.
Also, we do have opt_combine_stores, but not one for loads.
One limitation of using nir_vectorize_tess_levels is that it works
on variables, and so isn't able to combine outer/inner writes that
happen to live in the same vec4 slot (for triangle domains). That
said, it's still better than before.
For writes, we allow the intrinsics to supply up to the full size
of the variable (vec4 for outer, vec2 for inner) even if the domain
only requires a subset of those components (i.e. triangles needs 3).
shader-db results on Icelake:
total instructions in shared programs: 19605070 -> 19602284 (-0.01%)
instructions in affected programs: 65338 -> 62552 (-4.26%)
helped: 271 / HURT: 0
helped stats (abs) min: 6 max: 24 x̄: 10.28 x̃: 12
helped stats (rel) min: 1.30% max: 18.18% x̄: 5.80% x̃: 7.59%
95% mean confidence interval for instructions value: -10.71 -9.85
95% mean confidence interval for instructions %-change: -6.17% -5.43%
Instructions are helped.
total cycles in shared programs: 851854659 -> 851820320 (<.01%)
cycles in affected programs: 618749 -> 584410 (-5.55%)
helped: 271 / HURT: 0
helped stats (abs) min: 69 max: 540 x̄: 126.71 x̃: 108
helped stats (rel) min: 2.57% max: 37.97% x̄: 6.17% x̃: 5.06%
95% mean confidence interval for cycles value: -135.89 -117.54
95% mean confidence interval for cycles %-change: -6.72% -5.63%
Cycles are helped.
total sends in shared programs: 1025285 -> 1024355 (-0.09%)
sends in affected programs: 6454 -> 5524 (-14.41%)
helped: 271 / HURT: 0
helped stats (abs) min: 2 max: 8 x̄: 3.43 x̃: 4
helped stats (rel) min: 5.71% max: 25.00% x̄: 14.98% x̃: 17.39%
95% mean confidence interval for sends value: -3.57 -3.29
95% mean confidence interval for sends %-change: -15.42% -14.54%
Sends are helped.
According to Felix DeGrood, this results in a 10% improvement in
the draw call time for certain draw calls from Strange Brigade.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17944>
For Gen11 and prior, the dispatch mode for TCS was SINGLE_PATCH, and
this debug setting could be used to change it to 8_PATCH (falling back
to SINGLE_PATCH when shader couldn't be in the multi dispatch mode).
However after talking to Ken, seems this debug setting is not really
worth keeping around, so removing it.
For Gen12+ the only option is 8_PATCH, so it was always using that
dispatch mode as before.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18151>
Don't compute it based on devinfo->has_64bit_float. Othwerwise we may
end up emitting 64bit-int (Q) instructions on platforms with 64bit
floats but not 64bit integers.
Right now, the only platforms where has_64bit_int is different from
has_64bit_float are the platforms that use GFX7_FEATURES.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15835>
Task/Mesh stages are CS-like stages, and include many
builtins (e.g. workgroup ID/index) and intrinsics (e.g. workgroup
memory primitives) originally present only in CS.
This commit add two new stages (task and mesh) that 'inherit' from CS
by embedding a brw_cs_prog_data in their own prog_data structure, so
that CS functionality can be easily reused. They also currently use
the same helpers to select the SIMD variant to use -- that was
recently added for CS.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13661>
INTEL_DEBUG is defined (since 4015e1876a) as:
#define INTEL_DEBUG __builtin_expect(intel_debug, 0)
which unfortunately chops off upper 32 bits from intel_debug
on platforms where sizeof(long) != sizeof(uint64_t) because
__builtin_expect is defined only for the long type.
Fix this by changing the definition of INTEL_DEBUG to be function-like
macro with "flags" argument. New definition returns 0 or 1 when
any of the flags match.
Most of the changes in this commit were generated using:
for c in `git grep INTEL_DEBUG | grep "&" | grep -v i915 | awk -F: '{print $1}' | sort | uniq`; do
perl -pi -e "s/INTEL_DEBUG & ([A-Z0-9a-z_]+)/INTEL_DBG(\1)/" $c
perl -pi -e "s/INTEL_DEBUG & (\([A-Z0-9_ |]+\))/INTEL_DBG\1/" $c
done
but it didn't handle all cases and required minor cleanups (like removal
of round brackets which were not needed anymore).
Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13334>
A lot of CTS tests write a u8vec4 or an i8vec4 to an SSBO. This results
in a lot of shifts and MOVs. When that pattern can be recognized, the
individual 8-bit components can be packed much more efficiently.
v2: Rebase on b4369de27f ("nir/lower_packing: use
shader_instructions_pass")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9025>
This is where it should be rather than having to pass it into the
optimisation pass every time.
It also allows us to call the loop analysis pass without having to
duplicate these options which we will do later in this series.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12064>
The Intel bindless thread dispatch model is very simple. When a compute
shader is to be used for bindless dispatch, it can request a set of
stack IDs. These are allocated per-dual-subslice by the hardware and
recycled automatically when the stack ID is returned. Passed to the
bindless dispatch are a global argument address, a stack ID, and an
address of the BINDLESS_SHADER_RECORD to invoke. When the bindless
shader is dispatched, it is passed its stack ID as well as the global
and local argument pointers. The local argument pointer is the address
of the BINDLESS_SHADER_RECORD plus some offset which is specified as
part of the BINDLESS_SHADER_RECORD.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7356>
Intel hardware supports 8-bit arithmetic but it's tricky and annoying:
- Byte operations don't actually execute with a byte type. The
execution type for byte operations is actually word. (I don't know
if this has implications for the HW implementation. Probably?)
- Destinations are required to be strided out to at least the
execution type size. This means that B-type operations always have
a stride of at least 2. This means wreaks havoc on the back-end in
multiple ways.
- Thanks to the strided destination, we don't actually save register
space by storing things in bytes. We could, in theory, interleave
two byte values into a single 2B-strided register but that's both a
pain for RA and would lead to piles of false dependencies pre-Gen12
and on Gen12+, we'd need some significant improvements to the SWSB
pass.
- Also thanks to the strided destination, all byte writes are treated
as partial writes by the back-end and we don't know how to copy-prop
them.
- On Gen11, they added a new hardware restriction that byte types
aren't allowed in the 2nd and 3rd sources of instructions. This
means that we have to emit B->W conversions all over to resolve
things. If we emit said conversions in NIR, instead, there's a
chance NIR can get rid of some of them for us.
We can get rid of a lot of this pain by just asking NIR to get rid of
8-bit arithmetic for us. It may lead to a few more conversions in some
cases but having back-end copy-prop actually work is probably a bigger
bonus. There is still a bit we have to handle in the back-end. In
particular, basic MOVs and conversions because 8-bit load/store ops
still require 8-bit types.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7482>