In 4ceaed7839 we made scratch surface state allocations part of the
internal heap (mapped to STATE_BASE_ADDRESS::SurfaceStateBaseAddress)
so that it doesn't uses slots in the application's expected 1M
descriptors (especially with vkd3d-proton).
But all our compiler code relies on BSS
(STATE_BASE_ADDRESS::BindlessSurfaceStateBaseAddress).
The additional issue is that there is only 26bits of surface offset
available in CS instruction (CFE_STATE, 3DSTATE_VS, etc...) for
scratch surfaces. So we need the drivers to put the scratch surfaces
in the first chunk of STATE_BASE_ADDRESS::SurfaceStateBaseAddress
(hence all the driver changes).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 4ceaed7839 ("anv: split internal surface states from descriptors")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7687
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19727>
The code builds up the dynamic array of objects (spirv_objs) and
collect pointers to each of them into another dynamic
array (spirv_ptr_objs).
If the growth of the first array cause a reallocation, it is
possible that the previous pointers end up invalid.
Fixes: 77e929a527 ("intel/clc: allow multiple CL files to be compiled together")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19730>
Some review feedback of an earlier commit caused me to rearrange some
code quite a bit. I wasn't paying enough attention while applying the
later commits, and these breaks should have been returns. As it is, the
result of the imin or imax analysis is overwritten by the default case
handling... effectively the original commit does nothing. :(
Tiger Lake and Ice Lake had similar results. (Ice Lake shown)
total instructions in shared programs: 19914090 -> 19904772 (-0.05%)
instructions in affected programs: 121258 -> 111940 (-7.68%)
helped: 445 / HURT: 0
total cycles in shared programs: 855291535 -> 855266659 (<.01%)
cycles in affected programs: 2737005 -> 2712129 (-0.91%)
helped: 426 / HURT: 17
LOST: 0
GAINED: 3
Skylake and Broadwell had similar results. (Skylake shown)
total cycles in shared programs: 842395356 -> 842338259 (<.01%)
cycles in affected programs: 5460985 -> 5403888 (-1.05%)
helped: 458 / HURT: 0
Haswell and Ivy Bridge had similar results. (Haswell shown)
total instructions in shared programs: 16710449 -> 16708449 (-0.01%)
instructions in affected programs: 44101 -> 42101 (-4.54%)
helped: 75 / HURT: 0
total cycles in shared programs: 882760230 -> 882727923 (<.01%)
cycles in affected programs: 2867797 -> 2835490 (-1.13%)
helped: 62 / HURT: 10
No shader-db change on any other Intel platform.
No fossil-db changes on any Intel platform.
Fixes: 5ec75ca10d ("intel/compiler: Teach signed integer range analysis about imax and imin")
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19602>
The back-end swizzles dwords so that our indirect scratch messages match
the memory layout of spill/fill messages for better cache coherency.
The swizzle happens at a DWORD granularity. If a read or write crosses
a DWORD boundary, the first bit will get correctly swizzled but whatever
piece lands in the next dword will not because the scatter instructions
assume sequential addresses for all bytes. For DWORD writes, this is
handled naturally as part of scalarizing. For smaller writes, we need
to be sure that a single write never escapes a dword.
Fixes: fd04f858b0 ("intel/nir: Don't try to emit vector load_scratch instructions")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7364
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19580>
Because dup_mem_intrinsic() retains the SSA offset from the original
intrinsic and only modifies it by adding a constant, we can compute the
alignment based on the original alignment and the constant offset. This
is both easier and more accurate.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19580>
Many Intel platforms can only perform 32x16 bit multiplication. The
straightforward way to implement 32x32 bit multiplications is by
splitting one of the operands into high and low parts called H and L,
repsectively. The full multiplication can be implemented as:
((A * H) << 16) + (A * L)
On Intel platforms, special register accesses can be used to eliminate
the shift operation. This results in three instructions and a temporary
register for most values.
If H or L is 1, then one (or both) of the multiplications will later be
eliminated. On some platforms it may be possible to eliminate the
multiplication when H is 256.
If L is zero (note that H cannot be zero), one of the multiplications
will also be eliminated.
Instead of splitting the operand into high and low parts, it may
possible to factor the operand into two 16-bit factors X and Y. The
original multiplication can be replaced with (A * (X * Y)) = ((A * X) *
Y). This requires two instructions without a temporary register.
I may have gone a bit overboard with optimizing the factorization
routine. It was a fun brainteaser, and I couldn't put it down. :) On my
1.3GHz Ice Lake, a standalone test could chug through 1,000,000 randomly
selected values in about 5.7 seconds. This is about 9x the performance
of the obvious, straightforward implementation that I started with.
v2: Drop an unnecessary return. Rearrange logic slightly and rename
variables in factor_uint32 to better match the names used in the large
comment. Both suggested by Caio. Rearrange logic to avoid possibly
using `a` uninitialized. Noticed by Marcin.
v3: Use DIV_ROUND_UP instead of open coding it. Noticed by Caio.
Tiger Lake, Ice Lake, Haswell, and Ivy Bridge had similar results. (Ice Lake shown)
total instructions in shared programs: 19912558 -> 19912526 (<.01%)
instructions in affected programs: 3432 -> 3400 (-0.93%)
helped: 10 / HURT: 0
total cycles in shared programs: 856413218 -> 856412810 (<.01%)
cycles in affected programs: 122032 -> 121624 (-0.33%)
helped: 9 / HURT: 0
No shader-db changes on any other Intel platforms.
Tiger Lake and Ice Lake had similar results. (Ice Lake shown)
Instructions in all programs: 141997227 -> 141996923 (-0.0%)
Instructions helped: 71
Cycles in all programs: 9162524757 -> 9162523886 (-0.0%)
Cycles helped: 63
Cycles hurt: 5
No fossil-db changes on any other Intel platforms.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
This is especially helpful for a*isign(a) generated by idiv_by_const
optimization. On many GPUs, isign(a) is lowered to imax(imin(a, 1),
-1).
There are no changes on fossil-db because ANV uses a different
optimization path for idiv with a constant denominator. A future MR
will change this.
NOTE: This commit used to help a few hundred shader-db shaders, but
now none are affected. I suspect this is due to some change in the
idiv_by_const optimization. This could possibly be dropped.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
Only iabs and ineg are treated specially. Everything else just uses
nir_unsigned_upper_bound. The special treatment of source modifiers is
because they cause problems for nir_unsigned_upper_bound. Once those
are peeled off, nir_unsigned_upper_bound can generally produce a
tighter bound.
Future commits will add more opcodes. This mostly introduces the
basic framework.
v2: Add a bunch of comments to signed_integer_range_analysis. Re-arrange
the code a little to reduce duplication. Both suggested by
Caio. Rearrange some logic to simplify things. Suggested by Marcin.
Tiger Lake, Ice Lake, Haswell, and Ivy Bridge had similar results. (Ice Lake shown)
total instructions in shared programs: 19912894 -> 19912558 (<.01%)
instructions in affected programs: 109275 -> 108939 (-0.31%)
helped: 74 / HURT: 0
total cycles in shared programs: 856422769 -> 856413218 (<.01%)
cycles in affected programs: 15268102 -> 15258551 (-0.06%)
helped: 65 / HURT: 4
total fills in shared programs: 8218 -> 8217 (-0.01%)
fills in affected programs: 1171 -> 1170 (-0.09%)
helped: 1 / HURT: 0
Skylake and Broadwell had similar results. (Skylake shown)
total cycles in shared programs: 845145547 -> 845142263 (<.01%)
cycles in affected programs: 15261465 -> 15258181 (-0.02%)
helped: 65 / HURT: 0
Tiger Lake
Tiger Lake
Instructions in all programs: 157580768 -> 157579730 (-0.0%)
Instructions helped: 312
Instructions hurt: 28
Cycles in all programs: 7566977172 -> 7566967746 (-0.0%)
Cycles helped: 288
Cycles hurt: 53
Spills in all programs: 19701 -> 19700 (-0.0%)
Spills helped: 2
Spills hurt: 4
Fills in all programs: 33311 -> 33335 (+0.1%)
Fills helped: 5
Fills hurt: 4
Ice Lake
Instructions in all programs: 141998667 -> 141997227 (-0.0%)
Instructions helped: 420
Instructions hurt: 3
Cycles in all programs: 9162565297 -> 9162524757 (-0.0%)
Cycles helped: 389
Cycles hurt: 29
Spills in all programs: 19918 -> 19916 (-0.0%)
Spills helped: 2
Spills hurt: 3
Fills in all programs: 32795 -> 32814 (+0.1%)
Fills helped: 6
Fills hurt: 3
Skylake
Instructions in all programs: 132567691 -> 132567745 (+0.0%)
Instructions hurt: 24
Cycles in all programs: 8828897462 -> 8828889517 (-0.0%)
Cycles helped: 405
Cycles hurt: 6
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
Gfx8 and Gfx9 platforms are helped for cycles because now many
instructions like
mul(8) g12<1>D g10<8,8,1>D 6D
become
mul(8) g12<1>D g10<8,8,1>D 6W
It is the same number of instructions, but the 32x16 multiply is a
little faster.
v2: Fix transposed hi and lo in "(hi >= INT16_MIN && lo <= INT16_MAX)".
Noticed by Caio. Use nir_src_is_const instead of open coding it.
Suggested by Caio.
Broadwell and Skylake had similar results. (Skylake shown)
total cycles in shared programs: 845748380 -> 845145547 (-0.07%)
cycles in affected programs: 446346348 -> 445743515 (-0.14%)
helped: 6017
HURT: 0
helped stats (abs) min: 2 max: 7380 x̄: 100.19 x̃: 8
helped stats (rel) min: <.01% max: 3.72% x̄: 0.41% x̃: 0.39%
95% mean confidence interval for cycles value: -113.37 -87.00
95% mean confidence interval for cycles %-change: -0.42% -0.41%
Cycles are helped.
Skylake
Cycles in all programs: 8844820715 -> 8828897462 (-0.2%)
Cycles helped: 47914
Cycles hurt: 1
No shader-db or fossil-db changes on any other Intel platform.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
This enables copy propagation of
mov(8) g5<1>UD 0x00000180UD
mul(8) g10<1>D g2.3<0,1,0>D g5<16,8,2>W
into
mul(8) g10<1>D g2.3<0,1,0>D 180W
This is necessary for any optimization passes that generate imul_32x16
instructions.
No fossil-db or shader-db changes on any Intel platform.
v2: Fix type size check to (src size != 2) || (dest size != 4). It was
previously &&. :( This allowed copying constants into UB sources, and
that is invalid.
v3: Fix incorrect extraction of upper 16-bits of immediate value when
subnr=2. Noticed by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
The previous bounds checking would cause
mul(8) g121<1>D g120<8,8,1>D 0xec4dD
to be lowered to
mul(8) g121<1>D g120<8,8,1>D 0xec4dUW
mul(8) g41<1>D g120<8,8,1>D 0x0000UW
add(8) g121.1<2>UW g121.1<16,8,2>UW g41<16,8,2>UW
Instead of picking the bounds (and the new type) based on the old type,
pick the new type based on the value only.
This helps a few fossil-db shaders in Witcher 3 and Geekbench5. No
changes on any other Intel platforms.
Tiger Lake
Instructions in all programs: 157581069 -> 157580768 (-0.0%)
Instructions helped: 24
Cycles in all programs: 7566979620 -> 7566977172 (-0.0%)
Cycles helped: 22
Cycles hurt: 4
Ice Lake
Instructions in all programs: 141998965 -> 141998667 (-0.0%)
Instructions helped: 26
Cycles in all programs: 9162568666 -> 9162565297 (-0.0%)
Cycles helped: 24
Cycles hurt: 2
Skylake
No changes.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
Don't copy propagate the constant in situations like
mov(8) g8<1>D 0x7fffffffD
mul(8) g16<1>D g8<8,8,1>D g15<16,8,2>W
On platforms that only have a 32x16 multiplier, this will result in
lowering the multiply to
mul(8) g15<1>D g14<8,8,1>D 0xffffUW
mul(8) g16<1>D g14<8,8,1>D 0x7fffUW
add(8) g15.1<2>UW g15.1<16,8,2>UW g16<16,8,2>UW
On Gfx8 and Gfx9, which have the full 32x32 multiplier, it results in
mul(8) g16<1>D g15<16,8,2>W 0x7fffffffD
Volume 2a of the Skylake PRM says:
When multiplying a DW and any lower precision integer, the
DW operand must on src0.
See also https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/104.
Previous to INTEL_shader_integer_functions2 (in Vulkan or OpenGL), I
don't think it would be possible to create a situation where this could
occur. I discovered this via some optimizations that can determine that
the non-constant source must be able to fit in 16-bits. The case listed
above came from piglit's "ext_transform_feedback-order arrays points"
with those optimizations in place.
No shader-db or fossil-db changes on any Intel platform.
Fixes: de6c0f8487 ("intel/fs: Implement support for NIR opcodes for INTEL_shader_integer_functions2")
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
nir_opt_large_constants balks at seeing a store_deref of a variable
where the source is a vecN operation of multiple load_consts, and thinks
that isn't a constant, so it should not bother promoting it.
Unfortunately, we were running nir_lower_load_const_to_scalar before
nir_opt_large_constants, so this prevented a ton of constant promotion.
This commit /used to help/ some shaders in shader-db. Presumably since
!16770 landed, those shaders were already helped. Currently ther are
no shader-db changes on any Intel platform.
Fossil-db results:
All Intel platforms had similar results. (Ice Lake shown)
Instructions in all programs: 141998227 -> 141421756 (-0.4%)
Instructions helped: 12515
Instructions hurt: 237
SENDs in all programs: 7437925 -> 7468033 (+0.4%)
SENDs hurt: 12806
Cycles in all programs: 9161655753 -> 9132869800 (-0.3%)
Cycles helped: 10163
Cycles hurt: 2637
Spills in all programs: 19977 -> 18678 (-6.5%)
Spills helped: 384
Spills hurt: 40
Fills in all programs: 32863 -> 31396 (-4.5%)
Fills helped: 385
Fills hurt: 42
Lost: 1
Lots of Shadow of the Tomb Raider fragment shaders and Batman Arkham
Origins vertex shaders were hurt for SENDs in this commit. A couple
Aztec Ruins compute shaders and Spaceship shaders (multiple stages)
were also hurt.
All of the shaders hurt for spills or fills were Spaceship compute
shaders. Nearly all of the shaders helped were Shadow of the Tomb
Raider fragmenet shaders. One Spaceship shader was reall, REALLY helped:
Spills helped fossils/fossil-db/Spaceship.run.9f90a2a226fcc57f.1.foz/0b507d3abe2e3c28/compute: 321 -> 13 (-96.0%)
Fills helped fossils/fossil-db/Spaceship.run.9f90a2a226fcc57f.1.foz/0b507d3abe2e3c28/compute: 279 -> 21 (-92.5%)
Overall this seems like an improvement, but we may want to actually
run these few benchmarks before landing.
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16539>
When lowering a single instruction with a destination VGRF to 2 or
more, the VGRF is now considered partially written by each generated
instruction and that increases its liveness especially in loops. Thus
potentially increasing the number of spills/fills due to register
allocation.
Putting an UNDEF instruction in front of the lowered instructions
allows the IR to limit the liveness of the VGRF, reducing register
pressure.
This has a pretty dramatic effect on spills/fills for RT shaders. Here
the stats on Q2RTX shaders on DG2 (wipping out any spills/fills due to
register allocation) :
Instructions in all programs: 26150 -> 24955 (-4.6%)
SENDs in all programs: 1148 -> 1148 (+0.0%)
Loops in all programs: 4 -> 4 (+0.0%)
Cycles in all programs: 392179 -> 332787 (-15.1%)
Spills in all programs: 132 -> 116 (-12.1%)
Fills in all programs: 262 -> 154 (-41.2%)
Shader-db results on TGL :
total instructions in shared programs: 21158140 -> 21158377 (<.01%)
instructions in affected programs: 76629 -> 76866 (0.31%)
helped: 18
HURT: 20
helped stats (abs) min: 1 max: 60 x̄: 18.89 x̃: 12
helped stats (rel) min: 0.21% max: 3.61% x̄: 1.02% x̃: 0.77%
HURT stats (abs) min: 1 max: 79 x̄: 28.85 x̃: 18
HURT stats (rel) min: 0.04% max: 2.81% x̄: 1.13% x̃: 0.79%
95% mean confidence interval for instructions value: -4.82 17.30
95% mean confidence interval for instructions %-change: -0.34% 0.57%
Inconclusive result (value mean confidence interval includes 0).
total loops in shared programs: 5753 -> 5753 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 798856834 -> 798870688 (<.01%)
cycles in affected programs: 6208395 -> 6222249 (0.22%)
helped: 22
HURT: 17
helped stats (abs) min: 2 max: 8794 x̄: 1438.18 x̃: 782
helped stats (rel) min: 0.05% max: 2.28% x̄: 0.63% x̃: 0.44%
HURT stats (abs) min: 2 max: 19178 x̄: 2676.12 x̃: 1358
HURT stats (rel) min: 0.04% max: 23.49% x̄: 2.25% x̃: 0.71%
95% mean confidence interval for cycles value: -952.19 1662.65
95% mean confidence interval for cycles %-change: -0.64% 1.90%
Inconclusive result (value mean confidence interval includes 0).
total spills in shared programs: 4078 -> 4066 (-0.29%)
spills in affected programs: 40 -> 28 (-30.00%)
helped: 2
HURT: 0
total fills in shared programs: 2856 -> 2832 (-0.84%)
fills in affected programs: 127 -> 103 (-18.90%)
helped: 2
HURT: 0
total sends in shared programs: 998554 -> 998554 (0.00%)
sends in affected programs: 0 -> 0
helped: 0
HURT: 0
LOST: 0
GAINED: 0
Total CPU time (seconds): 2346.06 -> 2304.80 (-1.76%)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18657>
NIR has two implementations of lower_idiv, keyed on the
imprecise_32bit_lowering flag. This flag is misleading: the results when
setting this flag "imprecise", they're completely wrong for some values.
If a backend has a native implementation of umul_high, the correct path
isn't that much more expensive. If it doesn't, it's substantially slower
for highp integer divison... but in practice, non-constant highp integer
division is pretty rare.
After a painful migration of the tree, this code path has no more users.
Remove it so nobody else gets the bright idea of using it again.
Closes: #6555
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19303>
That was previously listed on the getopt_long struct but not actually
being used. This makes intel_clc argument processing easier as now
all of its arguments are handled with getopt and anything after the
special argument '--' is passed along to clang to form the final build
command.
Thanks to Dylan Baker for help with changes to the meson file.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19153>
VK_PIPELINE_CREATE_RAY_TRACING_SKIP_AABBS_BIT_KHR and
VK_PIPELINE_CREATE_RAY_TRACING_SKIP_TRIANGLES_BIT_KHR, when specified,
make TraceRay behave as if the corresponding shader flags were set, but
without affecting the value of IncomingRayFlags in shaders.
v2 (Lionel): Improve comments
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19152>