Consider the following case:
// g119-123 are written somewhere above
mul.sat(16) g67<1>F g6.4<0,1,0>F g125<8,8,1>F
mul.sat(16) g69<1>F g6.5<0,1,0>F g125<8,8,1>F
mul.sat(16) g71<1>F g6.6<0,1,0>F g125<8,8,1>F
mov(16) g119<1>F g67<8,8,1>F
mov(16) g121<1>F g69<8,8,1>F
mov(16) g123<1>F g71<8,8,1>F
We should be able to coalesce it into
mul.sat(16) g119<1>F g6.4<0,1,0>F g125<8,8,1>F
mul.sat(16) g121<1>F g6.5<0,1,0>F g125<8,8,1>F
mul.sat(16) g123<1>F g6.6<0,1,0>F g125<8,8,1>F
What's stopping us is an overly conservative check for writes to the two
registers being coalesced. The check walks over the intersection of
their live ranges and checks for no writes to either one. However,
because the register which starts the live range (the mul.sat in this
case) is inside that intersection, we flag it as a write in the
intersection and don't coalesce. However, this case is safe because the
destination register of the copy is never read after the source is
written.
Shader-db changes on ICL:
total instructions in shared programs: 16043613 -> 16042610 (<.01%)
instructions in affected programs: 43036 -> 42033 (-2.33%)
helped: 226
HURT: 0
helped stats (abs) min: 1 max: 30 x̄: 4.44 x̃: 4
helped stats (rel) min: 0.09% max: 26.67% x̄: 4.89% x̃: 3.43%
95% mean confidence interval for instructions value: -4.86 -4.02
95% mean confidence interval for instructions %-change: -5.57% -4.22%
Instructions are helped.
total cycles in shared programs: 334766372 -> 334710124 (-0.02%)
cycles in affected programs: 617548 -> 561300 (-9.11%)
helped: 214
HURT: 2
helped stats (abs) min: 15 max: 1512 x̄: 263.21 x̃: 212
helped stats (rel) min: 0.30% max: 75.36% x̄: 25.30% x̃: 21.58%
HURT stats (abs) min: 40 max: 40 x̄: 40.00 x̃: 40
HURT stats (rel) min: 0.15% max: 0.15% x̄: 0.15% x̃: 0.15%
95% mean confidence interval for cycles value: -277.91 -242.90
95% mean confidence interval for cycles %-change: -27.58% -22.55%
Cycles are helped.
No spill/fill changes or gained/lost
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4627>
Change brw_memory_fence to return the number of messages emitted, and
use that to update the send_count statistic in code generation.
This will fix the book-keeping for IVB since the memory fences will
result in two SEND messages.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4646>
The versions that take a specific number of operands will do various
fixups depending on the platform and the opcode. However, the version
that takes an array of sources did not. This makes all version operate
similarly.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>
This commit fixes performance regressions introduced by e03f965280
in which we started bounds checking our push constants. This added a
LOT of shader code to shaders which use the robustBufferAccess feature
and led to substantial spilling. The checking we just added to the FS
back-end is far more efficient for two reasons:
1. It can be done at a whole register granularity rather than per-
scalar and so we emit one SIMD8 SEL per 32B GRF rather than one
SIMD16 SEL (executed as two SELs) for each component loaded.
2. Because we do it with NoMask instructions, we can do it on whole
pushed GRFs without splatting them out to SIMD8 or SIME16 values.
This means that robust buffer access no longer explodes our register
pressure for no good reason.
As a tiny side-benefit, we're now using can use AND instead of SEL which
means no need for the flag and better scheduling.
Vulkan pipeline database results on ICL:
Instructions in all programs: 293586059 -> 238009118 (-18.9%)
SENDs in all programs: 13568515 -> 13568515 (+0.0%)
Loops in all programs: 149720 -> 149720 (+0.0%)
Cycles in all programs: 88499234498 -> 84348917496 (-4.7%)
Spills in all programs: 1229018 -> 184339 (-85.0%)
Fills in all programs: 1348397 -> 246061 (-81.8%)
This also improves the performance of a few apps:
- Shadow of the Tomb Raider: +4%
- Witcher 3: +3.5%
- UE4 Shooter demo: +2%
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4447>
Add new builtin parameters that are used to keep track of the group
size. This will be used to implement ARB_compute_variable_group_size.
The compiler will use the maximum group size supported to pick a
suitable SIMD variant. A later improvement will be to keep all SIMD
variants (like FS) so the driver can select the best one at dispatch
time.
When variable workgroup size is used, the small workgroup optimization
is disabled as it we can't prove at compile time that the barriers
won't be needed.
Extracted from original i965 patch with additional changes by
Caio Marcelo de Oliveira Filho.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
The push.total field had three values but only one was directly
used (size). Replace it with a helper function that explicitly takes
the cs_prog_data and the number of threads -- and use that in the
drivers.
This is a preparation for ARB_compute_variable_group_size where the
number of threads (hence the total size for push constants) is not
defined at compile time (not cs_prog_data->threads).
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
Change brw_compute_vue_map() to also take the number of pos slots. If
more than one slot is used, the VARYING_SLOT_POS is treated as an
array.
When using Primitive Replication, instead of a single position, the
VUE must contain an array of positions. Padding might be
necessary (after clip distance) to ensure rest of attributes start
aligned.
v2: Add note about array in the commit message and assert that
pos_slots >= 1 to make clear 0 is invalid. (Jason)
Move padding to be after the clip distance.
v3: Apply the correct offset when gathering the sources from outputs.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> [v2]
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/2313>
This commit enables the I/O vectorization pass that was originally
written for ACO for Intel drivers. We enable it for UBOs, SSBOs, global
memory, and SLM. We only enable vectorization for the scalar back-end
because it vec4 makes certain alignment assumptions.
Shader-db results with iris on ICL:
total instructions in shared programs: 16077927 -> 16068236 (-0.06%)
instructions in affected programs: 199839 -> 190148 (-4.85%)
helped: 324
HURT: 0
helped stats (abs) min: 2 max: 458 x̄: 29.91 x̃: 4
helped stats (rel) min: 0.11% max: 38.94% x̄: 4.32% x̃: 1.64%
95% mean confidence interval for instructions value: -37.02 -22.80
95% mean confidence interval for instructions %-change: -5.07% -3.58%
Instructions are helped.
total cycles in shared programs: 336806135 -> 336151501 (-0.19%)
cycles in affected programs: 16009735 -> 15355101 (-4.09%)
helped: 458
HURT: 154
helped stats (abs) min: 1 max: 77812 x̄: 1542.50 x̃: 75
helped stats (rel) min: <.01% max: 34.46% x̄: 5.16% x̃: 2.01%
HURT stats (abs) min: 1 max: 22800 x̄: 336.55 x̃: 20
HURT stats (rel) min: <.01% max: 17.11% x̄: 2.12% x̃: 1.00%
95% mean confidence interval for cycles value: -1596.83 -542.49
95% mean confidence interval for cycles %-change: -3.83% -2.82%
Cycles are helped.
total sends in shared programs: 814177 -> 809049 (-0.63%)
sends in affected programs: 15422 -> 10294 (-33.25%)
helped: 324
HURT: 0
helped stats (abs) min: 1 max: 256 x̄: 15.83 x̃: 2
helped stats (rel) min: 1.33% max: 67.90% x̄: 21.21% x̃: 15.38%
95% mean confidence interval for sends value: -19.67 -11.98
95% mean confidence interval for sends %-change: -23.03% -19.39%
Sends are helped.
LOST: 7
GAINED: 2
Most of the helped shaders were in the following titles:
- Doom
- Deus Ex: Mankind Divided
- Aztec Ruins
- Shadow of Mordor
- DiRT Showdown
- Tomb Raider (Rise, I think)
Five of the lost programs are SIMD16 shaders we lost from dirt showdown.
The other two are compute shaders in Aztec Ruins which switched from
SIMD8 to SIMD16.
Vulkan pipeline-db stats on ICL:
Instructions in all programs: 296780486 -> 293493363 (-1.1%)
Loops in all programs: 149669 -> 149669 (+0.0%)
Cycles in all programs: 90999206722 -> 88513844563 (-2.7%)
Spills in all programs: 1710217 -> 1730691 (+1.2%)
Fills in all programs: 1931235 -> 1958138 (+1.4%)
By far the most help was in the Tomb Raider games. A couple of Batman
games with DXVK were also helped. In Shadow of the Tomb Raider:
Instructions in all programs: 41614336 -> 39408023 (-5.3%)
Loops in all programs: 32200 -> 32200 (+0.0%)
Cycles in all programs: 1875498485 -> 1667034831 (-11.1%)
Spills in all programs: 196307 -> 214945 (+9.5%)
Fills in all programs: 282736 -> 307113 (+8.6%)
Benchmarks of real games I've done on this patch:
- Rise of the Tomb Raider: +3%
- Shadow of the Tomb Raider: +10%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4367>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4367>
Thanks to the NIR vectorizing pass, we're about to see alignments that
are higher than the bit size. Previously, we could use either and we
just happened to choose alignment (probably the wrong choice) so it's
harmless to switch to detecting based on bit size. This commit changes
things to take both into account which is more accurate to what the
messages we're using do. We also beef up the asserts and make them more
consistent, more accurate, and more complete.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4367>
This change incurs a small amount of hurt now, but it enables a lot of
benefit on vec4 shaders on the next commit. nir_opt_algebraic_late
converts dph, dot3, etc. to dhp_replicated, dot_replicated3, etc. In
the process, it introduces extra moves. If the original NIR contained
vec1 32 ssa_45 = fdot4 ssa_51, ssa_44
vec1 32 ssa_46 = fneg ssa_45
nir_opt_algebraic_late will produce
vec4 32 ssa_18 = fdot_replicated4 ssa_1, ssa_15
vec1 32 ssa_19 = mov ssa_18.x
vec1 32 ssa_17 = fneg ssa_19
The algebraic pass added in the next commit can't see through the move
to know that the fneg applies to a fdot_replicated4.
Haswell, Ivy Bridge, and Sandybridge had similar results. (Haswell shown)
total cycles in shared programs: 187077604 -> 187079858 (<.01%)
cycles in affected programs: 350132 -> 352386 (0.64%)
helped: 174
HURT: 194
helped stats (abs) min: 2 max: 124 x̄: 23.60 x̃: 16
helped stats (rel) min: 0.12% max: 15.88% x̄: 4.98% x̃: 3.86%
HURT stats (abs) min: 2 max: 164 x̄: 32.78 x̃: 16
HURT stats (rel) min: 0.17% max: 22.82% x̄: 6.46% x̃: 0.86%
95% mean confidence interval for cycles value: 2.04 10.21
95% mean confidence interval for cycles %-change: 0.17% 1.93%
Cycles are HURT.
No shader-db changes on any other Intel platform.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/1359>
This has been reported to fix a hang in Shadow of Mordor on Gen12.
One of its compute shaders seems to cause an in-order exec_all
dependency to be merged into an out-of-order SET dependency slot,
which would prevent us from baking the SET dependency into the parent
instruction, leading to an assert failure in emit_inst_dependencies()
(Thanks to Rafael for noticing that). Prevent that by avoiding
combination of in-order dependencies whenever that would cause a SET
dependency to be demoted to a SYNC.NOP instruction.
Fixes: e14529ff32 "intel/fs/gen12: Workaround data coherency issues due to broken NoMask control flow."
Tested-by: Rafael Antognolli <rafael.antognolli@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This will avoid generating multiple identical fences in a row.
For Gen11+ we have multiple types of fences (affecting different
variable modes), but is still better to combine them in a single
scoped barrier so that the translation to backend IR have the option
of dispatching both fences in parallel.
This will clean up redundant barriers from various
dEQP-VK.memory_model.* tests.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3224>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3224>
Knowing following:
- CMP writes to flag register the result of
applying cmod to the `src0 - src1`.
After that it stores the same value to dst.
Other instructions first store their result to
dst, and then store cmod(dst) to the flag
register.
- inst is either CMP or MOV
- inst->dst is null
- inst->src[0] overlaps with scan_inst->dst
- inst->src[1] is zero
- scan_inst wrote to a flag register
There can be three possible paths:
- scan_inst is CMP:
Considering that src0 is either 0x0 (false),
or 0xffffffff (true), and src1 is 0x0:
- If inst's cmod is NZ, we can always remove
scan_inst: NZ is invariant for false and true. This
holds even if src0 is NaN: .nz is the only cmod,
that returns true for NaN.
- .g is invariant if src0 has a UD type
- .l is invariant if src0 has a D type
- scan_inst and inst have the same cmod:
If scan_inst is anything than CMP, it already
wrote the appropriate value to the flag register.
- else:
We can change cmod of scan_inst to that of inst,
and remove inst. It is valid as long as we make
sure that no instruction uses the flag register
between scan_inst and inst.
Nine new cmod_propagation unit tests:
- cmp_cmpnz
- cmp_cmpg
- plnnz_cmpnz
- plnnz_cmpz (*)
- plnnz_sel_cmpz
- cmp_cmpg_D
- cmp_cmpg_UD (*)
- cmp_cmpl_D (*)
- cmp_cmpl_UD
(*) this would fail without changes to brw_fs_cmod_propagation.
This fixes optimisation that used to be illegal (see issue #2154)
= Before =
0: linterp.z.f0.0(8) vgrf0:F, g2:F, attr0<0>:F
1: cmp.nz.f0.0(8) null:F, vgrf0:F, 0f
= After =
0: linterp.z.f0.0(8) vgrf0:F, g2:F, attr0<0>:F
Now it is optimised as such (note change of cmod in line 0):
= Before =
0: linterp.z.f0.0(8) vgrf0:F, g2:F, attr0<0>:F
1: cmp.nz.f0.0(8) null:F, vgrf0:F, 0f
= After =
0: linterp.nz.f0.0(8) vgrf0:F, g2:F, attr0<0>:F
No shaderdb changes
Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2154
Signed-off-by: Yevhenii Kolesnikov <yevhenii.kolesnikov@globallogic.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3348>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3348>
Avoid looping over all VARYING_SLOT_MAX urb_setup array
entries from genX_upload_sbe. Prepare an array indirection
to the active entries of urb_setup already in the compile
step. On upload only walk the active arrays.
v2: Use uint8_t to store the attribute numbers.
v3: Change loop to build up the array indirection.
v4: Rebase.
v5: Style fix.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/308>
The current workaround for this hardware bug involved marking the ADD
instruction used to initialize the address register as NoMask on
Gen12, which was based on the assumption that the problem was caused
by a hardware bug affecting the application of the execution mask to
the address register write.
However that doesn't seem to be the case: The address register write
was working correctly, the real problem leading to hangs on TGL is
that the indirect addressing logic is unable to deal with garbage
values in the address register (e.g. misaligned offsets), even for
channels which are currently inactive due to non-uniform control flow.
The current workaround isn't able to avoid that situation in general,
since the result of the NoMask ADD instruction for a dead channel is
calculated based on the corresponding (dead) component of the
indirect_byte_offset source, which would still be undefined in the
likely case that the source was initialized under control flow itself.
This would lead to hangs whenever MOV_INDIRECT was used under
non-uniform control flow in some scenarios like a tessellation shader
from GFXBench5/gl_4 (AKA Car Chase) on TGL. In addition I've managed
to reproduce the same issue on earlier platforms by initializing the
whole address register with garbage before the ADD instruction, so
this seems to be a long-standing issue we have avoided mostly by luck.
This patch fixes the problem and applies the workaround to all
platforms, since even when the hardware is able to deal with garbage
address values without hanging there might be a significant
performance cost from reading random GRF registers due to the useless
extra EU cycles spent fetching registers for dead channels and due to
the potential for unintended serialization with respect to other
random instructions that could be executed in parallel, which may have
had a cost of the order of hundreds of cycles in the worst case
scenario.
Fixes: f93dfb509c "intel/fs: Write the address register with NoMask for MOV_INDIRECT"
Tested-by: Rafael Antognolli <rafael.antognolli@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Passing shader_stats to the fs_generator constructor means that the
SIMD8 shader stats from the visitor (such as the scheduler mode) will be
reported out for the SIMD16/SIMD32 versions as well.
As you can see, we are now passing 'shader_stats' and 'stats' to
generate_code(), which is obviously odd looking. Ian rebased and
committed an old patch of mine which added the shader_stats struct on
July 30 in commit dabb5d4bee (i965/fs: Add a shader_stats struct.) and
shortly after on August 12 Jason added the brw_compile_stats struct in
commit 134607760a (intel/compiler: Fill a compiler statistics struct).
I'd like to combine the two, but I'm not sure how. shader_stats is an
input to generate_code() while brw_compile_stats is an output and is
only used by the Vulkan driver. Leave it as is for now...
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>
This defines a new BRW_ANALYSIS object which wraps the register
pressure computation code along with its result. For the rationale
see the previous commits converting the liveness and dominance
analysis passes to the IR analysis framework.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>