Originally, we had virtual opcodes for scratch access, and let the
generator count spills/fills separately from other sends. Later, we
started using the generic SHADER_OPCODE_SEND for spills/fills on some
generations of hardware, and simply detected stateless messages there.
But then we started using stateless messages for other things:
- anv uses stateless messages for the buffer device address feature.
- nir_opt_large_constants generates stateless messages.
- XeHP curbe setup can generate stateless messages.
So counting stateless messages is not accurate. Instead, we move the
spill/fill accounting to the register allocator, as it generates such
things, as well as the load/store_scratch intrinsic handling, as those
are basically spill/fills, just at a higher level.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16691>
Since 40e1d798c6, we are now using physical register numbers for
everything which makes it all simpler. In particular, we no longer need
the special case for setting up the payload for SIMD16 on Gen4-5. This
fixes a pile of piglit tests on ILK and similar.
Fixes: 40e1d798c6 "intel/fs: Use ra_alloc_contig_reg_class()..."
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11221>
By using the new class type, we don't need to make 1928 different
registers to represent each contigous reg size starting from the actual
128 HW register, or have a mapping between RA regs and HW base regs. With
the number of regs reduced, and the fast q computation when using the new
classes, we no longer need to compute our own q.
This drops the FS RA initialization time on my CFL system from about 1ms to
50us.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9437>
Those helpers exist primarily to sort out some of the weirdness around
Gen4-6 dataport access. On Gen5 and earlier, everything was called
"dataport" and, instead of the SFID we have today there was a "target
cache" parameter in the descriptor. There are also some bits that moved
around on various gens depending on read vs. write. Starting with Gen6,
most things which target one of the data cache SFIDs should use
brw_dp_desc() instead.
v2: Drop backward comment (Ken)
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7455>
Starting with d0d039a4d3, we emit writes to the push constant chunk
of the payload to stomp out-of-bounds data to zero for Vulkan. Then, in
369eab9420, we started emitting shader preamble code for emulated
push constants on Gen12.5 parts. In either of these cases, we can run
into issues if we don't have a proper live range for some of the payload
registers where they get used for something and then smashed by our push
handling code. We've not seen many issues with this yet because it only
happens when you have dead push constants.
Fixes: d0d039a4d3 "anv: Emit pushed UBO bounds checking code..."
Fixes: 369eab9420 "intel/fs: Emit code for Gen12-HP indirect..."
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9501>
The current scratch mechanism uses an MRF hack where we reserve a few
GRF registers to treat like the MRF and we collect the data into that
MRF region before doing a scratch write. We also use that region for
the header for scratch reads.
This commit changes things and gets rid of the MRF hack. Instead, we
reserve a single register (which RA is free to pick) for the scratch
header and uses split sends for scratch writes to avoid having to do
the copy. This should provide RA with more freedom in the presence of
spilling as well as avoid some unnecessary data moves. In future, the
new GEN9_SCRATCH_HEADER opcode gives us a place where we can do our own
per-thread scratch base address calculations rather than depending on
the scratch base address that gets pushed into g0. Having an opcode for
this lets us do it once at the top of the shader rather than repeating
it at every read/write.
One other noticeable difference is the use of SHADER_OPCODE_SEND. We
can get away with this thanks to the fact that we're now using a set to
track which instructions are generated by spills and don't rely on the
opcodes to find spill/fill instructions. This allows us to avoid adding
more virtual opcodes and let the normal code paths handle things like
scoreboard dependencies between header setup and the SEND. It also
means that post-RA scheduling may be able to space out the header setup
MOV and the SEND for better latency hiding.
Shader-db results on Skylake:
total spills in shared programs: 12137 -> 10604 (-12.63%)
spills in affected programs: 6685 -> 5152 (-22.93%)
helped: 274
HURT: 2
total fills in shared programs: 13065 -> 11515 (-11.86%)
fills in affected programs: 9007 -> 7457 (-17.21%)
helped: 275
HURT: 1
Shader-db results on Ice Lake:
total spills in shared programs: 12482 -> 10953 (-12.25%)
spills in affected programs: 6586 -> 5057 (-23.22%)
helped: 275
HURT: 0
total fills in shared programs: 12819 -> 11234 (-12.36%)
fills in affected programs: 7867 -> 6282 (-20.15%)
helped: 274
HURT: 0
Shader-db results on Tigerlake:
total spills in shared programs: 11689 -> 10233 (-12.46%)
spills in affected programs: 4740 -> 3284 (-30.72%)
helped: 259
HURT: 0
total fills in shared programs: 10840 -> 9443 (-12.89%)
fills in affected programs: 6244 -> 4847 (-22.37%)
helped: 259
HURT: 0
Fossil-db results on Ice Lake:
Spills in all programs: 245249 -> 201633 (-17.8%)
Fills in all programs: 366066 -> 314368 (-14.1%)
More practically, this seems to give about a 0.5-1% perf boost in
Witcher 3 (DXVK) and Shadow of the Tomb Raider (Vulkan native).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
Starting with e99081e76d, we don't re-construct liveness information
every time we spill a register. Instead, we're very careful to track
which instructions are spill instructions and not contribute those to
the IP count so that we can continue to use the old liveness information
even though instructions have been added. This commit adds an assert
that sanity-checks that we count the same number of instructions as our
liveness information is based on.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
These variables seem to be initialized before being used, so this
patch is not fixing any bug, but leaving them unitialized may become
a bug after some refactoring.
These classes were affected: fs_reg_alloc, fs_visitor, fs_generator,
instruction_scheduler.
Found by Coverity.
Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6667>
This involves wrapping fs_live_variables in a BRW_ANALYSIS object and
hooking it up to invalidate_analysis() so it's properly invalidated.
Seems like a lot of churn but it's fairly straightforward. The
fs_visitor invalidate_ and calculate_live_intervals() methods are no
longer necessary after this change.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
This moves the following methods that are currently defined in
fs_visitor (even though they are side products of the liveness
analysis computation) and are already implemented in
brw_fs_live_variables.cpp:
> bool virtual_grf_interferes(int a, int b) const;
> int *virtual_grf_start;
> int *virtual_grf_end;
It makes sense for them to be part of the fs_live_variables object,
because they have the same lifetime as other liveness analysis results
and because this will allow some extra validation to happen wherever
they are accessed in order to make sure that we only ever use
up-to-date liveness analysis results.
This shortens the virtual_grf prefix in order to compensate for the
slightly increased lexical overhead from the live_intervals pointer
dereference.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
Have fun reading through the whole back-end optimizer to verify
whether I've missed any dependency flags -- Or alternatively, just
trust that any mistake here will trigger an assertion failure during
analysis pass validation if it ever poses a problem for the
consistency of any of the analysis passes managed by the framework.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
The invalidate_analysis() method knows what analysis passes there are
in the back-end and calls their invalidate() method to report changes
in the IR. For the moment it just calls invalidate_live_intervals()
(which will eventually be fully replaced by this function) if anything
changed.
This makes all optimization passes invalidate DEPENDENCY_EVERYTHING,
which is clearly far from ideal -- The dependency classes passed to
invalidate_analysis() will be refined in a future commit.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
This is mainly meant to avoid shader-db regressions on SNB as we start
using VGRFs for barycentrics more frequently. Currently the
aligned_pairs_class is only useful in SIMD8 mode, because in SIMD16
mode barycentric vectors are typically 4 GRFs. This is not a problem
on Gen4-5, because on those platforms all VGRF allocations are
pair-aligned in SIMD16 mode. However on Gen6 we end up using either
the fast or the slow path of LINTERP rather non-deterministically
based on the behavior of the register allocator.
Fix it by repurposing aligned_pairs_class to hold PLN-aligned
registers of whatever the natural size of a barycentric vector is in
the current dispatch width.
On SNB this prevents the following shader-db regressions (including
SIMD32 programs) in combination with the interpolation rework part of
this series:
total instructions in shared programs: 13983257 -> 14527274 (3.89%)
instructions in affected programs: 1766255 -> 2310272 (30.80%)
helped: 0
HURT: 11608
LOST: 26
GAINED: 13
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Previously we would hardcode fs_visitor::delta_xy barycentrics to be
allocated from aligned_pairs_class on hardware with PLN source
alignment restrictions (pre-Gen7). Instead allocate any registers
consumed by LINTERP from aligned_pairs_class, even if some barycentric
vector had ended up in a temporary.
On SNB this prevents the following shader-db regressions (including
SIMD32 programs) in combination with the interpolation rework part of
this series:
total instructions in shared programs: 13983257 -> 14527274 (3.89%)
instructions in affected programs: 1766255 -> 2310272 (30.80%)
helped: 0
HURT: 11608
LOST: 26
GAINED: 13
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This will be convenient in a later commit enabling SIMD32 fragment
shaders, and happens to fix the calculation for MATH instructions
which is currently inaccurate for SIMD-lowered instructions on Gen4-5
platforms (all of them on Gen4 in SIMD16 mode), since it was based on
the shader's dispatch width rather than on the actual execution size
of the instruction.
This causes some shader-db noise on Gen4 due to the more compact
register allocation interacting with the SEND dependency workarounds,
but otherwise no major changes.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The problem occured when the return payload of a SIMD8 SEND
instruction was re-used as source payload of an EOT SEND message. In
such cases the interference edge added by that workaround between the
payload and grf127_send_hack_node would have no effect, because the
payload would be allocated to a fixed range of registers containing
r127 by the special handling of EOT message payloads in the same
function. This would cause things to blow up if the source payload of
the first SIMD8 message ended up being allocated to a range which
happened to overlap the destination.
Fix it by avoiding r127 altogether in the allocation of EOT message
payloads.
The problem can be reproduced on ICL with the fp-indirections2 Piglit
test-case in combination with the other optimizer changes of this
series.
Fixes: 232ed89802 "i965/fs: Register allocator shoudn't use grf127 for sends dest"
Cc: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This might be slightly faster since we're doing one read rather than
two before we decide to skip. The more important reason, however, is
because no_spill prevents us from re-spilling spill registers. In the
new world in which we don't re-calculate liveness every spill, we may
not have valid liveness for spill registers so we shouldn't even look
their live ranges up.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110825
Fixes: e99081e76d "intel/fs/ra: Spill without destroying the..."
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Tested-by: Tapani Pälli <tapani.palli@intel.com>
Otherwise, we get an effectively random spill reg because we no longer
have the information from RA to guide us. Also, a completely clean
graph has undefined data in in_stack which is used for choosing the
spill reg so it really is non-deterministic.
Fixes: e99081e76d "intel/fs/ra: Spill without destroying the..."
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of re-building the interference graph every time we spill, we
modify it in place so we can avoid recalculating liveness and the whole
O(n^2) interference graph building process. We make a simplifying
assumption in order to do so which is that all spill/fill temporary
registers live for the entire duration of the instruction around which
we're spilling. This isn't quite true because a spill into the source
of an instruction doesn't need to interfere with its destination, for
instance. Not re-calculating liveness also means that we aren't
adjusting spill costs based on the new liveness. The combination of
these things results in a bit of churn in spilling. It takes a large
cut out of the run-time of shader-db on my laptop.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311224 -> 15311360 (<.01%)
instructions in affected programs: 77027 -> 77163 (0.18%)
helped: 11
HURT: 18
total cycles in shared programs: 355544739 -> 355830749 (0.08%)
cycles in affected programs: 203273745 -> 203559755 (0.14%)
helped: 234
HURT: 190
total spills in shared programs: 12049 -> 12042 (-0.06%)
spills in affected programs: 2465 -> 2458 (-0.28%)
helped: 9
HURT: 16
total fills in shared programs: 25112 -> 25165 (0.21%)
fills in affected programs: 6819 -> 6872 (0.78%)
helped: 11
HURT: 16
Total CPU time (seconds): 2469.68 -> 2360.22 (-4.43%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This is slightly less convenient in some places but it will make it much
easier when we want to start adding nodes dynamically.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The old code was arranged by the type of interference being added. It
would set up payload registers and then add payload interference for all
VGRFs. It would set up MRFs and add MRF interference for all VGRFs.
This commit re-arranges things to be organized differently. It first
creates and sets up all RA nodes and then groups interference into two
new categories: live range and instruction interference. Once all the
RA nodes have been set up, it walks the list of VGRFs and sets up their
live range interference and then walks the list of instructions and sets
up instruction interference. This new arrangement will be advantageous
for a future patch but, at the moment, it cuts 2% off the run-time of
shader-db on my laptop.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311224 -> 15311224 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355544739 -> 355544739 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2523.45 -> 2469.68 (-2.13%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The only use of the MRF hack these days is for spilling and there we
don't need the precise MRF usage information. If we're spilling then we
know pretty well how many MRFs are going to be used. It is possible if
the only things that are spilled have fewer SIMD channels than the
dispatch width of the shader that this may be more MRFs than needed.
That's a risk we're willing to takd.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311224 (<.01%)
instructions in affected programs: 16664 -> 16788 (0.74%)
helped: 1
HURT: 5
total cycles in shared programs: 355543197 -> 355544739 (<.01%)
cycles in affected programs: 731864 -> 733406 (0.21%)
helped: 3
HURT: 6
The hurt shaders are all SIMD32 compute shaders where we reserve enough
space for a 32-wide spill/fill but don't need it.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This accomplishes two things. First, it makes interfaces which are
really private to RA private to RA. Second, it gives us a place to
store some common stuff as we go through the algorithm.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
There's no reason why we need to use the calculated payload_node_count
value which is just first_non_payload_grf aligned up. The grf_used
value will be aligned up to 16 anyway (which is a much bigger alignment)
before being handed off to hardware.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We only have one node per VGRF so this was adding way too much
interference. No idea how we didn't catch this before.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355543197 (0.02%)
cycles in affected programs: 2472492 -> 2547639 (3.04%)
helped: 17
HURT: 20
Fixes: 014edff0d2 "intel/fs: Add interference between SENDS sources"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This reverts commit 40b3abb4d1.
It is not clear that this commit was entirely correct, and unfortunately
it was pushed by error.
CC: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
The current register allocator has a concept of "spill benefit" which is
based on the number of nodes with which a given node interferes. The
idea is that you want to spill stuff with high interference because
those are the most likely registers to help when spilling. However,
this fails to take into account the length of the live range so the
allocator frequently picks "cheap" (not many uses) registers which are
actually very short lived and so spilling them doesn't help with the
pressure situation.
This commit takes into account the length of the live range to make
long-lived registers more likely to get spilled than short-lived ones.
This encourages the spill chooser to choose slightly larger registers
which will affect a larger area of the program and hopefully we have to
spill fewer of them to get the same reduction in over-all register
pressure.
Shader-db results on Kaby Lake:
total spills in shared programs: 23664 -> 12050 (-49.08%)
spills in affected programs: 19243 -> 7629 (-60.35%)
helped: 296
HURT: 8
total fills in shared programs: 32028 -> 25139 (-21.51%)
fills in affected programs: 20378 -> 13489 (-33.81%)
helped: 295
HURT: 16
Of course, most of that is in Deus Ex...
Shader-db results on Kaby Lake (without Deus Ex):
total spills in shared programs: 6479 -> 5834 (-9.96%)
spills in affected programs: 3231 -> 2586 (-19.96%)
helped: 40
HURT: 4
total fills in shared programs: 17165 -> 17099 (-0.38%)
fills in affected programs: 6951 -> 6885 (-0.95%)
helped: 40
HURT: 7
Even without Deus Ex, the spill help is pretty respectable. The worst
hurt shaders were one compute shader in Aztec Ruins and one fragment
shader in KSP that were each hurt by around 13% fill 9% spill.
VkPipeline-db results on Kaby Lake:
total spills in shared programs: 9149 -> 8069 (-11.80%)
spills in affected programs: 5197 -> 4117 (-20.78%)
helped: 27
HURT: 16
total fills in shared programs: 26390 -> 25477 (-3.46%)
fills in affected programs: 12662 -> 11749 (-7.21%)
helped: 24
HURT: 22
The Vulkan results were decidedly more mixed but we don't have nearly as
many apps in that database yet.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>