The old code was arranged by the type of interference being added. It
would set up payload registers and then add payload interference for all
VGRFs. It would set up MRFs and add MRF interference for all VGRFs.
This commit re-arranges things to be organized differently. It first
creates and sets up all RA nodes and then groups interference into two
new categories: live range and instruction interference. Once all the
RA nodes have been set up, it walks the list of VGRFs and sets up their
live range interference and then walks the list of instructions and sets
up instruction interference. This new arrangement will be advantageous
for a future patch but, at the moment, it cuts 2% off the run-time of
shader-db on my laptop.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311224 -> 15311224 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355544739 -> 355544739 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2523.45 -> 2469.68 (-2.13%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The only use of the MRF hack these days is for spilling and there we
don't need the precise MRF usage information. If we're spilling then we
know pretty well how many MRFs are going to be used. It is possible if
the only things that are spilled have fewer SIMD channels than the
dispatch width of the shader that this may be more MRFs than needed.
That's a risk we're willing to takd.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311224 (<.01%)
instructions in affected programs: 16664 -> 16788 (0.74%)
helped: 1
HURT: 5
total cycles in shared programs: 355543197 -> 355544739 (<.01%)
cycles in affected programs: 731864 -> 733406 (0.21%)
helped: 3
HURT: 6
The hurt shaders are all SIMD32 compute shaders where we reserve enough
space for a 32-wide spill/fill but don't need it.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This accomplishes two things. First, it makes interfaces which are
really private to RA private to RA. Second, it gives us a place to
store some common stuff as we go through the algorithm.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
There's no reason why we need to use the calculated payload_node_count
value which is just first_non_payload_grf aligned up. The grf_used
value will be aligned up to 16 anyway (which is a much bigger alignment)
before being handed off to hardware.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We only have one node per VGRF so this was adding way too much
interference. No idea how we didn't catch this before.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355543197 (0.02%)
cycles in affected programs: 2472492 -> 2547639 (3.04%)
helped: 17
HURT: 20
Fixes: 014edff0d2 "intel/fs: Add interference between SENDS sources"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
In the last phase of the schedule and RA loop, the RA call is redundant
if we spill. Immediately afterwards, we're going to see that we
couldn't allocate without spilling and call back into RA and tell it to
go ahead and spill. We've known about it for a while but we've always
brushed over it on the theory that, if you're going to spill, you'll be
calling RA a bunch anyway and what does one extra RA hurt? As it turns
out, it hurts more than you'd expect. Because the RA interference graph
gets sparser with each spill and the RA algorithm is more efficient on
sparser graphs, the RA call that we're duplicating is actually the most
expensive call in the RA-and-spill loop.
There's another extra RA call we do that's a bit harder to see which
this also removes. If we try to compile a shader that isn't the minimum
dispatch width and it fails to allocate without spilling we call fail()
to set an error but then go ahead and do the first spilling RA pass and
only after that's complete do we detect the fail and bail out. By
making minimum dispatch widths part of the spill condition, we side-step
this problem.
Getting rid of these extra spills takes the compile time of a nasty
Aztec Ruins shader from about 28 seconds to about 26 seconds on my
laptop. It also makes shader-db 1.5% faster
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355468050 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2524.31 -> 2486.63 (-1.49%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This can be used by both etnaviv and freedreno/a2xx as they are both vec4
architectures with some instructions being scalar-only.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
In order to set up KILL sets, the dataflow code was walking the entire
array of ACPs for every instruction. If you assume the number of ACPs
increases roughly with the number of instructions, this is O(n^2). As
it turns out, regions_overlap() is not nearly as cheap as one would like
and shows up as a significant chunk on perf traces.
This commit changes things around and instead first builds an array of
exec_lists which it uses like a hash table (keyed off ACP source or
destination) similar to what's done in the rest of the copy-prop code.
By first walking the list of ACPs and populating the table and then
walking instructions and only looking at ACPs which probably have the
same VGRF number, we can reduce the complexity to O(n). This takes the
execution time of the piglit vs-isnan-dvec test from about 56.4 seconds
on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to
about 38.7 seconds.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
If the destination of an ACP entry exists only within this block, then
there's no need to keep it for dataflow analysis. We can delete it from
the out_acp table and avoid growing the bitsets any bigger than we
absolutely have to. This reduces the maximum number of global ACP
entries in the vs-isnan-dvec with software fp64 on Kaby Lake from 8630
to 3942 and takes the execution time of the piglit vs-isnan-dvec test
from about 1:16.2 on an unoptimized debug build (what we run in CI) with
NIR_VALIDATE=0 to about 56.4 seconds.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
While the number of ACPs is generally not huge compared to the number of
blocks, 16 does seem a bit small. Bumping it to 64 takes the execution
time of the piglit vs-isnan-dvec test from about 1:18.1 on an unoptimized
debug build (what we run in CI) with NIR_VALIDATE=0 to about 1:16.2.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
In the FS IR we pretend that the instruction is predicated with (+f0.1)
just for flag dependency tracking purposes. Since the instruction
doesn't support predication before Haswell, we unset the predicate so we
should also unset the flag register so that we can round-trip the
disassembly.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
On haswell, for dim instruction we encode immediate float value operand
into double float,
v2: Fix comment (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
For the W or UW (signed or unsigned word) source types, the 16-bit value
must be replicated in both the low and high words of the 32-bit
immediate value.
v2: Fix replication in other places as well
V3: fix a few nits (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Print quad value same as unsigned quad so that we can distinguish in
between quater control disassembled values for e.g 1/2/3[Q] and
immediate quad value for e.g 1Q. This allows round-tripping through the
assembler/disassembler.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
I tried to be very careful while updating all the various drivers, but I
don't have any of that hardware for testing. :(
i965 is the only platform that sets always_precise = true, and it is
only set true for fragment shaders. Gen4 and Gen5 both set lower_flrp32
only for vertex shaders. For fragment shaders, nir_op_flrp is lowered
during code generation as a(1-c)+bc. On all other platforms 64-bit
nir_op_flrp and on Gen11 32-bit nir_op_flrp are lowered using the old
nir_opt_algebraic method.
No changes on any other Intel platforms.
v2: Add panfrost changes.
Iron Lake and GM45 had similar results. (Iron Lake shown)
total cycles in shared programs: 188647754 -> 188647748 (<.01%)
cycles in affected programs: 5096 -> 5090 (-0.12%)
helped: 3
HURT: 0
helped stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2
helped stats (rel) min: 0.12% max: 0.12% x̄: 0.12% x̃: 0.12%
Reviewed-by: Matt Turner <mattst88@gmail.com>
Driver which do not support native integers should use a lowering
pass to go from integers to floats.
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Since 09f1de97a7 "anv,i965: Lower away image derefs in the driver"
the backend compiler is not expected to handle any derefs, so let's
assert on it.
This helps identifying problems when a deref is not lowered and
"leaks" into the backend compiler.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
One special case, `src/util/xmlpool/.gitignore` is not entirely deleted,
as `xmlpool.pot` still gets generated (eg. by `ninja xmlpool-pot`).
Signed-off-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
While we can clean this up later, it's trivial to not generate the
stupid code in the first place, which saves some optimization work.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
These will be lowered by nir_lower_tex() with the
lower_tex_when_implicit_lod_not_supported, so don't need the extra
handling here.
Reviewed-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
One cannot write the URB arbitrarily and therefore the message
has to be carefully constructed. The clever tricks originate
from Kenneth and Jason, I'm just writing the patch.
Fixes GPU hangs on ICL with Vulkan CTS.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
This reverts commit 40b3abb4d1.
It is not clear that this commit was entirely correct, and unfortunately
it was pushed by error.
CC: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Found by GCC warning:
src/intel/compiler/brw_fs_combine_constants.cpp: In function ‘bool needs_negate(const fs_reg*, const imm*)’:
src/intel/compiler/brw_fs_combine_constants.cpp:306:34: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
return ((reg->d & 0xffffu) < 0) != (imm->w < 0);
~~~~~~~~~~~~~~~~~~~^~~
The result of the bit-and is a 32-bit value with the top bits all zero.
This will never be < 0. Instead of masking off the bits, just cast to
int16_t and let the compiler handle the actual conversion.
Fixes: e64be391dd ("intel/compiler: generalize the combine constants pass")
Cc: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
flrp32 is also a 3-source instruction, but there is another pending
series that handles that for Gen4 and Gen5.
v2: Rebase on "intel/compiler: Don't have sepearate, per-Gen
nir_options"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Instead, just have separate scalar vs. vector nir_options and do
per-Gen "fix ups".
Suggested-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Allow ATTR and IMM sources unconditionally (ATTR are just GRFs, IMM will
be handled by opt_combine_constants(). Both are already allowed by
opt_copy_propagation().
Also allow FIXED_GRF if the regioning is 8,8,1. Could also allow other
stride=1 regions (e.g., 4,4,1) and scalar regions but I don't think
those occur. This is sufficient to allow a pass added in a future commit
(fs_visitor::lower_linterp) to avoid emitting extra MOV instructions.
I removed the 'src.stride > 1' case because it seems wrong: 3-src
instructions on Gen6-9 are align16-only and can only do stride=1 or
stride=0. A run through Jenkins with an assert(src.stride <= 1) never
triggers, so it seems that it was dead code.
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
The two new unit tests verify that propagating a saturate between
instructions of different exec sizes does not happen.
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Will allow us to test that propagation between instructions of different
exec sizes does not happen (in the next commit).
The stray-looking change in intervening_dest_write is to adjust the size
of the texture result to keep the test functioning identically when the
instructions' exec sizes are doubled. Without the change, the texture
does not overwrite the destination fully as the unit test intends.
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
We now have a lowering pass that will do this at the fs_visitor level,
so we can remove this code from gen11+.
v2: Reduce size of the "i" array from 4 to 2 (Matt).
Reviewed-by: Matt Turner <mattst88@gmail.com>
On gen11, instead of using a PLN instruction, we convert
FS_OPCODE_LINTERP to 2 or 4 multiply adds. That is done in the
fs_generator code.
This patch adds a lowering pass that does the same thing at the
fs_visitor. It also drops the usage of NF types, since we don't need the
extra precision and it lets us skip the accumulator. With all that, some
optimizations will still be run on the generated code, and we should get
better scheduling.
v2: Update comment about saturation and conditional mod (Matt)
Reviewed-by: Matt Turner <mattst88@gmail.com>
Move the scalar-region conversion from the IR to the generator, so it
doesn't affect the Gen11 path. We need the non-scalar regioning
for a later lowering pass that we are adding.
v2: Better commit message (Matt)
Reviewed-by: Matt Turner <mattst88@gmail.com>
Otherwise it could propagate the saturation from a SIMD16 instruction
into a SIMD8 instruction. With that, only part of the destination
register, which is the source of the move with saturation, would have
been updated.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Commit ad98fbc217 ("intel/fs: Refactor code generation for nir_op_fsign
to its own function") criss-crossed with c2b8fb9a81 ("anv/device:
expose VK_KHR_shader_float16_int8 in gen8+"), and I was not paying
enough attention when I rebased. This adds back the float16 changes and
enables the optimization.
v2: Incorporate more changes from 19cd2f5deb and a8d8b1a139 that I
missed in the previous version.
Fixes: ad98fbc217 ("intel/fs: Refactor code generation for nir_op_fsign to its own function")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110474
Reviewed-by: Matt Turner <mattst88@gmail.com> [v1]
We add two new texture sources for bindless surface and sampler handles.
Bindless surface handles are expected to be pre-shifted so that the
20-bit surface state table index is in the top 20 bits of the 32-bit
handle. This lets us avoid any extra shifts in the shader. Bindless
sampler handles are 32-byte aligned byte offsets from general state base
address. We use 32-byte aligned instead of 16-byte aligned to avoid
having to use more indirect messages than needed. It means we can't
tightly pack samplers but that's probably not a big deal.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
When we have a bindless sampler, we need an instruction header. Even in
SIMD8, this pushes the instruction over the sampler message size maximum
of 11 registers. Instead, we have to lower TXD to TXL.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We're about to start doing 64-bit pointer calculations in ANV. They
will get applied after brw_preprocess_nir which is where we currently do
64-bit integer arithmetic lowering. Because we're adding 64-bit integer
arithmetic after the initial lowering has happened, we need to lower
again.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The current register allocator has a concept of "spill benefit" which is
based on the number of nodes with which a given node interferes. The
idea is that you want to spill stuff with high interference because
those are the most likely registers to help when spilling. However,
this fails to take into account the length of the live range so the
allocator frequently picks "cheap" (not many uses) registers which are
actually very short lived and so spilling them doesn't help with the
pressure situation.
This commit takes into account the length of the live range to make
long-lived registers more likely to get spilled than short-lived ones.
This encourages the spill chooser to choose slightly larger registers
which will affect a larger area of the program and hopefully we have to
spill fewer of them to get the same reduction in over-all register
pressure.
Shader-db results on Kaby Lake:
total spills in shared programs: 23664 -> 12050 (-49.08%)
spills in affected programs: 19243 -> 7629 (-60.35%)
helped: 296
HURT: 8
total fills in shared programs: 32028 -> 25139 (-21.51%)
fills in affected programs: 20378 -> 13489 (-33.81%)
helped: 295
HURT: 16
Of course, most of that is in Deus Ex...
Shader-db results on Kaby Lake (without Deus Ex):
total spills in shared programs: 6479 -> 5834 (-9.96%)
spills in affected programs: 3231 -> 2586 (-19.96%)
helped: 40
HURT: 4
total fills in shared programs: 17165 -> 17099 (-0.38%)
fills in affected programs: 6951 -> 6885 (-0.95%)
helped: 40
HURT: 7
Even without Deus Ex, the spill help is pretty respectable. The worst
hurt shaders were one compute shader in Aztec Ruins and one fragment
shader in KSP that were each hurt by around 13% fill 9% spill.
VkPipeline-db results on Kaby Lake:
total spills in shared programs: 9149 -> 8069 (-11.80%)
spills in affected programs: 5197 -> 4117 (-20.78%)
helped: 27
HURT: 16
total fills in shared programs: 26390 -> 25477 (-3.46%)
fills in affected programs: 12662 -> 11749 (-7.21%)
helped: 24
HURT: 22
The Vulkan results were decidedly more mixed but we don't have nearly as
many apps in that database yet.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Normally fsign generates -1, 0, or +1. The new scale factor, S, causes
fsign to generate -S, 0, or +S.
v2: Rebase on v2 changes in previous commit.
v3: Rebase on 85c35885b3 ("nir: Rework nir_src_as_alu_instr to not take
a pointer").
Reviewed-by: Matt Turner <mattst88@gmail.com> [v2]