In programs with a lot of unused temps, if we don't do this, we may
end up recycling previously used rfs more often, which can be
detrimental to instruction pairing.
total instructions in shared programs: 11464335 -> 11444136 (-0.18%)
instructions in affected programs: 8976743 -> 8956544 (-0.23%)
helped: 33196
HURT: 33778
Inconclusive result
total max-temps in shared programs: 2230150 -> 2229445 (-0.03%)
max-temps in affected programs: 86413 -> 85708 (-0.82%)
helped: 2217
HURT: 1523
Max-temps are helped.
total sfu-stalls in shared programs: 18077 -> 17104 (-5.38%)
sfu-stalls in affected programs: 8669 -> 7696 (-11.22%)
helped: 2657
HURT: 2182
Sfu-stalls are helped.
total inst-and-stalls in shared programs: 11482412 -> 11461240 (-0.18%)
inst-and-stalls in affected programs: 8995697 -> 8974525 (-0.24%)
helped: 33319
HURT: 33708
Inconclusive result
total nops in shared programs: 298140 -> 296185 (-0.66%)
nops in affected programs: 52805 -> 50850 (-3.70%)
helped: 3797
HURT: 2662
Inconclusive result
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
The last 3 instructions can't use specific registers so flag all the
nodes for temps used in the last program instructions and try to
avoid assigning any of these. This may help us avoid injecting nops
for the last thread switch instruction.
Because regisster allocation needs to happen before QPU scheduling
and instruction merging we can't tell exactly what the last 3
instructions will be, so we do this for a few more instructions than
just 3.
We only do this for fragment shaders because other shader stages
always end with VPM store instructions that take an small immediate
and therefore will never allow us to merge the final thread switch
earlier, so limiting allocation for these shaders will never improve
anything and might instead be detrimental.
total instructions in shared programs: 11471389 -> 11464335 (-0.06%)
instructions in affected programs: 582908 -> 575854 (-1.21%)
helped: 4669
HURT: 578
Instructions are helped.
total max-temps in shared programs: 2230497 -> 2230150 (-0.02%)
max-temps in affected programs: 5662 -> 5315 (-6.13%)
helped: 344
HURT: 44
Max-temps are helped.
total sfu-stalls in shared programs: 18068 -> 18077 (0.05%)
sfu-stalls in affected programs: 264 -> 273 (3.41%)
helped: 37
HURT: 48
Inconclusive result (value mean confidence interval includes 0).
total inst-and-stalls in shared programs: 11489457 -> 11482412 (-0.06%)
inst-and-stalls in affected programs: 585180 -> 578135 (-1.20%)
helped: 4659
HURT: 588
Inst-and-stalls are helped.
total nops in shared programs: 301738 -> 298140 (-1.19%)
nops in affected programs: 14680 -> 11082 (-24.51%)
helped: 3252
HURT: 108
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
The rf variants need to encode the destination in the cond bits, which
prevents these to be merged with any other instruction that need them.
In 4.x, ldunif(a) write to r5 which is a special register that only
ldunif(a) and ldvary can write so we have a special register class for
it and only allow it for them. Then when we need to choose a register
for a node, if this register is available we always use it.
In 7.x these instructions write to rf0, which can be used by any
instruction, so instead of restricting rf0, we track the temps that
are used as ldunif(a) destinations and use that information to favor
rf0 for them.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
Having this pointer in the key is undesirable since it makes
copying keys difficult and error prone (as seen in previous
patches), also, it is only there for convenience and we don't
strictly need it (in fact the vulkan driver doesn't use it at
all), so let's just get rid of it so our v3d_key is fully
static.
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25418>
Our shader key includes a void pointer that we can't just memcmp,
so add helpers that allow us toget the 'static' portion and size
of a key. We will use this to fix up the shader cache in v3d in
a later patch.
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25418>
It seems NIR is tracking this for us now so we can stop doing this
in the backend.
Also, new CTS tests seem to add the requirement where in the presence of
some builtin's like gl_SampleID in a shader, even if unused, sample shading
is expected to be enabled, which is something we can't track in the backend
since the variable may have been dropped by then.
Fixes 2 failures in:
dEQP-VK.draw.renderpass.implicit_sample_shading.sample*
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23984>
This was only used for version < 40 (See commit 22a02f3e3).
Adding some extra explanations and asserts of places where it is used.
As we are here also move the definition of a register with QFILE_VPM,
to avoid defining it if not needed.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22984>
If we are trying to lower register pressure this can make a big
difference in some cases. To avoid adding even more strategies,
merge this with disabling ubo load sorting, since they are basically
trying to do the same.
total instructions in shared programs: 12848024 -> 12844510 (-0.03%)
instructions in affected programs: 236537 -> 233023 (-1.49%)
helped: 195
HURT: 87
Instructions are helped.
total uniforms in shared programs: 3815601 -> 3814932 (-0.02%)
uniforms in affected programs: 31773 -> 31104 (-2.11%)
helped: 67
HURT: 115
Inconclusive result (value mean confidence interval includes 0).
total max-temps in shared programs: 2210803 -> 2210622 (<.01%)
max-temps in affected programs: 9362 -> 9181 (-1.93%)
helped: 114
HURT: 34
Max-temps are helped.
total spills in shared programs: 2556 -> 2330 (-8.84%)
spills in affected programs: 1391 -> 1165 (-16.25%)
helped: 39
HURT: 9
total fills in shared programs: 3840 -> 3317 (-13.62%)
fills in affected programs: 2379 -> 1856 (-21.98%)
helped: 39
HURT: 23
total sfu-stalls in shared programs: 21965 -> 21978 (0.06%)
sfu-stalls in affected programs: 2618 -> 2631 (0.50%)
helped: 45
HURT: 81
Inconclusive result (value mean confidence interval includes 0).
total inst-and-stalls in shared programs: 12869989 -> 12866488 (-0.03%)
inst-and-stalls in affected programs: 238771 -> 235270 (-1.47%)
helped: 193
HURT: 87
Inst-and-stalls are helped.
total nops in shared programs: 303501 -> 303274 (-0.07%)
nops in affected programs: 4159 -> 3932 (-5.46%)
helped: 87
HURT: 105
Inconclusive result (value mean confidence interval includes 0).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22824>
nir_opt_gcm get us worse shader-db stats, but that is expected. But we
want to prevent to get worse values on spill/fills. Analyzing the
outcome with shader-db, this mostly happen with shaders that are
already complex, and are already spilling/filling.
So the best option here is adding a new strategy, that fall backs if
we get spill/fill using nir_opt_gcm.
It is not clear in which order we should disable gcm. For now we
disable it before loop unrolling.
We get a slight performance gain (in average) using nir_opt_gcm.
We don't show the shaderdb stats, as they are worse, but as mentioned,
this is expected.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17185>
Optimizations that we are already calling on the Vulkan driver. As
preparation to the Vulkan frontend to use v3d_optimize_nir too.
We need to add a new parameter to v3d_optimize_nir in order to know if
we can call nir_opt_find_array_copies. As we don't track if we are
calling nir_var_lower_copies, we explicitly call it when we create the
uncompiled shader create. So instead of tracking, we assume that each
driver (v3d/v3dv) would call it when the shader is created. So when
v3d_optimize_nir is called as part of the process to compile it at the
compiler, we call it with allow_copies as false.
We exclude on purpose nir_opt_gcm as it is a case of a optimization
that could help performance even if it hurts shader db stats.
shaderdb stats:
total instructions in shared programs: 11705923 -> 11705034 (<.01%)
instructions in affected programs: 88350 -> 87461 (-1.01%)
helped: 201
HURT: 80
Instructions are helped.
total threads in shared programs: 375552 -> 375558 (<.01%)
threads in affected programs: 6 -> 12 (100.00%)
helped: 3
HURT: 0
total uniforms in shared programs: 3486108 -> 3485789 (<.01%)
uniforms in affected programs: 7473 -> 7154 (-4.27%)
helped: 90
HURT: 1
Uniforms are helped.
total max-temps in shared programs: 2021860 -> 2021802 (<.01%)
max-temps in affected programs: 800 -> 742 (-7.25%)
helped: 21
HURT: 3
Max-temps are helped.
total sfu-stalls in shared programs: 19299 -> 19296 (-0.02%)
sfu-stalls in affected programs: 18 -> 15 (-16.67%)
helped: 10
HURT: 7
Inconclusive result (value mean confidence interval includes 0).
total inst-and-stalls in shared programs: 11725222 -> 11724330 (<.01%)
inst-and-stalls in affected programs: 88402 -> 87510 (-1.01%)
helped: 201
HURT: 80
Inst-and-stalls are helped.
total nops in shared programs: 269674 -> 269386 (-0.11%)
nops in affected programs: 3641 -> 3353 (-7.91%)
helped: 103
HURT: 29
Nops are helped.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17185>
Two advantages:
* When using NIR_DEBUG=nir_print_xx, will print outcome only if
there is a change
* We can use NIR_PASS(_, ...) instead of NIR_PASS_V, that has
slightly more validation checks.
This includes:
* v3d_nir_lower_image_load_store
* v3d_nir_lower_io
* v3d_nir_lower_line_smooth
* v3d_nir_lower_load_store_bitsize
* v3d_nir_lower_robust_buffer_access
* v3d_nir_lower_scratch
* v3d_nir_lower_txf_ms
As we are here we also simplify some of them by using the
nir_shader_instructions_pass helper.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17609>
Inline uniform blocks store their contents in pool memory rather
than a separate buffer, and are intended to provide a way in which
some platforms may provide more efficient access to the uniform
data, similar to push constants but with more flexible size
constraints.
We implement these in a similar way as push constants: for constant
access we copy the data in the uniform stream (using the new
QUNIFORM_UNIFORM_UBO_*) enums to identify the inline buffer from
which we need to copy and for indirect access we fallback to
regular UBO access.
Because at NIR level there is no distinction between inline and
regular UBOs and the compiler isn't aware of Vulkan descriptor
sets, we use the UBO index on UBO load intrinsics to identify
inline UBOs, just like we do for push constants. Particularly,
we reserve indices 1..MAX_INLINE_UNIFORM_BUFFERS for this,
however, unlike push constants, inline buffers are accessed
through descriptor sets, and therefore we need to make sure
they are located in the first slots of the UBO descriptor map.
This means we store them in the first MAX_INLINE_UNIFORM_BUFFERS
slots of the map, with regular UBOs always coming after these
slots.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15575>
This can add quite a bit of register pressure so it makes sense to disable it
to prevent us from dropping to 2 threads or increase spills:
total instructions in shared programs: 12672813 -> 12642413 (-0.24%)
instructions in affected programs: 256721 -> 226321 (-11.84%)
helped: 719
HURT: 77
total threads in shared programs: 415534 -> 416322 (0.19%)
threads in affected programs: 788 -> 1576 (100.00%)
helped: 394
HURT: 0
total uniforms in shared programs: 3711370 -> 3703861 (-0.20%)
uniforms in affected programs: 28859 -> 21350 (-26.02%)
helped: 204
HURT: 455
total max-temps in shared programs: 2159439 -> 2150686 (-0.41%)
max-temps in affected programs: 32945 -> 24192 (-26.57%)
helped: 585
HURT: 47
total spills in shared programs: 5966 -> 3255 (-45.44%)
spills in affected programs: 2933 -> 222 (-92.43%)
helped: 192
HURT: 4
total fills in shared programs: 9328 -> 4630 (-50.36%)
fills in affected programs: 5184 -> 486 (-90.62%)
helped: 196
HURT: 0
Compared to the stats before adding scheduling of non-filtered
memory reads we see we that we have now gotten back all that was
lost and then some:
total instructions in shared programs: 12663186 -> 12642413 (-0.16%)
instructions in affected programs: 2051803 -> 2031030 (-1.01%)
helped: 4885
HURT: 3338
total threads in shared programs: 415870 -> 416322 (0.11%)
threads in affected programs: 896 -> 1348 (50.45%)
helped: 300
HURT: 74
total uniforms in shared programs: 3711629 -> 3703861 (-0.21%)
uniforms in affected programs: 158766 -> 150998 (-4.89%)
helped: 1973
HURT: 499
total max-temps in shared programs: 2138857 -> 2150686 (0.55%)
max-temps in affected programs: 177920 -> 189749 (6.65%)
helped: 2666
HURT: 2035
total spills in shared programs: 3860 -> 3255 (-15.67%)
spills in affected programs: 2653 -> 2048 (-22.80%)
helped: 77
HURT: 21
total fills in shared programs: 5573 -> 4630 (-16.92%)
fills in affected programs: 3839 -> 2896 (-24.56%)
helped: 81
HURT: 15
total sfu-stalls in shared programs: 39583 -> 38154 (-3.61%)
sfu-stalls in affected programs: 8993 -> 7564 (-15.89%)
helped: 1808
HURT: 1038
total nops in shared programs: 324894 -> 323685 (-0.37%)
nops in affected programs: 30362 -> 29153 (-3.98%)
helped: 2513
HURT: 2077
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
Now that we don't sort our nodes we can arrange them so we can
easily translate between nodes and temps without a mapping table,
just applying an offset.
To do this we have a single array of nodes where twe put first the nodes
for accumulators and then the nodes for temps. With this setup we can
ensure that for any given temp T, its node is always T + ACC_COUNT.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
When we spill we add new temps. We should be careful not to access
liveness for these until we have re-computed it after all spills and
fill for that the spilled temp have been processed so as to avoid
out-of-bounds accesses to the c->temp_start and c->temp_end arrays.
This fixes a crash in a Three.js demo when we try to patch register
classes after a TMU spill that was caused because we would incorrectly
try to patch the same temps we had just added for the spill itself,
which is not only unnecessary but also incorrect since we these temps
would not have liveness information available yet and thus would
cause out of bounds accesses.
Fixes: f3c3228522 ('broadcom/compiler: do not rebuild the interference graph after each spill')
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15107>
Instead, we only recompute liveness and we add new nodes and
interferences to the graph manually (we also need to patch
register classes in some cases).
To assist in this process, we also add an ip counter to our
instructions that we also recompute after each spill, which we use
to identify registers that cross thrsw boundries introduced with
TMU spills and fills and adjust their register classes accordingly
(removing their capacity to use accumulators).
This significantly reduces the CPU cost of spills. Using
shaders/closed/gputest/piano/7.shader_test as reference:
Compile time up to the first successful compile strategy in main is
~24s and with this change it is ~11s. With this speed up, we can now
try all 2-thread compile strategies (including the fallback scheduler)
in only ~15s.
A full shader-db run results in:
Total CPU time (seconds): 9904.67 -> 9087.98 (-8.25%)
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
Instead of whether they are allowed to spill or not. This is more flexible.
Also, while we are not currently enabling spilling on any 4-thread strategies,
should we do that in the future, always prefer a 4-thread compile.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
V3D hardware doesn't support vector access for general TMU load/store
operations like the ones we use for UBO and SSBO, so we need to split
these to scalar operations.
It should be noted that we also have a vectorization pass (which runs
later, during optimization), that may reconstruct some of these into
32-bit operations when possible (i.e. when the resulting operation
is 32-bit aligned).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
We had been storing pointers to a driver owned swizzle table
rather than storing the actual swizzle value in various shader
and pipeline keys on both GL and Vulkan drivers.
This doesn't look very robust, particularly since we also
compute sha1 hashes from these values and we may store these
hashes to disk (for the disk cache).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13738>