Commit Graph

86 Commits

Author SHA1 Message Date
Iago Toral Quiroga
917e8e5439 broadcom/compiler: rename is_ldunif_dst to try_rf0
We flag nodes used to ldunif dst so we can try and favor allocating
rf0 to them, so be more explicit about its purpose.

Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31355>
2024-09-25 14:21:46 +00:00
Iago Toral Quiroga
d1f8351f3c broadcom/compiler: fix per-quad spilling
This is not safe when we have conditional spills since we could be
spilling disabled lanes with undefined values that could overwrite
valid data for those lanes from a previous spill of the same temp
that was unconditional (or that condionally enabled those same
lanes).

Fixes some Piglit OpenCL tests as well as the following OpenCL tests:
integer_divideAssign
integer_moduloAssign
integer_mad_sat
integer_ops integer_divideAssign
integer_ops integer_mad_sat
integer_ops integer_moduloAssign
integer_ops quick_char_math
integer_ops quick_short_math
math_brute_force half_powr
math_brute_force pow
math_brute_force pown
math_brute_force powr
math_brute_force rootn

Fixes: 597560e27c ('broadcom/compiler: always enable per-quad on spill operations')
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29909>
2024-06-27 06:43:09 +00:00
Iago Toral Quiroga
38b7f411a1 broadcom/compiler: don't spill in between multop and umul24
The multop instruction implicitly writes rtop which is not preserved
acrosss thread switches. We can spill the sources of the multop
(since these would happen before multop) and the destination of
umul24 (since that would happen after umul24).

Fixes some OpenCL tests when V3D_DEBUG=opt_compile_time is used to
choose a different compile configuration.

cc: mesa-stable

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29909>
2024-06-27 06:43:09 +00:00
Iago Toral Quiroga
02f33b7d92 broadcom/compiler: initialize payload_conflict for all initial nodes
Fixes: cb83f25b39 ('broadcom/compiler: don't assign payload registers to spilling setup temps')
cc: mesa-stable

Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29759>
2024-06-18 07:19:07 +00:00
Iago Toral Quiroga
865e682ad7 broadcom/compiler: apply payload conflict to spill setup before RA
We can emit spill setup before RA if we use scratch. In that case
we have the same situation as during spilling, with the caveat that
we have already emitted the instructions so we need to find them
(they should be the only instructions ones before the instructions
accessing payload registers) and flag them as such.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
cc: mesa-stable

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29343>
2024-05-24 05:25:22 +00:00
Iago Toral Quiroga
cb83f25b39 broadcom/compiler: don't assign payload registers to spilling setup temps
We read our payload registers first in the shader so we generally don't have
to care about temps being allocated to them and stomping their value before
we can read them. Hoewer, spilling setup instructions are an exception since
these will be inserted first when there is any spilling in the program.
To fix this, we flag RA nodes involved with these instructions so we can
then try to avoid assiging these registers to them.

Fixes CTS failures with V3D_DEBUG=opt_compile_time, particularly:
dEQP-VK.binding_model.buffer_device_address.set0.depth2.basessbo.convertcheckuv2.nostore.single.std140.comp_offset_nonzero

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
cc: mesa-stable

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29343>
2024-05-24 05:25:22 +00:00
Iago Toral Quiroga
901c485997 broadcom/compiler: make add_node return the node index
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
cc: mesa-stable

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29343>
2024-05-24 05:25:21 +00:00
Alejandro Piñeiro
85f26828fe broadcom: only support v42 and v71
Acked-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25851>
2023-11-02 11:59:08 +01:00
Iago Toral Quiroga
1f5a3391bb broadcom/compiler: only assign rf0 as last resort in V3D 7.x
So we can use it for ldunif(a) and avoid generating ldunif(a)rf which
can't be paired with conditional instructions.

shader-db (pi5):

total instructions in shared programs: 11357802 -> 11338883 (-0.17%)
instructions in affected programs: 7117889 -> 7098970 (-0.27%)
helped: 24264
HURT: 17574
Instructions are helped.

total uniforms in shared programs: 3857808 -> 3857815 (<.01%)
uniforms in affected programs: 92 -> 99 (7.61%)
helped: 0
HURT: 1

total max-temps in shared programs: 2230904 -> 2230199 (-0.03%)
max-temps in affected programs: 52309 -> 51604 (-1.35%)
helped: 1219
HURT: 725
Max-temps are helped.

total sfu-stalls in shared programs: 15021 -> 15236 (1.43%)
sfu-stalls in affected programs: 6848 -> 7063 (3.14%)
helped: 1866
HURT: 1704
Inconclusive result

total inst-and-stalls in shared programs: 11372823 -> 11354119 (-0.16%)
inst-and-stalls in affected programs: 7149177 -> 7130473 (-0.26%)
helped: 24315
HURT: 17561
Inst-and-stalls are helped.

total nops in shared programs: 273624 -> 273711 (0.03%)
nops in affected programs: 31562 -> 31649 (0.28%)
helped: 1619
HURT: 1854
Inconclusive result (value mean confidence interval includes 0).

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
c8e4ee8ecb broadcom/compiler: don't assign registers to unused nodes/temps
In programs with a lot of unused temps, if we don't do this, we may
end up recycling previously used rfs more often, which can be
detrimental to instruction pairing.

total instructions in shared programs: 11464335 -> 11444136 (-0.18%)
instructions in affected programs: 8976743 -> 8956544 (-0.23%)
helped: 33196
HURT: 33778
Inconclusive result

total max-temps in shared programs: 2230150 -> 2229445 (-0.03%)
max-temps in affected programs: 86413 -> 85708 (-0.82%)
helped: 2217
HURT: 1523
Max-temps are helped.

total sfu-stalls in shared programs: 18077 -> 17104 (-5.38%)
sfu-stalls in affected programs: 8669 -> 7696 (-11.22%)
helped: 2657
HURT: 2182
Sfu-stalls are helped.

total inst-and-stalls in shared programs: 11482412 -> 11461240 (-0.18%)
inst-and-stalls in affected programs: 8995697 -> 8974525 (-0.24%)
helped: 33319
HURT: 33708
Inconclusive result

total nops in shared programs: 298140 -> 296185 (-0.66%)
nops in affected programs: 52805 -> 50850 (-3.70%)
helped: 3797
HURT: 2662
Inconclusive result

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
ce13aa4ee7 broadcom/compiler: improve allocation for final program instructions
The last 3 instructions can't use specific registers so flag all the
nodes for temps used in the last program instructions and try to
avoid assigning any of these. This may help us avoid injecting nops
for the last thread switch instruction.

Because regisster allocation needs to happen before QPU scheduling
and instruction merging we can't tell exactly what the last 3
instructions will be, so we do this for a few more instructions than
just 3.

We only do this for fragment shaders because other shader stages
always end with VPM store instructions that take an small immediate
and therefore will never allow us to merge the final thread switch
earlier, so limiting allocation for these shaders will never improve
anything and might instead be detrimental.

total instructions in shared programs: 11471389 -> 11464335 (-0.06%)
instructions in affected programs: 582908 -> 575854 (-1.21%)
helped: 4669
HURT: 578
Instructions are helped.

total max-temps in shared programs: 2230497 -> 2230150 (-0.02%)
max-temps in affected programs: 5662 -> 5315 (-6.13%)
helped: 344
HURT: 44
Max-temps are helped.

total sfu-stalls in shared programs: 18068 -> 18077 (0.05%)
sfu-stalls in affected programs: 264 -> 273 (3.41%)
helped: 37
HURT: 48
Inconclusive result (value mean confidence interval includes 0).

total inst-and-stalls in shared programs: 11489457 -> 11482412 (-0.06%)
inst-and-stalls in affected programs: 585180 -> 578135 (-1.20%)
helped: 4659
HURT: 588
Inst-and-stalls are helped.

total nops in shared programs: 301738 -> 298140 (-1.19%)
nops in affected programs: 14680 -> 11082 (-24.51%)
helped: 3252
HURT: 108
Nops are helped.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
818fc41e7e broadcom/compiler: don't allocate spill base to rf0 in V3D 7.x
Otherwise it can be stomped by instructions doing implicit rf0 writes.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
d4285d7f2a broadcom/compiler: start allocating from RF 4 in V7.x
In V3D 4.x we start at RF3 so that we allocate RF0-2 only if there
aren't any other RFs available. This is useful with small shaders to
ensure that our TLB writes don't use these registers because these are
the last instructions we emit in fragment shaders and the last
instructions in a program can't write to these registers, so if we do,
we need to emit NOPs.

In V3D 7.x the registers affected by this restriction are RF2-3, so we
choose to start at RF4.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
cbedf14687 broadcom/compiler: don't assign rf0 to temps that conflict with ldvary
ldvary writes to rf0 implicitly, so we don't want to allocate rf0 to
any temps that are live across ldvary's rf0 live ranges.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
3a36a618d7 broadcom/compiler: try to use ldunif(a) instead of ldunif(a)rf in v71
The rf variants need to encode the destination in the cond bits, which
prevents these to be merged with any other instruction that need them.

In 4.x, ldunif(a) write to r5 which is a special register that only
ldunif(a) and ldvary can write so we have a special register class for
it and only allow it for them. Then when we need to choose a register
for a node, if this register is available we always use it.

In 7.x these instructions write to rf0, which can be used by any
instruction, so instead of restricting rf0, we track the temps that
are used as ldunif(a) destinations and use that information to favor
rf0 for them.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
2b15df963e broadcom/compiler: don't assign rf0 to temps across implicit rf0 writes
In platforms that don't have accumulators and have implicit writes to
the register file we need to be careful and avoid assigning a physical
register to a temp that lives across an implicit write to that same
physical register.

For now, we have the case of implicit writes to rf0 from various
signals, but it should be easy to extend this to include additional
registers if needed.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:42 +00:00
Iago Toral Quiroga
03594b3dca broadcom/compiler: only handle accumulator classes if present
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:41 +00:00
Iago Toral Quiroga
b1548b18d3 broadcom/compiler: rename vir_writes_rX to vir_writes_rX_implicitly
Since that represents more accurately what they check..

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:41 +00:00
Alejandro Piñeiro
083d082d8e broadcom/compiler: update register classes to not include accumulators on v71
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:41 +00:00
Alejandro Piñeiro
5a035af931 broadcom/compiler: payload_w is loaded on rf3 for v71
And in general rf0 is now used for other needs.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:41 +00:00
Alejandro Piñeiro
d00f7ef23e broadcom/compiler: don't favor/select accum registers for hw not supporting it
Note that what we do is to just return false on the favor/select accum
methods. We could just avoid to call them, but as the select is called
more than once, it is just easier this way.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:41 +00:00
Alejandro Piñeiro
1260b202be broadcom/compiler: phys index depends on hw version
For 7.1 there are not accumulators. So we replace the macro with a
function call.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:41 +00:00
Iago Toral Quiroga
63d633ca7a broadcom/compiler: update node/temp translation for v71
As the offset applied needs to take into account if we have
accumulators or not.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
2023-10-13 22:37:41 +00:00
Iago Toral Quiroga
6114e66124 broadcom/compiler: only use last thread switch flag to detect final section
Since commit 'c98ddc778a3 broadcom/compiler: force a last thrsw for spilling'
we always ensure we signal the last thread section explicitly with a
last thread switch.

Relying on VPM stores to detect the last thread section is particularly bad,
because we can have VPM stores occurring quite early in a shader program,
which would disable TMU spilling almost entirely.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22461>
2023-06-14 09:27:50 +00:00
Iago Toral Quiroga
1e28f2a6f2 broadcom/compiler: track pending ldtmu count with each TMU lookup
And use this information when scheduling QPU to avoid merging
a new TMU request into a previous ldtmu instruction when doing
so may cause TMU output fifo overflow due to a stalling ldtmu.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22044>
2023-03-21 11:29:05 +00:00
Iago Toral Quiroga
cfccd93efc broadcom/compiler: don't predicate postponed spills
The postponed spill is predicated using the condition from the
last write, but this is only correct if the register was only
written once in the TMU sequence, or if it is always written with
the same predication.

While we could try to track whether this is the case or not, it
would make the postponed spill path even more complex than it
already is, so let's just avoid predicating these. We are already
discouraging TMU spilling of registers in the middle of TMU
sequences, so this should not be a very common case.

Cc: mesa-stable
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17201>
2022-06-28 05:49:51 +00:00
Iago Toral Quiroga
98420408d0 broadcom/compiler: fix postponed TMU spills with multiple writes
If we are spilling a register that is used in the middle of a TMU
sequence, we postpone the spill until the TMU sequence finishes,
at which point we inject the spill and rewrite the original
instruction to write to the new temp.

However, this doesn't work if the register is written multiple
times during the TMU sequence. In that scenario, we need to ensure
that all writes are rewritten to use the new temp, not just the last
one.

Cc: mesa-stable
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17201>
2022-06-28 05:49:51 +00:00
Iago Toral Quiroga
cf4b3cb563 broadcom/compiler: prefer reconstruction over TMU spills when possible
We have been reconstructing/rematerializing uniforms for a while, but we
can do this in more scenarios, namely instructions which result is
immutable along the execution of a shader across all channels.

By doing this we gain the capacity to eliminate TMU spills which not
only are slower, but can also make us drop to a fallback compilation
strategy.

Shader-db results show a small increase in instruction counts caused
by us now being able to choose preferential compiler strategies that
are intended to reduce TMU latency. In some cases, we are now also
able to avoid dropping thread counts:

total instructions in shared programs: 12658092 -> 12659245 (<.01%)
instructions in affected programs: 75812 -> 76965 (1.52%)
helped: 55
HURT: 107

total threads in shared programs: 416286 -> 416412 (0.03%)
threads in affected programs: 126 -> 252 (100.00%)
helped: 63
HURT: 0

total uniforms in shared programs: 3716916 -> 3716396 (-0.01%)
uniforms in affected programs: 19327 -> 18807 (-2.69%)
helped: 94
HURT: 50

total max-temps in shared programs: 2161796 -> 2161578 (-0.01%)
max-temps in affected programs: 3961 -> 3743 (-5.50%)
helped: 80
HURT: 24

total spills in shared programs: 3274 -> 3266 (-0.24%)
spills in affected programs: 98 -> 90 (-8.16%)
helped: 6
HURT: 0

total fills in shared programs: 4657 -> 4642 (-0.32%)
fills in affected programs: 130 -> 115 (-11.54%)
helped: 6
HURT: 0

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15710>
2022-04-08 05:37:28 +00:00
Iago Toral Quiroga
597560e27c broadcom/compiler: always enable per-quad on spill operations
This ensures that any channels used for helper invocations are
also spilled/filled correctly.

Alternatively, we could recursively track all temps that get
involved in computing values that are then used in explicit
(dfdx,dfdy) or implicit (texture coordinates for mipmap or
anisotropic filtering, etc) derivatives, and only enable
per-quad on these (or disable spilling of any of these
values).

Fixes:
dEQP-VK.graphicsfuzz.cov-dfdx-dfdy-after-nested-loops

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15705>
2022-04-01 08:53:50 +00:00
Iago Toral Quiroga
49b5431197 broadcom/compiler: remove unused functions
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15302>
2022-03-10 07:25:37 +00:00
Iago Toral Quiroga
44feff93c2 broadcom/compiler: don't always assign r5 if available
Instead, only favor assigning r5 if we have first decided to
assign an accumulator. This helps with assining r5 to short
lived uniforms, favoring accumulator rotation to facilitate
QPU merges.

total instructions in shared programs: 12656164 -> 12628339 (-0.22%)
instructions in affected programs: 5368373 -> 5340548 (-0.52%)
helped: 17420
HURT: 9996

total uniforms in shared programs: 3704776 -> 3704863 (<.01%)
uniforms in affected programs: 12247 -> 12334 (0.71%)
helped: 23
HURT: 78

total max-temps in shared programs: 2153505 -> 2152684 (-0.04%)
max-temps in affected programs: 26468 -> 25647 (-3.10%)
helped: 569
HURT: 328

total fills in shared programs: 4656 -> 4657 (0.02%)
fills in affected programs: 43 -> 44 (2.33%)
helped: 0
HURT: 1

total sfu-stalls in shared programs: 34728 -> 34403 (-0.94%)
sfu-stalls in affected programs: 3411 -> 3086 (-9.53%)
helped: 842
HURT: 534

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
77f58b46d9 broadcom/compiler: add comment on why we don't use r5 with ldunifa
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
f761f8fd9e broadcom/compiler: simplify node/temp translation during register allocation
Now that we don't sort our nodes we can arrange them so we can
easily translate between nodes and temps without a mapping table,
just applying an offset.

To do this we have a single array of nodes where twe put first the nodes
for accumulators and then the nodes for temps. With this setup we can
ensure that for any given temp T, its node is always T + ACC_COUNT.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga
871b0a7f6a broadcom/compiler: don't sort nodes for register allocation
Nodes are allocated in order to registers so initially sorting
was used to ensure that nodes with smaller life ranges would
be assigned first and therefore be more likely to get
accumulators.

However, since d81a6e5f1d now we don't rely on order to make
decisions about accumulators and instead we make policy decisions
based on actual liveness, so sorting is no longer strictly
relevant to this decision.

Furthermore, we are not re-sorting nodes after each spill either,
since that would probably require that we rebuild the interference
graph after each spill (the graph identifies nodes by their index).

Shader-db results show a significant improvement in instruction
counts, due to more optimal accumulator assignments. The reason for
this is that we use a round-robin policy for choosing the next
accumulator to assign. The idea behind this is preventing nearby
temps to be assigned to the same accumulator so that QPU scheduling
is more flexible, but if we  sort our nodes, we are basically not
assigning temps in program order any more and the round-robin policy
becomes less effective:

total instructions in shared programs: 13000420 -> 12663189 (-2.59%)
instructions in affected programs: 11791267 -> 11454036 (-2.86%)
helped: 62890
HURT: 19987

total threads in shared programs: 415874 -> 415870 (<.01%)
threads in affected programs: 20 -> 16 (-20.00%)
helped: 2
HURT: 4

total uniforms in shared programs: 3711652 -> 3711624 (<.01%)
uniforms in affected programs: 43430 -> 43402 (-0.06%)
helped: 134
HURT: 173

total max-temps in shared programs: 2144876 -> 2138822 (-0.28%)
max-temps in affected programs: 123334 -> 117280 (-4.91%)
helped: 4112
HURT: 1195

total spills in shared programs: 3870 -> 3860 (-0.26%)
spills in affected programs: 1013 -> 1003 (-0.99%)
helped: 14
HURT: 12

total fills in shared programs: 5560 -> 5573 (0.23%)
fills in affected programs: 1765 -> 1778 (0.74%)
helped: 14
HURT: 17

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga
a1998a9f43 broadcom/compiler: disallow TMU spills if max tmu spills is 0
If we are compiling with a strategy that does not allow TMU spills
we should not allow spilling anything that is not a uniform.
Otherwise the RA cost/benefit algorithm may choose to spill a
temp that is not uniform and that will cause us to immediately
fail the strategy and fallback to the next one, even if we
could've instead chosen to spill more uniforms to compile the
program successfully with that strategy.

Some relevant shader-db stats:

total instructions in shared programs: 13040711 -> 13043585 (0.02%)
instructions in affected programs: 234238 -> 237112 (1.23%)
helped: 73
HURT: 172

total threads in shared programs: 415664 -> 415860 (0.05%)
threads in affected programs: 196 -> 392 (100.00%)
helped: 98
HURT: 0

total uniforms in shared programs: 3717266 -> 3721953 (0.13%)
uniforms in affected programs: 12831 -> 17518 (36.53%)
helped: 6
HURT: 100

total max-temps in shared programs: 2174177 -> 2173431 (-0.03%)
max-temps in affected programs: 4597 -> 3851 (-16.23%)
helped: 79
HURT: 21

total spills in shared programs: 4010 -> 4005 (-0.12%)
spills in affected programs: 55 -> 50 (-9.09%)
helped: 5
HURT: 0

total fills in shared programs: 5820 -> 5801 (-0.33%)
fills in affected programs: 186 -> 167 (-10.22%)
helped: 5
HURT: 0

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga
cbb4d0dded broadcom/compiler: increase cost of TMU spills to 10
Our cost was 5 which matches the number of instructions we have to
add for a TMU spill (a fill is 4 instructions).

Uniform spills on the other hand add an extra instruction for each
fill and remove one instruction for the spill itself. These have
a cost of 1.

Therefore, if we have a single spill+fill, we end up with +9
instructions if it is a TMU spill and +0 instructions with a uniform
spill, so making the former only 5 times more costly is probably
not a good idea, and this is without even considering the added
latency of the TMU accesses.

Relevant shader-db changes show this causes as a marginal instruction
count increase in a few shaders but better thread counts and lower
TMU spilling overall:

total instructions in shared programs: 13037315 -> 13040711 (0.03%)
instructions in affected programs: 370106 -> 373502 (0.92%)
helped: 187
HURT: 321

total threads in shared programs: 415090 -> 415664 (0.14%)
threads in affected programs: 574 -> 1148 (100.00%)
helped: 287
HURT: 0

total uniforms in shared programs: 3706674 -> 3717266 (0.29%)
uniforms in affected programs: 63075 -> 73667 (16.79%)
helped: 40
HURT: 395

total max-temps in shared programs: 2176080 -> 2174177 (-0.09%)
max-temps in affected programs: 15838 -> 13935 (-12.02%)
helped: 316
HURT: 34

total spills in shared programs: 4247 -> 4010 (-5.58%)
spills in affected programs: 2599 -> 2362 (-9.12%)
helped: 107
HURT: 14

total fills in shared programs: 6121 -> 5820 (-4.92%)
fills in affected programs: 3622 -> 3321 (-8.31%)
helped: 108
HURT: 13

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga
c4a78a2d2a broadcom/compiler: fix register class patching for postponed spills
If we have a postponed spill, the temp we create at ip is no longer
the spilled temp and therefore is affected by the thrsw injection.

Fixes corruption in the additive blending animation demo from
Three.js.

Fixes: f3c3228522 ('broadcom/compiler: do not rebuild the interference graph after each spill')
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15112>
2022-02-22 11:17:10 +00:00
Iago Toral Quiroga
a4b164b57b broadcom/compiler: only patch temps that existed before the current spill
When we spill we add new temps. We should be careful not to access
liveness for these until we have re-computed it after all spills and
fill for that the spilled temp have been processed so as to avoid
out-of-bounds accesses to the c->temp_start and c->temp_end arrays.

This fixes a crash in a Three.js demo when we try to patch register
classes after a TMU spill that was caused because we would incorrectly
try to patch the same temps we had just added for the spill itself,
which is not only unnecessary but also incorrect since we these temps
would not have liveness information available yet and thus would
cause out of bounds accesses.

Fixes: f3c3228522 ('broadcom/compiler: do not rebuild the interference graph after each spill')
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15107>
2022-02-22 06:41:51 +00:00
Iago Toral Quiroga
8883975209 broadcom/compiler: drop spill_count and add spilling boolean
We added spill_count to handle uniform batch spills, which we no longer do.
What we want now is a way to know if we are spilling registers.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga
f3c3228522 broadcom/compiler: do not rebuild the interference graph after each spill
Instead, we only recompute liveness and we add new nodes and
interferences to the graph manually (we also need to patch
register classes in some cases).

To assist in this process, we also add an ip counter to our
instructions that we also recompute after each spill, which we use
to identify registers that cross thrsw boundries introduced with
TMU spills and fills and adjust their register classes accordingly
(removing their capacity to use accumulators).

This significantly reduces the CPU cost of spills. Using
shaders/closed/gputest/piano/7.shader_test as reference:

Compile time up to the first successful compile strategy in main is
~24s and with this change it is ~11s. With this speed up, we can now
try all 2-thread compile strategies (including the fallback scheduler)
in only ~15s.

A full shader-db run results in:
Total CPU time (seconds): 9904.67 -> 9087.98 (-8.25%)

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga
92d819aaa0 broadcom/compiler: fix end of TMU sequence check
We may be pipelining TMU writes and reads, in which case we can
see both TMUWT and LDTMU at the end of a TMU sequence, so we should
not assume that a TMUWT always terminates a sequence.

Also, we had a bug where we were using inst instead of scan_inst
to check if we find another TMUWT after the curent instruction.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga
40e091267d broadcom/compiler: define max number of tmu spills for compile strategies
Instead of whether they are allowed to spill or not. This is more flexible.
Also, while we are not currently enabling spilling on any 4-thread strategies,
should we do that in the future, always prefer a 4-thread compile.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Alejandro Piñeiro
d50be41f8f broadcom/compiler: remove unused macro and function definition
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13444>
2021-10-20 10:08:27 +00:00
Juan A. Suarez Romero
c98ddc778a broadcom/compiler: force a last thrsw for spilling
As we don't know if we are going to have spilling or not, emit always a
last thrsw at the end of the shader.

If later we don't have spillings and we don't need that last thrsw, we
remove it and switch back to the previous one.

This way we ensure all the spilling happens always before the last
thrsw.

v2 (Juan):
 - Rework the code to force a last thrsw and remove later if no spilling

v3:
 - Merge functionality inside vir_emit_last_thrsw (Iago)
 - Add vir_restore_last_thrsw (Juan)

v4 (Iago):
 - Fix/add new comments
 - Rename variables/parameters

v5 (Iago):
 - Fix comments
 - Add assertion

Cc: mesa-stable
Fixes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/4760
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12322>
2021-09-10 09:18:05 +00:00
Iago Toral Quiroga
c335c03ae2 broadcom/compiler: make spills of conditional writes also conditional
A spill of a conditional write generates code like this:

mov.ifa t5000, 0
mov tmud, t5000
nop t5001; ldunif (0x00008100 / 0.000000)
add tmua, t11, t5001

Here, we are spilling t5000, which has a conditional write, and we
produce an inconditional spill for it. This implicitly means that
our spill requires a correct value for all channels of t5000.

If we do a conditional spill, then we emit:

mov.ifa t5000, 0
mov tmud.ifa, t5000
nop t5001; ldunif (0x00008100 / 0.000000)
add tmua.ifa, t11, t5001

Which only uses channels of t5000 that have been written by the
instruction being spilled.

By doing the latter, we can then narrow down the liveness for t5000
more effectively, as we can use this to detect that the block only reads
(in the tmud instruction) the values that have been written previously
in the same block (in the mov instruction). This means that values in
other channels are not used, and therefore, we don't need them to be
alive at the start of the block. This means that if this is the only
write of t5000 in this block, we can consider that the block
completely defines t5000.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12278>
2021-08-10 08:47:40 +00:00
Juan A. Suarez Romero
d0e83b6174 broadcom/compiler: change current block on setting spill base
The spill base setting instructions (which includes some uniforms) are
added in the entry block, not in the current block. When ldunif
optimization is applied, the cursor is pointing to instructions in the
entry block, but the current block is a different one. This leads to a
heap-buffer-overflow when going through the list of instructions
(detected by the address sanitizer).

Thus change the current block to entry block, and restore it after the
setup is done.

This fixes
dEQP-VK.ssbo.readonly.layout.single_struct.single_buffer.std430_instance_array_comp_access_store_cols
with address sanitizer enabled.

v2:
 - Set current block instead of disabling ldunif optimization (Iago)

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12221>
2021-08-09 13:15:24 +00:00
Eric Anholt
ec3bc5da74 v3d: Use the ra_alloc_contig_reg_class() function to speed up RA.
It means we don't need to do the n^2 loop over the regs to set up the pq
values, nor do we need the register conflicts lists.

Acked-by: Erico Nunes <nunes.erico@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9437>
2021-06-04 19:08:57 +00:00
Eric Anholt
95d41a3525 ra: Use struct ra_class in the public API.
All these unsigned ints are awful to keep track of.  Use pointers so we
get some type checking.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9437>
2021-06-04 19:08:57 +00:00
Juan A. Suarez Romero
54ec9c95cf broadcom/compiler: fix dynamic-stack-buffer-overflow error
When spilling a register, the number of temps can be increased when
introducing a temporal variable.

Those nodes are not elegible to be spilled, but we need to take care of
no accessing out-of-bounds of the arrays defined with a size equal to
the original number of temps.

Fixes address sanitizer error on
KHR-GLES3.shaders.uniform_block.random.all_shared_buffer.14 (and many
others).

v2 (Iago):
 - Add clarification in assertion.
 - Use `vir_get_temp` to increase num_temps.

v3 (Iago):
 - Update clarification

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10643>
2021-05-11 07:46:17 +00:00
Iago Toral Quiroga
d81a6e5f1d broadcom/compiler: change register allocation policy for accumulators
The current policy is to always favor accumulators if possible, however,
this is not always optimal.

Particularly, accumulators play a crucial role in enabling QPU instruction
merges, since these are limited to both the ADD and the ALU instructions
addressing at most 2 physical registers. For 2-src instructions, this means
that to be able to merge we need them to address at least 2 accumulators.

While favoring accumulators does help the case for instruction merges in
general, it is risky to assign accumulators to variables that have
long life spans. Doing so will make the accumulator unavailable for
any other instructions during that life span, and since we only have a few
accumulators, we can quickly run out and losing our capacity to merge
instructions for large parts of the qpu program.

On the other hand, we also want to avoid the extreme case were we keep
allocating physical registers to the point we run out, even if we have
accumulators available, since accumulators have additional restrictions
and may not be suitable for everything.

This change continues the policy of favoring accumulators, but it only
does so if the life span of the temps is short, to ensure that we can
recycle accumulators often across instructions and avoid running out
for sections of the QPU code, unless we are already running out of
physical registers.

total instructions in shared programs: 13654647 -> 13336921 (-2.33%)
instructions in affected programs: 11015919 -> 10698193 (-2.88%)
helped: 39758
HURT: 17325
Instructions are helped.

total threads in shared programs: 412046 -> 412038 (<.01%)
threads in affected programs: 16 -> 8 (-50.00%)
helped: 0
HURT: 4
Threads are HURT.

total uniforms in shared programs: 3745726 -> 3746003 (<.01%)
uniforms in affected programs: 17296 -> 17573 (1.60%)
helped: 76
HURT: 99
Uniforms are HURT.

total max-temps in shared programs: 2364430 -> 2359942 (-0.19%)
max-temps in affected programs: 109117 -> 104629 (-4.11%)
helped: 2893
HURT: 772
Max-temps are helped.

total spills in shared programs: 5727 -> 5746 (0.33%)
spills in affected programs: 221 -> 240 (8.60%)
helped: 1
HURT: 2

total fills in shared programs: 13121 -> 13139 (0.14%)
fills in affected programs: 466 -> 484 (3.86%)
helped: 1
HURT: 2

total sfu-stalls in shared programs: 33432 -> 34491 (3.17%)
sfu-stalls in affected programs: 18219 -> 19278 (5.81%)
helped: 4459
HURT: 5087
Inconclusive result

total inst-and-stalls in shared programs: 13688079 -> 13371412 (-2.31%)
inst-and-stalls in affected programs: 11030017 -> 10713350 (-2.87%)
helped: 39630
HURT: 17429
Inst-and-stalls are helped.

total nops in shared programs: 335753 -> 333708 (-0.61%)
nops in affected programs: 112659 -> 110614 (-1.82%)
helped: 8726
HURT: 7383
Inconclusive result

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10686>
2021-05-08 13:15:42 +02:00