We could add a nir_def_bit_size() helper but we use ->bit_size about 3x
as often as nir_dest_bit_size() today so that's a major Coccinelle
refactor anyway and this doesn't make it much worse. Most of this
commit was generated byt the following semantic patch:
@@
expression D;
@@
<...
-nir_dest_bit_size(D)
+D.ssa.bit_size
...
Some manual fixup was needed, especially in cpp files where Coccinelle
tends to give up the moment it sees any interesting C++.
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24674>
Do not apply "nir_lower_fragcolor" in the common code.
This fix a crash on agxv side when a frag shader have SSBO writes.
This is caused by "nir_lower_frag_color" assuming that every
"store_deref" will have a variable backing the
output.
Signed-off-by: Mary <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
There's no good reason to allow non-immediate nesting values, and this lets us
use the (smaller) mov_imm instruction without special casing. This matches what
Metal produces, so it seems like a good preference.
total bytes in shared programs: 11720338 -> 11717310 (-0.03%)
bytes in affected programs: 2341580 -> 2338552 (-0.13%)
helped: 1385
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Now that they're in the right blocks, this is easy. Includes an informal proof
and the implementation itself is built around a finite state machine, which
together meant this code worked on its first try :~)
And hey, it's a pointless little instruction saving optimization I've wanted to
do for a while~
Major note is that this HAS to be done after register allocation, since it
doesn't update the control flow graph and would introduce critical edges
if it tried to actually deleted the else block. The intuitive reason for this is
simple: sometimes RA needs to insert instructions into the else block, even if
it was empty in the original NIR, so we always need an else block even if we can
delete it with this pass after RA.
total instructions in shared programs: 1778390 -> 1776725 (-0.09%)
instructions in affected programs: 268459 -> 266794 (-0.62%)
helped: 1013
HURT: 0
Instructions are helped.
total bytes in shared programs: 12185102 -> 12175112 (-0.08%)
bytes in affected programs: 1927524 -> 1917534 (-0.52%)
helped: 1013
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
According to Dougall's pseudocode, else_icmp operates as:
if r0l == 0:
r0l = n
elif r0l == 1:
if cc.compare(A[thread], B[thread]):
r0l = 0
else:
r0l = 1
exec_mask[thread] = (r0l == 0)
Notice that the comparison only happens when r0l == 1, that is, for threads that
are about to enter the else block. Threads that just executed the if body are
still active (r0l = 0) and skip the comparison. As such, the sources of
else_icmp are only read in the else block, and hence the whole instruction
should be placed in the else block for correctness with respect to live range
splitting.
shader-db is a wash, but shows some improvements due to correctly modelling the
liveness of the condition variable.
total instructions in shared programs: 1778376 -> 1778390 (<.01%)
instructions in affected programs: 14753 -> 14767 (0.09%)
helped: 35
HURT: 39
Inconclusive result (value mean confidence interval includes 0).
total bytes in shared programs: 12185018 -> 12185102 (<.01%)
bytes in affected programs: 101522 -> 101606 (0.08%)
helped: 35
HURT: 39
Inconclusive result (value mean confidence interval includes 0).
total halfregs in shared programs: 531174 -> 531032 (-0.03%)
halfregs in affected programs: 2320 -> 2178 (-6.12%)
helped: 40
HURT: 1
Halfregs are helped.
total threads in shared programs: 18909184 -> 18909440 (<.01%)
threads in affected programs: 1792 -> 2048 (14.29%)
helped: 2
HURT: 0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We have an SR for it, which can save a bit of math. This came up while working
on the spiller.
total instructions in shared programs: 1778396 -> 1778376 (<.01%)
instructions in affected programs: 3036 -> 3016 (-0.66%)
helped: 10
HURT: 3
Instructions are helped.
total bytes in shared programs: 12185182 -> 12185018 (<.01%)
bytes in affected programs: 38640 -> 38476 (-0.42%)
helped: 18
HURT: 2
Bytes are helped.
total halfregs in shared programs: 531218 -> 531174 (<.01%)
halfregs in affected programs: 471 -> 427 (-9.34%)
helped: 6
HURT: 0
Halfregs are helped.
total threads in shared programs: 18909056 -> 18909184 (<.01%)
threads in affected programs: 1280 -> 1408 (10.00%)
helped: 2
HURT: 0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
These should "just" work, promoting the 8-bit channels to 16-bit registers
internally, allowing us to use our 8-bit stores with 8-bit data vectors packed
in 16-bit registers. All other non-conversion ALU gets lowered by the previous
patch, this is just needed for simple things like nir_op_vec of lowered math
passed to a vectorized store.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We need these fencing intrinsics because our image caches aren't coherent with
memory. Furthermore, we need some sync intrinsics for imageblocks (which are
spicy images). These are a stub of what the final fragment shader interlock
implementation will look like, or what a real Metal-grade imageblock
implementation needs, but this is good enough for handling the sync requirements
with spilled render targets.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Bizarrely, the clamps/wrap modes are respected so we need to set them
appropriately for correct out-of-bounds behaviour (returning all zero). That in
turn means we can't use whatever sampler is already there, instead we need to
allocate a dedicated sampler just for txf. Good news is we have an extra sampler
state register available for the purpose.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Or cache flushes or whatever these actually are. Probably could be optimized
once we understand what the 4 individual instructions are actually doing. Fixes
dEQP-GLES31.functional.image_load_store.2d.qualifiers.*.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Often not needed and makes the NIR harder to read.
shader-db is noise.
total instructions in shared programs: 1752712 -> 1752688 (<.01%)
instructions in affected programs: 8338 -> 8314 (-0.29%)
helped: 21
HURT: 8
Inconclusive result (%-change mean confidence interval includes 0).
total bytes in shared programs: 11943572 -> 11943434 (<.01%)
bytes in affected programs: 56716 -> 56578 (-0.24%)
helped: 21
HURT: 8
Inconclusive result (%-change mean confidence interval includes 0).
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
The GPU ABI requires varyings to be grouped as follows:
- Position
- Smooth shaded fp32
- Flat shaded fp32
- Linear shaded fp32
- Smooth shaded fp16
- Flat shaded fp16
- Linear shaded fp16
- Point size
Use the flat shaded mask info we now have in the vertex shader key to
sort things properly, and pass the counts to the hardware.
FP16 is still TODO.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
We need to propagate shading model metadata from the FS to the VS in
order to correctly lay out the uniforms in the right order. This means
we need VS variants depending on this data.
We could use the existing shader info structure, but that applies to
compiled shaders which would introduce a dependency from the VS compile
to the FS compile. This information does not change with FS variants, so
we can introduce an agx_uncompiled_shader_info structure and gather it
early at precompilation time.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
When lowering discards, it will be convenient to generate the pattern:
(cond ? 255 : 0) ^ 255
Add rules to optimize that to
(cond ? 0 : 255)
This is not part of the main algebraic optimizer since this lowering happens
late.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
The SSA killer feature is that, under an "optimal" allocator, the number of
registers used (register demand) is *equal* to the number of registers required
(register pressure, the maximum number of variables simultaneously live at any
point in the program). I put "optimal" in scare quotes, because we don't need to
use the exact minimum number of registers as long as we don't sacrifice thread
count or introduce spilling, and using a few extra registers when possible can
help coalesce moves. Details-shmetails.
The problem is that, prior to this commit, our register allocator was not
well-behaved in certain circumstances, and would require an arbitrarily large
number of registers. In particular, since different variables have different
sizes and require contiguous allocation, in large programs the register file may
become fragmented, causing the RA to use arbitrarily many registers despite
having lots of registers free.
The solution is vector live range splitting. First, we calculate the register
pressure (the minimum number of registers that it is theoretically possible to
allocate successfully), and round up to the maximum number of registers we will
actually use (to give some wiggle room to coalesce moves). Then, we will treat
this maximum as a *bound*, requiring that we don't use more registers than
chosen. In the event that register file fragmentation prevents us from finding a
contiguous sequence of registers to allocate a variable, rather than giving up
or using registers we don't have, we shuffle the register file around
(defragmenting it) to make room for the new variable. That lets us use a
few moves to avoid sacrificing thread count or introducing spilling, which is
usually a great choice.
Android GLES3.1 shader-db results are as expected: some noise / small
regressions for instruction count, but a bunch of shaders with improved thread
count. The massive increase in register demand may seem weird, but this is the
RA doing exactly what it's supposed to: using more registers if and only if they
would not hurt thread count. Notice that no programs whatsoever are hurt for
thread count, which is the salient part.
total instructions in shared programs: 1781473 -> 1781574 (<.01%)
instructions in affected programs: 276268 -> 276369 (0.04%)
helped: 1074
HURT: 463
Inconclusive result (value mean confidence interval includes 0).
total bytes in shared programs: 12196640 -> 12201670 (0.04%)
bytes in affected programs: 1987322 -> 1992352 (0.25%)
helped: 1060
HURT: 513
Bytes are HURT.
total halfregs in shared programs: 488755 -> 529651 (8.37%)
halfregs in affected programs: 295651 -> 336547 (13.83%)
helped: 358
HURT: 9737
Halfregs are HURT.
total threads in shared programs: 18875008 -> 18885440 (0.06%)
threads in affected programs: 64576 -> 75008 (16.15%)
helped: 82
HURT: 0
Threads are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23832>