Search for code like
if ... {
break
}
and replace with a break_if pseudo-instruction for optimized handling, since the
break_if lowering is better than the original code.
total instructions in shared programs: 1764596 -> 1764350 (-0.01%)
instructions in affected programs: 24540 -> 24294 (-1.00%)
helped: 78
HURT: 0
Instructions are helped.
total bytes in shared programs: 11611196 -> 11610212 (<.01%)
bytes in affected programs: 166458 -> 165474 (-0.59%)
helped: 78
HURT: 0
Bytes are helped.
shader-db probably understates the benefit here, since this optimizes the body
of loops.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
We use it for two different things. Pseudo-instructions are cheap, split it up
for easier optimization passes. This also fixes the schedule classes.. we can
move the cf_begin around if we want, it's inert.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
They're more trouble than they're worth for us. They were originally lifted
unthinkingly from ACO, where I assume they're necessary for software CF
lowering, but they're just an inconvenient convenience for us. Remove em.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
If we bound 4 render targets but we only write to 1 of them, the other 3 need
their contents preserved. This requires either properly configuring HSR to
implement colour masking (TODO) or using the big hammer of setting TRANSLUCENT.
This patch picks the latter for now.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
The debug flags are already plumbed into driver_flags for the disk
cache, so we just need to actually allow some flags instead of bailing
out of the disk cache init.
We only care about no16 for production right now, and it's probably a
good idea to disable disk caching during most debug sessions, so
allowlist only that one.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Reviewed-by: Eric Engestrom <eric@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
There are way too many broken WebGL apps using the wrong precision
qualifiers, which causes anything from jittery geometry to complete
breakage (e.g. QuakeJS and other games).
In addition, a Firefox bug is breaking basic canvas rendering for the
same reason (mozilla bug #1845309).
Let's just disable fp16 for browsers. There is no hope of getting all
this broken stuff fixed.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Eric Engestrom <eric@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
Since we register allocate in SSA, the number of registers required (register
demand) equals to the maximum number of simultaneous live values (register
pressure). So if we can reduce register pressure, we are guaranteed to reduce
register demand. Even an ineffective heuristic like randomly swapping
instructions can only reduce pressure as long as it's conservative.
This implements one such heuristic: in each block, schedule backwards, selecting
the free instruction that looks like it will reduce liveness the most. In other
words, the greedy algorithm to reduce register pressure. At the end of the
block, if we haven't actually reduced pressure, we bail. This isn't optimal, but
it's well-motivated and optimally handles special cases (like 0-source
instructions).
This is based on the scheduler I originally wrote for Mali.
In my Dolphin ubershader branch, this improved performance at native 4K by 10fps
(105fps->115fps) when I measured together with some other optimizations. On top
of my current next (which notably includes nir_opt_sink improvements), this
commit alone goes (53fps->54fps) which is considerably less impressive :-p
shader-db results are a win, but not as large as we might hope. Instruction
count win seems to be from the smaller live ranges being easier on RA (fewer
swaps / moves). The two shaders affected for thread count are from fifa mobile,
which go from 640 threads ->
1024 (full occupancy). In other words... this heuristic does an excellent job in
a small subset of shaders. The Dolphin ubershader win was real, though :~)
Note these shader-db wins are on top of a branch with the nir_opt_sink
improvements. Without that, the stats are much better... The schedulers have
some overlap, but they're better together.
total instructions in shared programs: 1766635 -> 1763496 (-0.18%)
instructions in affected programs: 445855 -> 442716 (-0.70%)
helped: 1963
HURT: 350
Instructions are helped.
total bytes in shared programs: 11597648 -> 11586924 (-0.09%)
bytes in affected programs: 3106230 -> 3095506 (-0.35%)
helped: 2003
HURT: 374
Bytes are helped.
total halfregs in shared programs: 504609 -> 481980 (-4.48%)
halfregs in affected programs: 138322 -> 115693 (-16.36%)
helped: 3405
HURT: 311
Halfregs are helped.
total threads in shared programs: 18839936 -> 18840704 (<.01%)
threads in affected programs: 1280 -> 2048 (60.00%)
helped: 2
HURT: 0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
Some special registers imply scheduling constraints. We want to have a single
scheduling class per instruction in the IR, so fork off various get_sr variants
depending on what kind of SR we're reading, and validate that we use the right
kind.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
Copying the state below overwrote the ms.sample_locations we set,
so our new_sample_locations was never actually used and we were
accidentally doing a shallow copy. Turnip passes a stack-allocated
old_state, so this resulted in invalid stack pointers.
Fixes: f497cc9d56 ("vk/graphics_state: Add helpers for pre-baking state")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25031>
On a7xx DI_SRC_SEL_AUTO_INDEX is used instead of DI_SRC_SEL_AUTO_XFB.
On a7xx the counter value and offset are shifted right by 2, so
the vertexStride should also be in units of dwords.
CTS doesn't test this though.
Fixes:
dEQP-VK.transform_feedback.simple.draw_indirect_*
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23217>
As r3d_ops emits sysmem draws directly, it needs to be manually
updated to emit the A7XX commands instead of A6XX.
VK-CTS tests success on A630 + A740:
dEQP-VK.api.copy_and_blit.core.blit_image.*
Signed-off-by: Mark Collins <mark@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23217>