nir: Be smarter fusing ffma

If there is a single use of fmul, and that single use is fadd, it makes
sense to fuse ffma, as we already do. However, if there are multiple
uses, fusing may impede code gen. Consider the source fragment:

   a = fmul(x, y)
   b = fadd(a, z)
   c = fmin(a, t)
   d = fmax(b, c)

The fmul has two uses. The current ffma fusing is greedy and will
produce the following "optimized" code.

   a = fmul(x, y)
   b = ffma(x, y, z)
   c = fmin(a, t)
   d = fmax(b, c)

Actually, this code is worse! Instead of 1 fmul + 1 fadd, we now have 1
fmul + 1 ffma. In effect, two multiplies (and a fused add) instead of
one multiply and an add. Depending on the ISA, that could impede
scheduling or increase code size. It can also increase register
pressure, extending the live range.

It's tempting to gate on is_used_once, but that would hurt in cases
where we really do fuse everything, e.g.:

   a = fmul(x, y)
   b = fadd(a, z)
   c = fadd(a, t)

For ISAs that fuse ffma, we expect that 2 ffma is faster than 1 fmul + 2
fadd. So what we really want is to fuse ffma iff the fmul will get
deleted. That occurs iff all uses of the fmul are fadd and will
themselves get fused to ffma, leaving fmul to get dead code eliminated.
That's easy to implement with a new NIR search helper, checking that all
uses are fadd.

shader-db results on Mali-G57 [open shader-db + subset of closed]:

total instructions in shared programs: 179491 -> 178991 (-0.28%)
instructions in affected programs: 36862 -> 36362 (-1.36%)
helped: 190
HURT: 27

total cycles in shared programs: 10573.20 -> 10571.75 (-0.01%)
cycles in affected programs: 72.02 -> 70.56 (-2.02%)
helped: 28
HURT: 1

total fma in shared programs: 1590.47 -> 1582.61 (-0.49%)
fma in affected programs: 319.95 -> 312.09 (-2.46%)
helped: 194
HURT: 1

total cvt in shared programs: 812.98 -> 813.03 (<.01%)
cvt in affected programs: 118.53 -> 118.58 (0.04%)
helped: 65
HURT: 81

total quadwords in shared programs: 98968 -> 98840 (-0.13%)
quadwords in affected programs: 2960 -> 2832 (-4.32%)
helped: 20
HURT: 4

total threads in shared programs: 4693 -> 4697 (0.09%)
threads in affected programs: 4 -> 8 (100.00%)
helped: 4
HURT: 0

v2: Update trace checksums for virgl due to numerical differences.

Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18814>
This commit is contained in:
Alyssa Rosenzweig
2022-10-15 13:39:26 -04:00
committed by Marge Bot
parent 07c654e08f
commit ac2964dfbd
4 changed files with 27 additions and 9 deletions

View File

@@ -2589,10 +2589,10 @@ late_optimizations = [
# re-combine inexact mul+add to ffma. Do this before fsub so that a * b - c
# gets combined to fma(a, b, -c).
(('~fadd@16', ('fmul', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma16'),
(('~fadd@32', ('fmul', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma32'),
(('~fadd@64', ('fmul', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma64'),
(('~fadd@32', ('fmulz', a, b), c), ('ffmaz', a, b, c), 'options->fuse_ffma32'),
(('~fadd@16', ('fmul(is_only_used_by_fadd)', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma16'),
(('~fadd@32', ('fmul(is_only_used_by_fadd)', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma32'),
(('~fadd@64', ('fmul(is_only_used_by_fadd)', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma64'),
(('~fadd@32', ('fmulz(is_only_used_by_fadd)', a, b), c), ('ffmaz', a, b, c), 'options->fuse_ffma32'),
# Subtractions get lowered during optimization, so we need to recombine them
(('fadd@8', a, ('fneg', 'b')), ('fsub', 'a', 'b'), 'options->has_fsub'),

View File

@@ -422,6 +422,24 @@ is_only_used_as_float(const nir_alu_instr *instr)
return true;
}
static inline bool
is_only_used_by_fadd(const nir_alu_instr *instr)
{
nir_foreach_use(src, &instr->dest.dest.ssa) {
const nir_instr *const user_instr = src->parent_instr;
if (user_instr->type != nir_instr_type_alu)
return false;
const nir_alu_instr *const user_alu = nir_instr_as_alu(user_instr);
assert(instr != user_alu);
if (user_alu->op != nir_op_fadd)
return false;
}
return true;
}
static inline bool
only_lower_8_bits_used(const nir_alu_instr *instr)
{

View File

@@ -21,7 +21,7 @@ traces:
checksum: c377f21f7bfaca0c04983612e7c9a7bb
gputest/pixmark-piano-v2.trace:
gl-virgl:
checksum: a4f3552e26c31a6d143519ee7ad47eea
checksum: 495ae47d50672d095854765bdb2eedc5
gputest/triangle-v2.trace:
gl-virgl:
checksum: 5f694874b15bcd7a3689b387c143590b
@@ -30,7 +30,7 @@ traces:
checksum: 32e8b627a33ad08d416dfdb804920371
0ad/0ad-v2.trace:
gl-virgl:
checksum: 784d20f0166ef66b4b65f25f2858a5ee
checksum: 78007615359981f7035f26c5b7759229
glmark2/buffer:update-fraction=0.5:update-dispersion=0.9:columns=200:update-method=map:interleave=false-v2.trace:
gl-virgl:
checksum: 040232e01e394a967dc3320bb9252870
@@ -126,7 +126,7 @@ traces:
label: [crash]
gputest/pixmark-julia-fp32-v2.trace:
gl-virgl:
checksum: fbf5e44a6f46684b84e5bb5ad6d36c67
checksum: 0aa3a82a5b849cb83436e52c4e3e95ac
gputest/pixmark-julia-fp64-v2.trace:
gl-virgl:
checksum: 1760aea00af985b8cd902128235b08f6

View File

@@ -123,7 +123,7 @@ traces:
label: [crash]
gputest/pixmark-julia-fp32-v2.trace:
gl-virgl:
checksum: 8b3584b1dd8f1d1bb63205564bd78e4e
checksum: 25f938c726c68c08a88193f28f7c4474
gputest/pixmark-julia-fp64-v2.trace:
gl-virgl:
checksum: 73ccaff82ea764057fb0f93f0024cf84
@@ -159,7 +159,7 @@ traces:
checksum: 37780a6eaa38a55700e8207e89009f56
neverball/neverball-v2.trace:
gl-virgl:
checksum: cc11743f008ccd76adf72695a423436a
checksum: 0b8ae7dd4f7f26c3278ded8a5694b983
pathfinder/canvas_moire-v2.trace:
gl-virgl:
checksum: 25ba8f18274126670311bd3ffe058f74