I've been doing this inside of vc4, but vc5 wants it as well and it may be
useful for other drivers (Intel has a related path for pre-gen6 with MRT,
and freedreno had a TGSI path for it at one point).
This required defining a common enum for the standard comparison
functions, but other lowering passes are likely to also want that enum.
v2: Add to meson.build as well.
Acked-by: Rob Clark <robdclark@gmail.com>
By leaving the compiled shader in the context's stage state, the next
compile of a new FS would look in the old compiled FS for figuring out
whether to set various dirty flags for the VS compile. Clear out the
pointer when deleting the program, and make sure that we always mark the
state as dirty if the previous program had been lost. Fixes valgrind
warnings on glsl-max-varyings.
Fixes: 2350569a78 ("vc4: Avoid VS shader recompiles by keeping a set of FS inputs seen so far.")
This allows drivers more freedom in how exactly they want to lower I/O,
e.g. first lowering I/O to temporaries.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Commit e1af20f18a changed the shader_info
from being embedded into being just a pointer. The idea was that
sharing the shader_info between NIR and GLSL would be easier if it were
a pointer pointing to the same shader_info struct. This, however, has
caused a few problems:
1) There are many things which generate NIR without GLSL. This means
we have to support both NIR shaders which come from GLSL and ones
that don't and need to have an info elsewhere.
2) The solution to (1) raises all sorts of ownership issues which have
to be resolved with ralloc_parent checks.
3) Ever since 00620782c9, we've been
using nir_gather_info to fill out the final shader_info. Thanks to
cloning and the above ownership issues, the nir_shader::info may not
point back to the gl_shader anymore and so we have to do a copy of
the shader_info from NIR back to GLSL anyway.
All of these issues go away if we just embed the shader_info in the
nir_shader. There's a little downside of having to copy it back after
calling nir_gather_info but, as explained above, we have to do that
anyway.
Acked-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The NIR story on conversion opcodes is a mess. We've had way too many
of them, naming is inconsistent, and which ones have explicit sizes was
sort-of random. This commit re-organizes things and makes them all
consistent:
- All non-bool conversion opcodes now have the explicit size in the
destination and are named <src_type>2<dst_type><size>.
- Integer <-> integer conversion opcodes now only come in i2i and u2u
forms (i2u and u2i have been removed) since the only difference
between the different integer conversions is whether or not they
sign-extend when up-converting.
- Boolean conversion opcodes all have the explicit size on the bool and
are named <src_type>2<dst_type>.
Making things consistent also allows nir_type_conversion_op to be moved
to nir_opcodes.c and auto-generated using mako. This will make adding
int8, int16, and float16 versions much easier when the time comes.
Reviewed-by: Eric Anholt <eric@anholt.net>
This reduces register pressure in both types of shaders, by reordering the
input loads from the var->data.driver_location order to whatever order
they appear first in the NIR shader. These instructions aren't
reorderable at our QIR scheduling level because the FS takes two in
lockstep to do an interpolation, and the VS takes multiple read
instructions in a row to get a whole vec4-level attribute read.
shader-db impact:
total instructions in shared programs: 76666 -> 76590 (-0.10%)
instructions in affected programs: 42945 -> 42869 (-0.18%)
total max temps in shared programs: 9395 -> 9208 (-1.99%)
max temps in affected programs: 2951 -> 2764 (-6.34%)
Some programs get their max temps hurt, depending on the order that the
load_input intrinsics appear, because we end up being unable to copy
propagate an older VPM read into its only use.
gallium's blitter expects that it can set the sample mask even when the
rasterizer doesn't have the flag on.
Between this and the previous test, 10 new ext_framebuffer_multisample
tests start passing.
The kernel will reject our shader if we emit one here, and having 4, 8, or
12 as the top end of our UBO clamp rare is enough that it's not worth
making the kernel let us.
Fixes piglit fs-const-array-of-struct and
fs-const-array-of-struct-of-array since recent GLSL linking changes made
us get this as an indirect load of a uniform, instead of a tempoary.
Cc: "13.0 17.0" <mesa-stable@lists.freedesktop.org>
shader-db results:
total instructions in shared programs: 92611 -> 91764 (-0.91%)
instructions in affected programs: 27417 -> 26570 (-3.09%)
The star is one shader in glmark2's terrain (drops 16% of its
instructions), but there are also wins in mupen64plus and glb2.7.
git grep -l comparitor | xargs sed -i 's/comparitor/comparator/g'
Just happened to notice this in a patch that was sent and included one
of the tokens in question.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Acked-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This should be a win for most loops, which tend to have uniform control
flow.
More importantly, it exposes important information to live variables: that
the break/continue here means that our jump target may have access to
values that were live on our input. Previously, we were just setting the
exec mask and letting control flow fall through, so an intervening def
between the break and the end of the loop would appear to live variables
as if it screened off the variable, when it didn't actually.
Fixes a regression in glsl-vs-loop-redundant-condition.shader_test when a
perturbing of register allocation caused a live variable to get stomped.
Cc: 13.0 <mesa-stable@lists.freedesktop.org>
Every caller was dereffing the qinst, and this will let us make the number
of sources vary depending on the destination of the qinst so that we can
have general ALU ops that store to tex_[strb] and get an implicit uniform.
Multithreaded fragment shaders let us hide texturing latency by a
hyperthreading-style switch to another fragment shader. This gets us up
to 20% framerate improvements on glmark2 tests.
We don't allow the last thread switch to be inside control flow, to be
sure that we hit the last state exactly once. If the last texturing was
in control flow, fall back to single threaded.
It's much better to just skip the draw call entirely. Getting this
information out of register allocation will also be useful for
implementing threaded fragment shaders, which will need to retry
non-threaded if RA fails.
Cc: <mesa-stable@lists.freedesktop.org>
The 1/W was apparently not accurate enough, and we were getting sparklies
in the distance. The closed driver also did a N-R step here.
Cc: <mesa-stable@lists.freedesktop.org>
No change in behavior. ralloc_size is equivalent to rzalloc_size.
That will change though.
Calls not switched to rzalloc_size:
- ralloc_vasprintf
- glsl_type::name allocation (it's filled with snprintf)
- C++ classes where valgrind didn't show uninitialized values
I switched most of non-glsl stuff to rzalloc without checking whether
it's really needed.
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
When restoring something from shader cache we won't have and don't
want to create a nir_shader this change detaches the two.
There are other advantages such as being able to reuse the
shader info populated by GLSL IR.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Getting stores to NIR regs to not generate new MOVs is tricky, since the
result we're trying to store into the NIR reg may have been from a
conditional update of a temp, or a series of packed writes. The easiest
solution seems to be to require that nir_store_dest()'s arg comes from an
SSA temp.
This causes us to put in a few more temporary MOVs in the NIR SSA dest
case, but copy propagation successfully cleans those up.
The shader-db change is modest:
total instructions in shared programs: 93774 -> 93598 (-0.19%)
instructions in affected programs: 14760 -> 14584 (-1.19%)
total estimated cycles in shared programs: 212135 -> 211946 (-0.09%)
estimated cycles in affected programs: 27005 -> 26816 (-0.70%)
but I was seeing patterns in some register-allocation failures in DEQP
tests that looked like the extra MOVs would increase maximum register
pressure in loops. Some debug code indicates that that's not the case,
though I'm still a bit confused by that result.
Rather than having simulator mode changes scattered around vc4_bufmgr.c
and vc4_screen.c, make vc4_bufmgr.c just call a vc4_simulator_ioctl, which
then dispatches to a corresponding implementation.
This will give the simulator support a centralized place to do tricks like
storing most BOs directly in simulator memory rather than copying in and
out.
This leaves special casing of mmaping BOs and execution, because of the
winsys mapping.
If a conditional assignment is only conditioned on the exec mask, that's
still screening off the value in the executed channels (and, since we're
not storing to the unexcuted channels, we don't care what's in there).
Fixes a bunch of extra register pressure on Processing's Ribbons demo,
which is failing to allocate.
One of NIR's invariants is that control flow lists always start and end
with blocks. There's no good reason why we should return a cf_node from
these functions since we know that it's always a block. Making it a block
lets us remove a bunch of code.
Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
VC4 was running into a major performance regression from enabling control
flow in the glmark2 conditionals test, because of short if statements
containing an ffract.
This pass seems like it was was trying to ensure that we only flattened
IFs that should be entirely a win by guaranteeing that there would be
fewer bcsels than there were MOVs otherwise. However, if the number of
ALU ops is small, we can avoid the overhead of branching (which itself
costs cycles) and still get a win, even if it means moving real
instructions out of the THEN/ELSE blocks.
For now, just turn on aggressive flattening on vc4. i965 will need some
tuning to avoid regressions. It does looks like this may be useful to
replace freedreno code.
Improves glmark2 -b conditionals:fragment-steps=5:vertex-steps=0 from 47
fps to 95 fps on vc4.
vc4 shader-db:
total instructions in shared programs: 101282 -> 99543 (-1.72%)
instructions in affected programs: 17365 -> 15626 (-10.01%)
total uniforms in shared programs: 31295 -> 31172 (-0.39%)
uniforms in affected programs: 3580 -> 3457 (-3.44%)
total estimated cycles in shared programs: 225182 -> 223746 (-0.64%)
estimated cycles in affected programs: 26085 -> 24649 (-5.51%)
v2: Update shader-db output.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> (v1)
In the GLSL-to-NIR conversion of VC4, I had a bit of trouble with what I
was calling the "state uniforms" that I was putting into the NIR fighting
with its other lowering passes. Instead of using magic uniform base
numbers in the backend, follow the lead of load_user_clip_plane and just
define system values for them.
v2: Fix unintended change to channel_num, drop unspecified const_index
value on blend_const_color_r_float.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>