So far the support for R/B swapping in vertex attributes were for the
generic attributes.
But there are cases like glSecondaryColorPointer() supporting BGRA
formats that require the R/B swapping to be also allowed in the
non-generic vertex attributes (in this case, in the COLOR1 attribute).
v2:
- Don't split line (Iago)
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7196>
It is not used right now, so keeping it adds some noise/confusion.
So far configuring Z test are done through the CFG_BITS. See
v3dX(emit_state) at v3dx_emit.c for v3d, and pack_cfg_bits at
v3dv_pipeline.c for v3dv. There flags like z_updates_enable and others
are filled up.
That key field seems like a leftover coming from using vc4 as
reference, as that driver defines and uses a field with name name.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7421>
Some shaders that need to spill hundreds of registers can take very long times
to compile as each allocation attempt spills a single register and restarts
the allocation process. We can significantly cut down these times if we allow
the compiler to spill in batches, which should be possible if we are spilling
uniforms, which is in fact the kind of spills that we do first because they
have lower cost than TMU spills.
Doing this could cause us to slightly over spill in some cases (depending on
the chosen batch size) leading to slightly worse performance, so we only
enable this behavior after we have started to spill over a certain threshold,
at which point we assume that performance won't be good and we want to
favor compilation speed instead.
v2:
- Keep it simple and just try to spill a fixed amount of registers in a
batch instead of trying to compute this dynamically based on accumulated
spills and current register pressure. (Eric).
v3:
- Check if the node is valid before doing anything with it.
- Drop the environment variable to select batch size and just fix it to 20.
With this we can take this CTS test from 35 minutes down to about 3 minutes:
dEQP-VK.ssbo.layout.random.all_shared_buffer.5
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>
For example, regarding gl_SampleID, the GLSL spec states:
"Any static use of this variable in a fragment shader causes the
entire shader to be evaluated per-sample."
So we need to track if the fragment shader does anything that implicitly
enables per-sample shading in the compiler for the driver to
auto-enable sample rate shading if needed.
v2:
- Instead of tracking reads of gl_SampleID, check SYSTEM_BIT_SAMPLE_ID
and SYSTEM_BIT_SAMPLE_POS as well as the sample layout qualifier like
other drivers are doing to activate this behavior (Eric).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> (v1)
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>
We will need this in Vulkan to support vertex format
VK_FORMAT_B8G8R8A8_UNORM. The hardware doesn't allow to swizzle
vertex attribute components, so we need to do it in the shader.
v2:
- Use nir_intrinsic_io_semantics() to retrieve the location instead
of looping through the shader input variables (Eric).
- Assert that we only have one component (Eric).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> (v1)
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>
Vulkan lowers gl_InstanceIndex to load_base_instance +
load_instance_id, so we need to implement loading the base instance in
the compiler.
The base instance is set by the BASE_VERTEX_BASE_INSTANCE command
right before the instanced draw call and it is included in the VPM
payload together with the InstanceID and VertexID if this is requested
by the shader record.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>
We would need on OpenGL to update values for all the textures used. On
OpenGL that value can be always took from the context or the nir
shader, but there are cases on Vulkan that it is not the case, or
would force up to recompute it.
Acked-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>
To ask to debug a registr allocation failure
(V3D_DEBUG_REGISTER_ALLOCATION seemed too long to me).
When a fallback register allocation algorithm was added, if the
register allocation fails, it only dumpg the current vir with the
register pressure info with the failed fallback. But if we want do
debug the problem, we would be interested on both.
Additionally, it was strange that we got the full vir dump with the
failure even if no debug option was set.
Additionally we add shaderdb like stats for those failures, to make
easier to compare one and the other.
v2: keep a small warning message in case both register allocation
algorithms fails (Neil)
Reviewed-by: Neil Roberts <nroberts@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6999>
v3d_compile is now split out into a helper function that gets called a
second time if compilation fails the first time with the result
reporting the register allocation failed. The second time it is run with
the fallback scheduler to try and increase the chances of successfully
allocating the registers.
v2: Add a performance debug message when using the fallback scheduler.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5953>
Instead of just having a bool status for the failure, there is now an
enum so that the compilation can report a more detailed status.
Currently this is only used to report whether the failure was due to
failed register allocation. The “failed” bool doesn’t seem to actually
have been used anywhere so this doesn’t really change a lot.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5953>
When line smoothing is enabled, the driver now increases the width of
the line so that it can add some semi-transparent pixels to either side
of the line. A lowering pass is added which modifies the alpha component
of every write to fragment output 0 so that if the fragment is outside
the width of the line then the alpha is reduced. It additionally
discards fragments that are completely invisible. It might seem bad to
use discard on a tiled renderer but the assumption is that any bad
effects from using discard will also happen anyway because of enabling
alpha blending.
v2: Disable the line smoothing pass entirely when the framebuffer
contains an integer colour output or one with no alpha channel.
Calculate the coverage once upfront and store in a global variable
instead of calculating each time an output write is modified. Also
do the conditional discard once upfront.
v3: Don’t check whether the output buffer has an alpha channel. Only
look at output 0. Use aa_line_width intrinsic instead of calculating
the real line width in the shader. Clamp the coverage as part of the
global variable, not per output write.
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5624>
When geometry shaders write a value to gl_Layer that doesn't correspond to
an existing layer in the target framebuffer the rendering behavior is
undefined according to the spec, however, there are CTS tests that trigger
this scenario on purpose, probably to ensure that nothing terrible happens.
For V3D, this situation is problematic because the binner uses the layer
index to select the offset to write into the tile state data, and we only
allocate tile state for MAX2(num_layers, 1), so we want to make sure we
don't produce values that would lead to out of bounds writes. The simulator
has an assert to catch this, although we haven't observed issues in actual
hardware it is probably best to play safe.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Geometry shaders can output many vertices and thus have higher VPM memory
pressure as a result. It is possible that too wide geometry shader dispatches
exceed the maximum available VPM output allocated, in which case we need
to reduce the dispatch width until we can fit the VPM memory requirements.
Supported dispatch widths for geometry shaders are 16, 8, 4, 1.
There is a limit in the number of VPM output sectors that can be used by a
geometry shader that we can meet by lowering the dispatch width at compile
time, however, at draw time we need to revisit this number and, together with
other elements that can contribute to total VPM memory requirements, decide
on a configuration that can fit the program into the available VPM memory.
Ideally, we also want to aim for not using more than half of the available
memory so we that we can run a pair of bin and render programs in parallel.
v2: fixed language in comment and typo in commit log. (Alejandro)
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Most of the relevant work happens in the v3d_nir_lower_io. Since
geometry shaders can write any number of output vertices, this pass
injects a few variables into the shader code to keep track of things
like the number of vertices emitted or the offsets into the VPM
of the current vertex output, etc. This is also where we handle
EmitVertex() and EmitPrimitive() intrinsics.
The geometry shader VPM output layout has a specific structure
with a 32-bit general header, then another 32-bit header slot for
each output vertex, and finally the actual vertex data.
When vertex shaders are paired with geometry shaders we also need
to consider the following:
- Only geometry shaders emit fixed function outputs.
- The coordinate shader used for the vertex stage during binning must
not drop varyings other than those used by transform feedback, since
these may be read by the binning GS.
v2:
- Use MAX3 instead of a chain of MAX2 (Alejandro).
- Make all loop variables unsigned in ntq_setup_gs_inputs (Alejandro)
- Update comment in IO owering so it includes the GS stage (Alejandro)
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Until now this made sense because we always paired vertex shaders
with fragment shaders, but as soon as we implement geometry and
tessellation shaders that will no longer be the case, so rename
this to (num_)used_outputs.
v2: Use 'used_outputs' instead of ns_outputs, which is more explicit (Eric).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
That we set for any TMU write on spills and general tmu. It is then
used as part of v3d_emit_gl_shader_state later.
v2: add a new flag instead at v3d_compiler instead of dirty the flag
at v3dx if there is any spill (change suggested by Eric, added by
Alejandro)
v3: set this for anything that is not a load and do it also in
v3d40_vir_emit_image_load_store (Eric)
Reviewed-by: Eric Anholt <eric@anholt.net>
SFU operations have a latency of 2 cicles, so if their results
are used in the following cycle to a SFU instruction, the GPU
stalls for an extra cycle until the result is available.
This adds the number of stalls to the shader-db debug mode and
sum of instruction + stalls to evaluate optimizations to schedule
instructions that avoid generating sfu-stalls.
v2: Rename v3d_qpu_generates_sfu_stalls to v3d_qpu_instr_is_sfu (Eric)
Reviewed-by: Eric Anholt <eric@anholt.net>
Among other things, this avoid the need of loading 1/-1 constants (so
one less operation).
The removed comment suggest the option of adding support on NIR for
inc/dec. Intel just uses an auxiliar method to get which hw operation
is needed, so no lowering is needed. And at the same time, being so
small, seems unreasonable to try to add a general one on NIR
itself. It is more easy to just adapt the method here (that is what
the patch does right now).
It is worth to note that we are not getting any change on shader-db
stats because all those methods are used on the usual shader-db set
with shaders needing GLSL > 4.2. In general there aren't too many GLSL
ES 3.1 tests.
As an alternative, we captured the GLES3/GLSL31/GLS32 used on
vk-gl-cts, even if that is not a real life usage of shaders. With
those we get the following:
total instructions in shared programs: 1217022 -> 1217013 (<.01%)
instructions in affected programs: 117 -> 108 (-7.69%)
helped: 6
HURT: 0
helped stats (abs) min: 1 max: 2 x̄: 1.50 x̃: 1
helped stats (rel) min: 3.57% max: 10.00% x̄: 8.09% x̃: 9.09%
95% mean confidence interval for instructions value: -2.07 -0.93
95% mean confidence interval for instructions %-change: -10.54% -5.64%
Instructions are helped.
Note that the shaders helped are really low because most of the
vk-gl-cts tests using AtomicInc/Dec/Add are mostly used on compute
shaders. Although right now there is a branch around with CS support,
the usual is doing the stats against master.
Reviewed-by: Eric Anholt <eric@anholt.net>
This implements support for OpenGL logic operations by emitting code to read
from the TLB if needed and blending the fragment output accordingly. It is
similar to VC4's blend lowering pass, but exclusive to logic operations, since
blending is otherwise supported in hardware.
The pass doesn't handle MSAA targets yet.
Fixes the following piglit tests:
spec/!opengl 1.0/gl-1.0-logicop/*
spec/!opengl 1.1/gl-1.1-xor
spec/!opengl 1.1/gl-1.1-xor-copypixels
It also fixes text cursor rendering in Libreoffice with the GTK+2 theme, which
is rendered via glamor using the XOR logic operation.
v2: fix checks for allowed variable location and maximum render target (Eric)
Reviewed-by: Eric Anholt <eric@anholt.net>
Until now we have always been emitting our scoreboard locks on the last thread
switch to improve parallelism. We did this by emitting our last thread switch
right before our tlb writes at the very end of the program, where we know that
we are outside control flow.
Unfortunately, this strategy is not valid when we have tlb color reads too, as
these will happen before this point in the program and can happen inside
control flow.
To fix this we always emit a thread switch before the first tlb load and if we
see additional thread switches after that point, we change the strategy to lock
on the first thread switch.
v2: change the solution so it is expected to work in more scenarios (Eric).
Reviewed-by: Eric Anholt <eric@anholt.net>
We need to scale the size of these arrays to consider up to
V3D_MAX_DRAW_BUFFERS render targets and 4 components per color.
v2: we want to store each color component separately, so scale by 4 too.
Reviewed-by: Eric Anholt <eric@anholt.net>
The ldtlbu version will read an implicit uniform with the TLB read
specifier and should be used for the first read in a sequence
of TLB reads (unless the default configuration is valid, in which
case we can use ldtlb). The ldtlb version is used for any subsequent
TLB read in the sequence.
Reviewed-by: Eric Anholt <eric@anholt.net>
We can use the same register spilling infrastructure for our loads/stores
of indirect access of temp variables, instead of doing an if ladder.
Cuts 50% of instructions and max-temps from 2 KSP shaders in shader-db.
Also causes several other KSP shaders with large bodies and large loop
counts to not be force-unrolled.
The change was originally motivated by NOLTIS slightly modifying register
pressure in piglit temp mat4 array read/write tests, triggering register
allocation failures.
While waiting for the CSD UABI to get reviewed, I keep having to rebase
the CS patch. Just land the compiler side for now to keep it from
diverging.
For now this covers just GLES 3.1 compute shaders, not CL kernels.
Our exec masking introduces lots of redundant flags updates, and even
without that there will be cases where NIR comparisons on the same sources
for different reasons may generate the same comparison instruction before
the selection.
total instructions in shared programs: 6492930 -> 6460934 (-0.49%)
total uniforms in shared programs: 2117460 -> 2115106 (-0.11%)
total spills in shared programs: 4983 -> 4987 (0.08%)
total fills in shared programs: 6408 -> 6416 (0.12%)
The idea was that we could skip uploading the constant-indexed uniform
data and just upload the uniforms that are variably-indexed. However,
since the VS bin and render shaders may have a different set of uniforms
used, this meant that we had to upload the UBO for each of them. The
first case is generally a fairly small impact (usually the uniform array
is the most space, other than a couple of FSes in shader-db), while the
second is a larger impact: 3DMMES2 was uploading 38k/frame of uniforms
instead of 18k.
Given that the optimization is of dubious value, has a big downside, and
is quite a bit of code, just drop it. No change in shader-db. No change
on 3DMMES2 (n=15).
We'd end up with the constant offset in the uniform stream anyway, since
they're bigger than small immediates. Avoids the extra uniforms and adds
in the shader in favor of just adding once on the CPU.
shader-db:
total instructions in shared programs: 6496865 -> 6494851 (-0.03%)
total uniforms in shared programs: 2119511 -> 2117243 (-0.11%)
I want to reuse this for encoding small constant UBO/SSBO offsets into the
uniform stream to reduce the extra uniform loads and adds for the small
constant offsets.
The idea is that for repeated use of the same uniform, we could avoid
loading it on each consumer. The results look pretty good.
total instructions in shared programs: 6413571 -> 6521464 (1.68%)
total threads in shared programs: 154214 -> 154000 (-0.14%)
total uniforms in shared programs: 2393604 -> 2119629 (-11.45%)
total spills in shared programs: 4960 -> 4984 (0.48%)
total fills in shared programs: 6350 -> 6418 (1.07%)
Once we do scheduling at the NIR level, the register pressure (and thus
also instructions) issues we see here will drop back down.