Add some comments explaining what's going on in a more natural flow in
order to solve the actual bug.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Fixes: 2d914ebe81 ("pan/midgard: Fix memory corruption in register spilling")
This allows a write to proceed to an uninitialized part of a buffer
even when the GPU is using the previously-initialized portions.
Such a situation can be triggered with the following API usage example:
glBufferSubData(..., offset, size, data1);
glDrawArrays(...);
// append new vertex data
glBufferSubData(..., offset+size, size, data2);
glDrawArrays(...);
Same is done for freedreno, nouveau and radeon.
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
This is the layout used in the GL API, and maps directly to PIPE
formats with no endianness trickery. As with the LA change, this
fixes big-endian fetching from texbos. Also cleans up some endian
shenanigans in shader images.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Now that Mesa is also using an array format for LA, nothing was using
these. (And, clearly, no HW driver had exposed them).
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The array format is what the GL API wants (fixing texbos on
big-endian), and matches directly to gallium's corresponding array
format. The only driver exposing A8L8 was radeon/r200 in big-endian,
where the HW's underlying format was trying to read as array and we
needed to flip things around to make our packed format come out right
(note that while the radeon format tables had both AL and LA,
ChooseTextureFormat would only pick one of them based on endianness).
v2: Don't make r200/radeon use endian swaps.
v3: Rebase on dropping the r200 _be/_le format table removal patch
v4: reword commit message to explain why we can drop both formats
from radeon.
Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
The array format is what the GL API wants (and we made a mistake in
the format returned for texbos on big-endian!), and it's exactly what
the gallium-side PIPE_FORMAT_L16A16 is. The only downside is that
dri_util tries to fall back to sampling RG16 using LA16, which doesn't
have a match for big-endian any more. No HW drivers supported A16L16
anyway.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The first arg to OUT_BATCH_RELOC is ignored, we actually wanted these
in the third arg. They're always 0 so far, so it didn't matter.
v2: Reword commit message that I don't end up using the tile bits, but
keep the commit as a cleanup anyway.
Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
No matter what, we deref the texFormat from the table, except for a
mistake in cpp=4 where we pulled a 0 out of the table either way.
v2: Rebase on dropping r200 table deduplication patch.
Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
PP stack size should be set to maximum PP stack size, not to stack size of
last shader.
Fixes: 27e7603c34 ("lima: fix ppir spill stack allocation")
Tested-by: Icenowy Zheng <icenowy@aosc.io>
Reviewed-by: Erico Nunes <nunes.erico@gmail.com>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
Similiar to iadd, we can fold an added constant value from an imad24_ir3
into the load_uniform's constant offset. This avoids some cases where
the addition of imad24_ir3 could otherwise be a regression in instr
count.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>
We can't encode immed sources for cat3 (mad) instructions, but we can
use const in first or third src. We handled this case already, but we
weren't considering that we could lower immed to const.
For manhattan:
total instructions in shared programs: 35202 -> 34718 (-1.37%)
instructions in affected programs: 14931 -> 14447 (-3.24%)
helped: 90
HURT: 0
total full in shared programs: 2451 -> 2359 (-3.75%)
full in affected programs: 653 -> 561 (-14.09%)
helped: 69
HURT: 2
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Lower amul to either imul or imul24, depending on whether 24b is enough
bits to calculate an offset within the thing being dereferenced.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Used for address/offset calculation (ie. array derefs), where we can
potentially use less than 32b for the multiply of array idx by element
size. For backends that support `imul24`, this gives a lowering pass
an easy way to find multiplies that potentially can be converted to
`imul24`.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>
Some hardware can do 24b multiply in a single instruction, but not 32b.
However in most cases 24b is sufficient for address/offset calculation.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>
ir3 compiler has a signed integer multiply-add instruction (MAD_S24)
that is used for different offset calculations in the backend.
Since we intend to move some of these calculations to NIR, we need
a new ALU op that can directly represent it.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>
Otherwise, if the base type is (for example) uint32, we would
incorrectly think that PoT optimizations could not apply.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Jason Ekstsrand <jason@jleksrand.net>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>
The pass should run once at the end of shader compilation, for a4xx
onwards. It iterates texture sampling instructions and mark those
eligibile for pre-dispatch by changing the tex op from 'tex' to
'tex_prefetch'. An instruction is eligibile if:
* The coordinate is a vector where all its components come from a
shader input.
* The order of the components match exactly that of the input (no
swizzles).
* The instruction is in the 'main' function, and in the outer
most-block.
The first two restrictions were arrived to empirically, so more
testing could tighten or loosen it.
The 3rd restriction is there to allow moving the instructions
eligible for pre-dispatch to the beginning of the shader, so
that we don't block the registers holding the result for too
long.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
It seems that pre-fs texture fetch only works if ij_pix ends up in r0.x.
I've tried unknown zero bits, to no avail, and blob also seems to force
r0.x when this feature is used.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Useful to see in disassembly listing texture fetches that were moved to
pre-dispatch.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
If the only use of varyings is a pre-shader texture-fetch, we still need
to issue a bary.f with the end-input flag, otherwise we'll block further
VS invocations, as the hw will think varying storage is still busy.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
It is possible that the result of a pre-fs texture fetch is an output
(or partially an output) of the FS. Sine the meta:tex_prefetch
instructions are dropped before the assembler, we need to account for
this when we fixup the register footprint.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Add a placeholder instruction to track texture fetches made prior to FS
shader dispatch. These, like meta:input instructions are scheduled
before any real instructions, so that RA realizes their result values
are live before the first real instruction. And to give legalize a way
to track usage of fetched sample requiring (sy) sync flags.
There is some related special handling for varying texcoord inputs used
for pre-fs-fetch, so that they are not DCE'd and remain in linkage
between FS and previous stage. Note that we could almost avoid this
special handling by giving meta:tex_prefetch real src arguments, except
that in the FS stage, inputs are actual bary.f/ldlv instructions.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
When we enable pre-dispatch texture fetch, we could have a scenario
where the barycentric i/j coord sysval is not used in the shader, but
only used for the varying fetch for the pre-dispatch texture fetch.
In this case we need to take care not to DCE this sysval.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Will be needed for special handling of SYSTEM_VALUE_BARYCENTRIC_PIXEL
(ij_pix) when pre-fs texture fetch is enabled.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Not sure I remember how long this has been unused for. But it's unused
now.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
This is like nir_texop_tex, but signals that the sampling coordinates
are immutable during the shader stage, in a way that allows the HW
that supports pre-dispatching sampling operations to pre-fetch
the result prior to scheduling the shader stage.
This is introduced to support the feature in Freedreno. Adreno HW
from a4xx supports it.
A NIR pass introduced later in this series will detect sampling
operations that are eligible for pre-dispatch, and replace
nir_texop_tex by this new op, to tell the backend to enable
pre-fetch.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>