Add OpenCL kernels implementing the tessellation algorithm on device. This is an
OpenCL C port of the D3D11 reference tessellator, originally written by
Microsoft in C++. There are significant differences compared to the CPU based
reference implementation:
* significant simplifications and clean up. The reference code did a lot of
things in weird ways that would be inefficient on the GPU. I did a *lot* of
work here to get good AGX assembly generated for the tessellation kernels ...
the first attempts were quite bad! Notably, everything is carefully written to
ensure that all private memory access is optimized out in NIR; the resulting
kernels do not use scratch and do not spill on G13.
* prefix sum variants. To implement geom+tess efficiently, we need to first
calculate the count of indices generated by the tessellator, then prefix sum
that, then tessellate using the prefix sum results writing into 1 large index
buffer for a single indirect draw. This isn't too bad, we already have most of
the logic and the guts of the prefix sum kernel is shared with geometry
shaders.
* VDM generation variant. To implement tess alone, it's fastest to generate a
hardware Index List word for each patch, adding an appropriate 32-bit index
bias to the dynamically allocated U16 index buffers. Then from the CPU, we
have the illusion of a single draw to Stream Link with Return to. This
requires packing hardware control words from the tessellator kernel.
Fortunately, we have GenXML available so we just use agx_pack like we would in
the driver.
Along the way, we pick up indirect tess support (this follows on naturally),
which gets rid of the other bit of tessellation-related cheating. Implementing
this requires reworking our internal agx_launch data structures, but that has
the nice side effect of speeding up GS invocations too (by fixing the workgroup
size).
Don't get me wrong. tessellator.cl is the single most unhinged file of my
career, featuring GenXML-based pack macros fed by dynamic memory allocation fed
by the inscrutable tessellation algorithm.
But it works *really* well.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30051>
Ensure the following identities hold to match IEEE-754-2019 and upcoming NIR:
min(-0, +0) = -0
min(+0, -0) = -0
max(-0, +0) = +0
max(+0, -0) = +0
NVK uses this lowering. In a simple compute shader using fmin64 on an SSBO with
signed zero preserve required, testing the effect of this patch, the instruction
count goes from 47->52. Obviously I'm not thrilled by that but I also couldn't
find any obvious way of mitigating the issue. (Maybe NVIDIA has special hardware
support here. By instruction count, lowering all the way to int64 is a loss,
though I don't know how to count cycles on NVIDIA.)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30075>
Ensure the following identities hold to match IEEE-754-2019 and upcoming NIR:
min(-0, +0) = -0
min(+0, -0) = -0
max(-0, +0) = +0
max(+0, -0) = +0
To implement, we specialize a version of flt64_nonnan. The regular flt64 has
extra logic to handle signed zero, so this version is actually simpler. So in
addition to the bug fix, this is an optimization. Compute shaders from
KHR-GL46.gpu_shader_fp64.builtin.max_dvec4 before and after:
before: 136 inst, 122 alu, 122 fscib, 4 ic, 1006 bytes, 39 regs, 28 uniforms
after: 104 inst, 90 alu, 90 fscib, 4 ic, 766 bytes, 39 regs, 28 uniforms
I will happy take a 24% reduction in instruction count as the cost of standards
conformance ^_^
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30075>
SPIR-V strengthened the semantics around signed zero, requiring fmin(-0, +0) =
-0. Since nir_op_fmin is commutative, we must also require fmin(+0, -0) = -0 to
match, although it's unclear if SPIR-V requires that. We must strengthen NIR's
definitions accordingly.
This strengthening is additionally motivated by the existing nir_opt_algebraic
rule like:
(('fmin', a, ('fneg', a)), ('fneg', ('fabs', a))),
With the strengthened new definition, this transform is clearly exact. With the
weaker definition, the transform could change the sign of zero based on
implementation-defined behaviours which ... while, not exactly unsound, is
undesireable semantically.
...
This is probably technically a bug fix, but I'm not convinced it's worth it's
weight in backporting.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30075>
These events were silently truncated in get_counters_for_event.
The integer types in this pass are a bit all over the place, maybe we should
consider using typedefs for clarity or a different solution with type safety.
Fixes: 9e9cabd2fa ("aco/waitcnt: support GFX12 in waitcnt pass")
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30163>
KGSL writes the profiling values asynchronously while we read them
immediately after the IOCTL returns which can result in the struct
not being filled in by the time we read it, this results in AGI not
correctly processing any timestamps from larger submits which take
longer to queue. To fix this, we now busy-wait on until the value
has been written out by KGSL.
Signed-off-by: Mark Collins <mark@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30147>
This fixes a leak that happens when copying a image using blit, and the
image is a compressed one.
In this case a new image view is created that can be re-interpreted to
perform the copy, but was not properly free.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30161>