It was determined that a significant part of queue submission
overhead was from allocation/freeing of CSes constantly inside
`tu_autotune_on_submit`. This has been reduced by retaining
instances of `tu_submission_data` with their corresponding
CSes, this results in entirely eliminating that overhead as
resetting a CS is a very cheap operation compared to allocation
or even freeing it wholly.
Signed-off-by: Mark Collins <mark@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18461>
When the workgroup is 1 dimensional, simply use a vec3
filled with zeroes and the local invocation index.
This is is better than lower_id_to_index + constant folding,
because this way we don't leave behind extra ALU instrs.
Note, this is relevant to mesh shaders on RDNA2 because
it enables us to better detect cross-invocation output
access.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18464>
fence_get_fd is required for any kind of surface flush or native fence
sync export on Android. The typical scenarios are:
- eglDupNativeFenceFDANDROID
- eglSwapBuffers*
- eglMakeCurrent
- glFlush/glFinish for front buffer rendering
This change updates zink_flush to handle PIPE_FLUSH_FENCE_FD via a
forced submit to signal an external sync_fd semaphore. fence_get_fd is
implemented to export the sync file from that semaphore.
Signed-off-by: Yiwei Zhang <zzyiwei@chromium.org>
Reviewed-By: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18453>
For in-fence handling, dri2 has this below sequence in a row:
1. create_fence_fd: import external fence fd
2. fence_server_sync: import the pipe fence into the driver ctx
3. fence_reference: deref the created pipe fence
Before this change, zink pushed the wrapped external semaphore to the
wait semaphores of the next batch but the followed fence_reference will
destroy the imported semaphore immediately. Instead of extending the
lifecycle of the pipe fence throughout the batch state, we can simply
transfer the semaphore ownership to the batch and destroy it upon batch
reset.
Fixes: 32597e116d ("zink: implement GL semaphores")
Signed-off-by: Yiwei Zhang <zzyiwei@chromium.org>
Reviewed-By: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18453>
This change fixes below:
1. Dup the fence fd, otherwise, since external semaphore import takes
the ownership of the fd, non-Vulkan part touches the fd leading to
undefined behavior. This can be hit on implementations that defer
the processing of the passed fd.
2. Use VK_SEMAPHORE_IMPORT_TEMPORARY_BIT for importing since that's
required for SYNC_FD handle type because of its copy transference.
Meanwhile, doing temporary import for opaque fd is fine in this path.
Fixes: 32597e116d ("zink: implement GL semaphores")
Signed-off-by: Yiwei Zhang <zzyiwei@chromium.org>
Reviewed-By: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18453>
Implement natively by always returning invalid feedback. This is a legal
(but useless) implementation according to the spec.
In the future, I want to return the real feedback values from the host,
but that requires changes to the venus protocol. The protocol does not
know that the VkPipelineCreationFeedback structs in the
VkGraphicsPipelineCreateInfo pNext are output parameters. Before
VK_EXT_pipeline_creation_feedback, the pNext chain was input-only.
Tested with `dEQP-VK.pipeline.*.creation_feedback.*`.
The tests in vulkan-cts-1.3.3.0 are buggy. I submitted a fix to dEQP
upstream; see below.
Results with the bug:
Passed: 0/30 ( 0.0%)
Failed: 12/30 (40.0%)
Not supported: 18/30 (60.0%)
Warnings: 0/30 ( 0.0%)
Results with bugfix:
Passed: 12/30 (40.0%)
Failed: 0/30 ( 0.0%)
Not supported: 18/30 (60.0%)
Warnings: 0/30 ( 0.0%)
See: https://gerrit.khronos.org/c/vk-gl-cts/+/10086
See: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/909
Reviewed-by: Yiwei Zhang <zzyiwei@chromium.org>
Signed-off-by: Chad Versace <chadversary@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18035>
this ensures types which consume more than 1 slot are effectively tagged
so that the next stage inputs are also assigned properly
fixes:
spec@arb_enhanced_layouts@execution@component-layout@vs-fs-array-dvec3
cc: mesa-stable
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18444>
When loading a TCS or GS input, we generate some code to read the URB
handle for a particular input control point (ICP handle), which often
involves indirect addressing due to a non-constant vertex.
For example:
mov(8) vgrf148+0.0:UW, 76543210V
shl(8) vgrf149:UD, vgrf148+0.0:UW, 2u
shl(8) vgrf150:UD, vgrf145:UD, 5u
add(8) vgrf151:UD, vgrf150:UD, vgrf149:UD
mov_indirect(8) vgrf147:UD, g2:UD, vgrf151:UD, 96u
Unfortunately, the first load with 76543210V is considered a partial
write because the 8 channels of 16-bit UW data doesn't fill an entire
register, and we can't allocate VGRFs at sub-register granularity.
This causes none of the above math to be CSE'd, even though the first
two instructions are common to *all* input loads, and the rest may be
reused sometimes as well.
To work around this, we stop emitting 76543210V to a temporary, and
instead use nir_system_values[SYSTEM_VALUE_SUBGROUP_INVOCATION], which
already contains this value, and is unconditionally set up for us.
With all input loads using the same register for the sequence, our
CSE pass is able to eliminate the rest of the common math.
shader-db results on Tigerlake:
total instructions in shared programs: 20748243 -> 20744844 (-0.02%)
instructions in affected programs: 73410 -> 70011 (-4.63%)
helped: 242 / HURT: 21
helped stats (abs) min: 1 max: 37 x̄: 14.17 x̃: 15
helped stats (rel) min: 0.17% max: 19.58% x̄: 6.13% x̃: 6.32%
HURT stats (abs) min: 1 max: 4 x̄: 1.38 x̃: 1
HURT stats (rel) min: 0.18% max: 1.31% x̄: 0.58% x̃: 0.58%
95% mean confidence interval for instructions value: -13.73 -12.12
95% mean confidence interval for instructions %-change: -6.00% -5.19%
Instructions are helped.
total cycles in shared programs: 785828951 -> 785788480 (<.01%)
cycles in affected programs: 597593 -> 557122 (-6.77%)
helped: 227 / HURT: 13
helped stats (abs) min: 6 max: 624 x̄: 182.19 x̃: 185
helped stats (rel) min: 0.24% max: 18.22% x̄: 7.85% x̃: 7.80%
HURT stats (abs) min: 2 max: 153 x̄: 68.08 x̃: 36
HURT stats (rel) min: 0.03% max: 7.79% x̄: 2.97% x̃: 1.25%
95% mean confidence interval for cycles value: -182.55 -154.71
95% mean confidence interval for cycles %-change: -7.84% -6.69%
Cycles are helped.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18455>
The change didn't make any sense. `s` will always be
`NV50_SHADER_STAGE_COMPUTE`, because it's used to loop over all shader
stages. And the TSC cache on the compute side is already flushed in
`nv50_compute_validate_samplers`.
Fixes spurious `CACHE_ERROR` dmesg messages.
Fixes: ba6ba8c990 ("nv50: adapt texture and constbuf paths for compute shaders")
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: M Henning <drawoc@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18382>
From clang 16 has_trivial_destructor is deprecated.
Use the replacement __is_trivially_destructible if it
is available.
Fixes new warnings with clang 16 like:
../src/compiler/glsl/list.h:58:4: warning: builtin __has_trivial_destructor is deprecated; use __is_trivially_destructible instead [-Wdeprecated-builtins]
../src/util/ralloc.h:551:4: note: expanded from macro 'DECLARE_RZALLOC_CXX_OPERATORS'
DECLARE_ALLOC_CXX_OPERATORS_TEMPLATE(type, rzalloc_size)
^
../src/util/ralloc.h:542:12: note: expanded from macro 'DECLARE_ALLOC_CXX_OPERATORS_TEMPLATE'
if (!HAS_TRIVIAL_DESTRUCTOR(TYPE)) \
^
../src/util/macros.h:233:44: note: expanded from macro 'HAS_TRIVIAL_DESTRUCTOR'
Reviewed-by: Eric Engestrom <eric@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18423>
The kernel driver has a range of valid priority values that can
be supplied to it, submitting any priority value outside these
bounds will result in `-EINVAL`. To avoid this, the priority
value is now clamped to the range that the kernel supports.
Fixes: 0c6fbfca0c
Signed-off-by: Mark Collins <mark@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18389>
Current code doesn't handle this, however it is easy to make it work
by moving the negate to the presubtract source. Minor win in shader-db,
mostly with Unigine shaders.
Shader-db RV530:
total instructions in shared programs: 136382 -> 136236 (-0.11%)
instructions in affected programs: 9911 -> 9765 (-1.47%)
total temps in shared programs: 18939 -> 18942 (0.02%)
temps in affected programs: 37 -> 40 (8.11%)
Reviewed-by: Filip Gawin <filip@gawin.net>
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18289>
eglTerminate() must be serialized against all other EGL calls. But in
most cases, other EGL calls do not need to be serialized against each
other. Which fits rather well with a rwlock.
One would be tempted to simply replace the existing BDL with a rwlock,
but several portability and debuggability limitations of the rwlock
implementation prevent that, as described in the TerminateLock comment
block.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Acked-by: Eric Engestrom <eric@igalia.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18050>