utrace submits can either have a batch or not.
When there is a batch, the utrace vk_sync is signaled by the utrace
batch (because utrace does a timestamp buffer copy using its own
batch). When there is no batch, the utrace vk_sync should be signaled
by the application batch (no timestamp copy required, utrace can read
the timestamps when the application batch has completed).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: fdea48df5e ("anv: Implement Xe version of anv_queue_exec_locked() and queue_exec_trace()")
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24085>
From Bspec 46959, a programming note applicable to Gfx12+:
"Since HZ_OP has to be sent twice (first time set the clear/resolve
state and 2nd time to clear the state), and HW internally flushes the
depth cache on HZ_OP, there is no need to explicitly send a Depth
Cache flush after Clear or Resolve."
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24027>
anv_vma_alloc() returns a canonical address, but explicit_address is a
regular address. This mismatch can potentially cause issues.
So here making bo->offset as always canonical address by converting it
in the explicit case and fixing the only caller that was caling
anv_bo_vma_alloc_or_close() with a canonical address.
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23977>
There is no mention in spec about subtract one of the number of
threads, also Iris and blorp code don't subtract.
Alchemist PRMs: Volume 2a: Command Reference: Instructions: CFE_STATE: Maximum Number of Threads:
Normally set to the maximum number of threads: (# EUs) * (# threads/EU)
Cc: mesa-stable
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23973>
On DG2 I ran into a case where the surface state was not being decoded
with INTEL_DEBUG=bat. This is because the surface states are not part
of a state pool there anymore. Instead BO are allocate manually and
placed in vma heap.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 96c33fb027 ("anv: enable direct descriptors on platforms with extended bindless offset")
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23891>
The Vulkan specification indicates that if memory types have
properties which are a strict subset of another type's, then they
should appear before that memory type. Otherwise the specification
does not require a specific ordering of memory types.
But, it appears that Aztec Ruins and the Vulkan CTS make an assumption
that the first host-accessible memory type is host-coherent and select
it when they expect data written by the CPU to become visible without
calling vkFlushMappedMemoryRanges(), even though flushing is required
by the spec, which leads to misrendering and hangs on MTL platforms.
We found that other drivers also put a host-coherent, but not cached
memory type as the first host-accessible memory type, so let's do the
same in order to match the expectations of such broken applications.
Host-coherent uncached memory types are currently implemented with a
WC CPU map on non-LLC platforms, so there shouldn't be a huge
performance penalty from this: If an application intends to do heavy
R/W CPU access on a memory range it's expected to loop over the
available memory types and select one marked as host-cached -- If an
application fails to do that and simply selects the first available
type it seems more robust to stay on the safe side and give them a
host-coherent type rather than a cached one.
Rework:
* Jordan: Add initial explanation to body of commmit message.
* Curro: Add additional comments to commit message.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22878>
Although the following is based on this observations for OpenGL, we
probably need this for Vulkan as well.
KHR-GL46.texture_buffer.texture_buffer_operations_ssbo_writes writes
to an SSBO in a compute program, then issues a memory-barrier, which
causes us to add a DC-flush. Then a second compute program samples
from the SSBO written by the first compute program.
Although we expected the DC-flush to make the writes available to the
second compute program, on MTL this wasn't the case. Adding the
"Untyped Data-Port Cache Flush" fixes this.
The PRM indicates that compute programs must set "Untyped Data-Port
Cache Flush" to flush some LSC writes when flushing HDC. Although we
are setting DC-flush, and not HDC-flush, it does appear that the
following reference might also apply to DC-flush.
In the Intel(R) Arc(tm) A-Series Graphics and Intel Data Center GPU
Flex Series Open-Source Programmer's Reference Manual, Vol 2a: Command
Reference: Instructions, PIPE_CONTROL, HDC Pipeline Flush (DWord 0,
Bit 9), there is a programming note:
> When the "Pipeline Select" mode is set to "GPGPU", the LSC Untyped
> L1 cache flush is controlled by "Untyped Data-Port Cache Flush" bit
> in the PIPE_CONTROL command.
Ref: a8108f1d44 ("anv: Add missing untyped data port flush on PIPELINE_SELECT")
Ref: bd8e8d204d ("iris: Add missing untyped data port flush on PIPELINE_SELECT")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23176>
In the Intel(R) Arc(tm) A-Series Graphics and Intel Data Center GPU
Flex Series Open-Source Programmer's Reference Manual, Vol 2a: Command
Reference: Instructions, PIPE_CONTROL, HDC Pipeline Flush (DWord 0,
Bit 9), there is a programming note:
> When the "Pipeline Select" mode is set to "GPGPU", the LSC Untyped
> L1 cache flush is controlled by "Untyped Data-Port Cache Flush" bit
> in the PIPE_CONTROL command.
Ref: a8108f1d44 ("anv: Add missing untyped data port flush on PIPELINE_SELECT")
Ref: bd8e8d204d ("iris: Add missing untyped data port flush on PIPELINE_SELECT")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23176>
On Alchemist, the FF_MODE2 documentation says that we must set the
FF_MODE2 timer values for GS and HS to 224. The hardware performance
tuning guide also recommends setting the TDS timer to 4.
On Tigerlake, i915 applies workarounds to set the GS timer to 224
(failing to do so can cause HS/DS unit hangs), and the TDS timer to 4
(for performance). It doesn't currently apply a HS timer there, and
I'm not sure if it's strictly necessary, but given that Alchemist
needed it, and the other two settings matched, let's assume that it
ought to match as well.
Unfortunately, there has been a bug in the i915 workarounds
infrastructure for non-masked context registers where writing one
field of the register zeroes out all the others. So, I believe the
Tigerlake TDS timer value of 4 isn't being applied correctly there,
though the register is also not readable on that platform which
makes it hard to verify. So, this may also speed up tessellation.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9233
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23839>
This enables L3 partial write merging for a number of cases that seem
to be getting accidentally disabled by the kernel, which was causing a
serious performance bottleneck on DG2 and MTL platforms. The
"Compressible Partial Write Merge Enable", "Coherent Partial Write
Merge Enable" and "Cross-Tile Partial Write Merge Enable" bits in
L3SQCREG5 were expected to be enabled by default (and confusingly,
they even read off as enabled if you ran 'intel_reg read 0xb158' on an
idle system), but they are getting clobbered during 3D context
initialization by an i915 workaround.
Enabling L3 partial write merging of compressible surfaces in
particular seems to increase rendering fillrate by over 3x in some
cases (e.g. the
"VulkanFillRate/FillRateGPU/resolution:1[0-3]/format:*/blend:0"
fillrate-bound microbenchmarks). Significant improvements can also be
reproduced in most real-world workloads we've tested so far,
e.g. Counter Strike GO improves by ~11%, Shadow Of the Tomb Raider
improves by ~5.5%, and AztecRuins-VK improves by ~6.5% on DG2-512 --
Thanks a lot to Caleb Callaway for these figures. No regressions have
been observed so far.
Even though this patch might strike as surprisingly simple for such a
large payoff, it's the result of Felix DeGrood and I trying to
root-cause the rendering performance gap of DG2 on Linux vs Windows on
and off during the last year, and some of the OA statistics captured
by Felix early this month were greatly helpful for me to connect the
last few dots, so Felix deserves a big chunk of the credit for this
work.
Cc: mesa-stable
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23783>
The Vulkan CTS started generating the list of valid versions the driver
can report as conformant against based on the active branches, and the
1.3.0 branch we were reporting up to now is no longer valid.
Fixes dEQP-VK.api.driver_properties.conformance_version
Reviewed-by: Mark Janes <markjanes@swizzler.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23784>
In the following sequence :
- write buffer B with a shader
- barrier on buffer from shader-write to transfer
- vkCmdCopyQueryPoolResults to buffer B
The barrier should take care of ordering things between the shader
writes and vkCmdCopyQueryPoolResults.
The problem is that vkCmdCopyQueryPoolResults runs on the command
streamer and that is not coherent or synchronized in the same way as
shaders.
This change marks the barrier has potentially containing pending
buffer writes for queries so that we can insert the necessary flush
for vkCmdCopyQueryPoolResults later.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9013
Cc: mesa-stable
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23675>