The skip_follow_statuses variable, used to check if we need to stay
monitoring the pipeline instead of jumping to the target job traces, is
based on COMPLETED_STATUSES set. But, in Python, we do shallow copies by
default, and changes on skip_follow_statuses reflected on
COMPLETED_STATUSES, which was making manual dependencies stop playing
when --force-manual was not given.
Fixes: 84d401aebf0832741716f947dd7e2e9aac1221ac
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30526>
Before the patch, intel_device_info_get_max_preferred_slm_size()
returns values in kilobytes, but then
intel_device_info_get_max_slm_size() is multiplying it by 1024.
As a result, LNL is reporting maxComputeSharedMemorySize to be
134217728, which is 128mb.
Fix this by making intel_device_info_get_max_slm_size() not multiply
it by 1024.
This should fix at least the following dEQP tests:
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.1
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.128
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.16
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.2
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.4
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.64
Some tests were failing with:
deqp-vk: ../../src/intel/common/intel_compute_slm.c:24: slm_encode_lookup: Assertion `kbytes <= table[table_len - 1].size_in_kb' failed.
while other tests were triggering the OOM.
v2:
- Make everybody return sizes in bytes (José).
v3:
- Rename variable to bytes (José, Jordan).
Fixes: fd368f5521 ("anv: Set maxComputeSharedMemorySize value for Xe2 platforms")
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30541>
This reverts commit d6bb4ddc63.
Fixes: d6bb4ddc63 ("d3d12: Video Encode - Remove PIPE_VIDEO_PROFILE_MPEG4_AVC_BASELINE as not supported")
PIPE_VIDEO_PROFILE_MPEG4_AVC_BASELINE is necessary for some scenarios like the example below
described in https://github.com/microsoft/WSL/issues/11838
gst-launch-1.0 -v videotestsrc num-buffers=250 !
video/x-raw,width=1920,height=1200 !
vaapipostproc !
vaapih264enc !
filesink location=~/wsl_test.h264
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30548>
The PRMs suggest that certain classes of auxiliary surface operations
will automatically synchronize when performed back-to-back:
Any transition from any value in {Clear, Render, Resolve} to a
different value in {Clear, Render, Resolve} requires end of pipe
synchronization.
Make use of this functionality by batching CCS and MCS flushes when
compatible auxiliary surface operations are performed within a command
buffer.
Ref: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11325
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29922>
According to the HSD, this is an alternative option for Wa_14016712196.
Taking this option allows us to combine this workaround with a couple
other depth workarounds. Make sure to execute these workarounds before
the workaround for the depth register mode, so that the stalling flush
is not impacted.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29922>
This flush was introduced with the following commits:
8949d27bb8 ("anv: implement gen9 post sync pipe control workaround")
bcb611361b ("anv: implement gen12 post sync pipe control workaround")
The flush was unsued with the following commit:
e79e1ca304 ("intel: Drop Tigerlake revision 0 workarounds")
This prevents some extra pipecontrols caused by a following patch.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29922>
When filling out ir3_info, any multi-component stp or ldp instruction
should indicate possible straddle across dword boundaries. This indirectly
prevents setting the PERWAVEMEMLAYOUT flag on the SP_VS_PVT_MEM_SIZE
register, enabling proper functioning of three-component 8-bit accesses
with natural alignment.
Signed-off-by: Zan Dobersek <zdobersek@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29875>
a750 improves dp4acc to have support for all dot product variants. The
main difference with dp4acc of previous generations is that the signedness
and packed instruction fields have to be instead interpreted as signedness
of either side of the dot product.
Signed-off-by: Zan Dobersek <zdobersek@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29875>
ir3's lowering of variables to scratch memory has to treat 8-bit values as
16-bit ones when comparing such value's size against the given threshold
since those values are handled through 16-bit half-registers. But those
values can still use natural 8-bit size and alignment for storing inside
scratch memory.
nir_lower_vars_to_scratch now accepts two size-and-alignment functions,
one used for calculating the variable size and the other for calculating
the size and alignment needed for storing inside scratch memory. Non-ir3
uses of this pass can just duplicate the currently-used function. ir3
provides a separate variable-size function that special-cases 8-bit types.
Signed-off-by: Zan Dobersek <zdobersek@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29875>
ir3 8-bit quad-broadcast, quad-swap, scan and reduce instructions only
work correctly when done in 16-bit space. A nir_lower_bit_size pass is
used to upcast the source value and then downcast the result back to 8
bits.
Signed-off-by: Zan Dobersek <zdobersek@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29875>
This is more correct than the previous code and the CL CTS relies on edge
case behavior here, e.g. for 1Dbuffer images.
I think part of that is not actually required by the spec, but whatever.
Fixes: 7b22bc617b ("rusticl/memory: complete rework on how mapping is implemented")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30528>
An application could map and unmap a host ptr allocation multiple times,
but because how the refcounting works, we might never ended up syncing the
written data to the mapped region.
This moves the refcounting out of the event processing.
Fixes: 7b22bc617b ("rusticl/memory: complete rework on how mapping is implemented")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30528>
We usually emit flags right before consuming them but this is
suboptimal from the point of view of register pressure: if an
instruction is only used to generate flags then waiting to emit
it right before reading the flags extends the liveness of the
sources used to generate the flags for no gain. This pass will
check for such instructions and try to move them as early as
possible.
Shader-db results below show this is effective to reduce register
pressure, allowing a few shaders to increase thread counts and/or
reduce spilling:
total instructions in shared programs: 11057173 -> 11057076 (<.01%)
instructions in affected programs: 1955543 -> 1955446 (<.01%)
helped: 4214
HURT: 3905
Inconclusive result (value mean confidence interval includes 0).
total threads in shared programs: 425096 -> 425170 (0.02%)
threads in affected programs: 74 -> 148 (100.00%)
helped: 37
HURT: 0
Threads are helped.
total uniforms in shared programs: 3846275 -> 3845674 (-0.02%)
uniforms in affected programs: 23574 -> 22973 (-2.55%)
helped: 217
HURT: 30
Uniforms are helped.
total max-temps in shared programs: 2222910 -> 2220488 (-0.11%)
max-temps in affected programs: 61904 -> 59482 (-3.91%)
helped: 2145
HURT: 113
Max-temps are helped.
total spills in shared programs: 4294 -> 4280 (-0.33%)
spills in affected programs: 148 -> 134 (-9.46%)
helped: 8
HURT: 0
total fills in shared programs: 6497 -> 6468 (-0.45%)
fills in affected programs: 291 -> 262 (-9.97%)
helped: 8
HURT: 0
total sfu-stalls in shared programs: 14344 -> 14611 (1.86%)
sfu-stalls in affected programs: 1308 -> 1575 (20.41%)
helped: 217
HURT: 335
Inconclusive result (%-change mean confidence interval includes 0).
total inst-and-stalls in shared programs: 11071517 -> 11071687 (<.01%)
inst-and-stalls in affected programs: 1946767 -> 1946937 (<.01%)
helped: 4191
HURT: 3909
Inconclusive result (value mean confidence interval includes 0).
total nops in shared programs: 270628 -> 269829 (-0.30%)
nops in affected programs: 22032 -> 21233 (-3.63%)
helped: 1213
HURT: 571
Inconclusive result (%-change mean confidence interval includes 0).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30511>
If surface height during fast clear is 16k, as per bspec the height programmed
should be "value - 1" i.e. 0x3FFF. However, HW adds "1" to it but ignores
overflow bit[14]. HW performs OOB check based on bit[13:0] which is 0 and
drops failed transactions.
This patch passes the following failing test on LNL:
"PIGLIT_PLATFORM=gbm PIGLIT_DEFAULT_SIZE=16384x16384
shader_runner fast-slow-clear-interaction.shader_test -auto -fbo"
Signed-off-by: Aditya Swarup <aditya.swarup@intel.com>
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29182>
16-bit SIMD8 sampler writeback messages come with a bit of padding in
them, requiring us to emit a LOAD_PAYLOAD to reorganize the data into
the padding-free format expected by NIR. Additionally, we may reduce
the response length on the sampler messages based on which components
of the (always vec4) NIR destination are actually in use. When we do
that, dest_size > read_size, and the trailing components are all empty
BAD_FILE registers, indicating the contents are undefined.
Unfortunately, we can't ignore those trailing components entirely.
In the past, we left them default-initialized, giving us a BAD_FILE
register with UD type (which didn't matter, since all sampler returns
were 32-bit). But with 16-bit, this was confusing the LOAD_PAYLOAD.
For example, writing RGB and skipping A (without sparse) would produce
read_size = 3 and dest_size = 4 and nir_dest[5] containing:
nir_dest[] = <R:hf, G:hf, B:hf, blank-A:ud, blank-sparse:ud>
We'd then call LOAD_PAYLOAD on the first 4 sources, causing it to see
3 HF's and a UD, and try to copy the full 32-bit value at the end,
instead of 16-bits of pad like we intended. This meant it would
overflow the destination register's size, triggering validation errors.
Thanks to Ian Romanick for noticing this, writing a test, and also
coming up with a nearly identical fix.
Fixes: 0116430d39 ("intel/brw: Handle 16-bit sampler return payloads")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11617
References: https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/152
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30529>