This can be used by both etnaviv and freedreno/a2xx as they are both vec4
architectures with some instructions being scalar-only.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
I don't know why I thought NIR_PASS always set the progress variable.
Derp.
Fixes: d41cdef2a5 ("nir: Use the flrp lowering pass instead of nir_opt_algebraic")
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Coverity CID: 1444996
Coverity CID: 1444995
Coverity CID: 1444994
Coverity CID: 1444993
Coverity CID: 1444991
Coverity CID: 1444989
I tried to be very careful while updating all the various drivers, but I
don't have any of that hardware for testing. :(
i965 is the only platform that sets always_precise = true, and it is
only set true for fragment shaders. Gen4 and Gen5 both set lower_flrp32
only for vertex shaders. For fragment shaders, nir_op_flrp is lowered
during code generation as a(1-c)+bc. On all other platforms 64-bit
nir_op_flrp and on Gen11 32-bit nir_op_flrp are lowered using the old
nir_opt_algebraic method.
No changes on any other Intel platforms.
v2: Add panfrost changes.
Iron Lake and GM45 had similar results. (Iron Lake shown)
total cycles in shared programs: 188647754 -> 188647748 (<.01%)
cycles in affected programs: 5096 -> 5090 (-0.12%)
helped: 3
HURT: 0
helped stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2
helped stats (rel) min: 0.12% max: 0.12% x̄: 0.12% x̃: 0.12%
Reviewed-by: Matt Turner <mattst88@gmail.com>
This enables the remaining capabilities in SPV_EXT_descriptor_indexing.
Fixes: 0e10790558 "radv: Enable VK_EXT_descriptor_indexing."
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Only computeDerivativeGroupLinear is supported for now.
All crucible tests pass.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
When I implemented opt_if_loop_last_continue() I had restricted
this pass from moving other if-statements inside the branch opposite
the continue. At the time it was causing a bunch of spilling in
shader-db for i965.
However Samuel Pitoiset noticed that making this pass more aggressive
significantly improved the performance of Doom on RADV. Below are
the statistics he gathered.
28717 shaders in 14931 tests
Totals:
SGPRS: 1267317 -> 1267549 (0.02 %)
VGPRS: 896876 -> 895920 (-0.11 %)
Spilled SGPRs: 24701 -> 26367 (6.74 %)
Code Size: 48379452 -> 48507880 (0.27 %) bytes
Max Waves: 241159 -> 241190 (0.01 %)
Totals from affected shaders:
SGPRS: 23584 -> 23816 (0.98 %)
VGPRS: 25908 -> 24952 (-3.69 %)
Spilled SGPRs: 503 -> 2169 (331.21 %)
Code Size: 2471392 -> 2599820 (5.20 %) bytes
Max Waves: 586 -> 617 (5.29 %)
The codesize increases is related to Wolfenstein II it seems largely
due to an increase in phis rather than the existing jumps.
This gives +10% FPS with Doom on my Vega56.
Rhys Perry also benchmarked Doom on his VEGA64:
Before: 72.53 FPS
After: 80.77 FPS
v2: disable pass on non-AMD drivers
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> (v1)
Acked-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
This lowering isn't needed for RADV because AMDGCN has two
instructions. It will be disabled for RADV in an upcoming series.
While we are at it, factorize a little bit.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
the naming is a bit confusing no matter how you look at it. Within SPIR-V
"global" memory is memory accessible from all threads. glsl "global" memory
normally refers to shader thread private memory declared at global scope. As
we already use "shared" for memory shared across all thrads of a work group
the solution where everybody could be happy with is to rename "global" to
"private" and use "global" later for memory usually stored within system
accessible memory (be it VRAM or system RAM if keeping SVM in mind).
glsl "local" memory is memory only accessible within a function, while SPIR-V
"local" memory is memory accessible within the same workgroup.
v2: rename local to function as well
v3: rename vtn_variable_mode_local as well
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
For now, it's hidden behind a cap. Hopefully, we can eventually drop
that along with all the manual offset code in spirv_to_nir.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Tested-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Instead of baking in uvec2 for UBO and SSBO pointers and uint for push
constant and shared memory pointers, make it configurable.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We're going to want to do more deref optimizations going forward and
this gives us a central place to do them. Also, cast propagation will
get a bit more complicated with the addition of ptr_as_array derefs.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This workaround has been introduced by 135e4d434f for fixing
DXVK GPU hangs with many games. It is no longer needed since
LLVM r345718.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
On some GPUs, especially older Intel GPUs, some math instructions are
very expensive. On those architectures, don't reduce flow control to a
csel if one of the branches contains one of these expensive math
instructions.
This prevents a bunch of cycle count regressions on pre-Gen6 platforms
with a later patch (intel/compiler: More peephole select for pre-Gen6).
v2: Remove stray #if block. Noticed by Thomas.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
That flow control may be trying to avoid invalid loads. On at least
some platforms, those loads can also be expensive.
No shader-db changes on any Intel platform (even with the later patch
"intel/compiler: More peephole select").
v2: Add a 'indirect_load_ok' flag to nir_opt_peephole_select. Suggested
by Rob. See also the big comment in src/intel/compiler/brw_nir.c.
v3: Use nir_deref_instr_has_indirect instead of deref_has_indirect (from
nir_lower_io_arrays_to_elements.c).
v4: Fix inverted condition in brw_nir.c. Noticed by Lionel.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Helpful for debugging compiler backend problems: this allows us to
easily retrieve the LLVM IR from RenderDoc.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
This implementation should work and potential bugs can be
fixed during the release candidates window anyway.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Totals from affected shaders:
SGPRS: 1112 -> 1112 (0.00 %)
VGPRS: 1492 -> 1196 (-19.84 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 112172 -> 101316 (-9.68 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 93 -> 98 (5.38 %)
Wait states: 0 -> 0 (0.00 %)
All affected shaders are from "Batman: Arkham City" over DXVK.
The pass detects that the temporary array created by DXVK for
storing TCS inputs is a copy of the input arrays and allows
us to avoid copying all of the input data and then indirecting
on it with if-ladders, instead we just do indirect indexing.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This allows NIR to CSE more operations. LLVM does this also so the
impact is limited, however doing this in NIR allows other opts to
make progress. For example in radeonsi more loops are unrolled in
Civilization Beyond Earth.
The actual pipeline-db stats are not overwhelming but even in the
negatively affected shaders the NIR is clearly better. It just
happens that the code shuffling and in some cases calls to max
rather than a flt result in the final output from LLVM not
giving as good numbers.
However this is an incremental opt that further passes build off
so the change should be made IMO.
Totals from affected shaders:
SGPRS: 20192 -> 20184 (-0.04 %)
VGPRS: 19516 -> 19524 (0.04 %)
Spilled SGPRs: 437 -> 444 (1.60 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 1527444 -> 1522276 (-0.34 %) bytes
LDS: 6 -> 6 (0.00 %) blocks
Max Waves: 1018 -> 1016 (-0.20 %)
Wait states: 0 -> 0 (0.00 %)
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
All CTS pass on Polaris/Vega with LLVM 6, 7 and master, so
I think it's safe to enable the feature.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Not sure if this is all wired up. CTS does pass and the Tangrams
demo works fine on Vega. There are corruption issues on Polaris
but not sure if that related to 16-bit support.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
It was very inconsistently handled; the only things that made use of it
were glsl_to_nir, glspirv, and nir_gather_info. In particular,
nir_lower_io completely ignored it so anyone using nir_lower_io on
64-bit vertex attributes was going to be in for a shock. Also, as of
the previous commit, it's set by every driver that supports 64-bit
vertex attributes. There's no longer any reason to have it be an option
so let's just delete it.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
The code sizes return here get passed to the cache shader insert function,
which then memcpy from the code ptr, and causes all sorts of valgrind
errors like:
==6755== Invalid read of size 8
==6755== at 0x4C32FEE: memcpy@GLIBC_2.2.5 (vg_replace_strmem.c:1021)
==6755== by 0x2305D4C7: radv_pipeline_cache_insert_shaders (radv_pipeline_cache.c:416)
==6755== by 0x2305791D: radv_create_shaders (radv_pipeline.c:2158)
==6755== by 0x2305C523: radv_pipeline_init (radv_pipeline.c:3404)
==6755== by 0x2305C890: radv_graphics_pipeline_create (radv_pipeline.c:3515)
==6755== by 0x230188AB: radv_device_init_meta_blit_color (radv_meta_blit.c:871)
==6755== by 0x2301D50E: radv_device_init_meta_blit_state (radv_meta_blit.c:1278)
==6755== by 0x23011893: radv_device_init_meta (radv_meta.c:352)
==6755== by 0x2300744B: radv_CreateDevice (radv_device.c:1576)
==6755== by 0x5187D0F: ??? (in /usr/lib64/libvulkan.so.1.1.77)
==6755== by 0x518F6A3: ??? (in /usr/lib64/libvulkan.so.1.1.77)
==6755== by 0x5192A42: vkCreateDevice (in /usr/lib64/libvulkan.so.1.1.77)
==6755== Address 0x22a58548 is 4 bytes after a block of size 116 alloc'd
==6755== at 0x4C2EBAB: malloc (vg_replace_malloc.c:299)
==6755== by 0x23089DC4: ac_elf_read (ac_binary.c:144)
==6755== by 0x23090A60: ac_compile_module_to_binary (ac_llvm_helper.cpp:162)
==6755== by 0x23053F06: compile_to_memory_buffer (radv_llvm_helper.cpp:58)
==6755== by 0x23053F06: radv_compile_to_binary (radv_llvm_helper.cpp:98)
==6755== by 0x23052769: ac_llvm_compile (radv_nir_to_llvm.c:3394)
==6755== by 0x23052823: ac_compile_llvm_module (radv_nir_to_llvm.c:3418)
==6755== by 0x23053C05: radv_compile_nir_shader (radv_nir_to_llvm.c:3542)
==6755== by 0x23061B4E: shader_variant_create (radv_shader.c:580)
==6755== by 0x23061CFD: radv_shader_variant_create (radv_shader.c:634)
==6755== by 0x23057765: radv_create_shaders (radv_pipeline.c:2123)
==6755== by 0x2305C523: radv_pipeline_init (radv_pipeline.c:3404)
==6755== by 0x2305C890: radv_graphics_pipeline_create (radv_pipeline.c:3515)
Since we are just inserting the code into the cache, we can avoid these
bad reads and data in the cache by just using the binary code size here.
Fixes: 939e5a382 (radv: add padding for the UMR disassembler)
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>