intel/fs: Add support for CS to group invocations in quads

When using quads, instead of mapping the elements to the next 4 local invocation indices, we map the two next in the "current" row and two next in the "next row". A side effect is that a thread will execute the indices in a different order. We now perform the lowering of both local invocation ID and index together -- and don't rely anymore on lowering done by nir_lower_system_values. That is convenient when doing the math for quads, because we need X and Y to get the right invocation index. When the pass progresses, fold the constants and clean up to reduce the noise from the indexing math. This implements the derivative_group_quadsNV semantics from NV_compute_shader_derivatives. v2: Take subgroup_id into account, otherwise only values in the first subgroup would be used. (Jason) v3: Calculate invocation index and ID together, to avoid duplicating some math in the quads case when both index and ID are used. (Jason) v4: Don't call cleanup passes as part of the lowering, let that to the call site. (Jason) Change calculation to use less instructions. (Jason) Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> (v3) Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-03-27 15:07:59 -07:00
parent ef0339d5ea
commit 3ee3024804
3 changed files with 103 additions and 16 deletions
--- a/src/intel/compiler/brw_compiler.c
+++ b/src/intel/compiler/brw_compiler.c
@@ -45,7 +45,6 @@
   .lower_flrp64 = true,                                                      \
   .lower_isign = true,                                                       \
   .lower_ldexp = true,                                                       \
-   .lower_cs_local_id_from_index = true,                                      \
   .lower_device_index_to_zero = true,                                        \
   .native_integers = true,                                                   \
   .use_interpolated_input_intrinsics = true,                                 \