intel/compiler/mesh: compactify MUE layout

Instead of using 4 dwords for each output slot, use only the amount
of memory actually needed by each variable.

There are some complications from this "obvious" idea:
- flat and non-flat variables can't be merged into the same vec4 slot,
  because flat inputs mask has vec4 stride
- multi-slot variables can have different layout:
   float[N] requires N 1-dword slots, but
   i64vec3 requires 1 fully occupied 4-dword slot followed by 2-dword slot
- some output variables occur both in single-channel/component split
  and combined variants
- crossing vec4 boundary requires generating more writes, so avoiding them
  if possible is beneficial

This patch fixes some issues with arrays in per-vertex and per-primitive data
(func.mesh.ext.outputs.*.indirect_array.q0 in crucible)
and by reduction in single MUE size it allows spawning more threads at
the same time.

Note: this patch doesn't improve vk_meshlet_cadscene performance because
default layout is already optimal enough.

Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20407>
This commit is contained in:
Marcin Ślusarz
2022-12-21 15:40:07 +01:00
committed by Marge Bot
parent fb765a65c8
commit a252123363
8 changed files with 478 additions and 118 deletions

View File

@@ -2085,6 +2085,21 @@ brw_nir_load_global_const(nir_builder *b, nir_intrinsic_instr *load_uniform,
return sysval;
}
const struct glsl_type *
brw_nir_get_var_type(const struct nir_shader *nir, nir_variable *var)
{
const struct glsl_type *type = var->interface_type;
if (!type) {
type = var->type;
if (nir_is_arrayed_io(var, nir->info.stage) || var->data.per_view) {
assert(glsl_type_is_array(type));
type = glsl_get_array_element(type);
}
}
return type;
}
bool
brw_nir_pulls_at_sample(nir_shader *shader)
{