intel/compiler: Add and use a pass to generate imul_32x16 instructions

Gfx8 and Gfx9 platforms are helped for cycles because now many
instructions like

    mul(8)          g12<1>D         g10<8,8,1>D     6D

become

    mul(8)          g12<1>D         g10<8,8,1>D     6W

It is the same number of instructions, but the 32x16 multiply is a
little faster.

v2: Fix transposed hi and lo in "(hi >= INT16_MIN && lo <= INT16_MAX)".
Noticed by Caio.  Use nir_src_is_const instead of open coding it.
Suggested by Caio.

Broadwell and Skylake had similar results. (Skylake shown)
total cycles in shared programs: 845748380 -> 845145547 (-0.07%)
cycles in affected programs: 446346348 -> 445743515 (-0.14%)
helped: 6017
HURT: 0
helped stats (abs) min: 2 max: 7380 x̄: 100.19 x̃: 8
helped stats (rel) min: <.01% max: 3.72% x̄: 0.41% x̃: 0.39%
95% mean confidence interval for cycles value: -113.37 -87.00
95% mean confidence interval for cycles %-change: -0.42% -0.41%
Cycles are helped.

Skylake
Cycles in all programs: 8844820715 -> 8828897462 (-0.2%)
Cycles helped: 47914
Cycles hurt: 1

No shader-db or fossil-db changes on any other Intel platform.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
This commit is contained in:
Ian Romanick
2022-02-02 18:49:25 -08:00
committed by Marge Bot
parent 9479e3a19b
commit f90d71055b
4 changed files with 126 additions and 0 deletions

View File

@@ -176,6 +176,8 @@ void brw_nir_analyze_ubo_ranges(const struct brw_compiler *compiler,
bool brw_nir_opt_peephole_ffma(nir_shader *shader);
bool brw_nir_opt_peephole_imul32x16(nir_shader *shader);
void brw_nir_optimize(nir_shader *nir,
const struct brw_compiler *compiler,
bool is_scalar,