intel/compiler: remove branch weight heuristic

As a result of this patch, compiler chooses SIMD32 shaders more frequently. Current logic is designed to avoid regressions from enabling SIMD32 at all cost, even though the cases where regression can happen are probably for smaller draw calls (far away from the camera and though smaller). In Intel perf CI this patch improves FPS in: - gfxbench5 alu2: 21.92% (gen9), 23.7% (gen11) - synmark OglShMapVsm: 3.26% (gen9), 4.52% (gen11) - gfxbench5 car chase: 1.34% (gen9), 1.32% (gen11) No observed regressions there. In my testing, it also improves FPS in: - The Talos Principle: 2.9% (gen9) The other 16 games I tested had very minor changes in performance (2/3 positive, but not significant enough to list here). Note: this patch harms synmark OglDrvState (which is not in Intel perf CI) by ~2.9%, but this benchmark renders multiple scenes from other workloads (including OglShMapVsm, which is helped in standalone mode) in tiny rectangles. Rendering so small drastically changes branching statistics, which favors smaller SIMD modes. I assume this matters only in micro-benchmarks, as in real workloads more expensive (with more uniform branching behavior) draw calls dominate. Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com> Acked-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7137>
2020-10-14 16:32:55 +02:00
parent 06764e0e5d
commit 21ffacff8c
1 changed files with 17 additions and 15 deletions
--- a/src/intel/compiler/brw_ir_performance.cpp
+++ b/src/intel/compiler/brw_ir_performance.cpp
@@ -1505,16 +1505,23 @@ namespace {
                            const backend_instruction *),
                         unsigned dispatch_width)
   {
-      /* XXX - Plumbing the trip counts from NIR loop analysis would allow us
+      /* XXX - Note that the previous version of this code used worst-case
-       *       to do a better job regarding the loop weights.  And some branch
+       *       scenario estimation of branching divergence for SIMD32 shaders,
-       *       divergence analysis would allow us to do a better job with
+       *       but this heuristic was removed to improve performance in common
-       *       branching weights.
+       *       scenarios. Wider shader variants are less optimal when divergence
       *       is high, e.g. when application renders complex scene on a small
       *       surface. It is assumed that such renders are short, so their
       *       time doesn't matter and when it comes to the overall performance,
       *       they are dominated by more optimal larger renders.
       *
       *       It's possible that we could do better with divergence analysis
       *       by isolating branches which are 100% uniform.
       *
       *       Plumbing the trip counts from NIR loop analysis would allow us
       *       to do a better job regarding the loop weights.
       *
       *       In the meantime use values that roughly match the control flow
-       *       weights used elsewhere in the compiler back-end -- Main
+       *       weights used elsewhere in the compiler back-end.
       *       difference is the worst-case scenario branch_weight used for
       *       SIMD32 which accounts for the possibility of a dynamically
       *       uniform branch becoming divergent in SIMD32.
       *
       *       Note that we provide slightly more pessimistic weights on
       *       Gen12+ for SIMD32, since the effective warp size on that
@@ -1523,7 +1530,6 @@ namespace {
       *       previous generations, giving narrower SIMD modes a performance
       *       advantage in several test-cases with non-uniform discard jumps.
       */
      const float branch_weight = (dispatch_width > 16 ? 1.0 : 0.5);
      const float discard_weight = (dispatch_width > 16 || s->devinfo->gen < 12 ?
                                    1.0 : 0.5);
      const float loop_weight = 10;
@@ -1539,16 +1545,12 @@ namespace {
            issue_instruction(st, s->devinfo, inst);
-            if (inst->opcode == BRW_OPCODE_ENDIF)
+            if (inst->opcode == FS_OPCODE_PLACEHOLDER_HALT && discard_count)
               st.weight /= branch_weight;
            else if (inst->opcode == FS_OPCODE_PLACEHOLDER_HALT && discard_count)
               st.weight /= discard_weight;
            elapsed += (st.unit_ready[unit_fe] - clock0) * st.weight;
-            if (inst->opcode == BRW_OPCODE_IF)
+            if (inst->opcode == BRW_OPCODE_DO)
               st.weight *= branch_weight;
            else if (inst->opcode == BRW_OPCODE_DO)
               st.weight *= loop_weight;
            else if (inst->opcode == BRW_OPCODE_WHILE)
               st.weight /= loop_weight;