intel/brw: Support CSE of ADD3

This one is a bit more complex in that we need to handle 3-source commutative opcodes. But it's also quite useful: fossil-db results on Alchemist (A770): Instrs: 151659750 -> 150164959 (-0.99%); split: -0.99%, +0.01% Cycles: 12822686329 -> 12574996669 (-1.93%); split: -2.05%, +0.12% Subgroup size: 7589608 -> 7589592 (-0.00%) Send messages: 7375047 -> 7375053 (+0.00%); split: -0.00%, +0.00% Loop count: 46313 -> 46315 (+0.00%); split: -0.01%, +0.01% Spill count: 110184 -> 54670 (-50.38%); split: -50.79%, +0.41% Fill count: 213724 -> 104802 (-50.96%); split: -51.43%, +0.47% Scratch Memory Size: 9406464 -> 3375104 (-64.12%); split: -64.35%, +0.23% Our older Shadow of the Tomb Raider fossil is particularly helped with over a 90% reduction in scratch access (spills, fills, and scratch size). However, benchmarking in the actual game shows no change in performance. We're thinking the game's shaders have been updated since our capture. Ian noted that there was a bug here where we'd accidentally CSE two ADD3 instructions with null destinations and different src[2] that couldn't be dead code eliminated due to conditional mods. However, this is only a bug in the new cse_defs pass so we don't need to nominate this for stable branches. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29848>
2024-06-16 02:45:53 -07:00
parent e1b1114bc2
commit 7adccbd48d
1 changed files with 11 additions and 2 deletions
--- a/src/intel/compiler/brw_fs_cse.cpp
+++ b/src/intel/compiler/brw_fs_cse.cpp
@@ -66,6 +66,7 @@ is_expression(const fs_visitor *v, const fs_inst *const inst)
   case BRW_OPCODE_FBH:
   case BRW_OPCODE_FBL:
   case BRW_OPCODE_CBIT:
+   case BRW_OPCODE_ADD3:
   case BRW_OPCODE_RNDU:
   case BRW_OPCODE_RNDD:
   case BRW_OPCODE_RNDE:
@@ -208,6 +209,13 @@ operands_match(const fs_inst *a, const fs_inst *b, bool *negate)
         }
      }
      return match;
+   } else if (a->sources == 3) {
+      return (xs[0].equals(ys[0]) && xs[1].equals(ys[1]) && xs[2].equals(ys[2])) ||
+             (xs[0].equals(ys[0]) && xs[1].equals(ys[2]) && xs[2].equals(ys[1])) ||
+             (xs[0].equals(ys[1]) && xs[1].equals(ys[0]) && xs[2].equals(ys[2])) ||
+             (xs[0].equals(ys[1]) && xs[1].equals(ys[2]) && xs[2].equals(ys[1])) ||
+             (xs[0].equals(ys[2]) && xs[1].equals(ys[0]) && xs[2].equals(ys[1])) ||
+             (xs[0].equals(ys[2]) && xs[1].equals(ys[1]) && xs[2].equals(ys[0]));
   } else {
      return (xs[0].equals(ys[0]) && xs[1].equals(ys[1])) ||
             (xs[1].equals(ys[0]) && xs[0].equals(ys[1]));
@@ -319,10 +327,11 @@ hash_inst(const void *v)

      hash = src_hash[0] * src_hash[1];
   } else if (inst->is_commutative()) {
-      /* Commutatively combine both sources */
+      /* Commutatively combine the sources */
      uint32_t hash0 = hash_reg(hash, inst->src[0]);
      uint32_t hash1 = hash_reg(hash, inst->src[1]);
-      hash = hash0 * hash1;
+      uint32_t hash2 = inst->sources > 2 ? hash_reg(hash, inst->src[2]) : 1;
+      hash = hash0 * hash1 * hash2;
   } else {
      /* Just hash all the sources */
      for (int i = 0; i < inst->sources; i++)