third_party_mesa3d

Author	SHA1	Message	Date
Ian Romanick	c262752d74	intel/fs: Make opt_copy_propagation_local file private This annoyed me durning development of this MR. Every time I changed the parameters to this internal function, I had to modify a public header file... and trigger a much large rebuild. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25091>	2023-09-14 22:31:23 +00:00
Ian Romanick	0946108298	intel/fs: Simplify check in can_propagate_from The larger predicate here already requires that inst->opcode must be BRW_OPCODE_MOV, so it can't BRW_OPCODE_SEL. With that removed, the other simplifications are pretty straight forward. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25091>	2023-09-14 22:31:23 +00:00
Ian Romanick	1f15a0f8b2	intel/fs: Don't loop in try_constant_propagate The caller already loops over the sources. This means that the caller must loop over the sources in reverse because constant propagation prefers to propagate into the last sources first. The shader-db and fossil-db changes (below) are all due to SEL instructions. Changing the order sources are visited changes whether a SEL with two immediate sources is (+f0.0) sel g12 IMM_A IMM_B or (-f0.0) sel g12 IMM_B IMM_A The ordering of the sources affects the order the constant combining encounters the values, and the determines which value is "combined" and which value remains an immediate. This affects the results by luck. If there are two instructions: (+f0.0) sel g12 IMM_A IMM_B (+f0.0) sel g13 IMM_A IMM_C Picking IMM_A is advantageous over picking IMM_B and IMM_C. Since the selection algorithm in constant combining is greedy, this case requires the algorithm see the values in just the right order for the right thing to happen. v2: Rebase on many, many changes. Move instruction source fixup reordering out or try_constant_propagate. v3: Rebase on !7698. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25091>	2023-09-14 22:31:23 +00:00
Ian Romanick	ab23d89ade	intel/fs: Move src.file checks out of try_constant_propagate and try_copy_propagate Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25091>	2023-09-14 22:31:23 +00:00
Ian Romanick	b5b2338c5c	intel/fs: Make try_constant_propagate and try_copy_propagate file private This annoyed me durning development of this MR. Every time I changed the parameters to this internal function, I had to modify a public header file... and trigger a much large rebuild. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25091>	2023-09-14 22:31:22 +00:00
Ian Romanick	8665e37960	intel/fs: Don't try to copy propagate into a source again after progress is made If the linked list structure used depended on the list head to know when to terminate, this would be a pretty serious bug. If try_constant_propage or try_copy_propagate make progress, inst->src[i].nr will change. This results in the foreach_in_list using a different list header on later iterations of the loop. This causes two shaders in shader-db and 9 shaders in fossil-db to change. Looking at the code changes, these are cases where there was a copy of a copy that gets propagated. The part that confuses me is the VGRF numbers involved should not hash to the same bucket, so it should be impossible to find the original source from the intermediate VGRF. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25091>	2023-09-14 22:31:22 +00:00
Ian Romanick	c506d7e511	intel/fs: Combine constants for integer instructions too v2: Remove type change for SHR with negation. This was a leftover from a previous attempt to deal with SHR and negation. Now all right-shifts with unsigned parameters are marked as not being able to have source modifiers. v3: Disallow negations on right shifts of unsigned sources by setting the no_negations flag in add_candidate_immediate. This eliminates the need to exclude SHR in can_do_source_mods. Tiger Lake total instructions in shared programs: 21102817 -> 21099443 (-0.02%) instructions in affected programs: 296796 -> 293422 (-1.14%) helped: 92 / HURT: 356 total cycles in shared programs: 790564691 -> 790393358 (-0.02%) cycles in affected programs: 36456886 -> 36285553 (-0.47%) helped: 171 / HURT: 286 total spills in shared programs: 3951 -> 3959 (0.20%) spills in affected programs: 176 -> 184 (4.55%) helped: 0 / HURT: 2 total fills in shared programs: 2631 -> 2639 (0.30%) fills in affected programs: 176 -> 184 (4.55%) helped: 0 / HURT: 2 LOST: 0 GAINED: 4 Ice Lake total instructions in shared programs: 19954204 -> 19949122 (-0.03%) instructions in affected programs: 40301 -> 35219 (-12.61%) helped: 23 / HURT: 2 total cycles in shared programs: 858377735 -> 858462082 (<.01%) cycles in affected programs: 75537286 -> 75621633 (0.11%) helped: 124 / HURT: 319 total spills in shared programs: 6255 -> 6190 (-1.04%) spills in affected programs: 392 -> 327 (-16.58%) helped: 1 / HURT: 2 total fills in shared programs: 7813 -> 7382 (-5.52%) fills in affected programs: 942 -> 511 (-45.75%) helped: 1 / HURT: 2 LOST: 0 GAINED: 3 Skylake total instructions in shared programs: 18049362 -> 18044440 (-0.03%) instructions in affected programs: 48317 -> 43395 (-10.19%) helped: 26 / HURT: 2 total cycles in shared programs: 844884806 -> 844915655 (<.01%) cycles in affected programs: 76137133 -> 76167982 (0.04%) helped: 171 / HURT: 293 total spills in shared programs: 6148 -> 6149 (0.02%) spills in affected programs: 595 -> 596 (0.17%) helped: 4 / HURT: 2 total fills in shared programs: 7484 -> 7067 (-5.57%) fills in affected programs: 1226 -> 809 (-34.01%) helped: 4 / HURT: 2 LOST: 0 GAINED: 8 Broadwell total instructions in shared programs: 17826844 -> 17821805 (-0.03%) instructions in affected programs: 60687 -> 55648 (-8.30%) helped: 28 / HURT: 8 total cycles in shared programs: 905332682 -> 904369499 (-0.11%) cycles in affected programs: 76743509 -> 75780326 (-1.26%) helped: 179 / HURT: 225 total spills in shared programs: 17922 -> 17908 (-0.08%) spills in affected programs: 2495 -> 2481 (-0.56%) helped: 6 / HURT: 8 total fills in shared programs: 26290 -> 25397 (-3.40%) fills in affected programs: 2606 -> 1713 (-34.27%) helped: 8 / HURT: 6 LOST: 1 GAINED: 1 Haswell total instructions in shared programs: 16678878 -> 16674444 (-0.03%) instructions in affected programs: 78458 -> 74024 (-5.65%) helped: 87 / HURT: 6 total cycles in shared programs: 880189381 -> 880301043 (0.01%) cycles in affected programs: 29956463 -> 30068125 (0.37%) helped: 169 / HURT: 163 total spills in shared programs: 14428 -> 14378 (-0.35%) spills in affected programs: 2384 -> 2334 (-2.10%) helped: 8 / HURT: 6 total fills in shared programs: 16975 -> 16881 (-0.55%) fills in affected programs: 1334 -> 1240 (-7.05%) helped: 10 / HURT: 4 Ivy Bridge total instructions in shared programs: 15706048 -> 15706035 (<.01%) instructions in affected programs: 9941 -> 9928 (-0.13%) helped: 13 / HURT: 0 total cycles in shared programs: 433618834 -> 433624637 (<.01%) cycles in affected programs: 12926714 -> 12932517 (0.04%) helped: 52 / HURT: 41 Sandy Bridge total cycles in shared programs: 741223552 -> 741223443 (<.01%) cycles in affected programs: 19814 -> 19705 (-0.55%) helped: 14 / HURT: 0 No changes on Iron Lake or GM45 fossil-db changes: Tiger Lake Instructions in all programs: 156858030 -> 156905532 (+0.0%) Instructions helped: 3915 Instructions hurt: 15411 Cycles in all programs: 7529667771 -> 7532117340 (+0.0%) Cycles helped: 10260 Cycles hurt: 9990 Spills in all programs: 5610 -> 5457 (-2.7%) Spills helped: 18 Fills in all programs: 6274 -> 6091 (-2.9%) Fills helped: 18 Gained: 2 Lost: 16 Ice Lake Instructions in all programs: 141308082 -> 141303083 (-0.0%) Instructions helped: 574 Instructions hurt: 172 Cycles in all programs: 9091361325 -> 9094622766 (+0.0%) Cycles helped: 8764 Cycles hurt: 11702 Spills in all programs: 7531 -> 7385 (-1.9%) Spills helped: 19 Fills in all programs: 8462 -> 8294 (-2.0%) Fills helped: 19 Gained: 22 Lost: 15 Skylake Instructions in all programs: 131872162 -> 131867263 (-0.0%) Instructions helped: 566 Instructions hurt: 172 Cycles in all programs: 8795095440 -> 8799676943 (+0.1%) Cycles helped: 8333 Cycles hurt: 12182 Spills in all programs: 7006 -> 6884 (-1.7%) Spills helped: 13 Fills in all programs: 7696 -> 7552 (-1.9%) Fills helped: 13 Gained: 24 Lost: 1 Tested-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7698>	2023-08-29 19:01:36 +00:00
Ian Romanick	64c251bb3a	intel/fs: Combine constants for SEL instructions too It is very common to have bcsel where the second and third sources are both constants. This results in a situation where we would want to emit a SEL with two constant sources, but that's not allowed. Previously, we would load both constants into registers, then let constant propagation copy the last constant into the SEL instruction. This results in the constant using an entire SIMD register instead of a single channel. Instead, copy propagate both sources, then let the combine-constants pass do its thing. In the worst case, this stores the constant in a single channel of the SIMD register. In the best case, it reuses a value that was loaded into a register to satisfy another instruction. shader-db results: Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown) total instructions in shared programs: 19951549 -> 19948709 (-0.01%) instructions in affected programs: 482795 -> 479955 (-0.59%) helped: 1184 / HURT: 3 total cycles in shared programs: 858584724 -> 858205341 (-0.04%) cycles in affected programs: 356168375 -> 355788992 (-0.11%) helped: 1448 / HURT: 1195 total spills in shared programs: 6569 -> 6255 (-4.78%) spills in affected programs: 912 -> 598 (-34.43%) helped: 58 / HURT: 0 total fills in shared programs: 8218 -> 7813 (-4.93%) fills in affected programs: 1570 -> 1165 (-25.80%) helped: 58 / HURT: 0 LOST: 6 GAINED: 16 Broadwell total instructions in shared programs: 17819660 -> 17819389 (<.01%) instructions in affected programs: 1078129 -> 1077858 (-0.03%) helped: 1067 / HURT: 304 total cycles in shared programs: 904722624 -> 905035016 (0.03%) cycles in affected programs: 362583117 -> 362895509 (0.09%) helped: 1381 / HURT: 1123 total spills in shared programs: 17884 -> 17922 (0.21%) spills in affected programs: 5088 -> 5126 (0.75%) helped: 55 / HURT: 152 total fills in shared programs: 25533 -> 26290 (2.96%) fills in affected programs: 12992 -> 13749 (5.83%) helped: 61 /HURT: 295 LOST: 7 GAINED: 24 Haswell total instructions in shared programs: 16678080 -> 16673976 (-0.02%) instructions in affected programs: 1162893 -> 1158789 (-0.35%) helped: 1584 / HURT: 7 total cycles in shared programs: 880180082 -> 879932525 (-0.03%) cycles in affected programs: 364067522 -> 363819965 (-0.07%) helped: 1226 / HURT: 976 total spills in shared programs: 14937 -> 14428 (-3.41%) spills in affected programs: 7866 -> 7357 (-6.47%) helped: 351 / HURT: 5 total fills in shared programs: 17572 -> 16975 (-3.40%) fills in affected programs: 11028 -> 10431 (-5.41%) helped: 350 / HURT: 3 LOST: 8 GAINED: 16 Ivy Bridge total instructions in shared programs: 15704044 -> 15703158 (<.01%) instructions in affected programs: 304513 -> 303627 (-0.29%) helped: 707 / HURT: 0 total cycles in shared programs: 433560149 -> 433471118 (-0.02%) cycles in affected programs: 19299650 -> 19210619 (-0.46%) helped: 687 / HURT: 395 LOST: 2 GAINED: 9 Sandy Bridge total instructions in shared programs: 13913386 -> 13912884 (<.01%) instructions in affected programs: 195687 -> 195185 (-0.26%) helped: 455 / HURT: 0 total cycles in shared programs: 741156272 -> 741136266 (<.01%) cycles in affected programs: 10934349 -> 10914343 (-0.18%) helped: 578 / HURT: 289 LOST: 9 GAINED: 4 Iron Lake and GM45 had similar results. (Iron Lake shown) total instructions in shared programs: 8364056 -> 8364042 (<.01%) instructions in affected programs: 5178 -> 5164 (-0.27%) helped: 10 / HURT: 0 total cycles in shared programs: 248759794 -> 248757940 (<.01%) cycles in affected programs: 4305246 -> 4303392 (-0.04%) helped: 183 / HURT: 24 fossil-db results: Tiger Lake Instructions in all programs: 156943594 -> 156802601 (-0.1%) Instructions helped: 20595 Instructions hurt: 23248 Cycles in all programs: 7512086950 -> 7528386387 (+0.2%) Cycles helped: 29531 Cycles hurt: 27837 Spills in all programs: 13500 -> 5643 (-58.2%) Spills helped: 394 Spills hurt: 22 Fills in all programs: 18943 -> 6306 (-66.7%) Fills helped: 394 Fills hurt: 11 Gained: 93 Lost: 76 Ice Lake Instructions in all programs: 141395899 -> 141249621 (-0.1%) Instructions helped: 30067 Instructions hurt: 3 Cycles in all programs: 9097127057 -> 9089668235 (-0.1%) Cycles helped: 32268 Cycles hurt: 24315 Spills in all programs: 13695 -> 7564 (-44.8%) Spills helped: 403 Fills in all programs: 18400 -> 8494 (-53.8%) Fills helped: 403 Gained: 114 Lost: 137 Skylake Instructions in all programs: 131948328 -> 131826063 (-0.1%) Instructions helped: 29968 Instructions hurt: 3 Cycles in all programs: 8794778440 -> 8793934844 (-0.0%) Cycles helped: 32705 Cycles hurt: 23575 Spills in all programs: 10526 -> 7039 (-33.1%) Spills helped: 403 Fills in all programs: 11025 -> 7728 (-29.9%) Fills helped: 403 Gained: 102 Lost: 250 Tested-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7698>	2023-08-29 19:01:36 +00:00
Ian Romanick	cb0de0a1d3	intel/fs: Constant fold OR and AND The path taken in fs_visitor::swizzle_nir_scratch_addr for DG2 generates some AND and OR instructions before the SHL. This commit folds those so the whold calculation becomes a constant (like on older platforms). v2: Fix return type of src_as_uint. Noticed by Marcin. shader-db results: DG2 total instructions in shared programs: 23190475 -> 23179540 (-0.05%) instructions in affected programs: 36026 -> 25091 (-30.35%) helped: 7 / HURT: 0 total cycles in shared programs: 841196807 -> 841142563 (<.01%) cycles in affected programs: 1660670 -> 1606426 (-3.27%) helped: 7 / HURT: 0 No shader-db changes on any older Intel platforms. fossil-db results: DG2 Totals: Instrs: 197780372 -> 197773966 (-0.00%) Cycles: 14066410782 -> 14066399378 (-0.00%); split: -0.00%, +0.00% Subgroup size: 8438104 -> 8438112 (+0.00%) Send messages: 8049445 -> 8049446 (+0.00%) Scratch Memory Size: 14263296 -> 14264320 (+0.01%) Totals from 9 (0.00% of 668055) affected shaders: Instrs: 24547 -> 18141 (-26.10%) Cycles: 1984791 -> 1973387 (-0.57%); split: -0.98%, +0.40% Subgroup size: 88 -> 96 (+9.09%) Send messages: 867 -> 868 (+0.12%) Scratch Memory Size: 69632 -> 70656 (+1.47%) No fossil-db changes on any older Intel platforms. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23884>	2023-07-25 22:11:21 +00:00
Ian Romanick	61c786bad5	intel/fs: Constant fold SHL This is a modified version of a commit originally in !7698. This version add the changes to brw_fs_copy_propagation. If the address passed to fs_visitor::swizzle_nir_scratch_addr is a constant, that function will generate SHL with two constant sources. DG2 uses a different path to generate those addresses, so the constant folding can't occur there yet. That will be addressed in the next commit. What follows is the commit change history from that older MR. v2: Previously this commit was after `intel/fs: Combine constants for integer instructions too`. However, this commit can create invalid instructions that are only cleaned up by `intel/fs: Combine constants for integer instructions too`. That would potentially affect the shader-db results of each commit, but I did not collect new data for the reordering. v3: Fix masking for W/UW and for Q/UQ types. Add an assertion for !saturate. Both suggested by Ken. Also add an assertion that B/UB types don't matically come back. v4: Fix sources count. See also `ed3c2f73db` ("intel/fs: fixup sources number from opt_algebraic"). v5: Fix typo in comment added in v3. Noticed by Marcin. Fix a typo in a comment added when pulling this commit out of !7698. Noticed by Ken. shader-db results: DG2 No changes. Tiger Lake, Ice Lake, and Skylake had similar results (Ice Lake shown) total instructions in shared programs: 20655696 -> 20651648 (-0.02%) instructions in affected programs: 23125 -> 19077 (-17.50%) helped: 7 / HURT: 0 total cycles in shared programs: 858436639 -> 858407749 (<.01%) cycles in affected programs: 8990532 -> 8961642 (-0.32%) helped: 7 / HURT: 0 Broadwell and Haswell had similar results. (Broadwell shown) total instructions in shared programs: 18500780 -> 18496630 (-0.02%) instructions in affected programs: 24715 -> 20565 (-16.79%) helped: 7 / HURT: 0 total cycles in shared programs: 946100660 -> 946087688 (<.01%) cycles in affected programs: 5838252 -> 5825280 (-0.22%) helped: 7 / HURT: 0 total spills in shared programs: 17588 -> 17572 (-0.09%) spills in affected programs: 1206 -> 1190 (-1.33%) helped: 2 / HURT: 0 total fills in shared programs: 25192 -> 25156 (-0.14%) fills in affected programs: 156 -> 120 (-23.08%) helped: 2 / HURT: 0 No shader-db changes on any older Intel platforms. fossil-db results: DG2 Totals: Instrs: 197780415 -> 197780372 (-0.00%); split: -0.00%, +0.00% Cycles: 14066412266 -> 14066410782 (-0.00%); split: -0.00%, +0.00% Totals from 16 (0.00% of 668055) affected shaders: Instrs: 16420 -> 16377 (-0.26%); split: -0.43%, +0.17% Cycles: 220133 -> 218649 (-0.67%); split: -0.69%, +0.01% Tiger Lake, Ice Lake and Skylake had similar results. (Ice Lake shown) Totals: Instrs: 153425977 -> 153423678 (-0.00%) Cycles: 14747928947 -> 14747929547 (+0.00%); split: -0.00%, +0.00% Subgroup size: 8535968 -> 8535976 (+0.00%) Send messages: 7697606 -> 7697607 (+0.00%) Scratch Memory Size: 4380672 -> 4381696 (+0.02%) Totals from 6 (0.00% of 662749) affected shaders: Instrs: 13893 -> 11594 (-16.55%) Cycles: 5386074 -> 5386674 (+0.01%); split: -0.42%, +0.43% Subgroup size: 80 -> 88 (+10.00%) Send messages: 675 -> 676 (+0.15%) Scratch Memory Size: 91136 -> 92160 (+1.12%) Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23884>	2023-07-25 22:11:21 +00:00
Ian Romanick	5336cbff3b	intel/fs: Constant propagate into SHADER_OPCODE_SHUFFLE Code already exists to convert SHADER_OPCODE_SHUFFLE into a simple MOV when either source is constant. However... the constants have to actually get into those sources! On a shader that I'm working on that multiplies very large matrices using lots of subgroup operations, -SIMD8 shader: 1378 instructions. 3 loops. 793896 cycles. 0:0 spills:fills, 23 sends, scheduled with mode non-lifo. Promoted 0 constants. Compacted 22048 to 21664 bytes (2%) +SIMD8 shader: 346 instructions. 3 loops. 61742 cycles. 0:0 spills:fills, 23 sends, scheduled with mode top-down. Promoted 0 constants. Compacted 5536 to 5216 bytes (6%) No changes in shader-db or fossil-db on any Intel platform. v2: Merge a bunch of identical cases. Suggested by Ken. Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> [v1] Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23609>	2023-06-21 17:16:57 +00:00
Ian Romanick	7ef45e661f	intel/fs: Add constant propagation for ADD3 v2: Require that the constant value be representable as either uint16_t or int16_t. Suggested by Matt. v3: Remove redundant patterns. Noticed by Matt. shader-db: DG2 total instructions in shared programs: 23103767 -> 23103577 (<.01%) instructions in affected programs: 51822 -> 51632 (-0.37%) helped: 98 / HURT: 15 total cycles in shared programs: 842347714 -> 842380017 (<.01%) cycles in affected programs: 1942595 -> 1974898 (1.66%) helped: 97 / HURT: 32 Nearly all of the affected shaders (around 9,900) are shaders in Cyberpunk 2077. It's about an even split between vertex and fragment shaders. The majority of the remaining affected shaders (3,600) are from Strange Brigade. This was also a nearly even split between fragment and vertex. All but two of the lost shaders are SIMD32 fragment shaders in Cyberpunk 2077. The other two are SIMD32 fragment shaders in Dota2. fossil-db: DG2 Instructions in all programs: 196379107 -> 196248608 (-0.1%) helped: 13467 / HURT: 1210 Cycles in all programs: 13931355281 -> 13929955971 (-0.0%) helped: 11801 / HURT: 2922 Lost: 90 Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23262>	2023-06-06 06:10:53 +00:00
Ian Romanick	7873edee6e	intel/fs: Use specialized version of regions_overlap in opt_copy_propagation Since one of the register must always be either VGRF or FIXED_GRF, much of regions_overlap and reg_offset can be elided. On my Ice Lake laptop (using a locked CPU speed and other measures to prevent thermal throttling, etc.) using a debugoptimized build, improves performance of Vulkan CTS "deqp-vk --deqp-case='dEQP-VK.spir'" by -0.29% ± 0.097% (n = 5, pooled s = 0.361697). Using a release build, improves performance of compiling shaders from batman_arkham_city_goty.foz by -3.3% ± 0.04% (n = 5, pooled s = 0.178312). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22299>	2023-04-06 19:07:50 +00:00
Ian Romanick	782de1932c	intel/fs: Don't copy propagate from saturate to sel There are already NIR algebraic optimizations (see also `ac6646129f` ("nir: Move fsat outside of fmin/fmax if second arg is 0 to 1.") that will try to remove the saturate from things like fmax(0.5, fsat(x)) This basically reverts `40aeb558ce` ("i965/fs: Allow propagation of instructions with saturate flag to sel"). That commit message had no shader-db information, so it's unclear whether this actually helped anything ever. No shader-db changes on any Intel platform. One shader in Far Cry New Dawn was affected. Cycles in all programs: 10933090738 -> 10933090736 (-0.0%) Cycles helped: 1 Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22169>	2023-03-29 23:48:19 +00:00
Francisco Jerez	d4015bcb38	intel/fs: Fix copy propagation dataflow analysis in presence of force_writemask_all ACP overwrites. This fixes the behavior of copy propagation in cases where either the source or destination of an ACP is overwritten elsewhere in the program by a force_writemask_all instruction, which could cause the overwrite to be executed for an inactive channel under non-uniform control flow, causing the current per-channel dataflow propagation to give incorrect results. This has been reported in cases like: > while (true) { > x = imageSize(img); > if (non_uniform_condition()) { > y = x; > break; > } > } > use(y); Currently the copy propagation pass would propagate copy 'y = x' into 'use(y)', which is invalid since in the example above imageSize() is implemented as a force_writemask_all SEND message, whose result is broadcast to all channels, so when a given channel executes 'y = x' and breaks out of the loop, another divergent channel can execute a subsequent iteration of the loop overwriting 'x' with a different value, hence replacing 'y' with 'x' at 'use(y)' changes the behavior of the program. This patch extends the global dataflow analysis algorithm to determine whether there is any control flow path from a given copy to an overwrite of its source or destination which has force_writemask_all behavior inconsistent with the copy, and in such case prevents copy propagation for that ACP entry at any point of the program which can be reached from the overwrite, even if the copy is statically re-executed along all such control flow paths (as in the example above), since the execution of the overwrite for a given channel i may corrupt other channels j!=i inactive for the subsequently re-executed copy. Note that a simpler solution has been attempted which fully shuts down copy propagation if such a force_writemask_all ACP overwrite is present /anywhere/ in the program regardless of its location in the control flow graph, however that led to large shader-db regressions in some programs from shader-db (like a CS from Car Chase which would emit 53% more instructions). With this solution the only handful of shaders that suffer instruction count regressions seem to be getting misoptimized right now (e.g. some compute shaders from Deus Ex Mankind). This solution doesn't seem to affect the run-time of shader-db significantly, it's less than 1% higher with the fix applied. Reported-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Acked-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21351>	2023-03-17 03:05:20 -07:00
Francisco Jerez	1c1be23497	intel/fs: Track force_writemask_all behavior of copy propagation ACP entries. force_writemask_all determines whether all channels of the copy are actually valid, and may be required to be set for it to be propagated safely in cases where the destination of the copy is used by another force_writemask_all instruction, or when the copy occurs in a divergent control flow block different from its use. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Acked-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21351>	2023-03-17 03:05:18 -07:00
Francisco Jerez	7b5e933629	intel/fs: Fix src and dst types of LOAD_PAYLOAD ACP entries during copy propagation. The ACP entries created by copy propagation to track the implied copies of LOAD_PAYLOAD instructions don't model the behavior of LOAD_PAYLOAD correctly, since (as of `41868bb682`) header moves are implicitly retyped to UD and the destination of non-header copies implicitly uses the same type as the corresponding source, even though the ACP entries created for such copies could incorrectly represent a type conversion, which can lead to mis-optimization of the program. According to Marcin, this fixes the func.mesh.ext.workgroup_id.task.q0 crucible test. Fixes: `41868bb682` ("i965/fs: Rework the fs_visitor LOAD_PAYLOAD instruction") Reported-by: Marcin Ślusarz <marcin.slusarz@intel.com> Tested-by: Marcin Ślusarz <marcin.slusarz@intel.com> Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18980>	2023-01-25 22:22:12 +00:00
Kenneth Graunke	02129eee3a	intel/compiler: Eliminate SHADER_OPCODE_UNTYPED_ATOMIC_FLOAT The only reason for the separate opcode was because of the overlapping BRW_AOP_* enums, making it impossible to tell whether a particular AOP was the integer or float operation. Now that we use the lsc_opcode enums, we can just have the legacy lowering inspect the opcode and select the right descriptor. No need for a separate opcode. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Rohan Garg <rohan.garg@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>	2023-01-19 08:42:22 +00:00
Ian Romanick	9479e3a19b	intel/fs: Allow constant copy prop from DW to W This enables copy propagation of mov(8) g5<1>UD 0x00000180UD mul(8) g10<1>D g2.3<0,1,0>D g5<16,8,2>W into mul(8) g10<1>D g2.3<0,1,0>D 180W This is necessary for any optimization passes that generate imul_32x16 instructions. No fossil-db or shader-db changes on any Intel platform. v2: Fix type size check to (src size != 2) \|\| (dest size != 4). It was previously &&. :( This allowed copying constants into UB sources, and that is invalid. v3: Fix incorrect extraction of upper 16-bits of immediate value when subnr=2. Noticed by Caio. Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>	2022-11-08 00:02:16 +00:00
Ian Romanick	db20412168	intel/fs: Fix constant propagation into 32x16 integer multiplication Don't copy propagate the constant in situations like mov(8) g8<1>D 0x7fffffffD mul(8) g16<1>D g8<8,8,1>D g15<16,8,2>W On platforms that only have a 32x16 multiplier, this will result in lowering the multiply to mul(8) g15<1>D g14<8,8,1>D 0xffffUW mul(8) g16<1>D g14<8,8,1>D 0x7fffUW add(8) g15.1<2>UW g15.1<16,8,2>UW g16<16,8,2>UW On Gfx8 and Gfx9, which have the full 32x32 multiplier, it results in mul(8) g16<1>D g15<16,8,2>W 0x7fffffffD Volume 2a of the Skylake PRM says: When multiplying a DW and any lower precision integer, the DW operand must on src0. See also https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/104. Previous to INTEL_shader_integer_functions2 (in Vulkan or OpenGL), I don't think it would be possible to create a situation where this could occur. I discovered this via some optimizations that can determine that the non-constant source must be able to fit in 16-bits. The case listed above came from piglit's "ext_transform_feedback-order arrays points" with those optimizations in place. No shader-db or fossil-db changes on any Intel platform. Fixes: `de6c0f8487` ("intel/fs: Implement support for NIR opcodes for INTEL_shader_integer_functions2") Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>	2022-11-08 00:02:16 +00:00
Kenneth Graunke	ec2e8bc33f	intel/compiler: Avoid copy propagating large registers into EOT messages EOT messages need to use g112-g127 for their sources. With the new opt_split_sends pass, we may be constructing an EOT message from two different registers, and be able to copy propagate the original values into those SENDs. This can cause problems if we copy propagate from a large register (say an RGBA value which is 4 GRFs in SIMD8 or 8 GRFs in SIMD16), in a situation where the SEND only read a subset of that (say the alpha value out of an RGBA texturing result). g112-127 can only hold 16 registers worth of data, and sometimes we can only use g112-126. So, we can't propagate if the GRFs in question are larger than 15 GRFs. Fixes a shader validation failure in Alan Wake. Thanks to Ian Romanick for catching this! shader-db on Icelake shows that only SIMD32 programs are affected, and the results are pretty negligable: total instructions in shared programs: 19615228 -> 19615269 (<.01%) instructions in affected programs: 10702 -> 10743 (0.38%) helped: 1 / HURT: 43 / largest change: +/- 2 instructions total cycles in shared programs: 852001706 -> 852001566 (<.01%) cycles in affected programs: 767098 -> 766958 (-0.02%) helped: 68 / HURT: 64 / largest change: +/- 774 cycles GAINED: 2 / LOST: 0 Fixes: `589b03d02f` ("intel/fs: Opportunistically split SEND message payloads") Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/6803 Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17390>	2022-07-07 20:20:01 +00:00
Kenneth Graunke	72e9843991	intel/compiler: Introduce a new brw_isa_info structure This structure will contain the opcode mapping tables in the next commit. For now, this is the mechanical change to plumb it into all the necessary places, and it continues simply holding devinfo. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17309>	2022-06-30 23:46:35 +00:00
Ian Romanick	b5fa43952a	intel/fs: Better handle constant sources of FS_OPCODE_PACK_HALF_2x16_SPLIT I noticed that a LOT of fragment shaders in Shadow of the Tomb Raider, for instance, end up with a sequence of NIR like: vec1 32 ssa_2 = load_const (0x00000000 = 0.000000) ... vec1 32 ssa_191 = pack_half_2x16_split ssa_188, ssa_2 vec1 32 ssa_192 = pack_half_2x16_split ssa_189, ssa_2 vec1 32 ssa_193 = pack_half_2x16_split ssa_190, ssa_2 This results in an assembly sequence like: mov(8) g28<1>UD 0x00000000UD mov(8) g21<2>HF g28<8,8,1>F shl(8) g21<1>UD g21<8,8,1>UD 0x00000010UD mov(8) g21<2>HF g25<8,8,1>F mov(8) g19<2>HF g28<8,8,1>F shl(8) g19<1>UD g19<8,8,1>UD 0x00000010UD mov(8) g19<2>HF g23<8,8,1>F mov(8) g20<2>HF g28<8,8,1>F shl(8) g20<1>UD g20<8,8,1>UD 0x00000010UD mov(8) g20<2>HF g24<8,8,1>F After this commit, the generated assembly is: mov(8) g21<1>UD 0x00000000UD mov(8) g21<2>HF g23<8,8,1>F mov(8) g19<1>UD 0x00000000UD mov(8) g19<2>HF g17<8,8,1>F mov(8) g20<1>UD 0x00000000UD mov(8) g20<2>HF g18<8,8,1>F Tiger Lake, Ice Lake, Skylake, and Haswell had similar results. (Ice Lake shown) total instructions in shared programs: 20119086 -> 20119034 (<.01%) instructions in affected programs: 9056 -> 9004 (-0.57%) helped: 8 HURT: 0 helped stats (abs) min: 2 max: 16 x̄: 6.50 x̃: 4 helped stats (rel) min: 0.29% max: 1.75% x̄: 1.00% x̃: 0.98% 95% mean confidence interval for instructions value: -11.01 -1.99 95% mean confidence interval for instructions %-change: -1.56% -0.44% Instructions are helped. total cycles in shared programs: 861019414 -> 861021044 (<.01%) cycles in affected programs: 279862 -> 281492 (0.58%) helped: 4 HURT: 2 helped stats (abs) min: 6 max: 936 x̄: 239.00 x̃: 7 helped stats (rel) min: 0.03% max: 8.13% x̄: 2.09% x̃: 0.09% HURT stats (abs) min: 18 max: 2568 x̄: 1293.00 x̃: 1293 HURT stats (rel) min: 0.36% max: 1.14% x̄: 0.75% x̃: 0.75% 95% mean confidence interval for cycles value: -972.56 1515.89 95% mean confidence interval for cycles %-change: -4.77% 2.49% Inconclusive result (value mean confidence interval includes 0). Broadwell total instructions in shared programs: 17812327 -> 17812263 (<.01%) instructions in affected programs: 9867 -> 9803 (-0.65%) helped: 8 HURT: 0 helped stats (abs) min: 2 max: 28 x̄: 8.00 x̃: 4 helped stats (rel) min: 0.32% max: 1.80% x̄: 1.00% x̃: 0.95% 95% mean confidence interval for instructions value: -15.46 -0.54 95% mean confidence interval for instructions %-change: -1.54% -0.47% Instructions are helped. total cycles in shared programs: 904768620 -> 904773291 (<.01%) cycles in affected programs: 454799 -> 459470 (1.03%) helped: 4 HURT: 4 helped stats (abs) min: 36 max: 586 x̄: 344.50 x̃: 378 helped stats (rel) min: 0.47% max: 4.04% x̄: 2.01% x̃: 1.77% HURT stats (abs) min: 1 max: 5572 x̄: 1512.25 x̃: 238 HURT stats (rel) min: <.01% max: 2.77% x̄: 1.46% x̃: 1.53% 95% mean confidence interval for cycles value: -1122.40 2290.15 95% mean confidence interval for cycles %-change: -2.26% 1.71% Inconclusive result (value mean confidence interval includes 0). total spills in shared programs: 18581 -> 18579 (-0.01%) spills in affected programs: 323 -> 321 (-0.62%) helped: 1 HURT: 0 total fills in shared programs: 24985 -> 24981 (-0.02%) fills in affected programs: 1348 -> 1344 (-0.30%) helped: 1 HURT: 0 Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown) Instructions in all programs: 143585431 -> 143513657 (-0.0%) Instructions helped: 14403 Cycles in all programs: 8439312778 -> 8439371578 (+0.0%) Cycles helped: 10570 Cycles hurt: 3290 Gained: 146 Lost: 74 All of the lost and gained fossil-db shaders are SIMD32 fragment shaders. 14,247 of the affected shaders are from Shadow of the Tomb Raider. 154 are from Batman Arkham Origins, and the remaining two are from Octopath Traveler. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15089>	2022-04-07 18:26:23 +00:00
Topi Pohjolainen	261dd6c8f8	intel/compiler: Add new variant for TXF_CMS_W This allows, for example, fs_inst::components_read() without passing devinfo as extra argument. Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11766>	2021-11-22 21:27:30 -08:00
Topi Pohjolainen	0374b56faa	intel/compiler/fs: Add support for 16-bit sampler msg payload For SIMD8 half float payload, each component takes a full register, so we can use existing LOAD_PAYLOAD infrastruture for required padding by alternating plain 8-wide half float vector and null vector. Also this patch removes an unwanted assertion from opt_copy_propagation_local for LOAD_PAYLOAD. Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11766>	2021-11-22 21:27:30 -08:00
Ian Romanick	e3f502e007	intel/fs: Allow copy propagation between MOVs of mixed sizes This eliminates some spurious, size-converting moves. For example, on Ice Lake this helps dEQP-VK.spirv_assembly.type.vec3.i8.bitwise_xor_frag: SIMD8 shader: 52 instructions. 1 loops. 4164 cycles. 0:0 spills:fills, 5 sends SIMD8 shader: 49 instructions. 1 loops. 4044 cycles. 0:0 spills:fills, 5 sends Unfortunately, this doesn't clean everything up. Here's a subset of the "before" assembly: send(8) g11<1>UW g2<0,1,0>UD 0x02106e02 dp data 1 MsgDesc: ( untyped surface read, Surface = 2, SIMD8, Mask = 0xe) mlen 1 rlen 1 { align1 1Q }; mov(8) g7<4>UB g11<8,8,1>UD { align1 1Q }; mov(8) g12<1>UB g7<32,8,4>UB { align1 1Q }; send(8) g13<1>UW g2<0,1,0>UD 0x02106e03 dp data 1 MsgDesc: ( untyped surface read, Surface = 3, SIMD8, Mask = 0xe) mlen 1 rlen 1 { align1 1Q }; mov(8) g15<1>UW g12<8,8,1>UB { align1 1Q }; mov(8) g8<4>UB g13<8,8,1>UD { align1 1Q }; mov(8) g14<1>UB g8<32,8,4>UB { align1 1Q }; mov(8) g16<1>UW g14<8,8,1>UB { align1 1Q }; xor(8) g17<1>UW g15<8,8,1>UW g16<8,8,1>UW { align1 1Q }; And here's the same subset of the "after" assembly: send(8) g11<1>UW g2<0,1,0>UD 0x02106e02 dp data 1 MsgDesc: ( untyped surface read, Surface = 2, SIMD8, Mask = 0xe) mlen 1 rlen 1 { align1 1Q }; mov(8) g7<4>UB g11<8,8,1>UD { align1 1Q }; send(8) g13<1>UW g2<0,1,0>UD 0x02106e03 dp data 1 MsgDesc: ( untyped surface read, Surface = 3, SIMD8, Mask = 0xe) mlen 1 rlen 1 { align1 1Q }; mov(8) g15<1>UW g7<32,8,4>UB { align1 1Q }; mov(8) g8<4>UB g13<8,8,1>UD { align1 1Q }; mov(8) g16<1>UW g8<32,8,4>UB { align1 1Q }; xor(8) g17<1>UW g15<8,8,1>UW g16<8,8,1>UW { align1 1Q }; There are a lot of regioning and type restrictions in fs_visitor::try_copy_propagate, and I'm a little nervious about messing with them too much. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Suggested-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9025>	2021-08-18 22:03:37 +00:00
Anuj Phogat	61e8636557	intel: Rename gen_device prefix to intel_device export SEARCH_PATH="src/intel src/gallium/drivers/iris src/mesa/drivers/dri/i965" grep -E "gen_device" -rIl $SEARCH_PATH \| xargs sed -ie "s/gen_device/intel_device/g" Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10241>	2021-04-20 20:06:33 +00:00
Michel Dänzer	2928c21eb7	Convert most remaining free-form fall-through comments to FALLTHROUGH One exception is src/amd/addrlib/, for which -Wimplicit-fallthrough is explicitly disabled. Reviewed-by: Eric Anholt <eric@anholt.net> Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com> Reviewed-by: Juan A. Suarez <jasuarez@igalia.com> Reviewed-by: Gert Wollny <gert.wollny@collabora.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10220>	2021-04-15 16:01:22 +00:00
Anuj Phogat	e7e55af4d6	intel: Rename GENx keyword to GFXx Commands used to do the changes: export SEARCH_PATH="src/intel src/gallium/drivers/iris src/mesa/drivers/dri/i965" grep -E "GEN[[:digit:]]+" -rIl $SEARCH_PATH \| xargs sed -ie "s/GEN$[[:digit:]]\+$/GFX\1/g" Exclude the changes to modifiers: grep -E "I915_.GFX" -rIl $SEARCH_PATH \| xargs sed -ie "s/$I915_.$GFX/\1GEN/g" Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9936>	2021-04-02 18:33:07 +00:00
Anuj Phogat	abe9a71a09	intel: Rename gen field in gen_device_info struct to ver Commands used to do the changes: export SEARCH_PATH="src/intel src/gallium/drivers/iris src/mesa/drivers/dri/i965" grep -E "info\)(.\|->)gen" -rIl $SEARCH_PATH \| xargs sed -ie "s/info$)$$\.\\|->$gen/info\1\2ver/g" Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9936>	2021-04-02 18:33:07 +00:00
Lionel Landwerlin	200e56f84d	intel/fs: implement another copy propagation restriction We are missing an additional restriction on CHV & upcoming Xe-Hp. v2: Quote BSW PRMs (Curro) Check source is not a scalar (Curro) Fix comment (Marcin) Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9929>	2021-04-01 07:48:06 +00:00
Lionel Landwerlin	aa53665fda	intel/fs/copy_prop: check stride constraints with actual final type In some cases we will change the type of the destination register of an instruction. This is the type we should use to verify that we're allow to do the replacement. Otherwise we can hit restrictions on CHV and upcoming Xe-Hp for instance where the copy propagation transforms this : send(16) (mlen: 2) vgrf10:UD, 0u, 0u, vgrf35:D, null:UD mov(16) vgrf11:UW, vgrf10<2>:UW mov(16) vgrf12:UW, vgrf10+0.2<2>:UW mov(16) vgrf15:HF, \|vgrf11\|:HF mov(16) vgrf16:HF, \|vgrf12\|:HF mov(8) vgrf41<2>:UW, vgrf15+0.0:UW group0 mov(8) vgrf42<2>:UW, vgrf15+0.16:UW group8 mov(8) vgrf45<2>:UW, vgrf16+0.0:UW group0 mov(8) vgrf46<2>:UW, vgrf16+0.16:UW group8 into this : send(16) (mlen: 2) vgrf10:UD, 0u, 0u, vgrf35:D, null:UD mov(8) vgrf41<2>:HF, \|vgrf10+0.0\|<2>:HF group0 mov(8) vgrf42<2>:HF, \|vgrf10+1.0\|<2>:HF group8 mov(8) vgrf45<2>:HF, \|vgrf10+0.2\|<2>:HF group0 mov(8) vgrf46<2>:HF, \|vgrf10+1.2\|<2>:HF group8 Because of the floating point use, stride and offets should be the same. v2: Fix final destination type selection (Curro) v3: constify (Curro) Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Cc: <mesa-stable@lists.freedesktop.org> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9832>	2021-03-29 22:14:45 +00:00
Ian Romanick	52eb47c8d4	intel/compiler: Relax some conditions in try_copy_propagate Previously can_do_source_mods was used to determine whether a value with a source modifier or a value from a scalar source (e.g., a uniform) could be copy propagated. The former is a superset of the latter, so this always produces correct results, but it is overly restrictive. For example, a BFI instruction can't have source modifiers, but it can have scalar sources. This was originally authored to prevent a small number of shader-db regressions in a commit that marked SHR has not being able to have source modifiers. That commit has since been dropped in favor of a different method. v2: Refactor register region restriction detection to a helper function. Suggested by Jason. No fossil-db changes on any Intel platform. All Gen7+ platforms had similar results. (Ice Lake shown) total instructions in shared programs: 20039111 -> 20038943 (<.01%) instructions in affected programs: 31736 -> 31568 (-0.53%) helped: 104 HURT: 0 helped stats (abs) min: 1 max: 9 x̄: 1.62 x̃: 1 helped stats (rel) min: 0.30% max: 0.88% x̄: 0.45% x̃: 0.42% 95% mean confidence interval for instructions value: -2.03 -1.20 95% mean confidence interval for instructions %-change: -0.47% -0.42% Instructions are helped. total cycles in shared programs: 980309750 -> 980308897 (<.01%) cycles in affected programs: 591078 -> 590225 (-0.14%) helped: 70 HURT: 26 helped stats (abs) min: 2 max: 622 x̄: 23.94 x̃: 4 helped stats (rel) min: <.01% max: 2.85% x̄: 0.33% x̃: 0.12% HURT stats (abs) min: 2 max: 520 x̄: 31.65 x̃: 6 HURT stats (rel) min: 0.02% max: 2.45% x̄: 0.34% x̃: 0.15% 95% mean confidence interval for cycles value: -26.41 8.64 95% mean confidence interval for cycles %-change: -0.27% -0.03% Inconclusive result (value mean confidence interval includes 0). No shader-db changes on earlier Intel platforms. Reviewed-by: Anuj Phogat anuj.phogat@gmail.com [v1] Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9237>	2021-02-23 15:11:37 -08:00
Jason Ekstrand	58bcb5401d	intel/fs: QUAD_SWIZZLE requires packed data We could probably support some strides if we tried hard enough but the whole point of this opcode is to accelerate things with crazy Align16 or crazy regions. It's ok if we have to emit an extra MOV to get a packed source. Fixes: `8b4a5e641b` "intel/fs: Add support for subgroup quad operations" Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7329>	2021-01-22 18:38:37 +00:00
Jason Ekstrand	f8117f7051	intel/fs: Allow constant-propagation into SAMPLEINFO and IMAGE_SIZE Without this, we end up with indirect sampler messages all the time because we don't propagate the texture/image BTI. This makes debugging shaders with imageSize or textureSamples in them a pain. Shader-db results on Ice Lake: total instructions in shared programs: 19720612 -> 19720564 (<.01%) instructions in affected programs: 4998 -> 4950 (-0.96%) helped: 12 HURT: 0 All affected shaders were compute shaders in Deus Ex: Mankind Divided. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6794>	2020-10-14 21:35:30 +00:00
Jason Ekstrand	8e8701b43a	intel/fs: Don't copy-propagate stride=0 sources into ddx/ddy This can come up if, for instance, the shader does a derivative of a uniform or flat input. Ideally, NIR would use divergence analysis to get rid of the derivative in this case but it doesn't right now. This fixes a crash in F1 2017. Cc: mesa-stable@lists.freedesktop.org Reported-by: Marcin Ślusarz <marcin.slusarz@intel.com> Tested-by: Marcin Ślusarz <marcin.slusarz@intel.com> Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6564>	2020-09-02 20:31:32 +00:00
Ian Romanick	5afaa407c1	intel/compiler: Only GE and L modifiers are commutative for SEL Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>	2020-04-17 08:21:43 -07:00
Francisco Jerez	ea44de6d8c	intel/compiler/fs: Switch liveness analysis to IR analysis framework This involves wrapping fs_live_variables in a BRW_ANALYSIS object and hooking it up to invalidate_analysis() so it's properly invalidated. Seems like a lot of churn but it's fairly straightforward. The fs_visitor invalidate_ and calculate_live_intervals() methods are no longer necessary after this change. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:57 -08:00
Francisco Jerez	ba73e606f6	intel/compiler: Move all live interval analysis results into fs_live_variables This moves the following methods that are currently defined in fs_visitor (even though they are side products of the liveness analysis computation) and are already implemented in brw_fs_live_variables.cpp: > bool virtual_grf_interferes(int a, int b) const; > int virtual_grf_start; > int virtual_grf_end; It makes sense for them to be part of the fs_live_variables object, because they have the same lifetime as other liveness analysis results and because this will allow some extra validation to happen wherever they are accessed in order to make sure that we only ever use up-to-date liveness analysis results. This shortens the virtual_grf prefix in order to compensate for the slightly increased lexical overhead from the live_intervals pointer dereference. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:43 -08:00
Francisco Jerez	ab6d792986	intel/compiler: Pass detailed dependency classes to invalidate_analysis() Have fun reading through the whole back-end optimizer to verify whether I've missed any dependency flags -- Or alternatively, just trust that any mistake here will trigger an assertion failure during analysis pass validation if it ever poses a problem for the consistency of any of the analysis passes managed by the framework. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:39 -08:00
Francisco Jerez	d966a6b4c4	intel/compiler: Introduce backend_shader method to propagate IR changes to analysis passes The invalidate_analysis() method knows what analysis passes there are in the back-end and calls their invalidate() method to report changes in the IR. For the moment it just calls invalidate_live_intervals() (which will eventually be fully replaced by this function) if anything changed. This makes all optimization passes invalidate DEPENDENCY_EVERYTHING, which is clearly far from ideal -- The dependency classes passed to invalidate_analysis() will be refined in a future commit. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:32 -08:00
Francisco Jerez	54b1b71e73	intel/fs: Allow limited copy propagation of a LOAD_PAYLOAD into another. This is particularly useful in cases where register coalaesce is unlikely to succeed because the LOAD_PAYLOAD isn't a plain copy -- E.g. when a LOAD_PAYLOAD is shuffling the contents of a barycentric vector in order to transform it into the PLN layout. This prevents the following shader-db regressions (including SIMD32 programs) in combination with the interpolation rework part of this series. On SKL: total instructions in shared programs: 18596672 -> 18976097 (2.04%) instructions in affected programs: 7937041 -> 8316466 (4.78%) helped: 39 HURT: 67427 LOST: 466 GAINED: 220 On SNB: total instructions in shared programs: 13993866 -> 14202963 (1.49%) instructions in affected programs: 7611309 -> 7820406 (2.75%) helped: 624 HURT: 52943 LOST: 6 GAINED: 18 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:22:09 -08:00
Francisco Jerez	8eb4f2092a	intel/fs: Add support for copy-propagating a block of multiple FIXED_GRFs. In cases where a LOAD_PAYLOAD instruction copies a single block of sequential GRF registers into the destination (see is_identity_payload()), splitting the block copy into a number of ACP entries (one for each LOAD_PAYLOAD source) is undesirable, because that prevents copy propagation into any instructions which read multiple components at once with the same source (the barycentric source of the LINTERP instruction is going to be the overwhelmingly most common example). Technically it would also be possible to do this for VGRF sources, but there is little benefit from that since register coalesce already covers many of those cases -- There is no way for a block of FIXED_GRFs to be coalesced into a VGRF though. This prevents the following shader-db regressions (including SIMD32 programs) in combination with the interpolation rework part of this series. On SKL: total instructions in shared programs: 18595160 -> 18828562 (1.26%) instructions in affected programs: 13374946 -> 13608348 (1.75%) helped: 7 HURT: 108977 total spills in shared programs: 9116 -> 9106 (-0.11%) spills in affected programs: 404 -> 394 (-2.48%) helped: 7 HURT: 9 total fills in shared programs: 8994 -> 9176 (2.02%) fills in affected programs: 898 -> 1080 (20.27%) helped: 7 HURT: 9 LOST: 469 GAINED: 220 On SNB: total instructions in shared programs: 13996898 -> 14096222 (0.71%) instructions in affected programs: 8088546 -> 8187870 (1.23%) helped: 2 HURT: 66520 total spills in shared programs: 2985 -> 2961 (-0.80%) spills in affected programs: 632 -> 608 (-3.80%) helped: 2 HURT: 0 total fills in shared programs: 3144 -> 3128 (-0.51%) fills in affected programs: 1515 -> 1499 (-1.06%) helped: 2 HURT: 0 LOST: 0 GAINED: 4 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:21:41 -08:00
Francisco Jerez	e328fbd9f8	intel/fs: Add partial support for copy-propagating FIXED_GRFs. This will be useful for eliminating redundant copies from the FS thread payload, particularly in SIMD32 programs. For the moment we only allow FIXED_GRFs with identity strides in order to avoid dealing with composing the arbitrary bidimensional strides that FIXED_GRF regions potentially have, which are rarely used at the IR level anyway. This enables the following commit allowing block-propagation of FIXED_GRF LOAD_PAYLOAD copies, and prevents the following shader-db regressions (including SIMD32 programs) in combination with the interpolation rework part of this series. On ICL: total instructions in shared programs: 20484665 -> 20529650 (0.22%) instructions in affected programs: 6031235 -> 6076220 (0.75%) helped: 5 HURT: 42073 total spills in shared programs: 8748 -> 8925 (2.02%) spills in affected programs: 186 -> 363 (95.16%) helped: 5 HURT: 9 total fills in shared programs: 8663 -> 8960 (3.43%) fills in affected programs: 647 -> 944 (45.90%) helped: 5 HURT: 9 On SKL: total instructions in shared programs: 18937442 -> 19128162 (1.01%) instructions in affected programs: 8378187 -> 8568907 (2.28%) helped: 39 HURT: 68176 LOST: 1 GAINED: 4 On SNB: total instructions in shared programs: 14094685 -> 14243499 (1.06%) instructions in affected programs: 7751062 -> 7899876 (1.92%) helped: 623 HURT: 53586 LOST: 7 GAINED: 25 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:21:33 -08:00
Francisco Jerez	5153d06d92	intel/fs: Extend copy propagation dataflow analysis to copies with FIXED_GRF source. This involves indexing the ACP tables used internally by fs_copy_prop_dataflow::setup_initial_values() by reg_space() instead of register number. Both are nearly equivalent for virtual GRFs (barring the single bit of entropy lost in the hash), and this makes handling FIXED_GRFs straightforward. Because we're only going to support FIXED_GRFs for the source of a copy, this change is only strictly necessary during the second pass that checks for source interference, but we also apply the same change to the first pass for consistency. Note that this shouldn't change the behavior of the copy propagation pass until we start inserting FIXED_GRF entries into the ACP. Even then FIXED_GRF writes are extremely rare so this change will hardly ever have an effect, but they aren't completely non-existing so we need to handle them for correctness. No functional nor shader-db changes. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:21:27 -08:00
Jason Ekstrand	f8bda81887	intel/fs/copy-prop: Don't walk all the ACPs for each instruction In order to set up KILL sets, the dataflow code was walking the entire array of ACPs for every instruction. If you assume the number of ACPs increases roughly with the number of instructions, this is O(n^2). As it turns out, regions_overlap() is not nearly as cheap as one would like and shows up as a significant chunk on perf traces. This commit changes things around and instead first builds an array of exec_lists which it uses like a hash table (keyed off ACP source or destination) similar to what's done in the rest of the copy-prop code. By first walking the list of ACPs and populating the table and then walking instructions and only looking at ACPs which probably have the same VGRF number, we can reduce the complexity to O(n). This takes the execution time of the piglit vs-isnan-dvec test from about 56.4 seconds on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to about 38.7 seconds. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com>	2019-05-10 09:10:17 -05:00
Jason Ekstrand	20bbc175a4	intel/fs/copy-prop: Purge unused ACPs If the destination of an ACP entry exists only within this block, then there's no need to keep it for dataflow analysis. We can delete it from the out_acp table and avoid growing the bitsets any bigger than we absolutely have to. This reduces the maximum number of global ACP entries in the vs-isnan-dvec with software fp64 on Kaby Lake from 8630 to 3942 and takes the execution time of the piglit vs-isnan-dvec test from about 1:16.2 on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to about 56.4 seconds. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com>	2019-05-10 09:10:17 -05:00
Jason Ekstrand	0b6da5bac6	intel/fs/copy-prop: Bump the hash table size to 64 While the number of ACPs is generally not huge compared to the number of blocks, 16 does seem a bit small. Bumping it to 64 takes the execution time of the piglit vs-isnan-dvec test from about 1:18.1 on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to about 1:16.2. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com>	2019-05-10 09:10:17 -05:00
Juan A. Suarez Romero	b06ae53606	Revert "intel/compiler: split is_partial_write() into two variants" This reverts commit `40b3abb4d1`. It is not clear that this commit was entirely correct, and unfortunately it was pushed by error. CC: Jason Ekstrand <jason@jlekstrand.net> Acked-by: Jason Ekstrand <jason@jlekstrand.net>	2019-04-25 09:19:10 +02:00
Iago Toral Quiroga	40b3abb4d1	intel/compiler: split is_partial_write() into two variants This function is used in two different scenarios that for 32-bit instructions are the same, but for 16-bit instructions are not. One scenario is that in which we are working at a SIMD8 register level and we need to know if a register is fully defined or written. This is useful, for example, in the context of liveness analysis or register allocation, where we work with units of registers. The other scenario is that in which we want to know if an instruction is writing a full scalar component or just some subset of it. This is useful, for example, in the context of some optimization passes like copy propagation. For 32-bit instructions (or larger), a SIMD8 dispatch will always write at least a full SIMD8 register (32B) if the write is not partial. The function is_partial_write() checks this to determine if we have a partial write. However, when we deal with 16-bit instructions, that logic disables some optimizations that should be safe. For example, a SIMD8 16-bit MOV will only update half of a SIMD register, but it is still a complete write of the variable for a SIMD8 dispatch, so we should not prevent copy propagation in this scenario because we don't write all 32 bytes in the SIMD register or because the write starts at offset 16B (wehere we pack components Y or W of 16-bit vectors). This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-bit instructions, which lose a number of optimizations because of this, most important of which is copy-propagation. This patch splits is_partial_write() into is_partial_reg_write(), which represents the current is_partial_write(), useful for things like liveness analysis, and is_partial_var_write(), which considers the dispatch size to check if we are writing a full variable (rather than a full register) to decide if the write is partial or not, which is what we really want in many optimization passes. Then the patch goes on and rewrites all uses of is_partial_write() to use one or the other version. Specifically, we use is_partial_var_write() in the following places: copy propagation, cmod propagation, common subexpression elimination, saturate propagation and sel peephole. Notice that the semantics of is_partial_var_write() exactly match the current implementation of is_partial_write() for anything that is 32-bit or larger, so no changes are expected for 32-bit instructions. Tested against ~5000 tests involving 16-bit instructions in CTS produced the following changes in instruction counts: Patched \| Master \| % \| ================================================ SIMD8 \| 621,900 \| 706,721 \| -12.00% \| ================================================ SIMD16 \| 93,252 \| 93,252 \| 0.00% \| ================================================ As expected, the change only affects SIMD8 dispatches. Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>	2019-04-18 11:05:18 +02:00

1 2

60 Commits