third_party_mesa3d

Author	SHA1	Message	Date
Eric Anholt	95d41a3525	ra: Use struct ra_class in the public API. All these unsigned ints are awful to keep track of. Use pointers so we get some type checking. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9437>	2021-06-04 19:08:57 +00:00
Iago Toral Quiroga	f07c797e93	v3dv: implement vkCmdDispatchBase This was added with VK_KHR_device_group and allows users to specify a base offset that will be automatically added to gl_WorkGroupID. Unfortunately, V3D doesn't support this natively, so we need to add the base to the workgroup id generated by hardware manually. For this, we inject add instructions that source from a QUNIFORM that will retrieve the actual dispatch base from the compute job when it is dispatched. Since a compute shader can be dispatched with CmdDispatch and/or CmdDispatchBase, we always need to add these additional add instructions and use a base of (0,0,0) for regular dispatches. Since we don't support any version of OpenGL with this dispatch base functionality we can avoid the extra instructions there. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11037>	2021-05-31 09:06:18 +00:00
Iago Toral Quiroga	d19ce36ff2	broadcom/compiler: refactor compile strategies Until now, if we can't compile at 4 threads we would lower thread count with optimizations disabled, however, lowering thread count doubles the amount of registers available per thread, so that alone is already a big relief for register pressure so it makes sense to enable optimizations when we do that, and progressively disable them until we enable spilling as a last resort. This can slightly improve performance for some applications. Sponza, for example, gets a ~1.5% boost. I see several UE4 shaders that also get compiled to better code at 2 threads with this, but it is more difficult to assess how much this improves performance in practice due to the large variance in frame times that we observe with UE4 demos. Also, if a compiler strategy disables an optimization that did not make any progress in the previous compile attempt, we would end up re-compiling the exact same shader code and failing again. This, patch keeps track of which strategies won't make progress and skips them in that case to save some CPU time during shader compiles. Care should be taken to ensure that we try to compile with the default NIR scheduler at minimum thread count at least once though, so a specific strategy for this is added, to prevent the scenario where no optimizations are used and we skip directly to the fallback scheduler if the default strategy fails at 4 threads. Similarly, we now also explicitly specify which strategies are allowed to do TMU spills and make sure we take this into account when deciding to skip strategies. This prevents the case where no optimizations are used in a shader and we skip directly to the fallback scheduler after failing compilation at 2 threads with the default NIR scheduler but without trying to spill first. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10647>	2021-05-06 12:27:06 +02:00
Iago Toral Quiroga	296fe4daa6	broadcom/compiler: add a compiler strategy to disable loop unrolling Loop unrolling can increase register pressure significantly, leading to lower thread counts and spilling. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10647>	2021-05-06 12:27:06 +02:00
Iago Toral Quiroga	4742300e6b	v3d: move NIR compiler options to GL driver The Vulkan driver was already creating and using its own set of options, so the ones defined in the compiler are only used with GL, which is confusing. Move them to the GL driver. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10647>	2021-05-06 12:27:06 +02:00
Iago Toral Quiroga	f514280524	broadcom/compiler: track if a shader has control barriers in prog_data Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10541>	2021-05-04 15:53:23 +00:00
Iago Toral Quiroga	0a3bfacabb	broadcom/compiler: rename unifa tracking fields The term 'last' may be misleading because the offset represents the current unifa offset, which is the offset used by the last load plus 4 bytes, so rename these to use the term 'current' instead. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>	2021-04-09 10:31:40 +00:00
Iago Toral Quiroga	8998666de7	broadcom/compiler: sort constant UBO loads by index and offset This implements a NIR pass that groups together constant UBO loads for the same UBO index in order of increasing offset when the distance between them is small enough that it enables the "skip unifa write" optimization. This may increase register pressure because it can move UBO loads earlier, so we also add a compiler strategy fallback to disable the optimization if we need to drop thread count to compile the shader with this optimization enabled. total instructions in shared programs: 13557555 -> 13550300 (-0.05%) instructions in affected programs: 814684 -> 807429 (-0.89%) helped: 4485 HURT: 2377 Instructions are helped. total uniforms in shared programs: 3777243 -> 3760990 (-0.43%) uniforms in affected programs: 112554 -> 96301 (-14.44%) helped: 7226 HURT: 36 Uniforms are helped. total max-temps in shared programs: 2318133 -> 2333761 (0.67%) max-temps in affected programs: 63230 -> 78858 (24.72%) helped: 23 HURT: 3044 Max-temps are HURT. total sfu-stalls in shared programs: 32245 -> 32567 (1.00%) sfu-stalls in affected programs: 389 -> 711 (82.78%) helped: 139 HURT: 451 Inconclusive result. total inst-and-stalls in shared programs: 13589800 -> 13582867 (-0.05%) inst-and-stalls in affected programs: 817738 -> 810805 (-0.85%) helped: 4478 HURT: 2395 Inst-and-stalls are helped. total nops in shared programs: 354365 -> 342202 (-3.43%) nops in affected programs: 31000 -> 18837 (-39.24%) helped: 4405 HURT: 265 Nops are helped. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>	2021-04-09 10:31:40 +00:00
Iago Toral Quiroga	fb2214a441	broadcom/compiler: allow compilation strategies to limit minimum thread count This adds a minimum thread count parameter to each compilation strategy with the intention to limit the minimum allowed thread count that can be used to register allocate with that strategy. For now all strategies allow the minimum thread count supported by the hardware, but we will be using this infrastructure to impose a more strict limit in an upcoming optimization. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>	2021-04-09 10:31:40 +00:00
Iago Toral Quiroga	4b244dc64f	broadcom/compiler: add a definition for the unifa skip distance We will be using this distance to setup another optimization in a follow-up patch. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> x# Please enter the commit message for your changes. Lines starting Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>	2021-04-09 10:31:40 +00:00
Iago Toral Quiroga	f33ca092da	broadcom/compiler: add a NOP count stat to shader-db Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9918>	2021-03-31 05:51:22 +00:00
Alejandro Piñeiro	b71fd5587e	broadcom/compiler: add driver_location_map at vs prog data This maps the nir shader data.location to its final data.driver_location. In general we are using the driver location as index (like vattr_sizes on the same struct), so having this map is useful if what we have is the data.location, and we don't have available the original nir shader. v2: use memset instead of for loop, and nir_foreach_shader_in_variable instead of nir_foreach_variable_with_modes (Iago) Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9403>	2021-03-22 17:10:47 +00:00
Alejandro Piñeiro	2be0c36775	broadcom/compiler: add local_size in v3d_compute_prog_data As we plan to try to get directly the compiled variant from the cache, it would be possible to not have available the nir shaders, so we add this info on prog data. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9403>	2021-03-22 17:10:47 +00:00
Iago Toral Quiroga	947e9e42cc	broadcom/compiler: simplify ldvary pipelining We get optimal ldvary pipelining by doing the following: 1) Carefully merge a paired ldvary into the previous instruction when possible. 2) When the above succeeds, flag the ldvary as scheduled immediately so we can merge one of its children into the current instruction. 3) When scheduling ldvary sequences, only pick up instructions that are part of the sequence to avoid picking up something that prevents successful pipelining. This patch skips 3) assuming some hurt shaders in exchange for better scheduling flexibility during ldvary sequences. Besides eliminating most of the code dedicated to special handling ldvary sequences, this also usually allows us to produce better code by merging instructions that are unrelated to ldvary sequences into the ldvary sequences, which is particularly effective to fill up the gaps produced when scheduling the first and last ldvary sequences as well as the gaps produced by flat and noperspective varyings sequences that don't have both mul and add instructions. Notice that there are some hurt shaders, because some times the extra scheduler flexibility can lead to picking up instructions that will break a sequence without compensating for that, typically an ldunif that prevents us from doing the fixup for a follow-up ldvary. We will try to correct some of these cases with the next patch. total instructions in shared programs: 13786037 -> 13760415 (-0.19%) instructions in affected programs: 3201387 -> 3175765 (-0.80%) helped: 16155 HURT: 4146 Instructions are helped. total max-temps in shared programs: 2324834 -> 2322991 (-0.08%) max-temps in affected programs: 22160 -> 20317 (-8.32%) helped: 1340 HURT: 103 Max-temps are helped. total sfu-stalls in shared programs: 30685 -> 31827 (3.72%) sfu-stalls in affected programs: 782 -> 1924 (146.04%) helped: 253 HURT: 1416 Inconclusive result. total inst-and-stalls in shared programs: 13816722 -> 13792242 (-0.18%) inst-and-stalls in affected programs: 3171642 -> 3147162 (-0.77%) helped: 15331 HURT: 4179 Inst-and-stalls are helped. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9471>	2021-03-10 07:52:22 +00:00
Iago Toral Quiroga	839007e490	broadcom/compiler: always restart ldvary pipelining when scheduling ldvary When we were only able to pipeline smooth varyings, if we had to disable ldvary pipelining in the middle of a sequence it would stay disabled for the rest of the program, to prevent us from prioritizing scheduling of ldvary instructions that we would not be able to pipeline effectively. Now that we can pipeline all ldvary sequences we can change this. This change re-enables ldvary pipelining upon finding the next ldvary in the program in the hopes that we can continue pipelining succesfully. To do this, we track the number of ldvary instructions we emitted so far and compare that to the number of inputs in the fragment shader we are scheduling. This also allows us to simplify our ldvary tracking at nir to vir time, since that is all now handled in the QPU scheduler. total instructions in shared programs: 13817048 -> 13810783 (-0.05%) instructions in affected programs: 810114 -> 803849 (-0.77%) helped: 4843 HURT: 591 Instructions are helped. total max-temps in shared programs: 2326612 -> 2326300 (-0.01%) max-temps in affected programs: 4689 -> 4377 (-6.65%) helped: 285 HURT: 7 Max-temps are helped. total sfu-stalls in shared programs: 30942 -> 30865 (-0.25%) sfu-stalls in affected programs: 207 -> 130 (-37.20%) helped: 120 HURT: 42 Sfu-stalls are helped. total inst-and-stalls in shared programs: 13847990 -> 13841648 (-0.05%) inst-and-stalls in affected programs: 825378 -> 819036 (-0.77%) helped: 4899 HURT: 590 Inst-and-stalls are helped. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9404>	2021-03-05 10:32:19 +01:00
Iago Toral Quiroga	acbd4881c2	broadcom/compiler: ldvary pipelining tracking and documentation clean-ups Now that we can pipeline all varyings we should not be referring specifically to smooth varyings anywhere. Also, rename the instruction field 'ldvary_pipelining' to 'is_ldvary_sequence', which is more appropriate, since we always set this for any instruction involved with varying setups, independently of whether they end up being pipelined or not. This also does some other minor edits which intend to slightly simplify the code and make it a bit more compact. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9363>	2021-03-02 13:54:14 +01:00
Iago Toral Quiroga	1d021539a2	broadcom/compiler: track pipelineable ldvary sequences If we have two (or more) smooth varyings like this: nop t3; ldvary.rf0 fmul t5, t3, t0 fadd t6, t5, r5 nop t7; ldvary.rf0 fmul t9, t7, t0 fadd t10, t9, r5 nop t11; ldvary.rf0 fmul t13, t11, t0 fadd t14, t13, r5 We may be able to pipeline them like this: nop ; nop ; ldvary.r4 nop ; fmul r0, r4, rf0 ; ldvary.r1 fadd rf13, r0, r5 ; fmul r2, r1, rf0 ; ldvary.r3 fadd rf12, r2, r5 ; fmul r4, r3, rf0 ; ldvary.r0 But in order to do this, we will need to manually tweak the QPU scheduling. This patch tracks information about ldvary sequences that are good candidates for pipelining, and a follow-up patch will use this information to pipeline them when we emit the QPU code. v2 (apinheiro): - Rename the v3d_compile fields to avoid confusion with the qinst fields. - Assert that a sequence's start instruction is not the same as the end. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9304>	2021-03-02 07:56:00 +01:00
Eric Anholt	60573b443b	v3d: Replace driver lowering of GL_CLAMP with mesa/st's. Mesa core can do this logic for us now. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9228>	2021-02-24 18:03:46 +00:00
Iago Toral Quiroga	54c17e45ae	broadcom/compiler: skip unnecessary unifa writes If a new UBO load happens to read exactly at the offset right after the previous UBO load (something that is fairly common, for example when reading a matrix), we can skip the unifa write (with its 3 delay slots) and just continue to call ldunifa to continue reading consecutive addresses. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9128>	2021-02-23 08:08:01 +00:00
Iago Toral Quiroga	e1cf2406da	broadcom/compiler: add a constant alu optimization pass Currently this is useful to clean up after DCEing leading ldunifa instructions, but it can be expanded to handle more cases which may allow to simplify the compiler code in places where we have been trying to optimize manually for similar cases. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9128>	2021-02-23 08:08:01 +00:00
Iago Toral Quiroga	14af7b3085	broadcom/compiler: don't emit redundant ldunif If we emit a new uniform and that uniform has already been emitted in the same block we can just reuse that. There is a balancing game here between reducing ldunif instructions and not increasing register pressure too much though, so we put a limit to how far back we are willing to look for a previous definition of the uniform. Based on shader-db results, 20 instructions produces best results. total instructions in shared programs: 14928266 -> 14907432 (-0.14%) instructions in affected programs: 6431841 -> 6411007 (-0.32%) helped: 15270 HURT: 10772 Instructions are helped. total uniforms in shared programs: 3944672 -> 3840276 (-2.65%) uniforms in affected programs: 1827184 -> 1722788 (-5.71%) helped: 30423 HURT: 845 Uniforms are helped. total inst-and-stalls in shared programs: 14957813 -> 14936873 (-0.14%) inst-and-stalls in affected programs: 6475349 -> 6454409 (-0.32%) helped: 15287 HURT: 10852 Inst-and-stalls are helped. v2 (Eric): - consider ldunifrf too - check that no other instruction writes to the register Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9077>	2021-02-17 09:01:01 +01:00
Iago Toral Quiroga	f85fcaa494	broadcom/compiler: pass a devinfo to check if an instruction writes to TMU V3D 3.x has V3D_QPU_WADDR_TMU which in V3D 4.x is V3D_QPU_WADDR_UNIFA (which isn't a TMU write address). This change passes a devinfo to any functions that need to do these checks so we can account for the target V3D version correctly. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8980>	2021-02-12 08:24:21 +00:00
Arcady Goldmints-Orlov	9909fe6bac	broadcom/compiler: Skip bool_to_cond where possible This change keeps track of when a boolean temp is loaded into the flags by a comparison instruction and uses that information to skip emitting instructions to set the flags in ntq_emit_bool_to_cond when the flags already have the right contents. total instructions in shared programs: 11116502 -> 11112225 (-0.04%) instructions in affected programs: 631691 -> 627414 (-0.68%) helped: 1591 HURT: 754 helped stats (abs) min: 1 max: 94 x̄: 4.14 x̃: 3 helped stats (rel) min: 0.11% max: 13.46% x̄: 2.10% x̃: 1.58% HURT stats (abs) min: 1 max: 19 x̄: 3.07 x̃: 2 HURT stats (rel) min: 0.13% max: 19.67% x̄: 1.88% x̃: 1.15% 95% mean confidence interval for instructions value: -2.02 -1.63 95% mean confidence interval for instructions %-change: -0.94% -0.71% Instructions are helped. total uniforms in shared programs: 3281555 -> 3281513 (<.01%) uniforms in affected programs: 1754 -> 1712 (-2.39%) helped: 10 HURT: 5 helped stats (abs) min: 1 max: 19 x̄: 7.90 x̃: 5 helped stats (rel) min: 0.56% max: 11.11% x̄: 7.37% x̃: 11.05% HURT stats (abs) min: 1 max: 15 x̄: 7.40 x̃: 3 HURT stats (rel) min: 0.64% max: 9.55% x̄: 5.31% x̃: 3.41% 95% mean confidence interval for uniforms value: -8.57 2.97 95% mean confidence interval for uniforms %-change: -7.35% 1.07% Inconclusive result (value mean confidence interval includes 0). total max-temps in shared programs: 1758419 -> 1758174 (-0.01%) max-temps in affected programs: 7006 -> 6761 (-3.50%) helped: 290 HURT: 14 helped stats (abs) min: 1 max: 8 x̄: 1.13 x̃: 1 helped stats (rel) min: 0.79% max: 22.86% x̄: 6.61% x̃: 4.88% HURT stats (abs) min: 1 max: 13 x̄: 6.00 x̃: 3 HURT stats (rel) min: 1.54% max: 54.17% x̄: 23.99% x̃: 9.12% 95% mean confidence interval for max-temps value: -1.03 -0.58 95% mean confidence interval for max-temps %-change: -6.24% -4.16% Max-temps are helped. total sfu-stalls in shared programs: 23676 -> 23610 (-0.28%) sfu-stalls in affected programs: 1578 -> 1512 (-4.18%) helped: 257 HURT: 252 helped stats (abs) min: 1 max: 3 x̄: 1.37 x̃: 1 helped stats (rel) min: 11.11% max: 100.00% x̄: 46.70% x̃: 40.00% HURT stats (abs) min: 1 max: 2 x̄: 1.14 x̃: 1 HURT stats (rel) min: 0.00% max: 200.00% x̄: 41.65% x̃: 25.00% 95% mean confidence interval for sfu-stalls value: -0.25 -0.01 95% mean confidence interval for sfu-stalls %-change: -8.24% 2.33% Inconclusive result (%-change mean confidence interval includes 0). total inst-and-stalls in shared programs: 11140178 -> 11135835 (-0.04%) inst-and-stalls in affected programs: 633972 -> 629629 (-0.69%) helped: 1581 HURT: 755 helped stats (abs) min: 1 max: 94 x̄: 4.26 x̃: 3 helped stats (rel) min: 0.11% max: 13.46% x̄: 2.12% x̃: 1.59% HURT stats (abs) min: 1 max: 17 x̄: 3.17 x̃: 2 HURT stats (rel) min: 0.05% max: 19.67% x̄: 1.93% x̃: 1.20% 95% mean confidence interval for inst-and-stalls value: -2.06 -1.66 95% mean confidence interval for inst-and-stalls %-change: -0.93% -0.70% Inst-and-stalls are helped. Reviewed-by: Iago Toral Quioroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8933>	2021-02-12 07:05:33 +00:00
Arcady Goldmints-Orlov	8762f29e9c	broadcom/compiler: Add a v3d_compile argument to vir_set_[pu]f Reviewed-by: Iago Toral Quioroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8933>	2021-02-12 07:05:33 +00:00
Eric Anholt	bcb5f9f94a	v3d: Stop advertising support for flat shading. The GL frontend can lower this weird GL feature away for us. This should fix redeclaration of the gl_Color/SecondaryColor as centroid, since that case had been missed in the !flat special case here. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8601>	2021-02-09 20:06:48 -08:00
Eric Anholt	ff805f8ac7	v3d: Stop advertising support for PIPE_CAP_*_COLOR_CLAMPED. The GL frontend can lower away this deprecated GL feature for us. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8601>	2021-02-09 20:06:48 -08:00
Eric Anholt	2992dc7386	v3d: Stop advertising support for PIPE_CAP_TWO_SIDED_COLOR. The GL frontend can lower away this deprecated GL feature for us. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8601>	2021-02-09 20:06:48 -08:00
Eric Anholt	5ddc2f916f	v3d: Clean up vestiges of alpha test lowering. We had an unnecessary case in our uniforms upload switch statement, since we no longer advertise the cap. Fixes: `8ad931808e` ("v3d: do not report alpha-test as supported") Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8601>	2021-02-09 20:06:48 -08:00
Iago Toral Quiroga	6630825dcf	broadcom/compiler: let QPUs stall on TMU input/config overflows We have been trying to avoid this by tracking fifo usages in the driver and flushing all outstanding TMU sequences if we overflowed any of these, however, this is actually not the most efficient strategy. Instead, we would like to flush only enough operations to get things going again, which is better for pipelining. Doing that in the driver would require some additional work, but thankfully, it is not required, since this seems to be what the hardware does automatically, so we can just remove overflow tracking for these two fifos and enjoy the benefits. This also further improves shader-db stats: total instructions in shared programs: 8975062 -> 8955145 (-0.22%) instructions in affected programs: 1637624 -> 1617707 (-1.22%) helped: 4050 HURT: 2241 Instructions are helped. total threads in shared programs: 236802 -> 237042 (0.10%) threads in affected programs: 252 -> 492 (95.24%) helped: 122 HURT: 2 Threads are helped. total sfu-stalls in shared programs: 19901 -> 19592 (-1.55%) sfu-stalls in affected programs: 4744 -> 4435 (-6.51%) helped: 1248 HURT: 1051 Sfu-stalls are helped. total inst-and-stalls in shared programs: 8994963 -> 8974737 (-0.22%) inst-and-stalls in affected programs: `1636184` -> 1615958 (-1.24%) helped: 4050 HURT: 2239 Inst-and-stalls are helped. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8825>	2021-02-04 10:33:10 +00:00
Iago Toral Quiroga	e18d6bbf2f	broadcom/compiler: disable TMU pipelining if we fail to register allocate TMU pipelining can severely reduce our capacity to emit TMU spills, causing us to fail to compile a shader we may otherwise be able to compile. This is because pipelining extends the liveness of TMU sequences by posponing the thread switch and LDTMU until a result is needed, and we can't emit TMU spills while in the middle of a TMU sequence. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8825>	2021-02-04 10:33:10 +00:00
Iago Toral Quiroga	be45960d3e	broadcom/compiler: support pipelining of tex instructions This follows the same idea as for TMU general instructions of reusing the existing infrastructure to first count required register writes and flush outstanding TMU dependencies, and then emit the actual writes, which requires that we split the code that decides about register writes to a helper. We also need to start using a component mask instead of the number of components that we need to read with a particular TMU operation. v2: update tmu_writes for V3D_QPU_WADDR_TMUOFF Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8825>	2021-02-04 10:33:10 +00:00
Iago Toral Quiroga	197090a3fc	broadcom/compiler: implement pipelining for general TMU operations This creates the basic infrastructure to implement TMU pipelining and applies it to general TMU. Follow-up patches will expand this to texture and image/load store operations. TMU pipelining means that we don't immediately end TMU sequences, and instead, we postpone the thread switch and LDTMU (for loads) or TMUWT (for stores) until we really need to do them. For loads, we may need to flush them if another instruction reads the result of a load operation. We can detect this because in that case ntq_get_src() will not find the definition for that ssa/reg (since we have not emitted the LDTMU instructions for it yet), so when that happens, we flush all pending TMU operations and then try again to find the definition for the source. We also need to flush pending TMU operations when we reach the end of a control flow block, to prevent the case where we emit a TMU operation in a block, but then we read the result in another block possibly under control flow. It is also required to flush across barriers and discards to honor their semantics. Since this change doesn't implement pipelining for texture and image load/store, we also need to flush outstanding TMU operations if we ever have to emit one of these. This will be corrected with follow-up patches. Finally, the TMU has 3 fifos where it can queue TMU operations. These fifos have limited capacity, depending on the number of threads used to compile the shader, so we also need to ensure that we don't have too many outstanding TMU requests and flush pending TMU operations if a new TMU operation would overflow any of these fifos. While overflowing the Input and Config fifos only leads to stalls (which we want to avoid anyway), overflowing the Output fifo is incorrect and would end up with a broken shader. This means that we need to know how many TMU register writes are required to emit a TMU operation and use that information to decide if we need to flush pending TMU operations before we emit any register writes for the new TMU operation. v2: fix TMU flushing for NIR registers reads (jasuarez) Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8825>	2021-02-04 10:33:10 +00:00
Arcady Goldmints-Orlov	79bde75131	broadcom/compiler: Emit uniform loops using uniform control flow Similarly to if statements, uniform loops are now emitted without predication, using simple branches for breaks and continues. The uniformity of the loop is determined by running the nir_divergence_analysis pass. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7726>	2021-02-01 08:11:48 +00:00
Alejandro Piñeiro	429c336412	broadcom/compiler: separate texture/sampler info from v3d_key So far the v3d compiler has them combined, as for OpenGL both are the same. This change is intended to fit the v3d compiler better with Vulkan, where they are separate concepts. Note that NIR has them separate for a long time, both on nir_variable and on some NIR lowerings. v2: (from Iago feedback) * Use key->num_tex/sampler_used to iterate through the array * Fill up num_samplers_used on v3d, assert that is the same that num_tex_used if possible. v3: (Iago) * Assert num_tex/samplers_used is smaller that tex/sampler array size. v4: Update assert mentioned on v3 to use <= instead of < (detected by CI) Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> squash! broadcom/compiler: separate texture/sampler info from v3d_key Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7545>	2020-11-14 15:59:02 +00:00
Juan A. Suarez Romero	1e723745dd	v3d/compiler: extend swapping R/B support to all vertex attributes So far the support for R/B swapping in vertex attributes were for the generic attributes. But there are cases like glSecondaryColorPointer() supporting BGRA formats that require the R/B swapping to be also allowed in the non-generic vertex attributes (in this case, in the COLOR1 attribute). v2: - Don't split line (Iago) Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7196>	2020-11-05 12:15:28 +00:00
Alejandro Piñeiro	09b2bd1df9	broadcom/compiler: remove v3d_fs_key depth_enabled field. It is not used right now, so keeping it adds some noise/confusion. So far configuring Z test are done through the CFG_BITS. See v3dX(emit_state) at v3dx_emit.c for v3d, and pack_cfg_bits at v3dv_pipeline.c for v3dv. There flags like z_updates_enable and others are filled up. That key field seems like a leftover coming from using vc4 as reference, as that driver defines and uses a field with name name. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7421>	2020-11-03 10:55:08 +00:00
Iago Toral Quiroga	3ec165bce9	broadcom/compiler: track partially interpolated fragment inputs We will need these to implement GLSL's interpolateAt*() functions where we are required to perform interpolation in the shader at arbitrary offsets. Acked-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7155>	2020-10-15 02:04:04 +02:00
Iago Toral Quiroga	7eb8eb10f6	v3d/compiler: allow to batch spills Some shaders that need to spill hundreds of registers can take very long times to compile as each allocation attempt spills a single register and restarts the allocation process. We can significantly cut down these times if we allow the compiler to spill in batches, which should be possible if we are spilling uniforms, which is in fact the kind of spills that we do first because they have lower cost than TMU spills. Doing this could cause us to slightly over spill in some cases (depending on the chosen batch size) leading to slightly worse performance, so we only enable this behavior after we have started to spill over a certain threshold, at which point we assume that performance won't be good and we want to favor compilation speed instead. v2: - Keep it simple and just try to spill a fixed amount of registers in a batch instead of trying to compute this dynamically based on accumulated spills and current register pressure. (Eric). v3: - Check if the node is valid before doing anything with it. - Drop the environment variable to select batch size and just fix it to 20. With this we can take this CTS test from 35 minutes down to about 3 minutes: dEQP-VK.ssbo.layout.random.all_shared_buffer.5 Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:33 +00:00
Iago Toral Quiroga	23c727dd67	v3d/compiler: add a lowering pass for robust buffer access Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:33 +00:00
Iago Toral Quiroga	4401dde0e9	broadcom/compiler: rename QUNIFORM_GET_BUFFER_SIZE to QUNIFORM_GET_SSBO_SIZE Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:33 +00:00
Iago Toral Quiroga	d93d903a37	v3d/compiler: implement nir_intrinsic_get_ubo_size Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:33 +00:00
Iago Toral Quiroga	644a15e69e	v3dv: implement nir_texop_texture_samples Fixes: dEQP-VK.glsl.texture_functions.query.texturesamples.* Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:32 +00:00
Iago Toral Quiroga	1c4c7d95f7	broadcom/compiler: track if the fragment shader forces per-sample MSAA For example, regarding gl_SampleID, the GLSL spec states: "Any static use of this variable in a fragment shader causes the entire shader to be evaluated per-sample." So we need to track if the fragment shader does anything that implicitly enables per-sample shading in the compiler for the driver to auto-enable sample rate shading if needed. v2: - Instead of tracking reads of gl_SampleID, check SYSTEM_BIT_SAMPLE_ID and SYSTEM_BIT_SAMPLE_POS as well as the sample layout qualifier like other drivers are doing to activate this behavior (Eric). Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> (v1) Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:32 +00:00
Iago Toral Quiroga	5a2ef59963	v3d/compiler: support swapping R/B channels in vertex attributes. We will need this in Vulkan to support vertex format VK_FORMAT_B8G8R8A8_UNORM. The hardware doesn't allow to swizzle vertex attribute components, so we need to do it in the shader. v2: - Use nir_intrinsic_io_semantics() to retrieve the location instead of looping through the shader input variables (Eric). - Assert that we only have one component (Eric). Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> (v1) Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:31 +00:00
Iago Toral Quiroga	f41857eb48	v3d/compiler: implement nir_intrinsic_load_base_instance Vulkan lowers gl_InstanceIndex to load_base_instance + load_instance_id, so we need to implement loading the base instance in the compiler. The base instance is set by the BASE_VERTEX_BASE_INSTANCE command right before the instanced draw call and it is included in the VPM payload together with the InstanceID and VertexID if this is requested by the shader record. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:29 +00:00
Iago Toral Quiroga	1f41a128e0	v3d/compiler: implement nir_op_fquantize2f16 Reviewd-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:28 +00:00
Alejandro Piñeiro	c8212731e7	v3d/compiler: handle GL/Vulkan differences in uniform handling This also adds a v3d_execution_environment, so compiler could know if it is generating code for OpenGL or Vulkan needs. Reviewed-by: Iago Toral <itoral@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:27 +00:00
Alejandro Piñeiro	62ca997476	v3d/compiler: num_tex_used on v3d_key We would need on OpenGL to update values for all the textures used. On OpenGL that value can be always took from the context or the nir shader, but there are cases on Vulkan that it is not the case, or would force up to recompute it. Acked-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6766>	2020-10-13 21:21:25 +00:00
Alejandro Piñeiro	8de380d26a	broadcom/compiler: add V3D_DEBUG_RA option To ask to debug a registr allocation failure (V3D_DEBUG_REGISTER_ALLOCATION seemed too long to me). When a fallback register allocation algorithm was added, if the register allocation fails, it only dumpg the current vir with the register pressure info with the failed fallback. But if we want do debug the problem, we would be interested on both. Additionally, it was strange that we got the full vir dump with the failure even if no debug option was set. Additionally we add shaderdb like stats for those failures, to make easier to compare one and the other. v2: keep a small warning message in case both register allocation algorithms fails (Neil) Reviewed-by: Neil Roberts <nroberts@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6999>	2020-10-07 20:21:17 +00:00
Arcady Goldmints-Orlov	bd87cdad18	broadcom/compiler: support nir_intrinsic_load_sample_id This adds support for the intrinsic as well as the vir_SAMPID instruction that corresponds to it in vir. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Reviewed-by: Eric Anholt <eric@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6356>	2020-08-20 20:14:14 +00:00

1 2 3 4

152 Commits