broadcom/compiler: implement pipelining for general TMU operations

This creates the basic infrastructure to implement TMU pipelining and applies it to general TMU. Follow-up patches will expand this to texture and image/load store operations. TMU pipelining means that we don't immediately end TMU sequences, and instead, we postpone the thread switch and LDTMU (for loads) or TMUWT (for stores) until we really need to do them. For loads, we may need to flush them if another instruction reads the result of a load operation. We can detect this because in that case ntq_get_src() will not find the definition for that ssa/reg (since we have not emitted the LDTMU instructions for it yet), so when that happens, we flush all pending TMU operations and then try again to find the definition for the source. We also need to flush pending TMU operations when we reach the end of a control flow block, to prevent the case where we emit a TMU operation in a block, but then we read the result in another block possibly under control flow. It is also required to flush across barriers and discards to honor their semantics. Since this change doesn't implement pipelining for texture and image load/store, we also need to flush outstanding TMU operations if we ever have to emit one of these. This will be corrected with follow-up patches. Finally, the TMU has 3 fifos where it can queue TMU operations. These fifos have limited capacity, depending on the number of threads used to compile the shader, so we also need to ensure that we don't have too many outstanding TMU requests and flush pending TMU operations if a new TMU operation would overflow any of these fifos. While overflowing the Input and Config fifos only leads to stalls (which we want to avoid anyway), overflowing the Output fifo is incorrect and would end up with a broken shader. This means that we need to know how many TMU register writes are required to emit a TMU operation and use that information to decide if we need to flush pending TMU operations before we emit any register writes for the new TMU operation. v2: fix TMU flushing for NIR registers reads (jasuarez) Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8825>
2021-01-26 12:18:43 +01:00
parent 0e96f0f8cd
commit 197090a3fc
5 changed files with 390 additions and 126 deletions
--- a/src/broadcom/compiler/v3d_compiler.h
+++ b/src/broadcom/compiler/v3d_compiler.h
@@ -566,6 +566,24 @@ struct v3d_compile {
        struct qinst **defs;
        uint32_t defs_array_size;

+        /* TMU pipelining tracking */
+        struct {
+                /* NIR registers that have been updated with a TMU operation
+                 * that has not been flushed yet.
+                 */
+                struct set *outstanding_regs;
+
+                uint32_t input_fifo_size;
+                uint32_t config_fifo_size;
+                uint32_t output_fifo_size;
+
+                struct {
+                        nir_dest *dest;
+                        uint32_t num_components;
+                } flush[8]; /* 16 entries / 2 threads for input/output fifos */
+                uint32_t flush_count;
+        } tmu;
+
        /**
         * Inputs to the shader, arranged by TGSI declaration order.
         *
@@ -918,6 +936,7 @@ uint8_t vir_channels_written(struct qinst *inst);
 struct qreg ntq_get_src(struct v3d_compile *c, nir_src src, int i);
 void ntq_store_dest(struct v3d_compile *c, nir_dest *dest, int chan,
                    struct qreg result);
+void ntq_flush_tmu(struct v3d_compile *c);
 void vir_emit_thrsw(struct v3d_compile *c);

 void vir_dump(struct v3d_compile *c);