radeonsi: rewrite the prefix sum computation for shader culling

Instead of storing the vertex mask per wave into LDS and then computing the prefix sum, store 8-bit bitcounts (vertex counts) of the vertex masks into LDS. This allows us to compute the sum using v_sad_u8, which computes a sum of 4 i8vec4 components in one instruction. Each i8vec4 of vertex counts is loaded in parallel threads (one dword per thread) instead of all being loaded in thread 0, and readlane copies them to SGPRs instead of readfirstlane. LDS is no longer initialized before culling. Instead, the counts for inactive waves are masked with AND later. Incorrect old comments are also fixed. This change removes 80 bytes from the code size, and it allows increasing the workgroup size from 128 to 256. (which is the main motivation for this) Now changing the workgroup size with wave64 has no effect on the code size. Switching to wave32 with 8 waves even generates slightly smaller code than wave64 with 4 waves. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10813>
2021-05-08 02:41:52 -04:00
parent 27c9e77c6a
commit 13acbaecd8
3 changed files with 133 additions and 151 deletions
--- a/src/amd/llvm/ac_llvm_build.h
+++ b/src/amd/llvm/ac_llvm_build.h
@@ -607,9 +607,6 @@ LLVMValueRef ac_build_main(const struct ac_shader_args *args, struct ac_llvm_con
                           LLVMTypeRef ret_type, LLVMModuleRef module);
 void ac_build_s_endpgm(struct ac_llvm_context *ctx);

-LLVMValueRef ac_prefix_bitcount(struct ac_llvm_context *ctx, LLVMValueRef mask, LLVMValueRef index);
-LLVMValueRef ac_prefix_bitcount_2x64(struct ac_llvm_context *ctx, LLVMValueRef mask[2],
-                                     LLVMValueRef index);
 void ac_build_triangle_strip_indices_to_triangle(struct ac_llvm_context *ctx, LLVMValueRef is_odd,
                                                 LLVMValueRef flatshade_first,
                                                 LLVMValueRef index[3]);