radv: Add scratch stack to reduce LDS stack in RT traversal.

The current stack size is a significant limiter for occupancy, and
hence we need smaller stacks in LDS.

Rhys earlier had a patch that just put the N entries closest to the
root in LDS and the rest in scratch. However, this is not ideal for
performance as most of the activity is happening away from the root,
near the leaves. Of course we can't just switch it around, as the
leaf activity likely isn't happening all the way at the end of the
stack.

So what we do is make the LDS stack kinda a ringbuffer by always
accessing it using the stack index modulo the buffer size (always
a power of two so we can efficiently mask). If we then do not have
free space in this buffer we evict the entries closest to the root
to scratch and if we hit the "bottom" of the LDS space we load from
scratch.

Some rough perf numbers for indication with Q2RTX:

| evicting | LDS entries | perf |
|----------|-------------|------|
|       no |          76 |  55% |
|       no |          32 | 100% |
|       no |          24 | 105% |
|      yes |          32 |  95% |
|      yes |          16 | 100% |
|      yes |           8 |  90% |
|      yes |           4 |  75% |

(For the case with 4 entries we need to do some extra accounting as
 a full batch may not be available to evict)

So an obvious choice is to use a stack of 16 entries.

One might wonder if Q2RTX perf is mainly good due to BVHs with very
little geometry and hence low depth, so I also did some profiling
with control. This is done with RGP instruction timing, so this is
instructions executed not weighted for enabled masks, i.e. divergence
effects included.

| game    | LDS entries | scratch action | fraction of iterations |
|---------|-------------|----------------|------------------------|
| Control |           8 |          store |                  10.3% |
| Control |           8 |          load  |                  34.8% |
| Control |          16 |          store |                  0.58% |
| Control |          16 |          load  |                  2.62% |
| Q2RTX   |          16 |          store |                  1.00% |
| Q2RTX   |          16 |          load  |                  3.07% |

So Q2RTX doesn't seem like an unreasonably good case for this
algorithm.

On the implementation side, we can always place the scratch stack at
address 0 by just reserving the scratch space, and in the case of fixed
callstack size moving that up. In the dynamic case the dynamic stack
base already takes any reserved scratch space into account.

Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18541>
This commit is contained in:
Bas Nieuwenhuizen
2022-09-11 23:31:58 +02:00
committed by Marge Bot
parent 7d26fafacf
commit b2972cf410
2 changed files with 62 additions and 13 deletions

View File

@@ -68,6 +68,8 @@ nir_ssa_def *create_bvh_descriptor(nir_builder *b);
* + 1 instance node. Furthermore, when processing a box node, worst case we actually
* push all 4 children and remove one, so the DFS stack depth is box nodes * 3 + 2.
*/
#define MAX_STACK_ENTRY_COUNT 76
#define MAX_STACK_ENTRY_COUNT 76
#define MAX_STACK_LDS_ENTRY_COUNT 16
#define MAX_STACK_SCRATCH_ENTRY_COUNT (MAX_STACK_ENTRY_COUNT - MAX_STACK_LDS_ENTRY_COUNT)
#endif