
The front-page of the docs is currently fairly intimidating, by diving into details rather abruptly. Let's try to make it a bit easier to navigate t by moving the details to their own articles, but linking them from the front-page. Acked-by: Daniel Stone <daniels@collabora.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28953>
113 lines
6.0 KiB
ReStructuredText
113 lines
6.0 KiB
ReStructuredText
Instancing
|
|
==========
|
|
|
|
The attribute descriptor lets the attribute unit compute the address of an
|
|
attribute given the vertex and instance ID. Unfortunately, the way this works is
|
|
rather complicated when instancing is enabled.
|
|
|
|
To explain this, first we need to explain how compute and vertex threads are
|
|
dispatched. When a quad is dispatched, it receives a single, linear index.
|
|
However, we need to translate that index into a (vertex id, instance id) pair.
|
|
One option would be to do:
|
|
|
|
.. math::
|
|
\text{vertex id} = \text{linear id} \% \text{num vertices}
|
|
|
|
\text{instance id} = \text{linear id} / \text{num vertices}
|
|
|
|
but this involves a costly division and modulus by an arbitrary number.
|
|
Instead, we could pad num_vertices. We dispatch padded_num_vertices *
|
|
num_instances threads instead of num_vertices * num_instances, which results
|
|
in some "extra" threads with vertex_id >= num_vertices, which we have to
|
|
discard. The more we pad num_vertices, the more "wasted" threads we
|
|
dispatch, but the division is potentially easier.
|
|
|
|
One straightforward choice is to pad num_vertices to the next power of two,
|
|
which means that the division and modulus are just simple bit shifts and
|
|
masking. But the actual algorithm is a bit more complicated. The thread
|
|
dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
|
|
to dividing by a power of two. As a result, padded_num_vertices can be
|
|
1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
|
|
since we need less padding.
|
|
|
|
padded_num_vertices is picked by the hardware. The driver just specifies the
|
|
actual number of vertices. Note that padded_num_vertices is a multiple of four
|
|
(presumably because threads are dispatched in groups of 4). Also,
|
|
padded_num_vertices is always at least one more than num_vertices, which seems
|
|
like a quirk of the hardware. For larger num_vertices, the hardware uses the
|
|
following algorithm: using the binary representation of num_vertices, we look at
|
|
the most significant set bit as well as the following 3 bits. Let n be the
|
|
number of bits after those 4 bits. Then we set padded_num_vertices according to
|
|
the following table:
|
|
|
|
========== =======================
|
|
high bits padded_num_vertices
|
|
========== =======================
|
|
1000 :math:`9 \cdot 2^n`
|
|
1001 :math:`5 \cdot 2^{n+1}`
|
|
101x :math:`3 \cdot 2^{n+2}`
|
|
110x :math:`7 \cdot 2^{n+1}`
|
|
111x :math:`2^{n+4}`
|
|
========== =======================
|
|
|
|
For example, if num_vertices = 70 is passed to glDraw(), its binary
|
|
representation is 1000110, so n = 3 and the high bits are 1000, and
|
|
therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72.
|
|
|
|
The attribute unit works in terms of the original linear_id. if
|
|
num_instances = 1, then they are the same, and everything is simple.
|
|
However, with instancing things get more complicated. There are four
|
|
possible modes, two of them we can group together:
|
|
|
|
1. Use the linear_id directly. Only used when there is no instancing.
|
|
|
|
2. Use the linear_id modulo a constant. This is used for per-vertex
|
|
attributes with instancing enabled by making the constant equal
|
|
padded_num_vertices. Because the modulus is always padded_num_vertices, this
|
|
mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
|
|
The shift field specifies the power of two, while the extra_flags field
|
|
specifies the odd number. If shift = n and extra_flags = m, then the modulus
|
|
is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as
|
|
computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set
|
|
extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware
|
|
algorithm used to get padded_num_vertices in order to correctly implement
|
|
per-vertex attributes.
|
|
|
|
3. Divide the linear_id by a constant. In order to correctly implement
|
|
instance divisors, we have to divide linear_id by padded_num_vertices times
|
|
to user-specified divisor. So first we compute padded_num_vertices, again
|
|
following the exact same algorithm that the hardware uses, then multiply it
|
|
by the GL-level divisor to get the hardware-level divisor. This case is
|
|
further divided into two more cases. If the hardware-level divisor is a
|
|
power of two, then we just need to shift. The shift amount is specified by
|
|
the shift field, so that the hardware-level divisor is just
|
|
:math:`2^\text{shift}`.
|
|
|
|
If it isn't a power of two, then we have to divide by an arbitrary integer.
|
|
For that, we use the well-known technique of multiplying by an approximation
|
|
of the inverse. The driver must compute the magic multiplier and shift
|
|
amount, and then the hardware does the multiplication and shift. The
|
|
hardware and driver also use the "round-down" optimization as described in
|
|
https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
|
|
The hardware further assumes the multiplier is between :math:`2^{31}` and
|
|
:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
|
|
to 0 by the driver -- presumably this simplifies the hardware multiplier a
|
|
little. The hardware first multiplies linear_id by the multiplier and
|
|
takes the high 32 bits, then applies the round-down correction if
|
|
extra_flags = 1, then finally shifts right by the shift field.
|
|
|
|
There are some differences between ridiculousfish's algorithm and the Mali
|
|
hardware algorithm, which means that the reference code from ridiculousfish
|
|
doesn't always produce the right constants. Mali does not use the pre-shift
|
|
optimization, since that would make a hardware implementation slower (it
|
|
would have to always do the pre-shift, multiply, and post-shift operations).
|
|
It also forces the multiplier to be at least :math:`2^{31}`, which means
|
|
that the exponent is entirely fixed, so there is no trial-and-error.
|
|
Altogether, given the divisor d, the algorithm the driver must follow is:
|
|
|
|
1. Set shift = :math:`\lfloor \log_2(d) \rfloor`.
|
|
2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
|
|
3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set
|
|
magic_divisor = m - 1 and extra_flags = 1. 4. Otherwise, set magic_divisor =
|
|
m and extra_flags = 0.
|