docs/panfrost: Move description of instancing

Connor Abbott wrote a nice explanation of how instance divisors work on Mali. Let's add it to the driver docs instead of letting it languish in a forgotten header file. This is mostly pasted from the existing header in tree, with a few local changes applied. Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20445>
2022-12-28 15:26:45 -05:00
parent 07b43d6231
commit e0752673be
2 changed files with 111 additions and 123 deletions
--- a/docs/drivers/panfrost.rst
+++ b/docs/drivers/panfrost.rst
@@ -175,3 +175,114 @@ Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and
 should be used instead where possible. However, not all formats are
 compressible, so u-interleaved tiling remains an important fallback on Panfrost.
 Instancing
 ----------
 The attribute descriptor lets the attribute unit compute the address of an
 attribute given the vertex and instance ID. Unfortunately, the way this works is
 rather complicated when instancing is enabled.
 To explain this, first we need to explain how compute and vertex threads are
 dispatched.  When a quad is dispatched, it receives a single, linear index.
 However, we need to translate that index into a (vertex id, instance id) pair.
 One option would be to do:
 .. math::
   \text{vertex id} = \text{linear id} \% \text{num vertices}
   \text{instance id} = \text{linear id} / \text{num vertices}
 but this involves a costly division and modulus by an arbitrary number.
 Instead, we could pad num_vertices. We dispatch padded_num_vertices *
 num_instances threads instead of num_vertices * num_instances, which results
 in some "extra" threads with vertex_id >= num_vertices, which we have to
 discard.  The more we pad num_vertices, the more "wasted" threads we
 dispatch, but the division is potentially easier.
 One straightforward choice is to pad num_vertices to the next power of two,
 which means that the division and modulus are just simple bit shifts and
 masking. But the actual algorithm is a bit more complicated. The thread
 dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
 to dividing by a power of two. As a result, padded_num_vertices can be
 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
 since we need less padding.
 padded_num_vertices is picked by the hardware. The driver just specifies the
 actual number of vertices. Note that padded_num_vertices is a multiple of four
 (presumably because threads are dispatched in groups of 4). Also,
 padded_num_vertices is always at least one more than num_vertices, which seems
 like a quirk of the hardware. For larger num_vertices, the hardware uses the
 following algorithm: using the binary representation of num_vertices, we look at
 the most significant set bit as well as the following 3 bits. Let n be the
 number of bits after those 4 bits. Then we set padded_num_vertices according to
 the following table:
 ==========  =======================
 high bits   padded_num_vertices
 ==========  =======================
 1000		   :math:`9 \cdot 2^n`
 1001		   :math:`5 \cdot 2^{n+1}`
 101x		   :math:`3 \cdot 2^{n+2}`
 110x		   :math:`7 \cdot 2^{n+1}`
 111x		   :math:`2^{n+4}`
 ==========  =======================
 For example, if num_vertices = 70 is passed to glDraw(), its binary
 representation is 1000110, so n = 3 and the high bits are 1000, and
 therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72.
 The attribute unit works in terms of the original linear_id. if
 num_instances = 1, then they are the same, and everything is simple.
 However, with instancing things get more complicated. There are four
 possible modes, two of them we can group together:
 1. Use the linear_id directly. Only used when there is no instancing.
 2. Use the linear_id modulo a constant. This is used for per-vertex
 attributes with instancing enabled by making the constant equal
 padded_num_vertices. Because the modulus is always padded_num_vertices, this
 mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
 The shift field specifies the power of two, while the extra_flags field
 specifies the odd number. If shift = n and extra_flags = m, then the modulus
 is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as
 computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set
 extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware
 algorithm used to get padded_num_vertices in order to correctly implement
 per-vertex attributes.
 3. Divide the linear_id by a constant. In order to correctly implement
 instance divisors, we have to divide linear_id by padded_num_vertices times
 to user-specified divisor. So first we compute padded_num_vertices, again
 following the exact same algorithm that the hardware uses, then multiply it
 by the GL-level divisor to get the hardware-level divisor. This case is
 further divided into two more cases. If the hardware-level divisor is a
 power of two, then we just need to shift. The shift amount is specified by
 the shift field, so that the hardware-level divisor is just 2^shift.
 If it isn't a power of two, then we have to divide by an arbitrary integer.
 For that, we use the well-known technique of multiplying by an approximation
 of the inverse. The driver must compute the magic multiplier and shift
 amount, and then the hardware does the multiplication and shift. The
 hardware and driver also use the "round-down" optimization as described in
 http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
 The hardware further assumes the multiplier is between 2^31 and 2^32, so the
 high bit is implicitly set to 1 even though it is set to 0 by the driver --
 presumably this simplifies the hardware multiplier a little. The hardware
 first multiplies linear_id by the multiplier and takes the high 32 bits,
 then applies the round-down correction if extra_flags = 1, then finally
 shifts right by the shift field.
 There are some differences between ridiculousfish's algorithm and the Mali
 hardware algorithm, which means that the reference code from ridiculousfish
 doesn't always produce the right constants. Mali does not use the pre-shift
 optimization, since that would make a hardware implementation slower (it
 would have to always do the pre-shift, multiply, and post-shift operations).
 It also forces the multplier to be at least 2^31, which means that the
 exponent is entirely fixed, so there is no trial-and-error. Altogether,
 given the divisor d, the algorithm the driver must follow is:
 1. Set shift = :math:`\lfloor \log_2(d) \rfloor`.
 2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
 3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set
   magic_divisor = m - 1 and extra_flags = 1.  4. Otherwise, set magic_divisor =
   m and extra_flags = 0.
--- a/src/panfrost/include/panfrost-job.h
+++ b/src/panfrost/include/panfrost-job.h
@@ -42,129 +42,6 @@ typedef uint64_t mali_ptr;
 #define MALI_EXTRACT_TYPE(fmt) ((fmt)&0xe0)
 #define MALI_EXTRACT_INDEX(pixfmt) (((pixfmt) >> 12) & 0xFF)
 /*
 * Mali Attributes
 *
 * This structure lets the attribute unit compute the address of an attribute
 * given the vertex and instance ID. Unfortunately, the way this works is
 * rather complicated when instancing is enabled.
 *
 * To explain this, first we need to explain how compute and vertex threads are
 * dispatched. This is a guess (although a pretty firm guess!) since the
 * details are mostly hidden from the driver, except for attribute instancing.
 * When a quad is dispatched, it receives a single, linear index. However, we
 * need to translate that index into a (vertex id, instance id) pair, or a
 * (local id x, local id y, local id z) triple for compute shaders (although
 * vertex shaders and compute shaders are handled almost identically).
 * Focusing on vertex shaders, one option would be to do:
 *
 * vertex_id = linear_id % num_vertices
 * instance_id = linear_id / num_vertices
 *
 * but this involves a costly division and modulus by an arbitrary number.
 * Instead, we could pad num_vertices. We dispatch padded_num_vertices *
 * num_instances threads instead of num_vertices * num_instances, which results
 * in some "extra" threads with vertex_id >= num_vertices, which we have to
 * discard.  The more we pad num_vertices, the more "wasted" threads we
 * dispatch, but the division is potentially easier.
 *
 * One straightforward choice is to pad num_vertices to the next power of two,
 * which means that the division and modulus are just simple bit shifts and
 * masking. But the actual algorithm is a bit more complicated. The thread
 * dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
 * to dividing by a power of two. This is possibly using the technique
 * described in patent US20170010862A1. As a result, padded_num_vertices can be
 * 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
 * since we need less padding.
 *
 * padded_num_vertices is picked by the hardware. The driver just specifies the
 * actual number of vertices. At least for Mali G71, the first few cases are
 * given by:
 *
 * num_vertices	| padded_num_vertices
 * 3		| 4
 * 4-7		| 8
 * 8-11		| 12 (3 * 4)
 * 12-15	| 16
 * 16-19	| 20 (5 * 4)
 *
 * Note that padded_num_vertices is a multiple of four (presumably because
 * threads are dispatched in groups of 4). Also, padded_num_vertices is always
 * at least one more than num_vertices, which seems like a quirk of the
 * hardware. For larger num_vertices, the hardware uses the following
 * algorithm: using the binary representation of num_vertices, we look at the
 * most significant set bit as well as the following 3 bits. Let n be the
 * number of bits after those 4 bits. Then we set padded_num_vertices according
 * to the following table:
 *
 * high bits	| padded_num_vertices
 * 1000		| 9 * 2^n
 * 1001		| 5 * 2^(n+1)
 * 101x		| 3 * 2^(n+2)
 * 110x		| 7 * 2^(n+1)
 * 111x		| 2^(n+4)
 *
 * For example, if num_vertices = 70 is passed to glDraw(), its binary
 * representation is 1000110, so n = 3 and the high bits are 1000, and
 * therefore padded_num_vertices = 9 * 2^3 = 72.
 *
 * The attribute unit works in terms of the original linear_id. if
 * num_instances = 1, then they are the same, and everything is simple.
 * However, with instancing things get more complicated. There are four
 * possible modes, two of them we can group together:
 *
 * 1. Use the linear_id directly. Only used when there is no instancing.
 *
 * 2. Use the linear_id modulo a constant. This is used for per-vertex
 * attributes with instancing enabled by making the constant equal
 * padded_num_vertices. Because the modulus is always padded_num_vertices, this
 * mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
 * The shift field specifies the power of two, while the extra_flags field
 * specifies the odd number. If shift = n and extra_flags = m, then the modulus
 * is (2m + 1) * 2^n. As an example, if num_vertices = 70, then as computed
 * above, padded_num_vertices = 9 * 2^3, so we should set extra_flags = 4 and
 * shift = 3. Note that we must exactly follow the hardware algorithm used to
 * get padded_num_vertices in order to correctly implement per-vertex
 * attributes.
 *
 * 3. Divide the linear_id by a constant. In order to correctly implement
 * instance divisors, we have to divide linear_id by padded_num_vertices times
 * to user-specified divisor. So first we compute padded_num_vertices, again
 * following the exact same algorithm that the hardware uses, then multiply it
 * by the GL-level divisor to get the hardware-level divisor. This case is
 * further divided into two more cases. If the hardware-level divisor is a
 * power of two, then we just need to shift. The shift amount is specified by
 * the shift field, so that the hardware-level divisor is just 2^shift.
 *
 * If it isn't a power of two, then we have to divide by an arbitrary integer.
 * For that, we use the well-known technique of multiplying by an approximation
 * of the inverse. The driver must compute the magic multiplier and shift
 * amount, and then the hardware does the multiplication and shift. The
 * hardware and driver also use the "round-down" optimization as described in
 * http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
 * The hardware further assumes the multiplier is between 2^31 and 2^32, so the
 * high bit is implicitly set to 1 even though it is set to 0 by the driver --
 * presumably this simplifies the hardware multiplier a little. The hardware
 * first multiplies linear_id by the multiplier and takes the high 32 bits,
 * then applies the round-down correction if extra_flags = 1, then finally
 * shifts right by the shift field.
 *
 * There are some differences between ridiculousfish's algorithm and the Mali
 * hardware algorithm, which means that the reference code from ridiculousfish
 * doesn't always produce the right constants. Mali does not use the pre-shift
 * optimization, since that would make a hardware implementation slower (it
 * would have to always do the pre-shift, multiply, and post-shift operations).
 * It also forces the multplier to be at least 2^31, which means that the
 * exponent is entirely fixed, so there is no trial-and-error. Altogether,
 * given the divisor d, the algorithm the driver must follow is:
 *
 * 1. Set shift = floor(log2(d)).
 * 2. Compute m = ceil(2^(shift + 32) / d) and e = 2^(shift + 32) % d.
 * 3. If e <= 2^shift, then we need to use the round-down algorithm. Set
 * magic_divisor = m - 1 and extra_flags = 1.
 * 4. Otherwise, set magic_divisor = m and extra_flags = 0.
 */
 /* Purposeful off-by-one in width, height fields. For example, a (64, 64)
 * texture is stored as (63, 63) in these fields. This adjusts for that.
 * There's an identical pattern in the framebuffer descriptor. Even vertex