From 0277d0321a3649aeacc14898d156b754e7ddf851 Mon Sep 17 00:00:00 2001 From: Erik Faye-Lund Date: Tue, 25 Jun 2024 20:40:15 +0200 Subject: [PATCH] docs/panfrost: quote identifiers Part-of: --- docs/drivers/panfrost/instancing.rst | 62 ++++++++++++++-------------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/docs/drivers/panfrost/instancing.rst b/docs/drivers/panfrost/instancing.rst index 581484160d6..fcbf9ccd20e 100644 --- a/docs/drivers/panfrost/instancing.rst +++ b/docs/drivers/panfrost/instancing.rst @@ -16,33 +16,33 @@ One option would be to do: \text{instance id} = \text{linear id} / \text{num vertices} but this involves a costly division and modulus by an arbitrary number. -Instead, we could pad num_vertices. We dispatch +Instead, we could pad ``num_vertices``. We dispatch :math:`\text{padded_num_vertices} \cdot \text{num_instances}` threads instead of :math:`\text{num_vertices} \cdot \text{num_instances}`, which results in some "extra" threads with :math:`\text{vertex_id} \geq \text{num_vertices}`, -which we have to discard. The more we pad num_vertices, the more "wasted" +which we have to discard. The more we pad ``num_vertices``, the more "wasted" threads we dispatch, but the division is potentially easier. -One straightforward choice is to pad num_vertices to the next power of two, -which means that the division and modulus are just simple bit shifts and -masking. But the actual algorithm is a bit more complicated. The thread +One straightforward choice is to pad ``num_vertices`` to the next power +of two, which means that the division and modulus are just simple bit shifts +and masking. But the actual algorithm is a bit more complicated. The thread dispatcher has special support for dividing by 3, 5, 7, and 9, in addition -to dividing by a power of two. As a result, padded_num_vertices can be -1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, +to dividing by a power of two. As a result, ``padded_num_vertices`` can +be 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, since we need less padding. -padded_num_vertices is picked by the hardware. The driver just specifies the -actual number of vertices. Note that padded_num_vertices is a multiple of four -(presumably because threads are dispatched in groups of 4). Also, -padded_num_vertices is always at least one more than num_vertices, which seems -like a quirk of the hardware. For larger num_vertices, the hardware uses the -following algorithm: using the binary representation of num_vertices, we look at -the most significant set bit as well as the following 3 bits. Let n be the -number of bits after those 4 bits. Then we set padded_num_vertices according to -the following table: +``padded_num_vertices`` is picked by the hardware. The driver just specifies +the actual number of vertices. Note that ``padded_num_vertices`` is a multiple +of four (presumably because threads are dispatched in groups of 4). Also, +``padded_num_vertices`` is always at least one more than ``num_vertices``, +which seems like a quirk of the hardware. For larger ``num_vertices``, the +hardware uses the following algorithm: using the binary representation of +``num_vertices``, we look at the most significant set bit as well as the +following 3 bits. Let n be the number of bits after those 4 bits. Then we +set ``padded_num_vertices`` according to the following table: ========== ======================= -high bits padded_num_vertices +high bits ``padded_num_vertices`` ========== ======================= 1000 :math:`9 \cdot 2^n` 1001 :math:`5 \cdot 2^{n+1}` @@ -56,32 +56,32 @@ For example, if :math:`\text{num_vertices} = 70` is passed to and the high bits are 1000, and therefore :math:`\text{padded_num_vertices} = 9 \cdot 2^3 = 72`. -The attribute unit works in terms of the original linear_id. if +The attribute unit works in terms of the original ``linear_id``. if :math:`\text{num_instances} = 1`, then they are the same, and everything is simple. However, with instancing things get more complicated. There are four possible modes, two of them we can group together: -1. Use the linear_id directly. Only used when there is no instancing. +1. Use the ``linear_id`` directly. Only used when there is no instancing. -2. Use the linear_id modulo a constant. This is used for per-vertex +2. Use the ``linear_id`` modulo a constant. This is used for per-vertex attributes with instancing enabled by making the constant equal -padded_num_vertices. Because the modulus is always padded_num_vertices, this -mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9. -The shift field specifies the power of two, while the extra_flags field -specifies the odd number. If :math:`\text{shift} = n` and +``padded_num_vertices``. Because the modulus is always ``padded_num_vertices``, +this mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, +or 9. The shift field specifies the power of two, while the ``extra_flags`` +field specifies the odd number. If :math:`\text{shift} = n` and :math:`\text{extra_flags} = m`, then the modulus is :math:`(2m + 1) \cdot 2^n`. As an example, if :math:`\text{num_vertices} = 70`, then as computed above, :math:`\text{padded_num_vertices} = 9 \cdot 2^3`, so we should set :math:`\text{extra_flags} = 4` and :math:`\text{shift} = 3`. Note that we -must exactly follow the hardware algorithm used to get padded_num_vertices +must exactly follow the hardware algorithm used to get ``padded_num_vertices`` in order to correctly implement per-vertex attributes. -3. Divide the linear_id by a constant. In order to correctly implement -instance divisors, we have to divide linear_id by padded_num_vertices times -to user-specified divisor. So first we compute padded_num_vertices, again -following the exact same algorithm that the hardware uses, then multiply it -by the GL-level divisor to get the hardware-level divisor. This case is +3. Divide the ``linear_id`` by a constant. In order to correctly implement +instance divisors, we have to divide ``linear_id`` by ``padded_num_vertices`` +times to user-specified divisor. So first we compute ``padded_num_vertices``, +again following the exact same algorithm that the hardware uses, then multiply +it by the GL-level divisor to get the hardware-level divisor. This case is further divided into two more cases. If the hardware-level divisor is a power of two, then we just need to shift. The shift amount is specified by the shift field, so that the hardware-level divisor is just @@ -96,7 +96,7 @@ https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. The hardware further assumes the multiplier is between :math:`2^{31}` and :math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set to 0 by the driver -- presumably this simplifies the hardware multiplier a -little. The hardware first multiplies linear_id by the multiplier and +little. The hardware first multiplies ``linear_id`` by the multiplier and takes the high 32 bits, then applies the round-down correction if :math:`\text{extra_flags} = 1`, then finally shifts right by the shift field.