docs/asahi: Document varying interpolation
Varying interpolation is quite involved in the hardware. Now that I understand how it works, add some documentation. This documentation is too long and uses too much fancy formatting to put inline with the XML, so put in our external documentation space. Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17198>
This commit is contained in:
@@ -37,3 +37,138 @@ The library is only built if ``-Dtools=asahi`` is passed. It builds a single
|
||||
For example, to trace an app ``./app``, run:
|
||||
|
||||
DYLD_INSERT_LIBRARIES=~/mesa/build/src/asahi/lib/libwrap.dylib ./app
|
||||
|
||||
Hardware varyings
|
||||
-----------------
|
||||
|
||||
At an API level, vertex shader outputs need to be interpolated to become
|
||||
fragment shader inputs. This process is logically pipelined in AGX, with a value
|
||||
travelling from a vertex shader to remapping hardware to coefficient register
|
||||
setup to the fragment shader to the iterator hardware. Each stage is described
|
||||
below.
|
||||
|
||||
Vertex shader
|
||||
`````````````
|
||||
|
||||
A vertex shader (running on the Unified Shader Cores) outputs varyings with the
|
||||
``st_var`` instruction. ``st_var`` takes a *vertex output index* and a 32-bit
|
||||
value. The maximum number of *vertex outputs* is specified as the "output count"
|
||||
of the shader in the "Bind Vertex Pipeline" packet. The value may be interpreted
|
||||
consist of a single 32-bit value or an aligned 16-bit register pair, depending
|
||||
on whether interpolation should happen at 32-bit or 16-bit. Vertex outputs are
|
||||
indexed starting from 0, with the *vertex position* always coming first, the
|
||||
32-bit user varyings coming next, then 16-bit user varyings, and finally *point
|
||||
size* at the end if present.
|
||||
|
||||
.. list-table:: Ordering of vertex outputs with all outputs used
|
||||
:widths: 25 75
|
||||
:header-rows: 1
|
||||
|
||||
* - Index
|
||||
- Value
|
||||
* - 0
|
||||
- Vertex position
|
||||
* - 4
|
||||
- 32-bit varying 0
|
||||
* -
|
||||
- ...
|
||||
* - 4 + m
|
||||
- 32-bit varying m
|
||||
* - 4 + m + 1
|
||||
- Packed pair of 16-bit varyings 0
|
||||
* -
|
||||
- ...
|
||||
* - 4 + m + 1 + n
|
||||
- Packed pair of 16-bit varyings n
|
||||
* - 4 + m + 1 + n + 1
|
||||
- Point size
|
||||
|
||||
Remapping
|
||||
`````````
|
||||
|
||||
Vertex outputs are remapped to varying slots to be interpolated.
|
||||
The output of remapping consists of the following items: the *W* fragment
|
||||
coordinate, the *Z* fragment coordinate, user varyings in the vertex
|
||||
output order. *Z* may be omitted, but *W* may not be. This remapping is
|
||||
configured by the "Linkage" packet.
|
||||
|
||||
.. list-table:: Ordering of remapped slots
|
||||
:widths: 25 75
|
||||
:header-rows: 1
|
||||
|
||||
* - Index
|
||||
- Value
|
||||
* - 0
|
||||
- Fragment coord W
|
||||
* - 1
|
||||
- Fragment coord Z
|
||||
* - 2
|
||||
- 32-bit varying 0
|
||||
* -
|
||||
- ...
|
||||
* - 2 + m
|
||||
- 32-bit varying m
|
||||
* - 2 + m + 1
|
||||
- Packed pair of 16-bit varyings 0
|
||||
* -
|
||||
- ...
|
||||
* - 2 + m + n + 1
|
||||
- Packed pair of 16-bit varyings n
|
||||
|
||||
Coefficient registers
|
||||
`````````````````````
|
||||
|
||||
The fragment shader does not see the physical slots.
|
||||
Instead, it references varyings through *coefficient registers*. A coefficient
|
||||
register is a register allocated constant for all fragment shader invocations in
|
||||
a given polygon. Physically, it contains the values output by the vertex shader
|
||||
for each vertex of the polygon. Coefficient registers are preloaded with values
|
||||
from varying slots. This preloading appears to occur in fixed function hardware,
|
||||
a simplifcation from PowerVR which requires a specialized program for the
|
||||
programmable data sequencer to do the preload.
|
||||
|
||||
The "Bind fragment pipeline" packet points to coefficient register bindings,
|
||||
preceded by a header. The header contains the number of 32-bit varying slots. As
|
||||
the *W* slot is always present, this field is always nonzero. Slots whose index
|
||||
is below this count are treated as 32-bit. The remaining slots are treated as
|
||||
16-bits.
|
||||
|
||||
The header also contains the total number of coefficient registers bound.
|
||||
|
||||
Each binding that follows maps a (vector of) varying slots to a (consecutive)
|
||||
coefficient registers. Some details about the varying (perspective
|
||||
interpolation, flat shading, point sprites) are configured here.
|
||||
|
||||
Coefficient registers may be ordered the same as the internal varying slots.
|
||||
However, this may be inconvenient for some APIs that require a separable shader
|
||||
model. For these APIs, the flexibility to mix-and-match slots and coefficient
|
||||
registers allows mixing shaders without shader variants. In that case, the
|
||||
bindings should be generated outside of the compiler. For simple APIs where the
|
||||
bindings are fixed and known at compile-time, the bindings could be generated
|
||||
within the compiler.
|
||||
|
||||
Fragment shader
|
||||
```````````````
|
||||
|
||||
In the fragment shader, coefficient registers, identified by the prefix `cf`
|
||||
followed by a decimal index, act as opaque handles to varyings. For flat
|
||||
shading, coefficient registers may be loaded into general registers with the
|
||||
`ldcf` instruction. For smooth shading, the coefficient register corresponding
|
||||
to the desired varying is passed as an argument to the "iterate" instruction
|
||||
`iter` in order to "iterate" (interpolate) a varying. As perspective correct
|
||||
interpolation also requires the W component of the fragment coordinate, the
|
||||
coefficient register for W is passed as a second argument. As an example, if
|
||||
there's a single varying to interpolate, an instruction like `iter r0, cf1, cf0`
|
||||
is used.
|
||||
|
||||
Iterator
|
||||
````````
|
||||
|
||||
To actually interpolate varyings, AGX provides fixed-function iteration hardware
|
||||
to multiply the specified coefficient registers with the required barycentrics,
|
||||
producing an interpolated value, hence the name "coefficient register". This
|
||||
operation is purely mathematical and does not require any memory access, as
|
||||
the required coefficients are preloaded before the shader begins execution.
|
||||
That means the iterate instruction executes in constant time, does not signal
|
||||
a data fence, and does not require the shader to wait on a data fence before
|
||||
using the value.
|
||||
|
Reference in New Issue
Block a user