third_party_mesa3d/docs/drivers/freedreno.rst

Freedreno
=========

Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs.  It implements up to
OpenGL ES 3.2 and desktop OpenGL 4.5.

See the `Freedreno Wiki
<https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more
details.

Turnip
======

Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs.

The current set of specific chip versions supported can be found in
:file:`src/freedreno/common/freedreno_devices.py`.  The current set of features
supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__.
There are no plans to port to a5xx or earlier GPUs.

Hardware architecture
---------------------

Adreno is a mostly tile-mode renderer, but with the option to bypass tiling
("gmem") and render directly to system memory ("sysmem").  It is UMA, using
mostly write combined memory but with the ability to map some buffers as cache
coherent with the CPU.

.. toctree::
   :glob:

   freedreno/hw/*

Hardware acronyms
^^^^^^^^^^^^^^^^^

.. glossary::

  Cluster
    A group of hardware registers, often with multiple copies to allow
    pipelining.  There is an M:N relationship between hardware blocks that do
    work and the clusters of registers for the state that hardware blocks use.

  CP
    Command Processor.  Reads the stream of statechanges and draw commands
    generated by the driver.

  PFP
    Prefetch Parser.  Adreno 2xx-4xx CP component.

  ME
    Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands.

  SQE
    a6xx+ replacement for PFP/ME.  This is the microcontroller that runs the
    microcode (loaded from Linux) which actually processes the command stream
    and writes to the hardware registers.  See `afuc
    <https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__.

  ROQ
    DMA engine used by the SQE for reading memory, with some prefetch buffering.
    Mostly reads in the command stream, but also serves for
    ``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads.

  SP
    Shader Processor.  Unified, scalar shader engine.  One or more, depending on
    GPU and tier.

  TP
    Texture Processor.

  UCHE
    Unified L2 Cache.  32KB on A330, unclear how big now.

  CCU
    Color Cache Unit.

  VSC
    Visibility Stream Compressor

  PVS
    Primitive Visibiliy Stream

  FE
    Front End?  Index buffer and vertex attribute fetch cluster.  Includes PC,
    VFD, VPC.

  VFD
    Vertex Fetch and Decode

  VPC
    Varying/Position Cache?  Hardware block that stores shaded vertex data for
    primitive assembly.

  HLSQ
    High Level Sequencer.  Manages state for the SPs, batches up PS invocations
    between primitives, is involved in preemption.

  PC_VS
    Cluster where varyings are read from VPC and assembled into primitives to
    feed GRAS.

  VS
    Vertex Shader. Responsible for generating VS/GS/tess invocations

  GRAS
    Rasterizer. Responsible for generating PS invocations from primitives, also
    does LRZ

  PS
    Pixel Shader.

  RB
    Render Backend.  Performs both early and late Z testing, blending, and
    attachment stores of output of the PS.

  GMEM
    Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store
    attachments during tiled rendering

  LRZ
    Low Resolution Z.  A low resolution area of the depth buffer that can be
    initialized during the binning pass to contain the worst-case (farthest) Z
    values in a block, and then used to early reject fragments during
    rasterization.

Cache hierarchy
^^^^^^^^^^^^^^^

The a6xx GPUs have two main caches: CCU and UCHE.

UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes,
texture L1, LRZ, and storage image accesses (``ldib``/``stib``).  Misses and
flushes access system memory.

The CCU is the separate cache used by 2D blits and sysmem render target access
(and also for resolves to system memory when in GMEM mode).  Its memory comes
from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount
reserved based on whether we're in a render pass using GMEM for attachment
storage, or we're doing sysmem rendering.  Cache entries have the attachment
number and layer mixed into the cache tag in some way, likely so that a
fragment's access is spread through the cache even if the attachments are the
same size and alignments in address space.  This means that the cache must be
flushed and invalidated between memory being used for one attachment and another
(notably depth vs color, but also MRT color).

The Texture Processors (TP) additionally have a small L1 cache (1KB on A330,
unclear how big now) before accessing UCHE. This cache is used for normal
sampling like ``sam``` and ``isam`` (and the compiler will make read-only
storage image access through it as well).  It is not coherent with UCHE (may get
stale results when you ``sam`` after ``stib``), but must get flushed per draw or
something because you don't need a manual invalidate between draws storing to an
image and draws sampling from a texture.

The command processor (CP) does not read from either of these caches, and
instead uses FIFOs in the ROQ to avoid stalls reading from system memory.

Draw states
^^^^^^^^^^^

Since the SQE is not a fast processor, and tiled rendering means that many draws
won't even be used in many bins, since a5xx state updates can be batched up into
"draw states" that point to a fragment of CP packets.  At draw time, if the draw
call is going to actually execute (some primitive is visible in the current
tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since
the last time they were executed, it executes the corresponding fragment.

Starting with a6xx, states can be taggged with whether they should be executed
at draw time for any of sysmem, binning, or tile rendering.  This allows a
single command stream to be generated which can be executed in any of the modes,
unlike pre-a6xx where we had to generate separate command lists for the binning
and rendering phases.

Note that this means that the generated draw state has to always update all of
the state you have chosen to pack into that ``GROUP_ID``, since any of your
previous statechanges in a previous draw state command may have been skipped.

Pipelining (a6xx+)
^^^^^^^^^^^^^^^^^^

Most CP commands write to registers.  In a6xx+, the registers are located in
clusters corresponding to the stage of the pipeline they are used from (see
``enum tu_stage`` for a list). To pipeline state updates and drawing, registers
generally have two copies ("contexts") in their cluster, so previous draws can
be working on the previous set of register state while the next draw's state is
being set up. You can find what registers go into which clusters by looking at
:command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section.

As SQE processes register writes in the command stream, it sends them into a
per-cluster queue stored in ``CP_MEMPOOL``.  This allows the pipeline stages to
process their stream of register updates and events independent of each other
(so even with just 2 contexts in a stage, earlier stages can proceed on to later
draws before later stages have caught up).

Each cluster has a per-context bit indicating that the context is done/free.
Register writes will stall on the context being done.

During a 3D draw command, SQE generates several internal events flow through the
pipeline:

- ``CP_EVENT_START`` clears the done bit for the context when written to the
  cluster
- ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off
  the actual event/drawing.
- ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the
  done flag.
- ``CP_EVENT_END`` waits for the done flag on the next context, then copies all
  the registers that were dirtied in this context to that one.

The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``,
``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context
rollover.

Because the clusters proceed independently of each other even across draws, if
you need to synchronize an earlier cluster to the output of a later one, then
you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any
necessary caches.

Also, note that some registers are not banked at all, and will require a
``CP_WAIT_FOR_IDLE`` for any previous usage of the register to complete.

In a2xx-a4xx, there weren't per-stage clusters, and instead there were two
register banks that were flipped between per draw.

Bindless/Bindful Descriptors (a6xx+)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Starting with a6xx++, cat5 (texture) and cat6 (image/ssbo/ubo) instructions are
extended to support bindless descriptors.

In the old bindful model, descriptors are separate for textures, samplers,
UBOs, and IBOs (combined descriptor for images and SSBOs), with separate
registers for the memory containing the array of descriptors, and/or different
``STATE_TYPE`` and ``STATE_BLOCK`` for ``CP_LOAD_STATE``/``_FRAG``/``_GEOM``
to pre-load the descriptors into cache.

- textures - per-shader-stage
   - registers: ``SP_xS_TEX_CONST``/``SP_xS_TEX_COUNT``
   - state-type: ``ST6_CONSTANTS``
   - state-block: ``SB6_xS_TEX``
- samplers - per-shader-stage
   - registers: ``SP_xS_TEX_SAMP``
   - state-type: ``ST6_SHADER``
   - state-block: ``SB6_xS_TEX``
- UBOs - per-shader-stage
   - registers: none
   - state-type: ``ST6_UBO``
   - state-block: ``SB6_xS_SHADER``
- IBOs - global acress shader 3d stages, separate for compute shader
   - registers: ``SP_IBO``/``SP_IBO_COUNT`` or ``SP_CS_IBO``/``SP_CS_IBO_COUNT``
   - state-type: ``ST6_SHADER``
   - state-block: ``ST6_IBO`` or ``ST6_CS_IBO`` for compute shaders
   - Note, unlike per-shader-stage descriptors, ``CP_LOAD_STATE6`` is used,
     as opposed to ``CP_LOAD_STATE6_GEOM`` or ``CP_LOAD_STATE6_FRAG``
     depending on shader stage.

.. note::
   For the per-shader-stage registers and state-blocks the ``xS`` notation
   refers to per-shader-stage names, ex. ``SP_FS_TEX_CONST`` or ``SB6_DS_TEX``

Textures and IBOs (images) use *basically* the same 64byte descriptor format
with some exceptions (for ex, for IBOs cubemaps are handles as 2d array).
SSBOs are just untyped buffers, but otherwise use the same descriptors and
instructions as images.  Samplers use a 16byte descriptor, and UBOs use an
8byte descriptor which packs the size in the upper 15 bits of the UBO address.

In the bindless model, descriptors are split into 5 desciptor sets, which are
global across shader stages (but as with bindful IBO descriptors, separate for
3d stages vs compute stage).  Each hw descriptor is an array of descriptors
of configurable size (each descriptor set can be configured for a descriptor
pitch of 8bytes or 64bytes).  Each descriptor can be of arbitrary format (ie.
UBOs/IBOs/textures/samplers interleaved), it's interpretation by the hw is
determined by the instruction that references the descriptor.  Each descriptor
set can contain at least 2^^16 descriptors.

The hw is configured with the base address of the descriptor set via an array
of "BINDLESS_BASE" registers, ie ``SP_BINDLESS_BASE[n]``/``HLSQ_BINDLESS_BASE[n]``
for 3d shader stages, or ``SP_CS_BINDLESS_BASE[n]``/``HLSQ_CS_BINDLESS_BASE[n]``
for compute shaders, with the descriptor pitch encoded in the low bits.
Which of the descriptor sets is referenced is encoded via three bits in the
instruction.  The address of the descriptor is calculated as::

   descriptor_addr = (BINDLESS_BASE[n] & ~0x3) +
                     (idx * 4 * (2 << BINDLESS_BASE[n] & 0x3))


.. note::
   Turnip reserves one descriptor set for internal use and exposes the other
   four for the application via the vulkan API.

Software Architecture
---------------------

Freedreno and Turnip use a shared core for shader compiler, image layout, and
register and command stream definitions.  They implement separate state
management and command stream generation.

.. toctree::
   :glob:

   freedreno/*

GPU hang debugging
^^^^^^^^^^^^^^^^^^

A kernel message from DRM of "gpu fault" can mean any sort of error reported by
the GPU (including its internal hang detection).  If a fault in GPU address
space happened, you should expect to find a message from the iommu, with the
faulting address and a hardware unit involved:

.. code-block:: console

  *** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1)

On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to
``/sys/devices/virtual/devcoredump/**/data``.  You can cp that file to a
:file:`crash.devcore` to save it, otherwise the kernel will expire it
eventually. Echo 1 to the file to free the core early, as another core won't be
taken until then.

Once you have your core file, you can use :command:`crashdec -f crash.devcore`
to decode it.  The output will have ``ESTIMATED CRASH LOCATION`` where we
estimate the CP to have stopped.  Note that it is expected that this will be
some distance past whatever state triggered the fault, given GPU pipelining, and
will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or
``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar
event. You can try running the workload with ``TU_DEBUG=flushall`` or
``FD_MESA_DEBUG=flush`` to try to close in on the failing commands.

You can also find what commands were queued up to each cluster in the
``regs-name: CP_MEMPOOL`` section.

Command Stream Capture
^^^^^^^^^^^^^^^^^^^^^^

During Mesa development, it's often useful to look at the command streams we
send to the kernel.  Mesa itself doesn't implement a way to stream them out
(though it maybe should!).  Instead, we have an interface for the kernel to
capture all submitted command streams:

.. code-block:: console

  cat /sys/kernel/debug/dri/0/rd > cmdstream &

By default, command stream capture does not capture texture/vertex/etc. data.
You can enable capturing all the BOs with:

.. code-block:: console

  echo Y > /sys/module/msm/parameters/rd_full

Note that, since all command streams get captured, it is easy to run the system
out of memory doing this, so you probably don't want to enable it during play of
a heavyweight game.  Instead, to capture a command stream within a game, you
probably want to cause a crash in the GPU during a farme of interest so that a
single GPU core dump is generated.  Emitting ``0xdeadbeef`` in the CS should be
enough to cause a fault.