docs: Move the gallium driver documentation to the top level.
I actually had never found these, buried under Developer Topics -> Gallium -> Drivers. Given that driver documentation contains not just gallium driver documentation but also end-user information, bring it to a much more prominent location between User Topics and Developer Topics at the top level. Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7174>
This commit is contained in:
@@ -1,9 +0,0 @@
|
||||
Drivers
|
||||
=======
|
||||
|
||||
Driver specific documentation.
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
|
||||
drivers/*
|
@@ -1,9 +0,0 @@
|
||||
Freedreno
|
||||
=========
|
||||
|
||||
Freedreno driver specific docs.
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
|
||||
freedreno/*
|
@@ -1,432 +0,0 @@
|
||||
IR3 NOTES
|
||||
=========
|
||||
|
||||
Some notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx. The same shader ISA is present, with some small differences, in adreno a4xx.
|
||||
|
||||
Compared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set. However, the compiler is responsible, in most cases, to schedule the instructions. The hardware does not try to hide the shader core pipeline stages. For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or nops). When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit. Although that results in a lot of edge cases where things fall over, like:
|
||||
|
||||
::
|
||||
|
||||
ADD TEMP[0], TEMP[1], TEMP[2]
|
||||
MUL TEMP[0], TEMP[1], TEMP[0].wzyx
|
||||
|
||||
Here, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``. Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
|
||||
|
||||
So the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
|
||||
|
||||
For additional documentation about the hardware, see wiki: `a3xx ISA
|
||||
<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
|
||||
|
||||
External Structure
|
||||
------------------
|
||||
|
||||
``ir3_shader``
|
||||
A single vertex/fragment/etc shader from gallium perspective (ie.
|
||||
maps to a single TGSI shader), and manages a set of shader variants
|
||||
which are generated on demand based on the shader key.
|
||||
|
||||
``ir3_shader_key``
|
||||
The configuration key that identifies a shader variant. Ie. based
|
||||
on other GL state (two-sided-color, render-to-alpha, etc) or render
|
||||
stages (binning-pass vertex shader) different shader variants are
|
||||
generated.
|
||||
|
||||
``ir3_shader_variant``
|
||||
The actual hw shader generated based on input TGSI and shader key.
|
||||
|
||||
``ir3_compiler``
|
||||
Compiler frontend which generates ir3 and runs the various backend
|
||||
stages to schedule and do register assignment.
|
||||
|
||||
The IR
|
||||
------
|
||||
|
||||
The ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s). But there are a few extensions, in the form of meta_ instructions. And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value. So, for example, the following TGSI shader:
|
||||
|
||||
::
|
||||
|
||||
VERT
|
||||
DCL IN[0]
|
||||
DCL IN[1]
|
||||
DCL OUT[0], POSITION
|
||||
DCL TEMP[0], LOCAL
|
||||
1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
|
||||
2: MOV OUT[0], TEMP[0].xxxx
|
||||
3: END
|
||||
|
||||
eventually generates:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph G {
|
||||
rankdir=RL;
|
||||
nodesep=0.25;
|
||||
ranksep=1.5;
|
||||
subgraph clusterdce198 {
|
||||
label="vert";
|
||||
inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
|
||||
instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
|
||||
instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
|
||||
inputdce198:<in2>:w -> instrdcedd0:<src0>
|
||||
inputdce198:<in6>:w -> instrdcedd0:<src1>
|
||||
instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
|
||||
inputdce198:<in1>:w -> instrdcec30:<src0>
|
||||
inputdce198:<in5>:w -> instrdcec30:<src1>
|
||||
instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
|
||||
inputdce198:<in0>:w -> instrdceb60:<src0>
|
||||
inputdce198:<in4>:w -> instrdceb60:<src1>
|
||||
instrdceb60:<dst0> -> instrdcec30:<src2>
|
||||
instrdcec30:<dst0> -> instrdcedd0:<src2>
|
||||
instrdcedd0:<dst0> -> instrdcf348:<src0>
|
||||
instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
|
||||
instrdcedd0:<dst0> -> instrdcf400:<src0>
|
||||
instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
|
||||
instrdcedd0:<dst0> -> instrdcf4b8:<src0>
|
||||
outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
|
||||
instrdcf348:<dst0> -> outputdce198:<out0>:e
|
||||
instrdcf400:<dst0> -> outputdce198:<out1>:e
|
||||
instrdcf4b8:<dst0> -> outputdce198:<out2>:e
|
||||
instrdcedd0:<dst0> -> outputdce198:<out3>:e
|
||||
}
|
||||
}
|
||||
|
||||
(after scheduling, etc, but before register assignment).
|
||||
|
||||
Internal Structure
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``ir3_block``
|
||||
Represents a basic block.
|
||||
|
||||
TODO: currently blocks are nested, but I think I need to change that
|
||||
to a more conventional arrangement before implementing proper flow
|
||||
control. Currently the only flow control handles is if/else which
|
||||
gets flattened out and results chosen with ``sel`` instructions.
|
||||
|
||||
``ir3_instruction``
|
||||
Represents a machine instruction or meta_ instruction. Has pointers
|
||||
to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
|
||||
as needed.
|
||||
|
||||
``ir3_register``
|
||||
Represents a src or dst register, flags indicate const/relative/etc.
|
||||
If ``IR3_REG_SSA`` is set on a src register, the actual register
|
||||
number (name) has not been assigned yet, and instead the ``instr``
|
||||
field points to src instruction.
|
||||
|
||||
In addition there are various util macros/functions to simplify manipulation/traversal of the graph:
|
||||
|
||||
``foreach_src(srcreg, instr)``
|
||||
Iterate each instruction's source ``ir3_register``\s
|
||||
|
||||
``foreach_src_n(srcreg, n, instr)``
|
||||
Like ``foreach_src``, also setting ``n`` to the source number (starting
|
||||
with ``0``).
|
||||
|
||||
``foreach_ssa_src(srcinstr, instr)``
|
||||
Iterate each instruction's SSA source ``ir3_instruction``\s. This skips
|
||||
non-SSA sources (consts, etc), but includes virtual sources (such as the
|
||||
address register if `relative addressing`_ is used).
|
||||
|
||||
``foreach_ssa_src_n(srcinstr, n, instr)``
|
||||
Like ``foreach_ssa_src``, also setting ``n`` to the source number.
|
||||
|
||||
For example:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
foreach_ssa_src_n(src, i, instr) {
|
||||
unsigned d = delay_calc_srcn(ctx, src, instr, i);
|
||||
delay = MAX2(delay, d);
|
||||
}
|
||||
|
||||
|
||||
TODO probably other helper/util stuff worth mentioning here
|
||||
|
||||
.. _meta:
|
||||
|
||||
Meta Instructions
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
**input**
|
||||
Used for shader inputs (registers configured in the command-stream
|
||||
to hold particular input values, written by the shader core before
|
||||
start of execution. Also used for connecting up values within a
|
||||
basic block to an output of a previous block.
|
||||
|
||||
**output**
|
||||
Used to hold outputs of a basic block.
|
||||
|
||||
**flow**
|
||||
TODO
|
||||
|
||||
**phi**
|
||||
TODO
|
||||
|
||||
**fanin**
|
||||
Groups registers which need to be assigned to consecutive scalar
|
||||
registers, for example `sam` (texture fetch) src instructions (see
|
||||
`register groups`_) or array element dereference
|
||||
(see `relative addressing`_).
|
||||
|
||||
**fanout**
|
||||
The counterpart to **fanin**, when an instruction such as `sam`
|
||||
writes multiple components, splits the result into individual
|
||||
scalar components to be consumed by other instructions.
|
||||
|
||||
|
||||
.. _`flow control`:
|
||||
|
||||
Flow Control
|
||||
~~~~~~~~~~~~
|
||||
|
||||
TODO
|
||||
|
||||
|
||||
.. _`register groups`:
|
||||
|
||||
Register Groups
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Certain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers. In the simplest example:
|
||||
|
||||
::
|
||||
|
||||
sam (f32)(xyz)r2.x, r0.z, s#0, t#0
|
||||
|
||||
for a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
|
||||
|
||||
Before register assignment, to group the two components of the texture src together:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph G {
|
||||
{ rank=same;
|
||||
fanin;
|
||||
};
|
||||
{ rank=same;
|
||||
coord_x;
|
||||
coord_y;
|
||||
};
|
||||
sam -> fanin [label="regs[1]"];
|
||||
fanin -> coord_x [label="regs[1]"];
|
||||
fanin -> coord_y [label="regs[2]"];
|
||||
coord_x -> coord_y [label="right",style=dotted];
|
||||
coord_y -> coord_x [label="left",style=dotted];
|
||||
coord_x [label="coord.x"];
|
||||
coord_y [label="coord.y"];
|
||||
}
|
||||
|
||||
The frontend sets up the SSA ptrs from ``sam`` source register to the ``fanin`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values. And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``fanin``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
|
||||
|
||||
And likewise, for the consecutive scalar registers for the destination:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
{ rank=same;
|
||||
A;
|
||||
B;
|
||||
C;
|
||||
};
|
||||
{ rank=same;
|
||||
fanout_0;
|
||||
fanout_1;
|
||||
fanout_2;
|
||||
};
|
||||
A -> fanout_0;
|
||||
B -> fanout_1;
|
||||
C -> fanout_2;
|
||||
fanout_0 [label="fanout\noff=0"];
|
||||
fanout_0 -> sam;
|
||||
fanout_1 [label="fanout\noff=1"];
|
||||
fanout_1 -> sam;
|
||||
fanout_2 [label="fanout\noff=2"];
|
||||
fanout_2 -> sam;
|
||||
fanout_0 -> fanout_1 [label="right",style=dotted];
|
||||
fanout_1 -> fanout_0 [label="left",style=dotted];
|
||||
fanout_1 -> fanout_2 [label="right",style=dotted];
|
||||
fanout_2 -> fanout_1 [label="left",style=dotted];
|
||||
sam;
|
||||
}
|
||||
|
||||
.. _`relative addressing`:
|
||||
|
||||
Relative Addressing
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Most instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers. In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, ie. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
|
||||
|
||||
Note that cat5 (texture sample) instructions are the notable exception, not
|
||||
supporting relative addressing of src or dst.
|
||||
|
||||
Relative addressing of the const file (for example, a uniform array) is relatively simple. We don't do register assignment of the const file, so all that is required is to schedule things properly. Ie. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
|
||||
|
||||
But relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (ie. the array elements must be assigned to consecutive scalar registers). And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
|
||||
|
||||
Each instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s). This behaves as an additional virtual src register, ie. ``foreach_ssa_src()`` will also iterate the address register (last).
|
||||
|
||||
Note that ``nop``\'s for timing constraints, type specifiers (ie.
|
||||
``add.f`` vs ``add.u``), etc, omitted for brevity in examples
|
||||
|
||||
::
|
||||
|
||||
mova a0.x, hr1.y
|
||||
sub r1.y, r2.x, r3.x
|
||||
add r0.x, r1.y, c<a0.x + 2>
|
||||
|
||||
results in:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir=LR;
|
||||
sub;
|
||||
const [label="const file"];
|
||||
add;
|
||||
mova;
|
||||
add -> mova;
|
||||
add -> sub;
|
||||
add -> const [label="off=2"];
|
||||
}
|
||||
|
||||
The scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
|
||||
|
||||
To implement variable arrays, values are stored in consecutive scalar registers. This has some overlap with `register groups`_, in that ``fanin`` and ``fanout`` are used to help group things for the `register assignment`_ pass.
|
||||
|
||||
To use a variable array as a src register, a slight variation of what is done for const array src. The instruction src is a `fanin` instruction that groups all the array members:
|
||||
|
||||
::
|
||||
|
||||
mova a0.x, hr1.y
|
||||
sub r1.y, r2.x, r3.x
|
||||
add r0.x, r1.y, r<a0.x + 2>
|
||||
|
||||
results in:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
a0 [label="r0.z"];
|
||||
a1 [label="r0.w"];
|
||||
a2 [label="r1.x"];
|
||||
a3 [label="r1.y"];
|
||||
sub;
|
||||
fanin;
|
||||
mova;
|
||||
add;
|
||||
add -> sub;
|
||||
add -> fanin [label="off=2"];
|
||||
add -> mova;
|
||||
fanin -> a0;
|
||||
fanin -> a1;
|
||||
fanin -> a2;
|
||||
fanin -> a3;
|
||||
}
|
||||
|
||||
TODO better describe how actual deref offset is derived, ie. based on array base register.
|
||||
|
||||
To do an indirect write to a variable array, a ``fanout`` is used. Say the array was assigned to registers ``r0.z`` through ``r1.y`` (hence the constant offset of 2):
|
||||
|
||||
Note that only cat1 (mov) can do indirect write.
|
||||
|
||||
::
|
||||
|
||||
mova a0.x, hr1.y
|
||||
min r2.x, r2.x, c0.x
|
||||
mov r<a0.x + 2>, r2.x
|
||||
mul r0.x, r0.z, c0.z
|
||||
|
||||
|
||||
In this case, the ``mov`` instruction does not write all elements of the array (compared to usage of ``fanout`` for ``sam`` instructions in grouping_). But the ``mov`` instruction does need an additional dependency (via ``fanin``) on instructions that last wrote the array element members, to ensure that they get scheduled before the ``mov`` in scheduling_ stage (which also serves to group the array elements for the `register assignment`_ stage).
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
a0 [label="r0.z"];
|
||||
a1 [label="r0.w"];
|
||||
a2 [label="r1.x"];
|
||||
a3 [label="r1.y"];
|
||||
min;
|
||||
mova;
|
||||
mov;
|
||||
mul;
|
||||
fanout [label="fanout\noff=0"];
|
||||
mul -> fanout;
|
||||
fanout -> mov;
|
||||
fanin;
|
||||
fanin -> a0;
|
||||
fanin -> a1;
|
||||
fanin -> a2;
|
||||
fanin -> a3;
|
||||
mov -> min;
|
||||
mov -> mova;
|
||||
mov -> fanin;
|
||||
}
|
||||
|
||||
Note that there would in fact be ``fanout`` nodes generated for each array element (although only the reachable ones will be scheduled, etc).
|
||||
|
||||
|
||||
|
||||
Shader Passes
|
||||
-------------
|
||||
|
||||
After the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_. Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
|
||||
|
||||
Note that we essentially have ~256 scalar registers in the
|
||||
architecture (although larger register usage will at some thresholds
|
||||
limit the number of threads which can run in parallel). And at some
|
||||
point we will have to deal with spilling.
|
||||
|
||||
.. _flatten:
|
||||
|
||||
Flatten
|
||||
~~~~~~~
|
||||
|
||||
In this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions. The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
|
||||
|
||||
|
||||
.. _`copy propagation`:
|
||||
|
||||
Copy Propagation
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Currently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources. And the CP pass simply removes all simple ``mov``\s (ie. src-type is same as dst-type, no abs/neg flags, etc).
|
||||
|
||||
The eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
|
||||
|
||||
|
||||
.. _grouping:
|
||||
|
||||
Grouping
|
||||
~~~~~~~~
|
||||
|
||||
In the grouping pass, instructions which need to be grouped (for ``fanin``\s, etc) have their ``left`` / ``right`` neighbor pointers setup. In cases where there is a conflict (ie. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted. This ensures that there is some possible valid `register assignment`_ at the later stages.
|
||||
|
||||
|
||||
.. _depth:
|
||||
|
||||
Depth
|
||||
~~~~~
|
||||
|
||||
In the depth pass, a depth is calculated for each instruction node within it's basic block. The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of it's source instructions. (meta_ instructions don't add to the depth). As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction. Unreachable instructions and inputs are marked.
|
||||
|
||||
TODO: we should probably calculate both hard and soft depths (?) to
|
||||
try to coax additional instructions to fit in places where we need
|
||||
to use sync bits, such as after a texture fetch or SFU.
|
||||
|
||||
.. _scheduling:
|
||||
|
||||
Scheduling
|
||||
~~~~~~~~~~
|
||||
|
||||
After the grouping_ pass, there are no more instructions to insert or remove. Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after it's source instructions plus delay slots. Insert ``nop``\s as required.
|
||||
|
||||
.. _`register assignment`:
|
||||
|
||||
Register Assignment
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
TODO
|
||||
|
||||
|
@@ -1,284 +0,0 @@
|
||||
Gallium LLVMpipe Driver
|
||||
=======================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
The Gallium llvmpipe driver is a software rasterizer that uses LLVM to
|
||||
do runtime code generation. Shaders, point/line/triangle rasterization
|
||||
and vertex processing are implemented with LLVM IR which is translated
|
||||
to x86, x86-64, or ppc64le machine code. Also, the driver is
|
||||
multithreaded to take advantage of multiple CPU cores (up to 8 at this
|
||||
time). It's the fastest software rasterizer for Mesa.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
- For x86 or amd64 processors, 64-bit mode is recommended. Support for
|
||||
SSE2 is strongly encouraged. Support for SSE3 and SSE4.1 will yield
|
||||
the most efficient code. The fewer features the CPU has the more
|
||||
likely it is that you will run into underperforming, buggy, or
|
||||
incomplete code.
|
||||
|
||||
For ppc64le processors, use of the Altivec feature (the Vector
|
||||
Facility) is recommended if supported; use of the VSX feature (the
|
||||
Vector-Scalar Facility) is recommended if supported AND Mesa is built
|
||||
with LLVM version 4.0 or later.
|
||||
|
||||
See ``/proc/cpuinfo`` to know what your CPU supports.
|
||||
|
||||
- Unless otherwise stated, LLVM version 3.4 is recommended; 3.3 or
|
||||
later is required.
|
||||
|
||||
For Linux, on a recent Debian based distribution do:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
aptitude install llvm-dev
|
||||
|
||||
If you want development snapshot builds of LLVM for Debian and
|
||||
derived distributions like Ubuntu, you can use the APT repository at
|
||||
`apt.llvm.org <https://apt.llvm.org/>`__, which are maintained by
|
||||
Debian's LLVM maintainer.
|
||||
|
||||
For a RPM-based distribution do:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
yum install llvm-devel
|
||||
|
||||
For Windows you will need to build LLVM from source with MSVC or
|
||||
MINGW (either natively or through cross compilers) and CMake, and set
|
||||
the ``LLVM`` environment variable to the directory you installed it
|
||||
to. LLVM will be statically linked, so when building on MSVC it needs
|
||||
to be built with a matching CRT as Mesa, and you'll need to pass
|
||||
``-DLLVM_USE_CRT_xxx=yyy`` as described below.
|
||||
|
||||
|
||||
+-----------------+----------------------------------------------------------------+
|
||||
| LLVM build-type | Mesa build-type |
|
||||
| +--------------------------------+-------------------------------+
|
||||
| | debug,checked | release,profile |
|
||||
+=================+================================+===============================+
|
||||
| Debug | ``-DLLVM_USE_CRT_DEBUG=MTd`` | ``-DLLVM_USE_CRT_DEBUG=MT`` |
|
||||
+-----------------+--------------------------------+-------------------------------+
|
||||
| Release | ``-DLLVM_USE_CRT_RELEASE=MTd`` | ``-DLLVM_USE_CRT_RELEASE=MT`` |
|
||||
+-----------------+--------------------------------+-------------------------------+
|
||||
|
||||
You can build only the x86 target by passing
|
||||
``-DLLVM_TARGETS_TO_BUILD=X86`` to cmake.
|
||||
|
||||
- scons (optional)
|
||||
|
||||
Building
|
||||
--------
|
||||
|
||||
To build everything on Linux invoke scons as:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
scons build=debug libgl-xlib
|
||||
|
||||
Alternatively, you can build it with meson with:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
mkdir build
|
||||
cd build
|
||||
meson -D glx=gallium-xlib -D gallium-drivers=swrast
|
||||
ninja
|
||||
|
||||
but the rest of these instructions assume that scons is used. For
|
||||
Windows the procedure is similar except the target:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
scons platform=windows build=debug libgl-gdi
|
||||
|
||||
Using
|
||||
-----
|
||||
|
||||
Linux
|
||||
~~~~~
|
||||
|
||||
On Linux, building will create a drop-in alternative for ``libGL.so``
|
||||
into
|
||||
|
||||
::
|
||||
|
||||
build/foo/gallium/targets/libgl-xlib/libGL.so
|
||||
|
||||
or
|
||||
|
||||
::
|
||||
|
||||
lib/gallium/libGL.so
|
||||
|
||||
To use it set the ``LD_LIBRARY_PATH`` environment variable accordingly.
|
||||
|
||||
For performance evaluation pass ``build=release`` to scons, and use the
|
||||
corresponding lib directory without the ``-debug`` suffix.
|
||||
|
||||
Windows
|
||||
~~~~~~~
|
||||
|
||||
On Windows, building will create
|
||||
``build/windows-x86-debug/gallium/targets/libgl-gdi/opengl32.dll`` which
|
||||
is a drop-in alternative for system's ``opengl32.dll``. To use it put it
|
||||
in the same directory as your application. It can also be used by
|
||||
replacing the native ICD driver, but it's quite an advanced usage, so if
|
||||
you need to ask, don't even try it.
|
||||
|
||||
There is however an easy way to replace the OpenGL software renderer
|
||||
that comes with Microsoft Windows 7 (or later) with llvmpipe (that is,
|
||||
on systems without any OpenGL drivers):
|
||||
|
||||
- copy
|
||||
``build/windows-x86-debug/gallium/targets/libgl-gdi/opengl32.dll`` to
|
||||
``C:\Windows\SysWOW64\mesadrv.dll``
|
||||
|
||||
- load this registry settings:
|
||||
|
||||
::
|
||||
|
||||
REGEDIT4
|
||||
|
||||
; https://technet.microsoft.com/en-us/library/cc749368.aspx
|
||||
; https://www.msfn.org/board/topic/143241-portable-windows-7-build-from-winpe-30/page-5#entry942596
|
||||
[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Windows NT\CurrentVersion\OpenGLDrivers\MSOGL]
|
||||
"DLL"="mesadrv.dll"
|
||||
"DriverVersion"=dword:00000001
|
||||
"Flags"=dword:00000001
|
||||
"Version"=dword:00000002
|
||||
|
||||
- Ditto for 64 bits drivers if you need them.
|
||||
|
||||
Profiling
|
||||
---------
|
||||
|
||||
To profile llvmpipe you should build as
|
||||
|
||||
::
|
||||
|
||||
scons build=profile <same-as-before>
|
||||
|
||||
This will ensure that frame pointers are used both in C and JIT
|
||||
functions, and that no tail call optimizations are done by gcc.
|
||||
|
||||
Linux perf integration
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
On Linux, it is possible to have symbol resolution of JIT code with
|
||||
`Linux perf <https://perf.wiki.kernel.org/>`__:
|
||||
|
||||
::
|
||||
|
||||
perf record -g /my/application
|
||||
perf report
|
||||
|
||||
When run inside Linux perf, llvmpipe will create a
|
||||
``/tmp/perf-XXXXX.map`` file with symbol address table. It also dumps
|
||||
assembly code to ``/tmp/perf-XXXXX.map.asm``, which can be used by the
|
||||
``bin/perf-annotate-jit.py`` script to produce disassembly of the
|
||||
generated code annotated with the samples.
|
||||
|
||||
You can obtain a call graph via
|
||||
`Gprof2Dot <https://github.com/jrfonseca/gprof2dot#linux-perf>`__.
|
||||
|
||||
Unit testing
|
||||
------------
|
||||
|
||||
Building will also create several unit tests in
|
||||
``build/linux-???-debug/gallium/drivers/llvmpipe``:
|
||||
|
||||
- ``lp_test_blend``: blending
|
||||
- ``lp_test_conv``: SIMD vector conversion
|
||||
- ``lp_test_format``: pixel unpacking/packing
|
||||
|
||||
Some of these tests can output results and benchmarks to a tab-separated
|
||||
file for later analysis, e.g.:
|
||||
|
||||
::
|
||||
|
||||
build/linux-x86_64-debug/gallium/drivers/llvmpipe/lp_test_blend -o blend.tsv
|
||||
|
||||
Development Notes
|
||||
-----------------
|
||||
|
||||
- When looking at this code for the first time, start in lp_state_fs.c,
|
||||
and then skim through the ``lp_bld_*`` functions called there, and
|
||||
the comments at the top of the ``lp_bld_*.c`` functions.
|
||||
- The driver-independent parts of the LLVM / Gallium code are found in
|
||||
``src/gallium/auxiliary/gallivm/``. The filenames and function
|
||||
prefixes need to be renamed from ``lp_bld_`` to something else
|
||||
though.
|
||||
- We use LLVM-C bindings for now. They are not documented, but follow
|
||||
the C++ interfaces very closely, and appear to be complete enough for
|
||||
code generation. See `this stand-alone
|
||||
example <https://npcontemplation.blogspot.com/2008/06/secret-of-llvm-c-bindings.html>`__.
|
||||
See the ``llvm-c/Core.h`` file for reference.
|
||||
|
||||
.. _recommended_reading:
|
||||
|
||||
Recommended Reading
|
||||
-------------------
|
||||
|
||||
- Rasterization
|
||||
|
||||
- `Triangle Scan Conversion using 2D Homogeneous
|
||||
Coordinates <https://www.cs.unc.edu/~olano/papers/2dh-tri/>`__
|
||||
- `Rasterization on
|
||||
Larrabee <http://www.drdobbs.com/parallel/rasterization-on-larrabee/217200602>`__
|
||||
(`DevMaster
|
||||
copy <http://devmaster.net/posts/2887/rasterization-on-larrabee>`__)
|
||||
- `Rasterization using half-space
|
||||
functions <http://devmaster.net/posts/6133/rasterization-using-half-space-functions>`__
|
||||
- `Advanced
|
||||
Rasterization <http://devmaster.net/posts/6145/advanced-rasterization>`__
|
||||
- `Optimizing Software Occlusion
|
||||
Culling <https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/>`__
|
||||
|
||||
- Texture sampling
|
||||
|
||||
- `Perspective Texture
|
||||
Mapping <http://chrishecker.com/Miscellaneous_Technical_Articles#Perspective_Texture_Mapping>`__
|
||||
- `Texturing As In
|
||||
Unreal <https://www.flipcode.com/archives/Texturing_As_In_Unreal.shtml>`__
|
||||
- `Run-Time MIP-Map
|
||||
Filtering <http://www.gamasutra.com/view/feature/3301/runtime_mipmap_filtering.php>`__
|
||||
- `Will "brilinear" filtering
|
||||
persist? <http://alt.3dcenter.org/artikel/2003/10-26_a_english.php>`__
|
||||
- `Trilinear
|
||||
filtering <http://ixbtlabs.com/articles2/gffx/nv40-rx800-3.html>`__
|
||||
- `Texture
|
||||
Swizzling <http://devmaster.net/posts/12785/texture-swizzling>`__
|
||||
|
||||
- SIMD
|
||||
|
||||
- `Whole-Function
|
||||
Vectorization <http://www.cdl.uni-saarland.de/projects/wfv/#header4>`__
|
||||
|
||||
- Optimization
|
||||
|
||||
- `Optimizing Pixomatic For Modern x86
|
||||
Processors <http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-proc/184405807>`__
|
||||
- `Intel 64 and IA-32 Architectures Optimization Reference
|
||||
Manual <http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html>`__
|
||||
- `Software optimization
|
||||
resources <http://www.agner.org/optimize/>`__
|
||||
- `Intel Intrinsics
|
||||
Guide <https://software.intel.com/en-us/articles/intel-intrinsics-guide>`__
|
||||
|
||||
- LLVM
|
||||
|
||||
- `LLVM Language Reference
|
||||
Manual <http://llvm.org/docs/LangRef.html>`__
|
||||
- `The secret of LLVM C
|
||||
bindings <https://npcontemplation.blogspot.co.uk/2008/06/secret-of-llvm-c-bindings.html>`__
|
||||
|
||||
- General
|
||||
|
||||
- `A trip through the Graphics
|
||||
Pipeline <https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/>`__
|
||||
- `WARP Architecture and
|
||||
Performance <https://msdn.microsoft.com/en-us/library/gg615082.aspx#architecture>`__
|
@@ -1,21 +0,0 @@
|
||||
OpenSWR
|
||||
=======
|
||||
|
||||
The Gallium OpenSWR driver is a high performance, highly scalable
|
||||
software renderer targeted towards visualization workloads. For such
|
||||
geometry heavy workloads there is a considerable speedup over llvmpipe,
|
||||
which is to be expected as the geometry frontend of llvmpipe is single
|
||||
threaded.
|
||||
|
||||
This rasterizer is x86 specific and requires AVX or above. The driver
|
||||
fits into the gallium framework, and reuses gallivm for doing the TGSI
|
||||
to vectorized llvm-IR conversion of the shader kernels.
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
|
||||
openswr/usage
|
||||
openswr/faq
|
||||
openswr/profiling
|
||||
openswr/knobs
|
||||
|
@@ -1,141 +0,0 @@
|
||||
FAQ
|
||||
===
|
||||
|
||||
Why another software rasterizer?
|
||||
--------------------------------
|
||||
|
||||
Good question, given there are already three (swrast, softpipe,
|
||||
llvmpipe) in the Mesa tree. Two important reasons for this:
|
||||
|
||||
* Architecture - given our focus on scientific visualization, our
|
||||
workloads are much different than the typical game; we have heavy
|
||||
vertex load and relatively simple shaders. In addition, the core
|
||||
counts of machines we run on are much higher. These parameters led
|
||||
to design decisions much different than llvmpipe.
|
||||
|
||||
* Historical - Intel had developed a high performance software
|
||||
graphics stack for internal purposes. Later we adapted this
|
||||
graphics stack for use in visualization and decided to move forward
|
||||
with Mesa to provide a high quality API layer while at the same
|
||||
time benefiting from the excellent performance the software
|
||||
rasterizerizer gives us.
|
||||
|
||||
What's the architecture?
|
||||
------------------------
|
||||
|
||||
SWR is a tile based immediate mode renderer with a sort-free threading
|
||||
model which is arranged as a ring of queues. Each entry in the ring
|
||||
represents a draw context that contains all of the draw state and work
|
||||
queues. An API thread sets up each draw context and worker threads
|
||||
will execute both the frontend (vertex/geometry processing) and
|
||||
backend (fragment) work as required. The ring allows for backend
|
||||
threads to pull work in order. Large draws are split into chunks to
|
||||
allow vertex processing to happen in parallel, with the backend work
|
||||
pickup preserving draw ordering.
|
||||
|
||||
Our pipeline uses just-in-time compiled code for the fetch shader that
|
||||
does vertex attribute gathering and AOS to SOA conversions, the vertex
|
||||
shader and fragment shaders, streamout, and fragment blending. SWR
|
||||
core also supports geometry and compute shaders but we haven't exposed
|
||||
them through our driver yet. The fetch shader, streamout, and blend is
|
||||
built internally to swr core using LLVM directly, while for the vertex
|
||||
and pixel shaders we reuse bits of llvmpipe from
|
||||
``gallium/auxiliary/gallivm`` to build the kernels, which we wrap
|
||||
differently than llvmpipe's ``auxiliary/draw`` code.
|
||||
|
||||
What's the performance?
|
||||
-----------------------
|
||||
|
||||
For the types of high-geometry workloads we're interested in, we are
|
||||
significantly faster than llvmpipe. This is to be expected, as
|
||||
llvmpipe only threads the fragment processing and not the geometry
|
||||
frontend. The performance advantage over llvmpipe roughly scales
|
||||
linearly with the number of cores available.
|
||||
|
||||
While our current performance is quite good, we know there is more
|
||||
potential in this architecture. When we switched from a prototype
|
||||
OpenGL driver to Mesa we regressed performance severely, some due to
|
||||
interface issues that need tuning, some differences in shader code
|
||||
generation, and some due to conformance and feature additions to the
|
||||
core swr. We are looking to recovering most of this performance back.
|
||||
|
||||
What's the conformance?
|
||||
-----------------------
|
||||
|
||||
The major applications we are targeting are all based on the
|
||||
Visualization Toolkit (VTK), and as such our development efforts have
|
||||
been focused on making sure these work as best as possible. Our
|
||||
current code passes vtk's rendering tests with their new "OpenGL2"
|
||||
(really OpenGL 3.2) backend at 99%.
|
||||
|
||||
piglit testing shows a much lower pass rate, roughly 80% at the time
|
||||
of writing. Core SWR undergoes rigorous unit testing and we are quite
|
||||
confident in the rasterizer, and understand the areas where it
|
||||
currently has issues (example: line rendering is done with triangles,
|
||||
so doesn't match the strict line rendering rules). The majority of
|
||||
the piglit failures are errors in our driver layer interfacing Mesa
|
||||
and SWR. Fixing these issues is one of our major future development
|
||||
goals.
|
||||
|
||||
Why are you open sourcing this?
|
||||
-------------------------------
|
||||
|
||||
* Our customers prefer open source, and allowing them to simply
|
||||
download the Mesa source and enable our driver makes life much
|
||||
easier for them.
|
||||
|
||||
* The internal gallium APIs are not stable, so we'd like our driver
|
||||
to be visible for changes.
|
||||
|
||||
* It's easier to work with the Mesa community when the source we're
|
||||
working with can be used as reference.
|
||||
|
||||
What are your development plans?
|
||||
--------------------------------
|
||||
|
||||
* Performance - see the performance section earlier for details.
|
||||
|
||||
* Conformance - see the conformance section earlier for details.
|
||||
|
||||
* Features - core SWR has a lot of functionality we have yet to
|
||||
expose through our driver, such as MSAA, geometry shaders, compute
|
||||
shaders, and tesselation.
|
||||
|
||||
* AVX512 support
|
||||
|
||||
What is the licensing of the code?
|
||||
----------------------------------
|
||||
|
||||
* All code is under the normal Mesa MIT license.
|
||||
|
||||
Will this work on AMD?
|
||||
----------------------
|
||||
|
||||
* If using an AMD processor with AVX or AVX2, it should work though
|
||||
we don't have that hardware around to test. Patches if needed
|
||||
would be welcome.
|
||||
|
||||
Will this work on ARM, MIPS, POWER, <other non-x86 architecture>?
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
* Not without a lot of work. We make extensive use of AVX and AVX2
|
||||
intrinsics in our code and the in-tree JIT creation. It is not the
|
||||
intention for this codebase to support non-x86 architectures.
|
||||
|
||||
What hardware do I need?
|
||||
------------------------
|
||||
|
||||
* Any x86 processor with at least AVX (introduced in the Intel
|
||||
SandyBridge and AMD Bulldozer microarchitectures in 2011) will
|
||||
work.
|
||||
|
||||
* You don't need a fire-breathing Xeon machine to work on SWR - we do
|
||||
day-to-day development with laptops and desktop CPUs.
|
||||
|
||||
Does one build work on both AVX and AVX2?
|
||||
-----------------------------------------
|
||||
|
||||
Yes. The build system creates two shared libraries, ``libswrAVX.so`` and
|
||||
``libswrAVX2.so``, and ``swr_create_screen()`` loads the appropriate one at
|
||||
runtime.
|
||||
|
@@ -1,114 +0,0 @@
|
||||
Knobs
|
||||
=====
|
||||
|
||||
OpenSWR has a number of environment variables which control its
|
||||
operation, in addition to the normal Mesa and gallium controls.
|
||||
|
||||
.. envvar:: KNOB_ENABLE_ASSERT_DIALOGS <bool> (true)
|
||||
|
||||
Use dialogs when asserts fire. Asserts are only enabled in debug builds
|
||||
|
||||
.. envvar:: KNOB_SINGLE_THREADED <bool> (false)
|
||||
|
||||
If enabled will perform all rendering on the API thread. This is useful mainly for debugging purposes.
|
||||
|
||||
.. envvar:: KNOB_DUMP_SHADER_IR <bool> (false)
|
||||
|
||||
Dumps shader LLVM IR at various stages of jit compilation.
|
||||
|
||||
.. envvar:: KNOB_USE_GENERIC_STORETILE <bool> (false)
|
||||
|
||||
Always use generic function for performing StoreTile. Will be slightly slower than using optimized (jitted) path
|
||||
|
||||
.. envvar:: KNOB_FAST_CLEAR <bool> (true)
|
||||
|
||||
Replace 3D primitive execute with a SWRClearRT operation and defer clear execution to first backend op on hottile, or hottile store
|
||||
|
||||
.. envvar:: KNOB_MAX_NUMA_NODES <uint32_t> (0)
|
||||
|
||||
Maximum # of NUMA-nodes per system used for worker threads 0 == ALL NUMA-nodes in the system N == Use at most N NUMA-nodes for rendering
|
||||
|
||||
.. envvar:: KNOB_MAX_CORES_PER_NUMA_NODE <uint32_t> (0)
|
||||
|
||||
Maximum # of cores per NUMA-node used for worker threads. 0 == ALL non-API thread cores per NUMA-node N == Use at most N cores per NUMA-node
|
||||
|
||||
.. envvar:: KNOB_MAX_THREADS_PER_CORE <uint32_t> (1)
|
||||
|
||||
Maximum # of (hyper)threads per physical core used for worker threads. 0 == ALL hyper-threads per core N == Use at most N hyper-threads per physical core
|
||||
|
||||
.. envvar:: KNOB_MAX_WORKER_THREADS <uint32_t> (0)
|
||||
|
||||
Maximum worker threads to spawn. IMPORTANT: If this is non-zero, no worker threads will be bound to specific HW threads. They will all be "floating" SW threads. In this case, the above 3 KNOBS will be ignored.
|
||||
|
||||
.. envvar:: KNOB_BUCKETS_START_FRAME <uint32_t> (1200)
|
||||
|
||||
Frame from when to start saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
|
||||
|
||||
.. envvar:: KNOB_BUCKETS_END_FRAME <uint32_t> (1400)
|
||||
|
||||
Frame at which to stop saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
|
||||
|
||||
.. envvar:: KNOB_WORKER_SPIN_LOOP_COUNT <uint32_t> (5000)
|
||||
|
||||
Number of spin-loop iterations worker threads will perform before going to sleep when waiting for work
|
||||
|
||||
.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT <uint32_t> (160)
|
||||
|
||||
Maximum number of draws outstanding before API thread blocks.
|
||||
|
||||
.. envvar:: KNOB_MAX_PRIMS_PER_DRAW <uint32_t> (2040)
|
||||
|
||||
Maximum primitives in a single Draw(). Larger primitives are split into smaller Draw calls. Should be a multiple of (3 * vectorWidth).
|
||||
|
||||
.. envvar:: KNOB_MAX_TESS_PRIMS_PER_DRAW <uint32_t> (16)
|
||||
|
||||
Maximum primitives in a single Draw() with tessellation enabled. Larger primitives are split into smaller Draw calls. Should be a multiple of (vectorWidth).
|
||||
|
||||
.. envvar:: KNOB_MAX_FRAC_ODD_TESS_FACTOR <float> (63.0f)
|
||||
|
||||
(DEBUG) Maximum tessellation factor for fractional-odd partitioning.
|
||||
|
||||
.. envvar:: KNOB_MAX_FRAC_EVEN_TESS_FACTOR <float> (64.0f)
|
||||
|
||||
(DEBUG) Maximum tessellation factor for fractional-even partitioning.
|
||||
|
||||
.. envvar:: KNOB_MAX_INTEGER_TESS_FACTOR <uint32_t> (64)
|
||||
|
||||
(DEBUG) Maximum tessellation factor for integer partitioning.
|
||||
|
||||
.. envvar:: KNOB_BUCKETS_ENABLE_THREADVIZ <bool> (false)
|
||||
|
||||
Enable threadviz output.
|
||||
|
||||
.. envvar:: KNOB_TOSS_DRAW <bool> (false)
|
||||
|
||||
Disable per-draw/dispatch execution
|
||||
|
||||
.. envvar:: KNOB_TOSS_QUEUE_FE <bool> (false)
|
||||
|
||||
Stop per-draw execution at worker FE NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_FETCH <bool> (false)
|
||||
|
||||
Stop per-draw execution at vertex fetch NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_IA <bool> (false)
|
||||
|
||||
Stop per-draw execution at input assembler NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_VS <bool> (false)
|
||||
|
||||
Stop per-draw execution at vertex shader NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_SETUP_TRIS <bool> (false)
|
||||
|
||||
Stop per-draw execution at primitive setup NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_BIN_TRIS <bool> (false)
|
||||
|
||||
Stop per-draw execution at primitive binning NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_RS <bool> (false)
|
||||
|
||||
Stop per-draw execution at rasterizer NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
@@ -1,67 +0,0 @@
|
||||
Profiling
|
||||
=========
|
||||
|
||||
OpenSWR contains built-in profiling which can be enabled
|
||||
at build time to provide insight into performance tuning.
|
||||
|
||||
To enable this, uncomment the following line in ``rasterizer/core/knobs.h`` and rebuild: ::
|
||||
|
||||
//#define KNOB_ENABLE_RDTSC
|
||||
|
||||
Running an application will result in a ``rdtsc.txt`` file being
|
||||
created in current working directory. This file contains profile
|
||||
information captured between the ``KNOB_BUCKETS_START_FRAME`` and
|
||||
``KNOB_BUCKETS_END_FRAME`` (see knobs section).
|
||||
|
||||
The resulting file will contain sections for each thread with a
|
||||
hierarchical breakdown of the time spent in the various operations.
|
||||
For example: ::
|
||||
|
||||
Thread 0 (API)
|
||||
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
|
||||
0.00 0.00 28370 2837 10 0 0 APIClearRenderTarget
|
||||
0.00 41.23 11698 1169 10 0 0 |-> APIDrawWakeAllThreads
|
||||
0.00 18.34 5202 520 10 0 0 |-> APIGetDrawContext
|
||||
98.72 98.72 12413773688 29957 414380 0 0 APIDraw
|
||||
0.36 0.36 44689364 107 414380 0 0 |-> APIDrawWakeAllThreads
|
||||
96.36 97.62 12117951562 9747 1243140 0 0 |-> APIGetDrawContext
|
||||
0.00 0.00 19904 995 20 0 0 APIStoreTiles
|
||||
0.00 7.88 1568 78 20 0 0 |-> APIDrawWakeAllThreads
|
||||
0.00 25.28 5032 251 20 0 0 |-> APIGetDrawContext
|
||||
1.28 1.28 161344902 64 2486370 0 0 APIGetDrawContext
|
||||
0.00 0.00 50368 2518 20 0 0 APISync
|
||||
0.00 2.70 1360 68 20 0 0 |-> APIDrawWakeAllThreads
|
||||
0.00 65.27 32876 1643 20 0 0 |-> APIGetDrawContext
|
||||
|
||||
|
||||
Thread 1 (WORKER)
|
||||
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
|
||||
83.92 83.92 13198987522 96411 136902 0 0 FEProcessDraw
|
||||
24.91 29.69 3918184840 167 23410158 0 0 |-> FEFetchShader
|
||||
11.17 13.31 1756972646 75 23410158 0 0 |-> FEVertexShader
|
||||
8.89 10.59 1397902996 59 23410161 0 0 |-> FEPAAssemble
|
||||
19.06 22.71 2997794710 384 7803387 0 0 |-> FEClipTriangles
|
||||
11.67 61.21 1834958176 235 7803387 0 0 |-> FEBinTriangles
|
||||
0.00 0.00 0 0 187258 0 0 |-> FECullZeroAreaAndBackface
|
||||
0.00 0.00 0 0 60051033 0 0 |-> FECullBetweenCenters
|
||||
0.11 0.11 17217556 2869592 6 0 0 FEProcessStoreTiles
|
||||
15.97 15.97 2511392576 73665 34092 0 0 WorkerWorkOnFifoBE
|
||||
14.04 87.95 2208687340 9187 240408 0 0 |-> WorkerFoundWork
|
||||
0.06 0.43 9390536 13263 708 0 0 |-> BELoadTiles
|
||||
0.00 0.01 293020 182 1609 0 0 |-> BEClear
|
||||
12.63 89.94 1986508990 949 2093014 0 0 |-> BERasterizeTriangle
|
||||
2.37 18.75 372374596 177 2093014 0 0 |-> BETriangleSetup
|
||||
0.42 3.35 66539016 31 2093014 0 0 |-> BEStepSetup
|
||||
0.00 0.00 0 0 21766 0 0 |-> BETrivialReject
|
||||
1.05 8.33 165410662 79 2071248 0 0 |-> BERasterizePartial
|
||||
6.06 48.02 953847796 1260 756783 0 0 |-> BEPixelBackend
|
||||
0.20 3.30 31521202 41 756783 0 0 |-> BESetup
|
||||
0.16 2.69 25624304 33 756783 0 0 |-> BEBarycentric
|
||||
0.18 2.92 27884986 36 756783 0 0 |-> BEEarlyDepthTest
|
||||
0.19 3.20 30564174 41 744058 0 0 |-> BEPixelShader
|
||||
0.26 4.30 41058646 55 744058 0 0 |-> BEOutputMerger
|
||||
1.27 20.94 199750822 32 6054264 0 0 |-> BEEndTile
|
||||
0.33 2.34 51758160 23687 2185 0 0 |-> BEStoreTiles
|
||||
0.20 60.22 31169500 28807 1082 0 0 |-> B8G8R8A8_UNORM
|
||||
0.00 0.00 302752 302752 1 0 0 WorkerWaitForThreadEvent
|
||||
|
@@ -1,44 +0,0 @@
|
||||
Usage
|
||||
=====
|
||||
|
||||
Requirements
|
||||
^^^^^^^^^^^^
|
||||
|
||||
* An x86 processor with AVX or above
|
||||
* LLVM version 3.9 or later
|
||||
* C++14 capable compiler
|
||||
|
||||
Building
|
||||
^^^^^^^^
|
||||
|
||||
To build with GNU automake, select building the swr driver at
|
||||
configure time, for example: ::
|
||||
|
||||
configure --with-gallium-drivers=swrast,swr
|
||||
|
||||
Using
|
||||
^^^^^
|
||||
|
||||
On Linux, building with autotools will create a drop-in alternative
|
||||
for libGL.so into::
|
||||
|
||||
lib/gallium/libGL.so
|
||||
lib/gallium/libswrAVX.so
|
||||
lib/gallium/libswrAVX2.so
|
||||
|
||||
Alternatively, building with SCons will produce::
|
||||
|
||||
build/linux-x86_64/gallium/targets/libgl-xlib/libGL.so
|
||||
build/linux-x86_64/gallium/drivers/swr/libswrAVX.so
|
||||
build/linux-x86_64/gallium/drivers/swr/libswrAVX2.so
|
||||
|
||||
To use it set the LD_LIBRARY_PATH environment variable accordingly.
|
||||
|
||||
**IMPORTANT:** Mesa will default to using llvmpipe or softpipe as the default software renderer. To select the OpenSWR driver, set the GALLIUM_DRIVER environment variable appropriately: ::
|
||||
|
||||
GALLIUM_DRIVER=swr
|
||||
|
||||
To verify OpenSWR is being used, check to see if a message like the following is printed when the application is started: ::
|
||||
|
||||
SWR detected AVX2
|
||||
|
@@ -1,102 +0,0 @@
|
||||
Zink
|
||||
====
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The Zink driver is a Gallium driver that emits Vulkan API calls instead
|
||||
of targeting a specific GPU architecture. This can be used to get full
|
||||
desktop OpenGL support on devices that only support Vulkan.
|
||||
|
||||
Features
|
||||
--------
|
||||
|
||||
The feature-level of Zink depends on two things; what's implemented in Zink,
|
||||
as well as the features of the Vulkan driver.
|
||||
|
||||
OpenGL 2.1
|
||||
^^^^^^^^^^
|
||||
|
||||
OpenGL 2.1 is the minimum version Zink can support, and will always be
|
||||
exposed, given Vulkan support. There's a few features that are required
|
||||
for correct behavior, but not all of these are validated; instead you'll
|
||||
see rendering-issues and likely validation error, or even crashes.
|
||||
|
||||
Here's a list of those requirements:
|
||||
|
||||
* Vulkan 1.0
|
||||
* ``VkPhysicalDeviceFeatures``:
|
||||
|
||||
* ``logicOp``
|
||||
* ``fillModeNonSolid``
|
||||
* ``wideLines``
|
||||
* ``largePoints``
|
||||
* ``alphaToOne``
|
||||
* ``shaderClipDistance``
|
||||
|
||||
* ``VkPhysicalDeviceLimits``:
|
||||
|
||||
* ``maxClipDistances`` ≥ 6
|
||||
|
||||
* Instance extensions:
|
||||
|
||||
* `VK_KHR_get_physical_device_properties2`_
|
||||
* `VK_KHR_external_memory_capabilities`_
|
||||
|
||||
* Device extensions:
|
||||
|
||||
* `VK_KHR_maintenance1`_
|
||||
* `VK_KHR_external_memory`_
|
||||
|
||||
OpenGL 3.0
|
||||
^^^^^^^^^^
|
||||
|
||||
For OpenGL 3.0 support, the following additional device extensions are
|
||||
required to be exposed and fully supported:
|
||||
|
||||
* `VK_EXT_transform_feedback`_
|
||||
* `VK_EXT_conditional_rendering`_
|
||||
|
||||
Debugging
|
||||
---------
|
||||
|
||||
There's a few tools that are useful for debugging Zink, like this environment
|
||||
variable:
|
||||
|
||||
.. envvar:: ZINK_DEBUG <flags> ("")
|
||||
|
||||
``nir``
|
||||
Print the NIR form of all shaders to stderr.
|
||||
``spirv``
|
||||
Write the binary SPIR-V form of all compiled shaders to a file in the
|
||||
current directory, and print a message with the filename to stderr.
|
||||
``tgsi``
|
||||
Print the TGSI form of TGSI shaders to stderr.
|
||||
|
||||
Vulkan Validation Layers
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Another useful tool for debugging is the `Vulkan Validation Layers
|
||||
<https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/master/README.md>`_.
|
||||
|
||||
The validation layers effectively insert extra checking between Zink and the
|
||||
Vulkan driver, pointing out incorrect usage of the Vulkan API. The layers can
|
||||
be enabled by setting the environment variable :envvar:`VK_INSTANCE_LAYERS` to
|
||||
"VK_LAYER_KHRONOS_validation". You can read more about the Validation Layers
|
||||
in the link above.
|
||||
|
||||
IRC
|
||||
---
|
||||
|
||||
In order to make things a bit easier to follow, we have decided to create our
|
||||
own IRC channel. If you're interested in contributing, or have any technical
|
||||
questions, don't hesitate to visit `#zink on FreeNode
|
||||
<irc://irc.freenode.net/zink>`_ and say hi!
|
||||
|
||||
|
||||
.. _VK_KHR_get_physical_device_properties2: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_get_physical_device_properties2.html
|
||||
.. _VK_KHR_external_memory_capabilities: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_external_memory_capabilities.html
|
||||
.. _VK_KHR_maintenance1: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_maintenance1.html
|
||||
.. _VK_KHR_external_memory: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_external_memory.html
|
||||
.. _VK_EXT_transform_feedback: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_transform_feedback.html
|
||||
.. _VK_EXT_conditional_rendering: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_conditional_rendering.html
|
@@ -15,7 +15,6 @@ Contents:
|
||||
context
|
||||
cso
|
||||
distro
|
||||
drivers
|
||||
postprocess
|
||||
glossary
|
||||
|
||||
|
Reference in New Issue
Block a user