docs: Move the gallium driver documentation to the top level.

I actually had never found these, buried under Developer Topics -> Gallium
-> Drivers.  Given that driver documentation contains not just gallium
driver documentation but also end-user information, bring it to a much
more prominent location between User Topics and Developer Topics at the
top level.

Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7174>
This commit is contained in:
Eric Anholt
2020-10-16 10:35:17 -07:00
committed by Marge Bot
parent 9a644d7017
commit 313f951f1b
14 changed files with 23 additions and 14 deletions

View File

@@ -1,9 +0,0 @@
Drivers
=======
Driver specific documentation.
.. toctree::
:glob:
drivers/*

View File

@@ -1,9 +0,0 @@
Freedreno
=========
Freedreno driver specific docs.
.. toctree::
:glob:
freedreno/*

View File

@@ -1,432 +0,0 @@
IR3 NOTES
=========
Some notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx. The same shader ISA is present, with some small differences, in adreno a4xx.
Compared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set. However, the compiler is responsible, in most cases, to schedule the instructions. The hardware does not try to hide the shader core pipeline stages. For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or nops). When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit. Although that results in a lot of edge cases where things fall over, like:
::
ADD TEMP[0], TEMP[1], TEMP[2]
MUL TEMP[0], TEMP[1], TEMP[0].wzyx
Here, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``. Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
So the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
For additional documentation about the hardware, see wiki: `a3xx ISA
<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
External Structure
------------------
``ir3_shader``
A single vertex/fragment/etc shader from gallium perspective (ie.
maps to a single TGSI shader), and manages a set of shader variants
which are generated on demand based on the shader key.
``ir3_shader_key``
The configuration key that identifies a shader variant. Ie. based
on other GL state (two-sided-color, render-to-alpha, etc) or render
stages (binning-pass vertex shader) different shader variants are
generated.
``ir3_shader_variant``
The actual hw shader generated based on input TGSI and shader key.
``ir3_compiler``
Compiler frontend which generates ir3 and runs the various backend
stages to schedule and do register assignment.
The IR
------
The ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s). But there are a few extensions, in the form of meta_ instructions. And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value. So, for example, the following TGSI shader:
::
VERT
DCL IN[0]
DCL IN[1]
DCL OUT[0], POSITION
DCL TEMP[0], LOCAL
1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
2: MOV OUT[0], TEMP[0].xxxx
3: END
eventually generates:
.. graphviz::
digraph G {
rankdir=RL;
nodesep=0.25;
ranksep=1.5;
subgraph clusterdce198 {
label="vert";
inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
inputdce198:<in2>:w -> instrdcedd0:<src0>
inputdce198:<in6>:w -> instrdcedd0:<src1>
instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
inputdce198:<in1>:w -> instrdcec30:<src0>
inputdce198:<in5>:w -> instrdcec30:<src1>
instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
inputdce198:<in0>:w -> instrdceb60:<src0>
inputdce198:<in4>:w -> instrdceb60:<src1>
instrdceb60:<dst0> -> instrdcec30:<src2>
instrdcec30:<dst0> -> instrdcedd0:<src2>
instrdcedd0:<dst0> -> instrdcf348:<src0>
instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
instrdcedd0:<dst0> -> instrdcf400:<src0>
instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
instrdcedd0:<dst0> -> instrdcf4b8:<src0>
outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
instrdcf348:<dst0> -> outputdce198:<out0>:e
instrdcf400:<dst0> -> outputdce198:<out1>:e
instrdcf4b8:<dst0> -> outputdce198:<out2>:e
instrdcedd0:<dst0> -> outputdce198:<out3>:e
}
}
(after scheduling, etc, but before register assignment).
Internal Structure
~~~~~~~~~~~~~~~~~~
``ir3_block``
Represents a basic block.
TODO: currently blocks are nested, but I think I need to change that
to a more conventional arrangement before implementing proper flow
control. Currently the only flow control handles is if/else which
gets flattened out and results chosen with ``sel`` instructions.
``ir3_instruction``
Represents a machine instruction or meta_ instruction. Has pointers
to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
as needed.
``ir3_register``
Represents a src or dst register, flags indicate const/relative/etc.
If ``IR3_REG_SSA`` is set on a src register, the actual register
number (name) has not been assigned yet, and instead the ``instr``
field points to src instruction.
In addition there are various util macros/functions to simplify manipulation/traversal of the graph:
``foreach_src(srcreg, instr)``
Iterate each instruction's source ``ir3_register``\s
``foreach_src_n(srcreg, n, instr)``
Like ``foreach_src``, also setting ``n`` to the source number (starting
with ``0``).
``foreach_ssa_src(srcinstr, instr)``
Iterate each instruction's SSA source ``ir3_instruction``\s. This skips
non-SSA sources (consts, etc), but includes virtual sources (such as the
address register if `relative addressing`_ is used).
``foreach_ssa_src_n(srcinstr, n, instr)``
Like ``foreach_ssa_src``, also setting ``n`` to the source number.
For example:
.. code-block:: c
foreach_ssa_src_n(src, i, instr) {
unsigned d = delay_calc_srcn(ctx, src, instr, i);
delay = MAX2(delay, d);
}
TODO probably other helper/util stuff worth mentioning here
.. _meta:
Meta Instructions
~~~~~~~~~~~~~~~~~
**input**
Used for shader inputs (registers configured in the command-stream
to hold particular input values, written by the shader core before
start of execution. Also used for connecting up values within a
basic block to an output of a previous block.
**output**
Used to hold outputs of a basic block.
**flow**
TODO
**phi**
TODO
**fanin**
Groups registers which need to be assigned to consecutive scalar
registers, for example `sam` (texture fetch) src instructions (see
`register groups`_) or array element dereference
(see `relative addressing`_).
**fanout**
The counterpart to **fanin**, when an instruction such as `sam`
writes multiple components, splits the result into individual
scalar components to be consumed by other instructions.
.. _`flow control`:
Flow Control
~~~~~~~~~~~~
TODO
.. _`register groups`:
Register Groups
~~~~~~~~~~~~~~~
Certain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers. In the simplest example:
::
sam (f32)(xyz)r2.x, r0.z, s#0, t#0
for a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
Before register assignment, to group the two components of the texture src together:
.. graphviz::
digraph G {
{ rank=same;
fanin;
};
{ rank=same;
coord_x;
coord_y;
};
sam -> fanin [label="regs[1]"];
fanin -> coord_x [label="regs[1]"];
fanin -> coord_y [label="regs[2]"];
coord_x -> coord_y [label="right",style=dotted];
coord_y -> coord_x [label="left",style=dotted];
coord_x [label="coord.x"];
coord_y [label="coord.y"];
}
The frontend sets up the SSA ptrs from ``sam`` source register to the ``fanin`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values. And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``fanin``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
And likewise, for the consecutive scalar registers for the destination:
.. graphviz::
digraph {
{ rank=same;
A;
B;
C;
};
{ rank=same;
fanout_0;
fanout_1;
fanout_2;
};
A -> fanout_0;
B -> fanout_1;
C -> fanout_2;
fanout_0 [label="fanout\noff=0"];
fanout_0 -> sam;
fanout_1 [label="fanout\noff=1"];
fanout_1 -> sam;
fanout_2 [label="fanout\noff=2"];
fanout_2 -> sam;
fanout_0 -> fanout_1 [label="right",style=dotted];
fanout_1 -> fanout_0 [label="left",style=dotted];
fanout_1 -> fanout_2 [label="right",style=dotted];
fanout_2 -> fanout_1 [label="left",style=dotted];
sam;
}
.. _`relative addressing`:
Relative Addressing
~~~~~~~~~~~~~~~~~~~
Most instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers. In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, ie. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
Note that cat5 (texture sample) instructions are the notable exception, not
supporting relative addressing of src or dst.
Relative addressing of the const file (for example, a uniform array) is relatively simple. We don't do register assignment of the const file, so all that is required is to schedule things properly. Ie. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
But relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (ie. the array elements must be assigned to consecutive scalar registers). And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
Each instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s). This behaves as an additional virtual src register, ie. ``foreach_ssa_src()`` will also iterate the address register (last).
Note that ``nop``\'s for timing constraints, type specifiers (ie.
``add.f`` vs ``add.u``), etc, omitted for brevity in examples
::
mova a0.x, hr1.y
sub r1.y, r2.x, r3.x
add r0.x, r1.y, c<a0.x + 2>
results in:
.. graphviz::
digraph {
rankdir=LR;
sub;
const [label="const file"];
add;
mova;
add -> mova;
add -> sub;
add -> const [label="off=2"];
}
The scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
To implement variable arrays, values are stored in consecutive scalar registers. This has some overlap with `register groups`_, in that ``fanin`` and ``fanout`` are used to help group things for the `register assignment`_ pass.
To use a variable array as a src register, a slight variation of what is done for const array src. The instruction src is a `fanin` instruction that groups all the array members:
::
mova a0.x, hr1.y
sub r1.y, r2.x, r3.x
add r0.x, r1.y, r<a0.x + 2>
results in:
.. graphviz::
digraph {
a0 [label="r0.z"];
a1 [label="r0.w"];
a2 [label="r1.x"];
a3 [label="r1.y"];
sub;
fanin;
mova;
add;
add -> sub;
add -> fanin [label="off=2"];
add -> mova;
fanin -> a0;
fanin -> a1;
fanin -> a2;
fanin -> a3;
}
TODO better describe how actual deref offset is derived, ie. based on array base register.
To do an indirect write to a variable array, a ``fanout`` is used. Say the array was assigned to registers ``r0.z`` through ``r1.y`` (hence the constant offset of 2):
Note that only cat1 (mov) can do indirect write.
::
mova a0.x, hr1.y
min r2.x, r2.x, c0.x
mov r<a0.x + 2>, r2.x
mul r0.x, r0.z, c0.z
In this case, the ``mov`` instruction does not write all elements of the array (compared to usage of ``fanout`` for ``sam`` instructions in grouping_). But the ``mov`` instruction does need an additional dependency (via ``fanin``) on instructions that last wrote the array element members, to ensure that they get scheduled before the ``mov`` in scheduling_ stage (which also serves to group the array elements for the `register assignment`_ stage).
.. graphviz::
digraph {
a0 [label="r0.z"];
a1 [label="r0.w"];
a2 [label="r1.x"];
a3 [label="r1.y"];
min;
mova;
mov;
mul;
fanout [label="fanout\noff=0"];
mul -> fanout;
fanout -> mov;
fanin;
fanin -> a0;
fanin -> a1;
fanin -> a2;
fanin -> a3;
mov -> min;
mov -> mova;
mov -> fanin;
}
Note that there would in fact be ``fanout`` nodes generated for each array element (although only the reachable ones will be scheduled, etc).
Shader Passes
-------------
After the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_. Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
Note that we essentially have ~256 scalar registers in the
architecture (although larger register usage will at some thresholds
limit the number of threads which can run in parallel). And at some
point we will have to deal with spilling.
.. _flatten:
Flatten
~~~~~~~
In this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions. The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
.. _`copy propagation`:
Copy Propagation
~~~~~~~~~~~~~~~~
Currently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources. And the CP pass simply removes all simple ``mov``\s (ie. src-type is same as dst-type, no abs/neg flags, etc).
The eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
.. _grouping:
Grouping
~~~~~~~~
In the grouping pass, instructions which need to be grouped (for ``fanin``\s, etc) have their ``left`` / ``right`` neighbor pointers setup. In cases where there is a conflict (ie. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted. This ensures that there is some possible valid `register assignment`_ at the later stages.
.. _depth:
Depth
~~~~~
In the depth pass, a depth is calculated for each instruction node within it's basic block. The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of it's source instructions. (meta_ instructions don't add to the depth). As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction. Unreachable instructions and inputs are marked.
TODO: we should probably calculate both hard and soft depths (?) to
try to coax additional instructions to fit in places where we need
to use sync bits, such as after a texture fetch or SFU.
.. _scheduling:
Scheduling
~~~~~~~~~~
After the grouping_ pass, there are no more instructions to insert or remove. Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after it's source instructions plus delay slots. Insert ``nop``\s as required.
.. _`register assignment`:
Register Assignment
~~~~~~~~~~~~~~~~~~~
TODO

View File

@@ -1,284 +0,0 @@
Gallium LLVMpipe Driver
=======================
Introduction
------------
The Gallium llvmpipe driver is a software rasterizer that uses LLVM to
do runtime code generation. Shaders, point/line/triangle rasterization
and vertex processing are implemented with LLVM IR which is translated
to x86, x86-64, or ppc64le machine code. Also, the driver is
multithreaded to take advantage of multiple CPU cores (up to 8 at this
time). It's the fastest software rasterizer for Mesa.
Requirements
------------
- For x86 or amd64 processors, 64-bit mode is recommended. Support for
SSE2 is strongly encouraged. Support for SSE3 and SSE4.1 will yield
the most efficient code. The fewer features the CPU has the more
likely it is that you will run into underperforming, buggy, or
incomplete code.
For ppc64le processors, use of the Altivec feature (the Vector
Facility) is recommended if supported; use of the VSX feature (the
Vector-Scalar Facility) is recommended if supported AND Mesa is built
with LLVM version 4.0 or later.
See ``/proc/cpuinfo`` to know what your CPU supports.
- Unless otherwise stated, LLVM version 3.4 is recommended; 3.3 or
later is required.
For Linux, on a recent Debian based distribution do:
.. code-block:: console
aptitude install llvm-dev
If you want development snapshot builds of LLVM for Debian and
derived distributions like Ubuntu, you can use the APT repository at
`apt.llvm.org <https://apt.llvm.org/>`__, which are maintained by
Debian's LLVM maintainer.
For a RPM-based distribution do:
.. code-block:: console
yum install llvm-devel
For Windows you will need to build LLVM from source with MSVC or
MINGW (either natively or through cross compilers) and CMake, and set
the ``LLVM`` environment variable to the directory you installed it
to. LLVM will be statically linked, so when building on MSVC it needs
to be built with a matching CRT as Mesa, and you'll need to pass
``-DLLVM_USE_CRT_xxx=yyy`` as described below.
+-----------------+----------------------------------------------------------------+
| LLVM build-type | Mesa build-type |
| +--------------------------------+-------------------------------+
| | debug,checked | release,profile |
+=================+================================+===============================+
| Debug | ``-DLLVM_USE_CRT_DEBUG=MTd`` | ``-DLLVM_USE_CRT_DEBUG=MT`` |
+-----------------+--------------------------------+-------------------------------+
| Release | ``-DLLVM_USE_CRT_RELEASE=MTd`` | ``-DLLVM_USE_CRT_RELEASE=MT`` |
+-----------------+--------------------------------+-------------------------------+
You can build only the x86 target by passing
``-DLLVM_TARGETS_TO_BUILD=X86`` to cmake.
- scons (optional)
Building
--------
To build everything on Linux invoke scons as:
.. code-block:: console
scons build=debug libgl-xlib
Alternatively, you can build it with meson with:
.. code-block:: console
mkdir build
cd build
meson -D glx=gallium-xlib -D gallium-drivers=swrast
ninja
but the rest of these instructions assume that scons is used. For
Windows the procedure is similar except the target:
.. code-block:: console
scons platform=windows build=debug libgl-gdi
Using
-----
Linux
~~~~~
On Linux, building will create a drop-in alternative for ``libGL.so``
into
::
build/foo/gallium/targets/libgl-xlib/libGL.so
or
::
lib/gallium/libGL.so
To use it set the ``LD_LIBRARY_PATH`` environment variable accordingly.
For performance evaluation pass ``build=release`` to scons, and use the
corresponding lib directory without the ``-debug`` suffix.
Windows
~~~~~~~
On Windows, building will create
``build/windows-x86-debug/gallium/targets/libgl-gdi/opengl32.dll`` which
is a drop-in alternative for system's ``opengl32.dll``. To use it put it
in the same directory as your application. It can also be used by
replacing the native ICD driver, but it's quite an advanced usage, so if
you need to ask, don't even try it.
There is however an easy way to replace the OpenGL software renderer
that comes with Microsoft Windows 7 (or later) with llvmpipe (that is,
on systems without any OpenGL drivers):
- copy
``build/windows-x86-debug/gallium/targets/libgl-gdi/opengl32.dll`` to
``C:\Windows\SysWOW64\mesadrv.dll``
- load this registry settings:
::
REGEDIT4
; https://technet.microsoft.com/en-us/library/cc749368.aspx
; https://www.msfn.org/board/topic/143241-portable-windows-7-build-from-winpe-30/page-5#entry942596
[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Windows NT\CurrentVersion\OpenGLDrivers\MSOGL]
"DLL"="mesadrv.dll"
"DriverVersion"=dword:00000001
"Flags"=dword:00000001
"Version"=dword:00000002
- Ditto for 64 bits drivers if you need them.
Profiling
---------
To profile llvmpipe you should build as
::
scons build=profile <same-as-before>
This will ensure that frame pointers are used both in C and JIT
functions, and that no tail call optimizations are done by gcc.
Linux perf integration
~~~~~~~~~~~~~~~~~~~~~~
On Linux, it is possible to have symbol resolution of JIT code with
`Linux perf <https://perf.wiki.kernel.org/>`__:
::
perf record -g /my/application
perf report
When run inside Linux perf, llvmpipe will create a
``/tmp/perf-XXXXX.map`` file with symbol address table. It also dumps
assembly code to ``/tmp/perf-XXXXX.map.asm``, which can be used by the
``bin/perf-annotate-jit.py`` script to produce disassembly of the
generated code annotated with the samples.
You can obtain a call graph via
`Gprof2Dot <https://github.com/jrfonseca/gprof2dot#linux-perf>`__.
Unit testing
------------
Building will also create several unit tests in
``build/linux-???-debug/gallium/drivers/llvmpipe``:
- ``lp_test_blend``: blending
- ``lp_test_conv``: SIMD vector conversion
- ``lp_test_format``: pixel unpacking/packing
Some of these tests can output results and benchmarks to a tab-separated
file for later analysis, e.g.:
::
build/linux-x86_64-debug/gallium/drivers/llvmpipe/lp_test_blend -o blend.tsv
Development Notes
-----------------
- When looking at this code for the first time, start in lp_state_fs.c,
and then skim through the ``lp_bld_*`` functions called there, and
the comments at the top of the ``lp_bld_*.c`` functions.
- The driver-independent parts of the LLVM / Gallium code are found in
``src/gallium/auxiliary/gallivm/``. The filenames and function
prefixes need to be renamed from ``lp_bld_`` to something else
though.
- We use LLVM-C bindings for now. They are not documented, but follow
the C++ interfaces very closely, and appear to be complete enough for
code generation. See `this stand-alone
example <https://npcontemplation.blogspot.com/2008/06/secret-of-llvm-c-bindings.html>`__.
See the ``llvm-c/Core.h`` file for reference.
.. _recommended_reading:
Recommended Reading
-------------------
- Rasterization
- `Triangle Scan Conversion using 2D Homogeneous
Coordinates <https://www.cs.unc.edu/~olano/papers/2dh-tri/>`__
- `Rasterization on
Larrabee <http://www.drdobbs.com/parallel/rasterization-on-larrabee/217200602>`__
(`DevMaster
copy <http://devmaster.net/posts/2887/rasterization-on-larrabee>`__)
- `Rasterization using half-space
functions <http://devmaster.net/posts/6133/rasterization-using-half-space-functions>`__
- `Advanced
Rasterization <http://devmaster.net/posts/6145/advanced-rasterization>`__
- `Optimizing Software Occlusion
Culling <https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/>`__
- Texture sampling
- `Perspective Texture
Mapping <http://chrishecker.com/Miscellaneous_Technical_Articles#Perspective_Texture_Mapping>`__
- `Texturing As In
Unreal <https://www.flipcode.com/archives/Texturing_As_In_Unreal.shtml>`__
- `Run-Time MIP-Map
Filtering <http://www.gamasutra.com/view/feature/3301/runtime_mipmap_filtering.php>`__
- `Will "brilinear" filtering
persist? <http://alt.3dcenter.org/artikel/2003/10-26_a_english.php>`__
- `Trilinear
filtering <http://ixbtlabs.com/articles2/gffx/nv40-rx800-3.html>`__
- `Texture
Swizzling <http://devmaster.net/posts/12785/texture-swizzling>`__
- SIMD
- `Whole-Function
Vectorization <http://www.cdl.uni-saarland.de/projects/wfv/#header4>`__
- Optimization
- `Optimizing Pixomatic For Modern x86
Processors <http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-proc/184405807>`__
- `Intel 64 and IA-32 Architectures Optimization Reference
Manual <http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html>`__
- `Software optimization
resources <http://www.agner.org/optimize/>`__
- `Intel Intrinsics
Guide <https://software.intel.com/en-us/articles/intel-intrinsics-guide>`__
- LLVM
- `LLVM Language Reference
Manual <http://llvm.org/docs/LangRef.html>`__
- `The secret of LLVM C
bindings <https://npcontemplation.blogspot.co.uk/2008/06/secret-of-llvm-c-bindings.html>`__
- General
- `A trip through the Graphics
Pipeline <https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/>`__
- `WARP Architecture and
Performance <https://msdn.microsoft.com/en-us/library/gg615082.aspx#architecture>`__

View File

@@ -1,21 +0,0 @@
OpenSWR
=======
The Gallium OpenSWR driver is a high performance, highly scalable
software renderer targeted towards visualization workloads. For such
geometry heavy workloads there is a considerable speedup over llvmpipe,
which is to be expected as the geometry frontend of llvmpipe is single
threaded.
This rasterizer is x86 specific and requires AVX or above. The driver
fits into the gallium framework, and reuses gallivm for doing the TGSI
to vectorized llvm-IR conversion of the shader kernels.
.. toctree::
:glob:
openswr/usage
openswr/faq
openswr/profiling
openswr/knobs

View File

@@ -1,141 +0,0 @@
FAQ
===
Why another software rasterizer?
--------------------------------
Good question, given there are already three (swrast, softpipe,
llvmpipe) in the Mesa tree. Two important reasons for this:
* Architecture - given our focus on scientific visualization, our
workloads are much different than the typical game; we have heavy
vertex load and relatively simple shaders. In addition, the core
counts of machines we run on are much higher. These parameters led
to design decisions much different than llvmpipe.
* Historical - Intel had developed a high performance software
graphics stack for internal purposes. Later we adapted this
graphics stack for use in visualization and decided to move forward
with Mesa to provide a high quality API layer while at the same
time benefiting from the excellent performance the software
rasterizerizer gives us.
What's the architecture?
------------------------
SWR is a tile based immediate mode renderer with a sort-free threading
model which is arranged as a ring of queues. Each entry in the ring
represents a draw context that contains all of the draw state and work
queues. An API thread sets up each draw context and worker threads
will execute both the frontend (vertex/geometry processing) and
backend (fragment) work as required. The ring allows for backend
threads to pull work in order. Large draws are split into chunks to
allow vertex processing to happen in parallel, with the backend work
pickup preserving draw ordering.
Our pipeline uses just-in-time compiled code for the fetch shader that
does vertex attribute gathering and AOS to SOA conversions, the vertex
shader and fragment shaders, streamout, and fragment blending. SWR
core also supports geometry and compute shaders but we haven't exposed
them through our driver yet. The fetch shader, streamout, and blend is
built internally to swr core using LLVM directly, while for the vertex
and pixel shaders we reuse bits of llvmpipe from
``gallium/auxiliary/gallivm`` to build the kernels, which we wrap
differently than llvmpipe's ``auxiliary/draw`` code.
What's the performance?
-----------------------
For the types of high-geometry workloads we're interested in, we are
significantly faster than llvmpipe. This is to be expected, as
llvmpipe only threads the fragment processing and not the geometry
frontend. The performance advantage over llvmpipe roughly scales
linearly with the number of cores available.
While our current performance is quite good, we know there is more
potential in this architecture. When we switched from a prototype
OpenGL driver to Mesa we regressed performance severely, some due to
interface issues that need tuning, some differences in shader code
generation, and some due to conformance and feature additions to the
core swr. We are looking to recovering most of this performance back.
What's the conformance?
-----------------------
The major applications we are targeting are all based on the
Visualization Toolkit (VTK), and as such our development efforts have
been focused on making sure these work as best as possible. Our
current code passes vtk's rendering tests with their new "OpenGL2"
(really OpenGL 3.2) backend at 99%.
piglit testing shows a much lower pass rate, roughly 80% at the time
of writing. Core SWR undergoes rigorous unit testing and we are quite
confident in the rasterizer, and understand the areas where it
currently has issues (example: line rendering is done with triangles,
so doesn't match the strict line rendering rules). The majority of
the piglit failures are errors in our driver layer interfacing Mesa
and SWR. Fixing these issues is one of our major future development
goals.
Why are you open sourcing this?
-------------------------------
* Our customers prefer open source, and allowing them to simply
download the Mesa source and enable our driver makes life much
easier for them.
* The internal gallium APIs are not stable, so we'd like our driver
to be visible for changes.
* It's easier to work with the Mesa community when the source we're
working with can be used as reference.
What are your development plans?
--------------------------------
* Performance - see the performance section earlier for details.
* Conformance - see the conformance section earlier for details.
* Features - core SWR has a lot of functionality we have yet to
expose through our driver, such as MSAA, geometry shaders, compute
shaders, and tesselation.
* AVX512 support
What is the licensing of the code?
----------------------------------
* All code is under the normal Mesa MIT license.
Will this work on AMD?
----------------------
* If using an AMD processor with AVX or AVX2, it should work though
we don't have that hardware around to test. Patches if needed
would be welcome.
Will this work on ARM, MIPS, POWER, <other non-x86 architecture>?
-------------------------------------------------------------------------
* Not without a lot of work. We make extensive use of AVX and AVX2
intrinsics in our code and the in-tree JIT creation. It is not the
intention for this codebase to support non-x86 architectures.
What hardware do I need?
------------------------
* Any x86 processor with at least AVX (introduced in the Intel
SandyBridge and AMD Bulldozer microarchitectures in 2011) will
work.
* You don't need a fire-breathing Xeon machine to work on SWR - we do
day-to-day development with laptops and desktop CPUs.
Does one build work on both AVX and AVX2?
-----------------------------------------
Yes. The build system creates two shared libraries, ``libswrAVX.so`` and
``libswrAVX2.so``, and ``swr_create_screen()`` loads the appropriate one at
runtime.

View File

@@ -1,114 +0,0 @@
Knobs
=====
OpenSWR has a number of environment variables which control its
operation, in addition to the normal Mesa and gallium controls.
.. envvar:: KNOB_ENABLE_ASSERT_DIALOGS <bool> (true)
Use dialogs when asserts fire. Asserts are only enabled in debug builds
.. envvar:: KNOB_SINGLE_THREADED <bool> (false)
If enabled will perform all rendering on the API thread. This is useful mainly for debugging purposes.
.. envvar:: KNOB_DUMP_SHADER_IR <bool> (false)
Dumps shader LLVM IR at various stages of jit compilation.
.. envvar:: KNOB_USE_GENERIC_STORETILE <bool> (false)
Always use generic function for performing StoreTile. Will be slightly slower than using optimized (jitted) path
.. envvar:: KNOB_FAST_CLEAR <bool> (true)
Replace 3D primitive execute with a SWRClearRT operation and defer clear execution to first backend op on hottile, or hottile store
.. envvar:: KNOB_MAX_NUMA_NODES <uint32_t> (0)
Maximum # of NUMA-nodes per system used for worker threads 0 == ALL NUMA-nodes in the system N == Use at most N NUMA-nodes for rendering
.. envvar:: KNOB_MAX_CORES_PER_NUMA_NODE <uint32_t> (0)
Maximum # of cores per NUMA-node used for worker threads. 0 == ALL non-API thread cores per NUMA-node N == Use at most N cores per NUMA-node
.. envvar:: KNOB_MAX_THREADS_PER_CORE <uint32_t> (1)
Maximum # of (hyper)threads per physical core used for worker threads. 0 == ALL hyper-threads per core N == Use at most N hyper-threads per physical core
.. envvar:: KNOB_MAX_WORKER_THREADS <uint32_t> (0)
Maximum worker threads to spawn. IMPORTANT: If this is non-zero, no worker threads will be bound to specific HW threads. They will all be "floating" SW threads. In this case, the above 3 KNOBS will be ignored.
.. envvar:: KNOB_BUCKETS_START_FRAME <uint32_t> (1200)
Frame from when to start saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
.. envvar:: KNOB_BUCKETS_END_FRAME <uint32_t> (1400)
Frame at which to stop saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
.. envvar:: KNOB_WORKER_SPIN_LOOP_COUNT <uint32_t> (5000)
Number of spin-loop iterations worker threads will perform before going to sleep when waiting for work
.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT <uint32_t> (160)
Maximum number of draws outstanding before API thread blocks.
.. envvar:: KNOB_MAX_PRIMS_PER_DRAW <uint32_t> (2040)
Maximum primitives in a single Draw(). Larger primitives are split into smaller Draw calls. Should be a multiple of (3 * vectorWidth).
.. envvar:: KNOB_MAX_TESS_PRIMS_PER_DRAW <uint32_t> (16)
Maximum primitives in a single Draw() with tessellation enabled. Larger primitives are split into smaller Draw calls. Should be a multiple of (vectorWidth).
.. envvar:: KNOB_MAX_FRAC_ODD_TESS_FACTOR <float> (63.0f)
(DEBUG) Maximum tessellation factor for fractional-odd partitioning.
.. envvar:: KNOB_MAX_FRAC_EVEN_TESS_FACTOR <float> (64.0f)
(DEBUG) Maximum tessellation factor for fractional-even partitioning.
.. envvar:: KNOB_MAX_INTEGER_TESS_FACTOR <uint32_t> (64)
(DEBUG) Maximum tessellation factor for integer partitioning.
.. envvar:: KNOB_BUCKETS_ENABLE_THREADVIZ <bool> (false)
Enable threadviz output.
.. envvar:: KNOB_TOSS_DRAW <bool> (false)
Disable per-draw/dispatch execution
.. envvar:: KNOB_TOSS_QUEUE_FE <bool> (false)
Stop per-draw execution at worker FE NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_FETCH <bool> (false)
Stop per-draw execution at vertex fetch NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_IA <bool> (false)
Stop per-draw execution at input assembler NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_VS <bool> (false)
Stop per-draw execution at vertex shader NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_SETUP_TRIS <bool> (false)
Stop per-draw execution at primitive setup NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_BIN_TRIS <bool> (false)
Stop per-draw execution at primitive binning NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_RS <bool> (false)
Stop per-draw execution at rasterizer NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h

View File

@@ -1,67 +0,0 @@
Profiling
=========
OpenSWR contains built-in profiling which can be enabled
at build time to provide insight into performance tuning.
To enable this, uncomment the following line in ``rasterizer/core/knobs.h`` and rebuild: ::
//#define KNOB_ENABLE_RDTSC
Running an application will result in a ``rdtsc.txt`` file being
created in current working directory. This file contains profile
information captured between the ``KNOB_BUCKETS_START_FRAME`` and
``KNOB_BUCKETS_END_FRAME`` (see knobs section).
The resulting file will contain sections for each thread with a
hierarchical breakdown of the time spent in the various operations.
For example: ::
Thread 0 (API)
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
0.00 0.00 28370 2837 10 0 0 APIClearRenderTarget
0.00 41.23 11698 1169 10 0 0 |-> APIDrawWakeAllThreads
0.00 18.34 5202 520 10 0 0 |-> APIGetDrawContext
98.72 98.72 12413773688 29957 414380 0 0 APIDraw
0.36 0.36 44689364 107 414380 0 0 |-> APIDrawWakeAllThreads
96.36 97.62 12117951562 9747 1243140 0 0 |-> APIGetDrawContext
0.00 0.00 19904 995 20 0 0 APIStoreTiles
0.00 7.88 1568 78 20 0 0 |-> APIDrawWakeAllThreads
0.00 25.28 5032 251 20 0 0 |-> APIGetDrawContext
1.28 1.28 161344902 64 2486370 0 0 APIGetDrawContext
0.00 0.00 50368 2518 20 0 0 APISync
0.00 2.70 1360 68 20 0 0 |-> APIDrawWakeAllThreads
0.00 65.27 32876 1643 20 0 0 |-> APIGetDrawContext
Thread 1 (WORKER)
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
83.92 83.92 13198987522 96411 136902 0 0 FEProcessDraw
24.91 29.69 3918184840 167 23410158 0 0 |-> FEFetchShader
11.17 13.31 1756972646 75 23410158 0 0 |-> FEVertexShader
8.89 10.59 1397902996 59 23410161 0 0 |-> FEPAAssemble
19.06 22.71 2997794710 384 7803387 0 0 |-> FEClipTriangles
11.67 61.21 1834958176 235 7803387 0 0 |-> FEBinTriangles
0.00 0.00 0 0 187258 0 0 |-> FECullZeroAreaAndBackface
0.00 0.00 0 0 60051033 0 0 |-> FECullBetweenCenters
0.11 0.11 17217556 2869592 6 0 0 FEProcessStoreTiles
15.97 15.97 2511392576 73665 34092 0 0 WorkerWorkOnFifoBE
14.04 87.95 2208687340 9187 240408 0 0 |-> WorkerFoundWork
0.06 0.43 9390536 13263 708 0 0 |-> BELoadTiles
0.00 0.01 293020 182 1609 0 0 |-> BEClear
12.63 89.94 1986508990 949 2093014 0 0 |-> BERasterizeTriangle
2.37 18.75 372374596 177 2093014 0 0 |-> BETriangleSetup
0.42 3.35 66539016 31 2093014 0 0 |-> BEStepSetup
0.00 0.00 0 0 21766 0 0 |-> BETrivialReject
1.05 8.33 165410662 79 2071248 0 0 |-> BERasterizePartial
6.06 48.02 953847796 1260 756783 0 0 |-> BEPixelBackend
0.20 3.30 31521202 41 756783 0 0 |-> BESetup
0.16 2.69 25624304 33 756783 0 0 |-> BEBarycentric
0.18 2.92 27884986 36 756783 0 0 |-> BEEarlyDepthTest
0.19 3.20 30564174 41 744058 0 0 |-> BEPixelShader
0.26 4.30 41058646 55 744058 0 0 |-> BEOutputMerger
1.27 20.94 199750822 32 6054264 0 0 |-> BEEndTile
0.33 2.34 51758160 23687 2185 0 0 |-> BEStoreTiles
0.20 60.22 31169500 28807 1082 0 0 |-> B8G8R8A8_UNORM
0.00 0.00 302752 302752 1 0 0 WorkerWaitForThreadEvent

View File

@@ -1,44 +0,0 @@
Usage
=====
Requirements
^^^^^^^^^^^^
* An x86 processor with AVX or above
* LLVM version 3.9 or later
* C++14 capable compiler
Building
^^^^^^^^
To build with GNU automake, select building the swr driver at
configure time, for example: ::
configure --with-gallium-drivers=swrast,swr
Using
^^^^^
On Linux, building with autotools will create a drop-in alternative
for libGL.so into::
lib/gallium/libGL.so
lib/gallium/libswrAVX.so
lib/gallium/libswrAVX2.so
Alternatively, building with SCons will produce::
build/linux-x86_64/gallium/targets/libgl-xlib/libGL.so
build/linux-x86_64/gallium/drivers/swr/libswrAVX.so
build/linux-x86_64/gallium/drivers/swr/libswrAVX2.so
To use it set the LD_LIBRARY_PATH environment variable accordingly.
**IMPORTANT:** Mesa will default to using llvmpipe or softpipe as the default software renderer. To select the OpenSWR driver, set the GALLIUM_DRIVER environment variable appropriately: ::
GALLIUM_DRIVER=swr
To verify OpenSWR is being used, check to see if a message like the following is printed when the application is started: ::
SWR detected AVX2

View File

@@ -1,102 +0,0 @@
Zink
====
Overview
--------
The Zink driver is a Gallium driver that emits Vulkan API calls instead
of targeting a specific GPU architecture. This can be used to get full
desktop OpenGL support on devices that only support Vulkan.
Features
--------
The feature-level of Zink depends on two things; what's implemented in Zink,
as well as the features of the Vulkan driver.
OpenGL 2.1
^^^^^^^^^^
OpenGL 2.1 is the minimum version Zink can support, and will always be
exposed, given Vulkan support. There's a few features that are required
for correct behavior, but not all of these are validated; instead you'll
see rendering-issues and likely validation error, or even crashes.
Here's a list of those requirements:
* Vulkan 1.0
* ``VkPhysicalDeviceFeatures``:
* ``logicOp``
* ``fillModeNonSolid``
* ``wideLines``
* ``largePoints``
* ``alphaToOne``
* ``shaderClipDistance``
* ``VkPhysicalDeviceLimits``:
* ``maxClipDistances`` ≥ 6
* Instance extensions:
* `VK_KHR_get_physical_device_properties2`_
* `VK_KHR_external_memory_capabilities`_
* Device extensions:
* `VK_KHR_maintenance1`_
* `VK_KHR_external_memory`_
OpenGL 3.0
^^^^^^^^^^
For OpenGL 3.0 support, the following additional device extensions are
required to be exposed and fully supported:
* `VK_EXT_transform_feedback`_
* `VK_EXT_conditional_rendering`_
Debugging
---------
There's a few tools that are useful for debugging Zink, like this environment
variable:
.. envvar:: ZINK_DEBUG <flags> ("")
``nir``
Print the NIR form of all shaders to stderr.
``spirv``
Write the binary SPIR-V form of all compiled shaders to a file in the
current directory, and print a message with the filename to stderr.
``tgsi``
Print the TGSI form of TGSI shaders to stderr.
Vulkan Validation Layers
^^^^^^^^^^^^^^^^^^^^^^^^
Another useful tool for debugging is the `Vulkan Validation Layers
<https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/master/README.md>`_.
The validation layers effectively insert extra checking between Zink and the
Vulkan driver, pointing out incorrect usage of the Vulkan API. The layers can
be enabled by setting the environment variable :envvar:`VK_INSTANCE_LAYERS` to
"VK_LAYER_KHRONOS_validation". You can read more about the Validation Layers
in the link above.
IRC
---
In order to make things a bit easier to follow, we have decided to create our
own IRC channel. If you're interested in contributing, or have any technical
questions, don't hesitate to visit `#zink on FreeNode
<irc://irc.freenode.net/zink>`_ and say hi!
.. _VK_KHR_get_physical_device_properties2: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_get_physical_device_properties2.html
.. _VK_KHR_external_memory_capabilities: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_external_memory_capabilities.html
.. _VK_KHR_maintenance1: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_maintenance1.html
.. _VK_KHR_external_memory: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_external_memory.html
.. _VK_EXT_transform_feedback: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_transform_feedback.html
.. _VK_EXT_conditional_rendering: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_conditional_rendering.html

View File

@@ -15,7 +15,6 @@ Contents:
context
cso
distro
drivers
postprocess
glossary