For NIR-to-TGSI, we don't want to revectorize 64-bit ops that we split to
scalar beyond vec2 width. We even have some ops that we would rather
retain as scalar due to TGSI opcodes being scalar, or having more unusual
requirements.
This could be used to do the vectorize_vec2_16bit filtering, but that
shader compiler option is also used in algebraic so leave it in place for
now.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6567>
This allows us to do API specific checks before removing variable
without filling nir_remove_dead_variables() with API specific code.
In the following patches we will use this to support the removal
of dead uniforms in GLSL.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4797>
Move the duplicate consts step to a nir pass.
This makes the nir representation closer to what ppir will have in the
result.
Additionally, it handles the case where a const is used multiple times
by a single node (which can happen in instructions like fcsel). The new
implementation will only emit a single load const for that case.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4535>
Move the duplicate uniform and varying steps to a nir pass, along with
some changes in the duplicating strategy.
Node duplication is now done per user of the varying/uniform. This is
inspired by what the offline shader compiler seems to usually do, and as
usual aims to reduce register pressure and better utilize the ld_uni and
ld_var instruction slots.
It is worth noting that due to a bug/feature, ppir was already
duplicating uniforms per successor in ppir_node_add_src even if the
comment indicated it was meant to be per-block.
Additionally, ppir was duplicating load uniform nodes twice for nodes
that use the same uniform in more than one source, resulting in one
unnecessary (and unpipelineable) load. This new implementation in nir
only creates one load in that case.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4535>
Otherwise we may lower some fdot to fdph which is not implemented in pp.
Fixes#2126
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Fix memory leak on allocation for nir shader, reported by valgrind.
3,502 (480 direct, 3,022 indirect) bytes in 1 blocks are definitely lost in loss record 77 of 84
at 0x48483F8: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
by 0x5750817: ralloc_size (ralloc.c:119)
by 0x5750977: rzalloc_size (ralloc.c:151)
by 0x575C173: nir_shader_create (nir.c:45)
by 0x5763ACB: nir_shader_clone (nir_clone.c:728)
by 0x55D5003: st_create_fp_variant (st_program.c:1242)
by 0x55D789F: st_get_fp_variant (st_program.c:1522)
by 0x55D789F: st_get_fp_variant (st_program.c:1507)
by 0x56400C3: st_update_fp (st_atom_shader.c:163)
by 0x563D333: st_validate_state (st_atom.c:261)
by 0x55D07CB: prepare_draw (st_draw.c:132)
by 0x55D08DF: st_draw_vbo (st_draw.c:184)
by 0x55576CB: _mesa_draw_arrays (draw.c:374)
by 0x55576CB: _mesa_draw_arrays (draw.c:351)
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
GP handles gl_PointSize similar to gl_Position, i.e. it needs
separate buffer and it has special type in varying descriptors, also
for indexed draw we need to emit special PLBU command to pass
address of gl_PointSize buffer.
Blob also clamps gl_PointSize to 1 .. 100 (as well as line width),
so let's do the same.
Reviewed-by: Andreas Baierl <ichgeh@imkreisrum.de>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
NIR may emit a single instrinsic to load several packed varyings,
but that's suboptimal for Utgard PP for several reasons:
- varyings that are used as sampler inputs can be passed using
pipeline register with increased precision
- we have small number of regs, so using a vec4 regs for storing
two vec2 varyings increases reg pressure.
Add NIR pass to split a single load into several loads and utilize
it in lima.
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
Allocating BOs is expensive, so we should avoid doing that by caching
freed BOs.
BO cache is modelled after one in v3d driver and works as follows:
- in lima_bo_create() check if we have matching BO in cache and return
it if there's one, allocate new BO otherwise.
- in lima_bo_unreference() (renamed from lima_bo_free()): put BO in
cache instead of freeing it and remove all stale BOs from cache
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
Utgard PP is vec4 architecture, so lowering phis to scalars
increases instruction count and potentially interferes with
spilling.
Tested-by: Andreas Baierl <ichgeh@imkreisrum.de>
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
Utgard PP has vector fcsel operation, but its condition is scalar. Add
filtering callback that checks whether {b,f}csel condition is not scalar
to lower {b,f}csel to scalar only in this case.
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
Set of opcodes doesn't have enough flexibility in certain cases. E.g.
Utgard PP has vector conditional select operation, but condition is always
scalar. Lowering all the vector selects to scalar increases instruction
number, so we need a way to filter only those ops that can't be handled
in hardware.
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
pp has vector units and some operations can be optimized when bundled
together.
Benchmarking this with piglit shaders shows that the instruction count
can be greatly reduced on many examples with vectorize.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
nir vec4 fcsel assumes that each component of the condition will be used
to select the same component from the options, but pp can't implement
that since it only has 1 component for the condition.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
The previous spill stack was fixed and too small, and caused instability
in programs requiring spilling for roughly more than one value.
This patch adds a dynamic calculation of the buffer size based on stack
utilization and switches it to a separate allocation at flush time that
will fit the shader that requires the largest buffer.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Very basic summary, loops and gpir spills:fills are not updated yet and
are only there to comply with the strings to shader-db report.py regex.
For now it can be used to analyze the impact of changes in instruction
count in both gpir and ppir.
The LIMA_DEBUG=shaderdb setting can be useful to output stats on
applications other than shader-db.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
nir_lower_int_to_float is currently only meant to run once, and some ops
must be lowered after being converted from int ops to be implementable,
so re-run nir_opt_algebraic after lowering ints to floats.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Utgard PP is vec4, but some operations are scalar, utilize
NIR vec to scalar lowering pass and indicate operations that we
want to lower.
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
The mali pp doesn't support integers and some nir_algebraic
optimizations may result in ops that are not easily lowerable to floats,
so disable optimizations resulting in bitops.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Jonathan Marek <jonathan@marek.ca>
Optimizations that insert bitshift or bitwise operations should not be
applied on GPUs that don't support integer operations.
The .lower_bitshift could be used to control the bitshift related ones,
but there was also one bitwise optimization uncovered.
Since only lima and freedreno use this option and the use case is that
no bit operations are wanted, let's rename it to .lower_bitops and use
it to control all bitops related optimizations.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Jonathan Marek <jonathan@marek.ca>
Now that we have fsum in nir, we can move fdot lowering there.
This helps reduce ppir complexity and enables the lowered ops to be part
of other nir optimizations in the optimization loop.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Lower texture projection in ppir using nir_lower_tex and nir_lower_tex.
This will insert a mul with the coordinate division before the load
varying.
Even though the lima pp supports projection in the load varying
instruction while loading the coordinates (from a register or a
varying), it requires that both the coordinates and projector be
components in a single register.
nir currently handles them in separate ssa, and attempting to merge them
manually may end up in worse code than just doing the coordinate
division manually. So for now let's just lower the projection to add
support for it in lima.
In the future, an optimization pass may be implemented in lima to ensure
that both coords and projector come in the same register, then this
lowering may be disabled and in this case lima may use the built-in
projection and save the mul instruction from lowering.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Remove an unnecessary nir_lower_regs_to_ssa as that should be done by
the state tracker, and add a missing DCE pass after running copy
propagation in order to remove the dead copies. This shouldn't fix
anything but the second part will reduce shader sizes.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Since we cannot handle ffma in ppir, lower it on nir level already.
Signed-off-by: Andreas Baierl <ichgeh@imkreisrum.de>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Since commit 4f3c82c72c fmod is no longer being lowered in nir, and
ends up crashing lima programs with "unsupported nir_op: fmod" in both
ppir and gpir.
There seems to be no mod operation in hardware in utgard and there is an
optimization in nir to lower fmod to instructions that lima already
implements, so let's use that.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Removes the bool_to_float logic from the int_to_float pass, so that both
can be used separately. By having separate passes we have better validation
and it makes it possible to use with the lower_ftrunc option (int lowering
generates ftrunc, but lower_ftrunc generates bools, ftrunc lowering should
probably be reworked). For now we always expect lower_bool to come after
lower_int.
Also fixes f2i32 to become ftrunc and adds u2f/f2u cases.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Add a "lower_bitshift" option, which disables optimizations introducing
bitshifts and lowers ishl by constant to a multiply, so that we don't have
to deal with bitshifts in int_to_float lowering.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This can be used by both etnaviv and freedreno/a2xx as they are both vec4
architectures with some instructions being scalar-only.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Driver which do not support native integers should use a lowering
pass to go from integers to floats.
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Neither GP nor PP in Mali4x0 support integers, so utilize new pass
and set native_integers to true for now until this flag is dropped.
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>