Commit Graph

25 Commits

Author SHA1 Message Date
Eric Anholt
88b41239f9 vc4: Rework scheduling of thread switch to cut one more NOP.
Jonas's patch got us most of the benefit of scheduling instructions into
the delay slots of thread switch, but if there had been nothing to pair
the thrsw with, it would move the thrsw up and leave a NOP where the thrsw
was.

Instead, don't pair anything with thrsw through the normal scheduling
path, and have a separate helper function that inserts the thrsw earlier
if possible and inserts any necessary NOPs.

total instructions in shared programs: 93027 -> 92643 (-0.41%)
instructions in affected programs:     14952 -> 14568 (-2.57%)
2016-12-29 15:22:54 -08:00
Jonas Pfeil
d82dbc4cde vc4: Fill thread switching delay slots
Scan for instructions without a signal set in front of the switching
instruction and move the signal up there.

shader-db results:

total instructions in shared programs: 94494 -> 93027 (-1.55%)
instructions in affected programs:     23545 -> 22078 (-6.23%)

v2: Fix re-emitting of the instruction in the loop trying to emit NOPs,
    drop a scheduling change from branch delay slots. (by anholt)

Signed-off-by: Jonas Pfeil <pfeiljonas@gmx.de>
2016-12-29 14:41:09 -08:00
Eric Anholt
51244859e3 vc4: Avoid false scheduling dependencies for LOAD_IMMs.
Noticed in shaders with branching, where we ended up scheduling delay
slots near the start of a block for the uniforms reset setup.

total instructions in shared programs: 93970 -> 93951 (-0.02%)
instructions in affected programs:     3117 -> 3098 (-0.61%)

3DMMES performance +0.423087% +/- 0.133521% (n=9,10)
2016-11-30 19:58:09 -08:00
Eric Anholt
d40a3212ae vc4: Add a note for the future about texture latency calculation.
Debugging a shader-db reported cycle count regression from the tex
coalescing, I eventually figured out that the texture latencies were
totally bogus.  Really fixing it will probably involve mirroring
vc4_qir_schedule.c's texture fifo management here.
2016-11-29 09:01:23 -08:00
Eric Anholt
1ee503c74d vc4: Add support for QPU scheduling of thread switch instructions.
This is vaguely based off of Jonas Pfeil's thread switch support branch.
2016-11-12 18:46:35 -08:00
Eric Anholt
e887341d3f vc4: Don't pair up TLB scoreboard locking instructions early in QPU sched.
Jonas Pfeil noticed that we were putting passthrough tlb_z writes early in
the shader, despite QIR and QPU scheduling both trying to delay scoreboard
locking for as long as possible.

The problem was that when trying to pair up QPU instructions, at some
point the passthrough tlb_z would be the last one available and it would
get paired, even if the other half would open up other instructions to be
scheduled and we could have paired tlb_z with something later in the
program.  Also, since passthrough z is just a mov, it pairs up really
easily.

The proper fix would probably be to flip the order of scheduling
instructions so we went from bottom to top (also relevant for branch delay
slot scheduling).

However, we can do a quick fix here to just not schedule a TLB lock until
there's nothing but TLB left in the program, at a slight instruction cost
(est .61% cycle count in shader-db) but a major fragment shader
parallelism win.

glmark2 results:
  texture:texture-filter=linear: +1.24481% +/- 0.626117% (n=15)
  bump:bump-render=height: 1.24991% +/- 0.154793% (n=136,133 -- screensaver
    outliers removed)
2016-11-09 15:33:56 -08:00
Eric Anholt
3da4e38f48 vc4: Add QPU scheduling to handle MUL rotate sources.
We need MUL rotates to do ddx/ddy support.
2016-08-25 17:24:11 -07:00
Eric Anholt
9194473dd2 vc4: Emit resets of the uniform stream at the starts of blocks.
If a block might be entered from multiple locations, then the uniform
stream will (probably) be at different points, and we need to make sure
that it's pointing where we expect it to be.  The kernel also enforces
that any block reading a uniform resets uniforms, to prevent reading
outside of the uniform stream by using looping.
2016-07-13 23:54:15 -07:00
Eric Anholt
44df061aaa vc4: Add support for scheduling of branch instructions.
For now we don't fill the delay slots, and instead just drop in NOPs.
2016-07-13 23:54:15 -07:00
Eric Anholt
a59da513d3 vc4: Move the QPU instructions to schedule into each block.
We'll want to schedule them individually, to handle delay slots.
2016-07-13 23:54:15 -07:00
Eric Anholt
44df374a9c vc4: Fix a pasteo in scheduling condition flag usage.
Noticed by code inspection.  This hasn't been too big of a deal, because
our cond usages all start out as adder ops, either MOVs or the FTOI for Z
writes.  MOVs *can* get converted to mul ops during scheduling, but
apparently we hadn't hit this.
2016-07-04 16:33:22 -07:00
Eric Anholt
5278c64de5 vc4: Fix latency handling for QPU texture scheduling.
There's only high latency between a complete texture fetch setup and
collecting its result, not between each step of setting up the texture
fetch request.
2015-12-18 17:09:03 -08:00
Eric Anholt
960f48809f vc4: Keep sample mask writes from being reordered after TLB writes
Fixes a regression I noticed after introducing scheduling on the QIR.

Cc: "11.1" <mesa-stable@lists.freedesktop.org>
2015-12-18 17:09:03 -08:00
Eric Anholt
5989ef2b0f vc4: Add debugging of the estimated time to run the shader to shader-db. 2015-12-11 12:21:22 -08:00
Eric Anholt
74c4b3b80c vc4: Add support for storing sample mask.
From the API perspective, writing 1 bits can't turn on pixels that were
off, so we AND it with the sample mask from the payload.
2015-12-04 09:23:55 -08:00
Eric Anholt
78c773bb36 vc4: Convert from simple_list.h to list.h
list.h is a nicer and more familiar set of list functions/macros.
2015-05-29 22:09:53 -07:00
Eric Anholt
e473fbe469 vc4: Add support for turning constant uniforms into small immediates.
Small immediates have the downside of taking over the raddr B field, so
you might have less chance to pack instructions together thanks to raddr B
conflicts.  However, it also reduces some register pressure since it lets
you load 2 "uniform" values in one instruction (avoiding a previous load
of the constant value to a register), and increases some pairing for the
same reason.

total uniforms in shared programs: 16231 -> 13374 (-17.60%)
uniforms in affected programs:     10280 -> 7423 (-27.79%)
total instructions in shared programs: 40795 -> 41168 (0.91%)
instructions in affected programs:     25551 -> 25924 (1.46%)

In a previous version of this patch I had a reduction in instruction count
by forcing the other args alongside a SMALL_IMM to be in the A file or
accumulators, but that increases register pressure and had a bug in
handling FRAG_Z.  In this patch is I just use raddr conflict resolution,
which is more expensive.  I think I'd rather tweak allocation to have some
way to slightly prefer good choices for files in general, rather than risk
failing to register allocate by forcing things into register classes.
2014-12-17 19:35:13 -08:00
Eric Anholt
8812dc503e vc4: Do QPU scheduling across uniform loads.
This means another pass of reordering the uniform data store, but it lets
us pair up a lot more instructions.

total instructions in shared programs: 44639 -> 43176 (-3.28%)
instructions in affected programs:     36938 -> 35475 (-3.96%)
2014-12-09 21:19:11 -08:00
Eric Anholt
c5b544403f vc4: Populate the delay field better, and schedule high delay first.
This is a standard scheduling heuristic, and clearly helps.

total instructions in shared programs: 46418 -> 44467 (-4.20%)
instructions in affected programs:     42531 -> 40580 (-4.59%)
2014-12-09 18:32:36 -08:00
Eric Anholt
45a8923771 vc4: Skip raddr dependencies for 32-bit immediate loads.
These don't have raddr fields.
2014-12-09 18:32:36 -08:00
Eric Anholt
f431b4f110 vc4: Mark VPM read setup as impacting VPM reads, not writes.
Fixes assertion failures if we adjust scheduling priorities to emphasize
VPM reads more.
2014-12-09 18:32:36 -08:00
Eric Anholt
6f32deb538 vc4: Add separate write-after-read dependency tracking for pairing.
If an operation is the last one to read a register, the instruction
containing it can also include the op that has the next write to that
register.

total instructions in shared programs: 57486 -> 56995 (-0.85%)
instructions in affected programs:     43004 -> 42513 (-1.14%)
2014-12-05 10:53:53 -08:00
Eric Anholt
042962df2d vc4: Fix inverted priority of instructions for QPU scheduling.
We were scheduling TLB operations as early as possible, and texture setup
as late as possible.  When I introduced prioritization, I visually
inspected that an independent operation got moved above texture results
collection, which tricked me into thinking it was working (but it was just
because texture setup was being pushed late).

total instructions in shared programs: 57651 -> 57486 (-0.29%)
instructions in affected programs:     18532 -> 18367 (-0.89%)
2014-12-05 10:43:14 -08:00
Eric Anholt
29c7cf2b2b vc4: Pair up QPU instructions when scheduling.
We've got two mostly-independent operations in each QPU instruction, so
try to pack two operations together.  This is fairly naive (doesn't track
read and write separately in instructions, doesn't convert ADD-based MOVs
into MUL-based movs, doesn't reorder across uniform loads), but does show
a decent improvement on shader-db-2.

total instructions in shared programs: 59583 -> 57651 (-3.24%)
instructions in affected programs:     47361 -> 45429 (-4.08%)
2014-12-01 22:29:42 -08:00
Eric Anholt
3fe4d8e1e3 vc4: Introduce scheduling of QPU instructions.
This doesn't reschedule much currently, just tries to fit things into the
regfile A/B write-versus-read slots (the cause of the improvements in
shader-db), and hide texture fetch latency by scheduling setup early and
results collection late (haven't performance tested it).  This
infrastructure will be important for doing instruction pairing, though.

shader-db2 results:
total instructions in shared programs: 61874 -> 59583 (-3.70%)
instructions in affected programs:     50677 -> 48386 (-4.52%)
2014-12-01 11:00:23 -08:00