Commit Graph

1357 Commits

Author SHA1 Message Date
D Scott Phillips
65b05ebdda anv,iris: Fix input vertex max for tcs on gen12
gen12 does away with the single patch dispatch mode for tcs, and
increases some limits so that 8_patch mode can always work. Make the
necessary changes so we don't try to fall back to single patch mode.

Fixes KHR-GL46.tessellation_shader.single.max_patch_vertices and others

Fixes: 44754279ac ("intel/fs/gen12: Use TCS 8_PATCH mode.")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4843>
2020-05-01 16:49:11 +00:00
D Scott Phillips
7bd15135a6 intel/fs: Update location of Render Target Array Index for gen12
Render Target Array Index has moved from R0.0[26:16] to
R1.1[26:16] on gen12.

Fixes dEQP-VK.multiview.input_attachments.*

Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4836>
2020-05-01 08:48:22 -07:00
Jason Ekstrand
4985e380dd intel/eu: Use non-coherent mode (BTI=253) for stateless A64 messages
We don't care about full IA coherency since we always have the
opportunity in GL or Vulkan to flush the data cache.  Using IA-coherent
mode is likely just making A64 access slower than it needs to be.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4819>
2020-04-30 14:45:50 +00:00
Francisco Jerez
0842758ec0 intel/ir: Update performance analysis parameters for memory fence codegen changes.
The SFID field of the SHADER_OPCODE_MEMORY_FENCE and
SHADER_OPCODE_INTERLOCK instructions now indicates the target function
of the memory fence.  Account the cycle-count cost to the right shared
unit.

Fixes: f858fa26b4 ("intel/fs,vec4: Pull stall logic for memory fences up into the IR")
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4817>
2020-04-29 23:40:36 +00:00
Jason Ekstrand
e581ddeeee intel/fs: Don't delete coalesced MOVs if they have a cmod
Shader-db results on ICL:

    total instructions in shared programs: 17133088 -> 17133287 (<.01%)
    instructions in affected programs: 61300 -> 61499 (0.32%)
    helped: 0
    HURT: 199

This means it's likely fixing 199 bugs. :-)  All the changed shaders are
in Mad Max.  It's surprisingly difficult to get the back-end compiler to
generate a pattern that hits this we don't tend to emit a lot coalescable
MOVs.  The pattern in Mad Max that's able to hit is fsign(fsat(x)) under
the right conditions.

Closes: #2820
Cc: mesa-stable@lists.freedesktop.org
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4773>
2020-04-29 16:45:51 +00:00
Caio Marcelo de Oliveira Filho
a3cba3c771 intel/fs: Only stall after sending all memory fence messages
In Gen11+, when emitting a fence for both L3 and SLM, the generated
code would look like

    SEND, MOV (for stall), SEND, MOV (for stall)

This commit change that so two SENDs are emitted before the MOVs for
stall.  This is similar to the approach used in Ivy Bridge for the
render fence.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
2020-04-29 07:17:27 +00:00
Caio Marcelo de Oliveira Filho
f858fa26b4 intel/fs,vec4: Pull stall logic for memory fences up into the IR
Instead of emitting the stall MOV "inside" the
SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when
creating the IR.

For IvyBridge, every (data cache) fence is accompained by a render
cache fence, that now is explicit in the IR, two
SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs).

Because Begin and End interlock intrinsics are effectively memory
barriers, move its handling alongside the other memory barrier
intrinsics.  The SHADER_OPCODE_INTERLOCK is still used to distinguish
if we are going to use a SENDC (for Begin) or regular SEND (for End).

This change is a preparation to allow emitting both SENDs in Gen11+
before we can stall on them.

Shader-db results for IVB (i965):

    total instructions in shared programs: 11971190 -> 11971200 (<.01%)
    instructions in affected programs: 11482 -> 11492 (0.09%)
    helped: 0
    HURT: 8
    HURT stats (abs)   min: 1 max: 3 x̄: 1.25 x̃: 1
    HURT stats (rel)   min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10%
    95% mean confidence interval for instructions value: 0.66 1.84
    95% mean confidence interval for instructions %-change: 0.01% 0.27%
    Instructions are HURT.

  Unlike the previous code, that used the `mov g1 g2` trick to force
  both `g1` and `g2` to stall, the scheduling fence will generate `mov
  null g1` and `mov null g2`.  During review it was decided it was not
  worth keeping the special codepath for the small effect will have.

Shader-db results for HSW (i965), BDW and SKL don't have a change
on instruction count, but do report changes in cycles count, showing
SKL results below

    total cycles in shared programs: 341738444 -> 341710570 (<.01%)
    cycles in affected programs: 7240002 -> 7212128 (-0.38%)
    helped: 46
    HURT: 5
    helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154
    helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95%
    HURT stats (abs)   min: 2 max: 1768 x̄: 646.40 x̃: 362
    HURT stats (rel)   min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08%
    95% mean confidence interval for cycles value: -777.71 -315.38
    95% mean confidence interval for cycles %-change: -1.42% -0.83%
    Cycles are helped.

  This seems to be the effect of allocating two registers separatedly
  instead of a single one with size 2, which causes different register
  allocation, affecting the cycle estimates.

while ICL also has not change on instruction count but report changes
negative changes in cycles

    total cycles in shared programs: 352665369 -> 352707484 (0.01%)
    cycles in affected programs: 9608288 -> 9650403 (0.44%)
    helped: 4
    HURT: 104
    helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101
    helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49%
    HURT stats (abs)   min: 2 max: 2016 x̄: 408.36 x̃: 48
    HURT stats (rel)   min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45%
    95% mean confidence interval for cycles value: 256.67 523.24
    95% mean confidence interval for cycles %-change: 0.63% 1.03%
    Cycles are HURT.

  AFAICT this is the result of the case above.

Shader-db results for TGL have similar cycles result as ICL, but also
affect instructions

    total instructions in shared programs: 17690586 -> 17690597 (<.01%)
    instructions in affected programs: 64617 -> 64628 (0.02%)
    helped: 55
    HURT: 32
    helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3
    helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74%
    HURT stats (abs)   min: 1 max: 65 x̄: 7.44 x̃: 2
    HURT stats (rel)   min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69%
    95% mean confidence interval for instructions value: -2.03 2.28
    95% mean confidence interval for instructions %-change: -0.41% 0.15%
    Inconclusive result (value mean confidence interval includes 0).

  Now that more is done in the IR, more dependencies are visible and
  more SWSB annotations are emitted.  Mixed with different register
  allocation decisions like above, some shaders will see more `sync
  nops` while others able to avoid them.

  Most of the new `sync nops` are also redundant and could be dropped,
  which will be fixed in a separate change.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
2020-04-29 07:17:27 +00:00
Caio Marcelo de Oliveira Filho
0e96b0d6dd intel/fs: Allow FS_OPCODE_SCHEDULING_FENCE stall on registers
It will generate the MOVs (or SYNC_NOP in Gen12+) needed for stall.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
2020-04-29 07:17:27 +00:00
Francisco Jerez
5e2a7e11b4 intel/ir: Remove scheduling-based cycle count estimates.
The cycle count estimation logic part of the scheduler is now
redundant with the shader performance modeling pass, and the estimates
can be consolidated into the brw::performance analysis result object
instead of being part of the CFG, which guarantees that the estimates
cannot be accessed without previously calling the
performance_analysis::require() method, which makes sure that the
right analysis pass is executed at the right time if we don't already
have up-to-date cached results.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
486f3b04a5 intel/ir: Pass block cycle count information explicitly to disassembler.
So we can eventually remove the cycle count estimates from the CFG
data structure and consolidate performance information in the
brw::performance object.

It would be cleaner to pass the brw::performance object directly to
the disassembler but that isn't straightforward since the disassembler
is built as a plain C file unlike the rest of the compiler back-end.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
6579f562c3 intel/ir: Use brw::performance object instead of CFG cycle counts for codegen stats.
These should be more accurate than the current cycle counts, since
among other things they consider the effect of post-scheduling passes
like the software scoreboard on TGL.  In addition it will enable us to
clean up some of the now redundant cycle-count estimation
functionality in the instruction scheduler.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
65342be3ae intel/fs: Add INTEL_DEBUG=no32 debugging flag.
This is useful in order to identify codegen issues caused by SIMD32.
It doesn't currently have any effect on compute shaders since SIMD32
dispatch is only enabled for CS when it's strictly necessary to do so
in order to support the workgroup size requested for the shader --
That might change in the future though when we hook up the SIMD32
heuristic to CS compilation.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
14f0a5cf64 intel/fs: Implement performance analysis-based SIMD32 heuristic for fragment shaders.
The heuristic enables the SIMD32 fragment shader based on whether the
IR performance modeling pass predicts it to have greater throughput
than the SIMD16 and SIMD8 variants of the same shader.  It would be
straightforward to do the same thing in order to control whether
SIMD16 dispatch is enabled, but it's pending additional performance
evaluation.

The INTEL_DEBUG=do32 option is left around in order to force the
SIMD32 shader to be used regardless of the result of the heuristic,
since it's useful as a debugging aid e.g. in order to identify
SIMD32-specific codegen issues which may be masked by the SIMD32
heuristic, or cases where the heuristic is incorrectly disabling
SIMD32 shaders that offer a performance advantage.

Currently this is only enabled on Gen6+, since SIMD32 codegen support
is incomplete on earlier platforms.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
d6aa0c261f intel/fs: Heap-allocate fs_visitors in brw_compile_fs().
This makes brw_compile_fs() look a bit more similar to
brw_compile_cs().  It saves us three v*_shader_stats local variables,
and will save us additional triplicated declarations as we start
tracking IR performance analysis results.

The triplicated cfg pointers are left around because they're set to
NULL to mark specific dispatch modes as disabled (e.g. in order to
enforce hardware restrictions).  Doing the same thing with the visitor
pointers would cause data leaks.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
188a3659ae intel/ir: Import shader performance analysis pass.
This introduces an analysis pass intended to estimate several
performance statistics of the shader, including cycle count latency
and throughput values, based on static modeling.  It has instruction
performance information more comprehensive than the current scheduling
pass for all platforms between Gen4-11, and works on both the FS and
VEC4 back-end.

The most immediate purpose of this pass is to implement a heuristic
meant to determine whether using SIMD32 dispatch for a fragment shader
can be expected to help more than it hurts.  In addition this will
allow the effect of passes run after scheduling (e.g. the TGL software
scoreboard pass and the VEC4 dependency control pass) to be visible in
shader-db statistics.

But that isn't the end of the story, other potential applications of
this pass (not part of this MR) I've been playing around with are:

 - Implement a similar SIMD16 heuristic allowing the identification of
   inefficient SIMD16 fragment shaders.

 - Implement similar SIMD16 and SIMD32 heuristics for the compute
   shader stage -- Currently compute shader builds always use the
   SIMD16 shader if available and never use the SIMD32 shader unless
   strictly necessary, which is suboptimal under certain conditions.

 - Hook up to the instruction scheduler in order to improve the
   accuracy of its timing information.

 - Use as heuristic in order to drive the selection of scheduling
   modes (Matt was experimenting with that).

 - Plug to the TGL software scoreboard pass in order to implement a
   more effective SBID token allocation algorithm, since in general
   the optimal token allocation depends on the timings of all
   instructions in the program.

 - Use its bottleneck detection functionality in order to implement a
   heuristic computing a more optimal bound for the number of fragment
   shader threads executed in parallel (by adjusting the
   MaximumNumberofThreadsPerPSD control of 3DSTATE_PS).

As a follow-up I'm planning to submit updated timing information for
Gen12 platforms -- Everything else required to support Gen12 like SWSB
handling is already included in this patch, but there were some IP
concerns regarding the TGL timing parameters since they cannot
currently be obtained with the documentation and hardware which is
publicly available.  The timing parameters for any previous Gen7-11
platforms can be obtained by anyone by sampling the timestamp register
using e.g. shader_time, though I have some more convenient
instrumentation coming up.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:03 -07:00
Francisco Jerez
c8ce1cfc9c intel/vec4: Fix constness of vec4_instruction::reads_flag() and ::writes_flag().
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
bda1d72dd9 intel/fs: Replace fs_visitor::bank_conflict_cycles() with stand-alone function.
This will be re-usable by the IR performance analysis pass.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
d2ed740795 intel/fs: Fix constness of argument of fs_instruction_scheduler::is_compressed().
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
6310a05f68 intel/fs: Rename half() helpers to quarter(), allow index up to 3.
Makes more sense considering SIMD32.  Relaxing the assertion in
brw_ir_fs.h will be required in order to avoid assertion failures on
SNB with SIMD32 fragment shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
bdad7f429a intel/ir: Add missing initialization of backend_reg::offset during construction.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
e549e4f6c0 intel/fs/gen12: Fix Render Target Read header setup for new thread payload layout.
In Gen12 the Poly 0 Info DWORD containing the Viewport Index and
Render Target Index fields were moved from r0.0 to r1.1 in order to
make room for dual-polygon dispatch.  The render target message format
was updated to expect that information in the same location, so we
didn't need to make any changes for framebuffer fetch to work with
SIMD8 and SIMD16 dispatch.  Unfortunately that won't work with SIMD32,
since the render target message header is assembled from r0 and r2
instead of r1, and the r2 thread payload wasn't updated with an
additional copy of the same information.  We need to fix things up
manually instead.  This avoids a handful of
EXT_shader_framebuffer_fetch regressions in combination with SIMD32
fragment shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
72324035fb intel/fs/gen12: Work around dual-source blending hangs in combination with SIMD32.
This applies the same work-around I commited as b84fa0b31e
"intel/fs/gen11: Work around dual-source blending hangs in combination
with SIMD32." to Gen12, which seems to suffer from the same hardware
bug found empirically.  The failure mode seems to be identical.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:28 -07:00
Francisco Jerez
d6ae079771 intel/fs/gen12: Fix hangs with per-sample SIMD32 fragment shader dispatch.
The Gen12 docs are rather contradictory regarding the dispatch
configurations supported by the fragment shader -- The same table
present in previous generations seems to imply that only one dispatch
mode can be enabled when doing per-sample shading, but a restriction
documented in the 3DSTATE_PS_BODY page implies the opposite: That
SIMD32 can only be used in combination with some other dispatch mode.

The latter seems to match the behavior of real hardware as I could
tell from my testing: A bunch of multisample test-cases that do
per-sample shading hang if we only provide a SIMD32 shader.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:28 -07:00
Rhys Perry
32d871b48f nir/algebraic: don't undo lowering of 8/16-bit comparisons to 32-bit
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4387>
2020-04-23 10:57:38 +00:00
Kenneth Graunke
259cae4442 intel/compiler: Don't create 64-bit src1 immediates in opt_peephole_sel
64-bit immediates are only allowed as src0.  Long ago, we decided to
avoid constructing such illegal situations in the IR, rather than
allowing them in the IR but then promoting bogus immediates to GRFs
later.  So, we need to fix opt_peephole_sel to not put 64-bit immediates
as src1 of the new SEL instruction.

Fixes: a4b36cd3dd ("intel/fs: Coalesce when the src live range is contained in the dst")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2816
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4692>
2020-04-23 00:53:14 +00:00
Kenneth Graunke
4459a70a6e intel/compiler: Delete abs/neg handling in fsign code
This should have gone away when removing source modifiers.  They won't
be set any longer, so this is simply dead code.

Fixes: b7c47c4f7c ("intel/compiler: Drop nir_lower_to_source_mods() and related handling.")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4691>
2020-04-22 17:04:37 -07:00
Kenneth Graunke
220f0e10d8 intel/compiler: Don't copy prop source mods into PICK_HIGH_32BIT
VEC4_OPCODE_PICK_HIGH_32BIT performs 32-bit UD access on a 64-bit DF
value.  abs and negate make sense on DF, but break entirely when
trying to access pieces of the value as unsigned integer dwords.

Fixes an fsign Piglit test on Ivybridge:
tests/spec/arb_gpu_shader_fp64/execution/built-in-functions/vs-sign-neg-abs

It had regressed when I removed nir_lower_to_source_modifiers, as that
caused us to start generating different code which provoked this bug.

Fixes: b7c47c4f7c ("intel/compiler: Drop nir_lower_to_source_mods() and related handling.")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2817
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4691>
2020-04-22 17:03:18 -07:00
Kenneth Graunke
902c8731f4 intel/compiler: Put back saturate on [iu]add_sat opcodes
I deleted one too many inst->saturate = ... lines.  This one must stay.

Fixes: b7c47c4f7c ("intel/compiler: Drop nir_lower_to_source_mods() and related handling.")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4669>
2020-04-22 00:47:40 +00:00
Kenneth Graunke
b7c47c4f7c intel/compiler: Drop nir_lower_to_source_mods() and related handling.
I think we're unanimous in wanting to drop nir_lower_to_source_mods.
It's a bit of complexity to handle in the backend, but perhaps more
importantly, would be even more complexity to handle in nir_search.

And, it turns out that since we made other compiler improvements in the
last few years, they no longer appear to buy us anything of value.
Summarizing the results from shader-db from this patch:

 - Icelake (scalar mode)

   Instruction counts:

   - 411 helped, 598 hurt (out of 139,470 shaders)
   - 99.2% of shaders remain unaffected.  The average increase in
     instruction count in hurt programs is 1.78 instructions.
   - total instructions in shared programs: 17214951 -> 17215206 (<.01%)
   - instructions in affected programs: 1143879 -> 1144134 (0.02%)

   Cycles:

   - 1042 helped, 1357 hurt
   - total cycles in shared programs: 365613294 -> 365882263 (0.07%)
   - cycles in affected programs: 138155497 -> 138424466 (0.19%)

 - Haswell (both scalar and vector modes)

   Instruction counts:

   - 73 helped, 1680 hurt (out of 139,470 shaders)
   - 98.7% of shaders remain unaffected.  The average increase in
     instruction count in hurt programs is 1.9 instructions.
   - total instructions in shared programs: 14199527 -> 14202262 (0.02%)
   - instructions in affected programs: 446499 -> 449234 (0.61%)

   Cycles:

   - 5253 helped, 5559 hurt
   - total cycles in shared programs: 359996545 -> 360038731 (0.01%)
   - cycles in affected programs: 155897127 -> 155939313 (0.03%)

Given that ~99% of shader-db remains unaffected, and the affected
programs are hurt by about 1-2 instructions - which are all cheap
ALU instructions - this is unlikely to be measurable in terms of
any real performance impact that would affect users.

So, drop them and simplify the backend, and hopefully enable other
future simplifications in NIR.

Reviewed-by: Eric Anholt <eric@anholt.net> [v1]
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4616>
2020-04-21 21:42:21 +00:00
Dylan Baker
8e3696137f remove final imports.h and imports.c bits
This moves the fi_types to a new mesa_private.h and removes the
imports.c file. The vast majority of this patch is just removing
pound includes of imports.h and fixing up the recursive includes.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>
2020-04-21 11:09:04 -07:00
Dylan Baker
f8e4542bad replace _mesa_logbase2 with util_logbase2
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>
2020-04-21 11:09:03 -07:00
Dylan Baker
e533fad182 replace _mesa_is_pow_two with util_is_power_of_two_*
Mostly this uses util_is_power_of_two_or_zero, which has the same
behavior as _mesa_is_pow_two when the input is zero. In cases where the
value is known to be != 0 ahead of time I used the _nonzero variant as
it may be faster on some platforms.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>
2020-04-21 11:09:03 -07:00
Jason Ekstrand
7c43b8ce1b nir: Delete the fnoise opcodes
As of the previous commit, they are never used.

Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4624>
2020-04-21 06:16:13 +00:00
Jason Ekstrand
a4b36cd3dd intel/fs: Coalesce when the src live range is contained in the dst
Consider the following case:

    // g119-123 are written somewhere above
    mul.sat(16)   g67<1>F    g6.4<0,1,0>F   g125<8,8,1>F
    mul.sat(16)   g69<1>F    g6.5<0,1,0>F   g125<8,8,1>F
    mul.sat(16)   g71<1>F    g6.6<0,1,0>F   g125<8,8,1>F
    mov(16)       g119<1>F   g67<8,8,1>F
    mov(16)       g121<1>F   g69<8,8,1>F
    mov(16)       g123<1>F   g71<8,8,1>F

We should be able to coalesce it into

    mul.sat(16)   g119<1>F   g6.4<0,1,0>F   g125<8,8,1>F
    mul.sat(16)   g121<1>F   g6.5<0,1,0>F   g125<8,8,1>F
    mul.sat(16)   g123<1>F   g6.6<0,1,0>F   g125<8,8,1>F

What's stopping us is an overly conservative check for writes to the two
registers being coalesced.  The check walks over the intersection of
their live ranges and checks for no writes to either one.  However,
because the register which starts the live range (the mul.sat in this
case) is inside that intersection, we flag it as a write in the
intersection and don't coalesce.  However, this case is safe because the
destination register of the copy is never read after the source is
written.

Shader-db changes on ICL:

    total instructions in shared programs: 16043613 -> 16042610 (<.01%)
    instructions in affected programs: 43036 -> 42033 (-2.33%)
    helped: 226
    HURT: 0
    helped stats (abs) min: 1 max: 30 x̄: 4.44 x̃: 4
    helped stats (rel) min: 0.09% max: 26.67% x̄: 4.89% x̃: 3.43%
    95% mean confidence interval for instructions value: -4.86 -4.02
    95% mean confidence interval for instructions %-change: -5.57% -4.22%
    Instructions are helped.

    total cycles in shared programs: 334766372 -> 334710124 (-0.02%)
    cycles in affected programs: 617548 -> 561300 (-9.11%)
    helped: 214
    HURT: 2
    helped stats (abs) min: 15 max: 1512 x̄: 263.21 x̃: 212
    helped stats (rel) min: 0.30% max: 75.36% x̄: 25.30% x̃: 21.58%
    HURT stats (abs)   min: 40 max: 40 x̄: 40.00 x̃: 40
    HURT stats (rel)   min: 0.15% max: 0.15% x̄: 0.15% x̃: 0.15%
    95% mean confidence interval for cycles value: -277.91 -242.90
    95% mean confidence interval for cycles %-change: -27.58% -22.55%
    Cycles are helped.

No spill/fill changes or gained/lost

Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4627>
2020-04-21 01:00:24 +00:00
Jason Ekstrand
14b8d979db intel/fs: Rename block to scan_block in can_coalesce_vars
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4627>
2020-04-21 01:00:24 +00:00
Caio Marcelo de Oliveira Filho
c76f2292b5 intel/fs,vec4: Properly account SENDs in IVB memory fence
Change brw_memory_fence to return the number of messages emitted, and
use that to update the send_count statistic in code generation.

This will fix the book-keeping for IVB since the memory fences will
result in two SEND messages.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4646>
2020-04-20 09:29:09 -07:00
Ian Romanick
f7d620f47d intel/compiler: Fixup operands in fs_builder::emit() that takes array
The versions that take a specific number of operands will do various
fixups depending on the platform and the opcode.  However, the version
that takes an array of sources did not.  This makes all version operate
similarly.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>
2020-04-17 08:21:47 -07:00
Ian Romanick
39ad0c2af8 intel/compiler: CSEL can do saturate
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>
2020-04-17 08:21:46 -07:00
Ian Romanick
5afaa407c1 intel/compiler: Only GE and L modifiers are commutative for SEL
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>
2020-04-17 08:21:43 -07:00
Ian Romanick
a80e44902f intel/compiler: Silence unused parameter warning in update_inst_scoreboard
src/intel/compiler/brw_fs_scoreboard.cpp: In function ‘void {anonymous}::update_inst_scoreboard(const fs_visitor*, const ordered_address*, const fs_inst*, unsigned int, {anonymous}::scoreboard&)’:
src/intel/compiler/brw_fs_scoreboard.cpp:793:45: warning: unused parameter ‘shader’ [-Wunused-parameter]
  793 |    update_inst_scoreboard(const fs_visitor *shader, const ordered_address *jps,
      |                           ~~~~~~~~~~~~~~~~~~^~~~~~

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>
2020-04-17 08:21:42 -07:00
Ian Romanick
c836295dfd intel/compiler: Silence unused parameter warning in fs_live_variables::setup_one_read
src/intel/compiler/brw_fs_live_variables.cpp: In member function ‘void brw::fs_live_variables::setup_one_read(brw::fs_live_variables::block_data*, fs_inst*, int, const fs_reg&)’:
src/intel/compiler/brw_fs_live_variables.cpp:56:67: warning: unused parameter ‘inst’ [-Wunused-parameter]
   56 | fs_live_variables::setup_one_read(struct block_data *bd, fs_inst *inst,
      |                                                          ~~~~~~~~~^~~~

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>
2020-04-17 08:21:40 -07:00
Ian Romanick
62f70a353f intel/compiler: Silence unused parameter warnings in vec4_tcs_visitor
In file included from src/intel/compiler/brw_vec4_tcs.cpp:31:
src/intel/compiler/brw_vec4_tcs.h: In member function ‘virtual void brw::vec4_tcs_visitor::emit_urb_write_header(int)’:
src/intel/compiler/brw_vec4_tcs.h:74:43: warning: unused parameter ‘mrf’ [-Wunused-parameter]
   74 |    virtual void emit_urb_write_header(int mrf) {}
      |                                       ~~~~^~~
src/intel/compiler/brw_vec4_tcs.h: In member function ‘virtual brw::vec4_instruction* brw::vec4_tcs_visitor::emit_urb_write_opcode(bool)’:
src/intel/compiler/brw_vec4_tcs.h:75:57: warning: unused parameter ‘complete’ [-Wunused-parameter]
   75 |    virtual vec4_instruction *emit_urb_write_opcode(bool complete) { return NULL; }
      |                                                    ~~~~~^~~~~~~~

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4582>
2020-04-17 08:21:37 -07:00
Jason Ekstrand
d0d039a4d3 anv: Emit pushed UBO bounds checking code in the back-end compiler
This commit fixes performance regressions introduced by e03f965280
in which we started bounds checking our push constants.  This added a
LOT of shader code to shaders which use the robustBufferAccess feature
and led to substantial spilling.  The checking we just added to the FS
back-end is far more efficient for two reasons:

 1. It can be done at a whole register granularity rather than per-
    scalar and so we emit one SIMD8 SEL per 32B GRF rather than one
    SIMD16 SEL (executed as two SELs) for each component loaded.

 2. Because we do it with NoMask instructions, we can do it on whole
    pushed GRFs without splatting them out to SIMD8 or SIME16 values.
    This means that robust buffer access no longer explodes our register
    pressure for no good reason.

As a tiny side-benefit, we're now using can use AND instead of SEL which
means no need for the flag and better scheduling.

Vulkan pipeline database results on ICL:

    Instructions in all programs: 293586059 -> 238009118 (-18.9%)
    SENDs in all programs: 13568515 -> 13568515 (+0.0%)
    Loops in all programs: 149720 -> 149720 (+0.0%)
    Cycles in all programs: 88499234498 -> 84348917496 (-4.7%)
    Spills in all programs: 1229018 -> 184339 (-85.0%)
    Fills in all programs: 1348397 -> 246061 (-81.8%)

This also improves the performance of a few apps:

 - Shadow of the Tomb Raider: +4%
 - Witcher 3: +3.5%
 - UE4 Shooter demo: +2%

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4447>
2020-04-17 14:48:06 +00:00
Jason Ekstrand
eb5a10ff63 intel/cfg: Add first/last_block helpers
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4447>
2020-04-17 14:48:06 +00:00
Jason Ekstrand
e003104605 intel: Add _const versions of prog_data cast helpers
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4597>
2020-04-16 17:26:16 +00:00
Jason Ekstrand
b2e4157143 anv: Advertise SEND count through VK_EXT_pipeline_executable_properties
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4578>
2020-04-15 21:51:55 +00:00
Caio Marcelo de Oliveira Filho
db74ad0696 intel/compiler: Remove cs_prog_data->threads
At this point all drivers are doing this math on their own -- since
most of them need to cover the variable group size case, in which at
compile time the group size (and number of threads) is not defined.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
2020-04-09 19:23:20 -07:00
Plamena Manolova
c77dc51203 intel/compiler: Add support for variable workgroup size
Add new builtin parameters that are used to keep track of the group
size.  This will be used to implement ARB_compute_variable_group_size.

The compiler will use the maximum group size supported to pick a
suitable SIMD variant.  A later improvement will be to keep all SIMD
variants (like FS) so the driver can select the best one at dispatch
time.

When variable workgroup size is used, the small workgroup optimization
is disabled as it we can't prove at compile time that the barriers
won't be needed.

Extracted from original i965 patch with additional changes by
Caio Marcelo de Oliveira Filho.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
2020-04-09 19:23:12 -07:00
Caio Marcelo de Oliveira Filho
c54fc0d07b intel/compiler: Replace cs_prog_data->push.total with a helper
The push.total field had three values but only one was directly
used (size).  Replace it with a helper function that explicitly takes
the cs_prog_data and the number of threads -- and use that in the
drivers.

This is a preparation for ARB_compute_variable_group_size where the
number of threads (hence the total size for push constants) is not
defined at compile time (not cs_prog_data->threads).

Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
2020-04-09 19:23:12 -07:00
Caio Marcelo de Oliveira Filho
395de69b1f intel/fs: Allow multiple slots for position
Change brw_compute_vue_map() to also take the number of pos slots.  If
more than one slot is used, the VARYING_SLOT_POS is treated as an
array.

When using Primitive Replication, instead of a single position, the
VUE must contain an array of positions.  Padding might be
necessary (after clip distance) to ensure rest of attributes start
aligned.

v2: Add note about array in the commit message and assert that
    pos_slots >= 1 to make clear 0 is invalid. (Jason)
    Move padding to be after the clip distance.

v3: Apply the correct offset when gathering the sources from outputs.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> [v2]
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/2313>
2020-04-07 17:16:09 +00:00