third_party_mesa3d

Author	SHA1	Message	Date
Jason Ekstrand	0d462dbee5	intel/fs: Add an alignment to VARYING_PULL_CONSTANT_LOAD_LOGICAL Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3932>	2020-10-08 01:14:46 -05:00
Marcin Ślusarz	40b964dc8f	intel/compiler: remove unused fs_validator::param_size Found by Coverity as unitialized variable. Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6667>	2020-09-10 12:16:58 +00:00
Jason Ekstrand	90b6745bc8	intel/fs,vec4: Stuff the constant data from NIR in the end of the program Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6244>	2020-09-02 19:48:44 +00:00
Matt Turner	66111bc95a	intel/compiler: Drop opt_sampler_eot() Gen9 and Cherryview have the ability to mark texture instructions with the End-of-thread bit under some conditions, which allows the texture result to be written to the render target directly, rather than returning to the EU. In order to handle overlapping primitives correctly, we have to use the 'sendc' instruction which stalls until other threads potentially writing to the same locations in the render target are retired. Unfortunately, this stall happens before the texture is sampled (rather than in parallel with stall), so for some literal edge cases (like the diagonal edge between two triangles forming a rectangle) there can be a performance penalty. As a result, it's probably not a good idea to use this optimization in general. I had planned to leave it enabled only for BLORP, where we use rectangle primitives and are typically clearing/blitting an entire render target without any overlapping primitives, but I noticed that the optimization wasn't applied in some normal cases anyway. For example, in the piglit test tests/shaders/glsl-fs-texture2d-bias.shader_test it is applied to one BLORP-blit shader but not another due to some kind of mishandling of register types (the destination register type of the texture operation is UD while the color source of the render target write is F). Additionally the instruction scheduler assumed that the combined texture and render target write operation took 0 cycles, leading to cycle estimates that are wildly inaccurate. Since the optimization was not implemented for SIMD32 and our decision whether to use the SIMD32 program is made by comparing the estimated performance with that of the SIMD16 shader, we wrongly threw out a bunch of SIMD32 programs that are likely profitable. total cycles in shared programs: 472807891 -> 473784245 (0.21%) cycles in affected programs: 108277 -> 1084631 (901.72%) helped: 0 HURT: 1290 total sends in shared programs: 998955 -> 1000245 (0.13%) sends in affected programs: 1400 -> 2690 (92.14%) helped: 0 HURT: 1290 LOST: 0 GAINED: 33 This patch shows no performance changes in Intel's Mesa performance CI. Given the problems, the lack of evidence that the pass improves performance, and the fact that the hardware feature was removed from subsequent GPU generations, I think that the pass is not valuable and should be removed. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Signed-off-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5412>	2020-06-12 19:01:26 +00:00
Caio Marcelo de Oliveira Filho	10d0f39beb	intel/fs: Remove min_dispatch_width spilling decision from RA Move the decision one level up, let brw_compile_() functions use the spilling information to decide whether or not a certain width compilation can spill (passed via run_() functions). The min_dispatch_width was used to compare with the dispatch_width and decide whether "a previous shader is already available, so don't accept spill". This is replaced by: - Not calling run_*() functions if it is know beforehand a smaller width already spilled -- since the larger width will spill and fail; - Explicitly passing whether or not a shader is allowed to spill. For the cases where the smaller width is available and haven't spilled, the larger width will be compiled but is only useful if it won't spill. Moving the decision to this level will be useful later for variable group size, which is a case where we want all the widths to be allowed to spill. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>	2020-05-27 18:16:31 -07:00
Francisco Jerez	6579f562c3	intel/ir: Use brw::performance object instead of CFG cycle counts for codegen stats. These should be more accurate than the current cycle counts, since among other things they consider the effect of post-scheduling passes like the software scoreboard on TGL. In addition it will enable us to clean up some of the now redundant cycle-count estimation functionality in the instruction scheduler. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:01:27 -07:00
Francisco Jerez	188a3659ae	intel/ir: Import shader performance analysis pass. This introduces an analysis pass intended to estimate several performance statistics of the shader, including cycle count latency and throughput values, based on static modeling. It has instruction performance information more comprehensive than the current scheduling pass for all platforms between Gen4-11, and works on both the FS and VEC4 back-end. The most immediate purpose of this pass is to implement a heuristic meant to determine whether using SIMD32 dispatch for a fragment shader can be expected to help more than it hurts. In addition this will allow the effect of passes run after scheduling (e.g. the TGL software scoreboard pass and the VEC4 dependency control pass) to be visible in shader-db statistics. But that isn't the end of the story, other potential applications of this pass (not part of this MR) I've been playing around with are: - Implement a similar SIMD16 heuristic allowing the identification of inefficient SIMD16 fragment shaders. - Implement similar SIMD16 and SIMD32 heuristics for the compute shader stage -- Currently compute shader builds always use the SIMD16 shader if available and never use the SIMD32 shader unless strictly necessary, which is suboptimal under certain conditions. - Hook up to the instruction scheduler in order to improve the accuracy of its timing information. - Use as heuristic in order to drive the selection of scheduling modes (Matt was experimenting with that). - Plug to the TGL software scoreboard pass in order to implement a more effective SBID token allocation algorithm, since in general the optimal token allocation depends on the timings of all instructions in the program. - Use its bottleneck detection functionality in order to implement a heuristic computing a more optimal bound for the number of fragment shader threads executed in parallel (by adjusting the MaximumNumberofThreadsPerPSD control of 3DSTATE_PS). As a follow-up I'm planning to submit updated timing information for Gen12 platforms -- Everything else required to support Gen12 like SWSB handling is already included in this patch, but there were some IP concerns regarding the TGL timing parameters since they cannot currently be obtained with the documentation and hardware which is publicly available. The timing parameters for any previous Gen7-11 platforms can be obtained by anyone by sampling the timestamp register using e.g. shader_time, though I have some more convenient instrumentation coming up. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:01:03 -07:00
Francisco Jerez	bda1d72dd9	intel/fs: Replace fs_visitor::bank_conflict_cycles() with stand-alone function. This will be re-usable by the IR performance analysis pass. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:00:29 -07:00
Plamena Manolova	c77dc51203	intel/compiler: Add support for variable workgroup size Add new builtin parameters that are used to keep track of the group size. This will be used to implement ARB_compute_variable_group_size. The compiler will use the maximum group size supported to pick a suitable SIMD variant. A later improvement will be to keep all SIMD variants (like FS) so the driver can select the best one at dispatch time. When variable workgroup size is used, the small workgroup optimization is disabled as it we can't prove at compile time that the barriers won't be needed. Extracted from original i965 patch with additional changes by Caio Marcelo de Oliveira Filho. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>	2020-04-09 19:23:12 -07:00
Mathias Fröhlich	630154e77b	i965: Move down genX_upload_sbe in profiles. Avoid looping over all VARYING_SLOT_MAX urb_setup array entries from genX_upload_sbe. Prepare an array indirection to the active entries of urb_setup already in the compile step. On upload only walk the active arrays. v2: Use uint8_t to store the attribute numbers. v3: Change loop to build up the array indirection. v4: Rebase. v5: Style fix. Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/308>	2020-03-10 14:28:36 +00:00
Matt Turner	bb3e7b0fe3	intel/compiler: Pass shader_stats for each SIMD mode Passing shader_stats to the fs_generator constructor means that the SIMD8 shader stats from the visitor (such as the scheduler mode) will be reported out for the SIMD16/SIMD32 versions as well. As you can see, we are now passing 'shader_stats' and 'stats' to generate_code(), which is obviously odd looking. Ian rebased and committed an old patch of mine which added the shader_stats struct on July 30 in commit `dabb5d4bee` (i965/fs: Add a shader_stats struct.) and shortly after on August 12 Jason added the brw_compile_stats struct in commit `134607760a` (intel/compiler: Fill a compiler statistics struct). I'd like to combine the two, but I'm not sure how. shader_stats is an input to generate_code() while brw_compile_stats is an output and is only used by the Vulkan driver. Leave it as is for now... Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>	2020-03-09 04:44:12 +00:00
Matt Turner	75a33e268e	intel/compiler: Mark some methods and parameters const Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>	2020-03-09 04:44:11 +00:00
Francisco Jerez	e5e4d016b9	intel/compiler: Move register pressure calculation into IR analysis object This defines a new BRW_ANALYSIS object which wraps the register pressure computation code along with its result. For the rationale see the previous commits converting the liveness and dominance analysis passes to the IR analysis framework. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:21:10 -08:00
Francisco Jerez	2878817197	intel/compiler: Drop invalidate_live_intervals() Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:21:01 -08:00
Francisco Jerez	ea44de6d8c	intel/compiler/fs: Switch liveness analysis to IR analysis framework This involves wrapping fs_live_variables in a BRW_ANALYSIS object and hooking it up to invalidate_analysis() so it's properly invalidated. Seems like a lot of churn but it's fairly straightforward. The fs_visitor invalidate_ and calculate_live_intervals() methods are no longer necessary after this change. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:57 -08:00
Francisco Jerez	ba73e606f6	intel/compiler: Move all live interval analysis results into fs_live_variables This moves the following methods that are currently defined in fs_visitor (even though they are side products of the liveness analysis computation) and are already implemented in brw_fs_live_variables.cpp: > bool virtual_grf_interferes(int a, int b) const; > int virtual_grf_start; > int virtual_grf_end; It makes sense for them to be part of the fs_live_variables object, because they have the same lifetime as other liveness analysis results and because this will allow some extra validation to happen wherever they are accessed in order to make sure that we only ever use up-to-date liveness analysis results. This shortens the virtual_grf prefix in order to compensate for the slightly increased lexical overhead from the live_intervals pointer dereference. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:43 -08:00
Francisco Jerez	3ceb496cdf	intel/compiler: Mark virtual_grf_interferes and vars_interfere as const Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:41 -08:00
Francisco Jerez	d966a6b4c4	intel/compiler: Introduce backend_shader method to propagate IR changes to analysis passes The invalidate_analysis() method knows what analysis passes there are in the back-end and calls their invalidate() method to report changes in the IR. For the moment it just calls invalidate_live_intervals() (which will eventually be fully replaced by this function) if anything changed. This makes all optimization passes invalidate DEPENDENCY_EVERYTHING, which is clearly far from ideal -- The dependency classes passed to invalidate_analysis() will be refined in a future commit. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:32 -08:00
Francisco Jerez	a6fc88e91b	intel/compiler: Reverse inclusion dependency between brw_fs_live_variables.h and brw_fs.h brw_fs.h (in particular fs_visitor) is logically a user of the live variables analysis pass, not the other way around. brw_fs_live_variables.h requires the definition of some FS IR data structures to compile, but those can be obtained directly from brw_ir_fs.h without including brw_fs.h. The dependency of fs_live_variables on fs_visitor is rather accidental and will be removed in a future commit, a forward declaration is enough for the moment. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:26 -08:00
Francisco Jerez	a792e11f5c	intel/fs/gen7+: Swap sample mask flag register and FIND_LIVE_CHANNEL temporary. FIND_LIVE_CHANNEL was using f1.0-f1.1 as temporary flag register on Gen7, instead use f0.0-f0.1. In order to avoid collision with the discard sample mask, move the latter to f1.0-f1.1. This makes room for keeping track of the sample mask of the second half of SIMD32 programs that use discard. Note that some MOVs of the sample mask into f1.0 become redundant now in lower_surface_logical_send() and lower_a64_logical_send(). Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>x	2020-02-14 14:31:48 -08:00
Francisco Jerez	083fd96a97	intel/fs: Use helper for discard sample mask flag subregister number. Use it instead of hard-coding f0.1 for the sample mask of programs that use discard. This will make the task easier when we replace f0.1 with another flag register location in order to support discard with SIMD32 shaders. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-02-14 14:31:48 -08:00
Francisco Jerez	a8ac0bd759	intel/fs/gen12: Workaround unwanted SEND execution due to broken NoMask control flow. This is a less invasive alternative to the workaround documented in the hardware spec for GEN:BUG:1407528679, which doesn't involve disabling structured control flow (it's unlikely that switching to GOTO/JOIN would have actually fixed the problem anyway). Under some conditions Gen12 hardware can end up executing a BB with all channels disabled, which will lead to the execution of any NoMask instructions in it, even though any execution-masked instructions will be correctly shot down. This may break assumptions of some NoMask SEND messages whose descriptor depends on data generated by live invocations of the shader. This avoids the problem by predicating certain instructions on an ANY horizontal predicate that makes sure that their execution is omitted when all channels of the program are disabled. The shader-db impact of this patch seems to be minimal: total instructions in shared programs: 17169833 -> 17169913 (0.00%) instructions in affected programs: 30663 -> 30743 (0.26%) helped: 0 HURT: 42 total cycles in shared programs: 336966176 -> 336968568 (0.00%) cycles in affected programs: 2367290 -> 2369682 (0.10%) helped: 0 HURT: 13 Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Cc: 20.0 <mesa-stable@lists.freedesktop.org>	2020-02-14 14:31:48 -08:00
Ian Romanick	58907568ec	intel/fs: Add SHADER_OPCODE_[IU]SUB_SAT pseudo-ops v2: Add a big comment explaining the [IU]SUB_SAT lowering. Suggested by Caio. v3: Use get_fpu_lowered_simd_width in get_lowered_simd_width. Suggested by Ken on IRC. v4: Fix a typo in a comment. Noticed by Caio. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>	2020-01-23 00:18:57 +00:00
Caio Marcelo de Oliveira Filho	ff5b74ef32	intel/fs: Add workgroup_size() helper Reviewed-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>	2020-01-21 23:41:35 +00:00
Francisco Jerez	b54b67e067	intel/fs: Switch to standard vector layout for barycentrics at optimization time. This involves permuting the registers of barycentric vectors to have the standard X[0-n] Y[0-n] layout at NIR translation time. Barycentrics are converted to the format expected by the PLN instruction in the lower_barycentrics() pass run after the optimization loop. Main reason is correctness of SIMD32 fragment shaders. The shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used during NIR translation are busted for SIMD32. This leads to serious corruption at present with INTEL_DEBUG=do32, especially on Gen11+ where these helpers are hit more frequently due to the lack of a hardware PLN instruction. Of course one could have chosen to fix those helpers instead, but there is another far more subtle issue that was reported during review of the SIMD32 fragment shader codegen changes: The SIMD splitting pass currently handles SIMD32 barycentric vectors as if they had the standard X[0-n] Y[0-n] layout, even though they are interleaved for the PLN instruction, which causes incorrect execution masks to be applied to the MOVs unzipping barycentric vectors in cases where a LINTERP instruction occurs under non-uniform control flow. I'm not aware of any conformance regressions due to the latter issue at present, but for our peace of mind let's move the conversion to the PLN layout into the lower_barycentrics() pass run after lower_simd_width(). This leads to the following shader-db improvements (including SIMD32 shaders) in combination with the previous back-end preparation changes -- Without them (especially the copy propagation changes) this would lead to a massive number of regressions. On ICL: total instructions in shared programs: 20662316 -> 20466903 (-0.95%) instructions in affected programs: 10538474 -> 10343061 (-1.85%) helped: 68775 HURT: 6 total spills in shared programs: 8938 -> 8748 (-2.13%) spills in affected programs: 376 -> 186 (-50.53%) helped: 9 HURT: 5 total fills in shared programs: 8965 -> 8663 (-3.37%) fills in affected programs: 965 -> 663 (-31.30%) helped: 9 HURT: 6 LOST: 146 GAINED: 43 On SKL: total instructions in shared programs: 18725867 -> 18614912 (-0.59%) instructions in affected programs: 3876590 -> 3765635 (-2.86%) helped: 27492 HURT: 2 LOST: 191 GAINED: 417 On SNB: total instructions in shared programs: 14573613 -> 13980646 (-4.07%) instructions in affected programs: 5199074 -> 4606107 (-11.41%) helped: 29998 HURT: 0 LOST: 21 GAINED: 30 Results are somewhat less impressive but still significant without SIMD32 fragment shaders enabled. On ICL: total instructions in shared programs: 16148728 -> 16061659 (-0.54%) instructions in affected programs: 6114788 -> 6027719 (-1.42%) helped: 42046 HURT: 6 total spills in shared programs: 8218 -> 8028 (-2.31%) spills in affected programs: 376 -> 186 (-50.53%) helped: 9 HURT: 5 total fills in shared programs: 8953 -> 8651 (-3.37%) fills in affected programs: 965 -> 663 (-31.30%) helped: 9 HURT: 6 LOST: 0 GAINED: 3 On SKL: total instructions in shared programs: 14927994 -> 14926738 (-0.01%) instructions in affected programs: 168850 -> 167594 (-0.74%) helped: 711 HURT: 2 On SNB: total instructions in shared programs: 10770538 -> 10734403 (-0.34%) instructions in affected programs: 2702172 -> 2666037 (-1.34%) helped: 17818 HURT: 0 All of the hurt shaders are either spilling slightly more or emitting additional NOP instructions due to the SIMD16 POW workaround for Gen8-9 combined with differences in scheduling. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:23:12 -08:00
Francisco Jerez	79bd252d6e	intel/fs: Introduce barycentric layout lowering pass. The goal is to represent barycentrics with the standard vector layout during optimization and particularly SIMD lowering. Instead of emitting the barycentric layout conversions at NIR translation time, do it later as a lowering pass. For the moment this is only applied to PI messages, but we'll give the same treatment to LINTERP instructions too. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:22:59 -08:00
Francisco Jerez	44d7d66adc	intel/fs: Split fetch_payload_reg() into separate helper for barycentrics. We're about to change the layout of barycentric vectors, which will involve permuting the GRFs of barycentrics fetched from the thread payload. Make room for this in a function separate from the generic fetch_payload_reg(), since the permutation will only be applicable to barycentric vectors. This allows simplifying fetch_payload_reg(), since there was no need for handling multiple-component payload registers except for barycentrics. This causes some minor shader-db noise due to the new helper emitting a LOAD_PAYLOAD instruction unconditionally, but it will be cleaned up shortly. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:22:51 -08:00
Francisco Jerez	c20dc9b836	intel/fs: Make implied_mrf_writes() an fs_inst method. This will be convenient in a later commit enabling SIMD32 fragment shaders, and happens to fix the calculation for MATH instructions which is currently inaccurate for SIMD-lowered instructions on Gen4-5 platforms (all of them on Gen4 in SIMD16 mode), since it was based on the shader's dispatch width rather than on the actual execution size of the instruction. This causes some shader-db noise on Gen4 due to the more compact register allocation interacting with the SEND dependency workarounds, but otherwise no major changes. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-10 11:02:30 -08:00
Francisco Jerez	591f146fd2	intel/fs/cse: Fix non-deterministic behavior due to inaccurate liveness calculation. The liveness calculation done by the local CSE pass in order to prune AEB entries whose sources are no longer live is currently inaccurate, because the live intervals are calculated once at the beginning of the pass, so they don't take into account any of the copy instructions inserted by the CSE pass as it makes progress. However the IP counter used in that calculation is based on the start_ip of the basic block, which is updated automatically whenever any instructions are inserted into the CFG. This causes the IP counter and liveness intervals to get out of sync in programs with multiple basic blocks, causing the CSE pass to toss AEB entries prematurely, which can lead to missed optimization opportunities rather non-deterministically. On BDW this leads to the following shader-db changes: total instructions in shared programs: 14952488 -> 14951763 (-0.00%) instructions in affected programs: 45416 -> 44691 (-1.60%) helped: 40 HURT: 4 total spills in shared programs: 20989 -> 20970 (-0.09%) spills in affected programs: 103 -> 84 (-18.45%) helped: 3 HURT: 0 total fills in shared programs: 24981 -> 24926 (-0.22%) fills in affected programs: 127 -> 72 (-43.31%) helped: 3 HURT: 0 In addition it avoids a number of regressions in combination with some of the optimization changes I'm working on for SIMD32, which would have made CSE more effective... Causing it to be less effective elsewhere in the program astonishingly. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-10 11:02:06 -08:00
Jason Ekstrand	2fca325ea6	Revert "i965/fs: Merge CMP and SEL into CSEL on Gen8+" This reverts commit `52c7df1643`. The pass, while clearly useful for some shaders, has at least three bugs that I was able to find fairly quickly: 1. It doesn't work for type-converting MOVs because f > 0 is not the same as f2i(f) > 0 2. CSEL is a 3src instruction and only supports one source type; it doesn't take this into account and tries to create instructions which do a F compare and a D select. This is especially nasty to debug because you don't see that in the dumped assembly because we don't properly assert that types are the same in codegen. 3. While you can handle 2, in theory, by reinterpreting types, you can't do that in the presence of source modifiers. This pass doesn't even attempt to detect that. Those are just the ones I found with the one almost trival shader I was debugging. There very likely may be more and. Best thing to do for now is just shut it off until someone has the time to figure out how to do this properly and write tests to ensure it's correct. Fixes: 3cb085e6d61a "i965/fs: Merge CMP and SEL into CSEL on Gen8+" Reviewed-by: Brian Paul <brianp@vmware.com>	2019-11-20 20:47:32 +00:00
Jason Ekstrand	53bfcdeecf	intel/fs: Implement the new load/store_scratch intrinsics This commit fills in a number of different pieces: 1. We add support to brw_nir_lower_mem_access_bit_sizes to handle the new intrinsics. This involves simple plumbing work as well as a tiny bit of extra logic to always scalarize scratch intrinsics 2. Add code to brw_fs_nir.cpp to turn nir_load/store_scratch intrinsics into byte/dword scattered read/write messages which use the A32 stateless model. 3. Add code to lower_surface_logical_send to handle dword scattered messages and the A32 stateless model. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>	2019-11-11 17:17:02 +00:00
Francisco Jerez	265c7c8971	intel/fs/gen12: Introduce software scoreboard lowering pass. Gen12+ hardware lacks the register scoreboard logic that used to guarantee data coherency between register reads and writes in previous generations. This lowering pass runs after register allocation in order to make up for it. It works by performing global dataflow analysis in order to determine the set of potential dependencies of every instruction in the shader, and then inserts any required SWSB annotations and additional SYNC instructions in order to guarantee data coherency. v2: Drop unnecessary _safe list iteration (Caio). v3: Temporarily workaround potential WaR hazard between FPU instruction and subsequent out-of-order write, pending clarification from the hardware team. Drop redundant tracking of implicit access of acc0-1, since the hardware guarantees coherency of these (but not the other accumulators...). Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>	2019-10-11 12:24:16 -07:00
Francisco Jerez	b2ae65c7d9	intel/fs: Fix constness of implied_mrf_writes() argument. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-10-11 12:24:16 -07:00
Samuel Iglesias Gonsálvez	9da56ffc52	i965/fs: add emit_shader_float_controls_execution_mode() and aux functions We need this function to emit code that setups the control register later with the defined execution mode for the shader. Therefore, we emit it as the first instruction. v2: - Fix bug in setting the default mode mask in brw_rnd_mode_from_nir(). - Fix support for rounding modes in brw_rnd_mode_from_nir(). v3: - Updated to renamed shader info member and enum values (Andres). v4: - Add actual emission as first instruction of emit_nir_code (Caio). Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Signed-off-by: Andres Gomez <agomez@igalia.com> Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>	2019-09-17 23:39:19 +03:00
Kenneth Graunke	86a63b1098	intel/compiler: Refactor FB write message control setup into a helper. This will be used by visitor code to convert directly to SEND in a bit. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2019-08-27 14:20:07 -07:00
Jason Ekstrand	f58e0405b6	intel/fs: Drop the gl_program from fs_visitor It's not used by anything anymore now that so much lowering has been moved into NIR. Sadly, we still need on in brw_compile_gs() for geometry shaders on Sandy Bridge. Short of a lot of pointless work, that one's probably not going away. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-08-25 01:02:52 -05:00
Jason Ekstrand	134607760a	intel/compiler: Fill a compiler statistics struct This commit is all annoying plumbing work which just adds support for a new brw_compile_stats struct. This struct provides a binary driver readable form of the same statistics we dump out to stderr when we INTEL_DEBUG is set with a shader stage. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>	2019-08-12 22:56:07 +00:00
Paulo Zanoni	866bb775de	intel/fs: add 64 bit integer multiplication lowering While NIR's lower_imul64() solves the case of 64 bit integer multiplications generated early, we don't have a way to lower such instructions when they are generated by our own backend, such as the scan/reduce intrinsics. We'll need this soon, so implement it now. An easy way to test this is to simply disable nir_lower_imul64 to let those operations reach the backend. v2: - Fix Q/UQ copy/paste errors (Caio). - Transform an 'if' into 'else if' (Caio). - Add an extra comment to clarify the need for 64b = 32b * 32b (Caio). - Make private functions private (Caio). v3: - Remove ambiguity with 'b' and 'd' variables (Caio). - Allocate potentially less regs for the dwords (Caio). Cc: Jason Ekstrand <jason.ekstrand@intel.com> Cc: Matt Turner <matt.turner@intel.com> Cc: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>	2019-08-12 15:16:23 -07:00
Paulo Zanoni	6ba4717924	intel/compiler: don't instantiate a builder for each instruction Don't instantiate a builder for each instruction during lower_integer_multiplication(). Instantiate one only when needed. On the other hand, these unneeded builders don't seem to cost much to init, so I don't expect any significant difference in performance: this is mostly about code organization. Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>	2019-08-12 15:16:23 -07:00
Paulo Zanoni	75b3868dcc	intel/compiler: extract subfunctions of lower_integer_multiplication() The lower_integer_multiplication() function is already a little too big. I want to add more to it, so let's reorganize the existing code first. Let's start with just extracting the current code to subfunctions. Later we'll change them a little more. v2: Make private functions private (Caio). v3: Fix typo (Caio). Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>	2019-08-12 15:16:23 -07:00
Jason Ekstrand	d03ec807a4	intel/fs: Drop all of the 64-bit varying code Reviewed-by: Matt Turner <mattst88@gmail.com>	2019-07-31 18:14:09 -05:00
Matt Turner	46a3ea06be	i965/fs: Print the scheduler mode. Line wrap some awfully long lines while we are here. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>	2019-07-30 14:35:43 -07:00
Matt Turner	dabb5d4bee	i965/fs: Add a shader_stats struct. It'll grow further, and we'd like to avoid adding an additional parameter to fs_generator() for each new piece of data. v2 (idr): Rebase on 17 months. Track a visitor instead of a cfg. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com>	2019-07-30 14:35:43 -07:00
Jason Ekstrand	e401303597	intel/fs: Remove calculate_urb_setup from fs_visitor Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-07-29 23:30:26 +00:00
Kenneth Graunke	517005b4cf	i965: Use NIR to lower legacy userclipping. This allows us to drop legacy userclip plane handling in both the vec4 and FS backends, and simplifies a few interfaces. v2 (Jason Ekstrand): - Move brw_nir_lower_legacy_clipping to brw_nir_uniforms.cpp because it's i965-specific. - Handle adding the params in brw_nir_lower_legacy_clipping - Call brw_nir_lower_legacy_clipping from brw_codegen_vs_prog Co-authored-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-07-24 18:00:13 +00:00
Jason Ekstrand	14781e2122	intel/compiler: Add a "base class" for program keys Right now, all keys have two things in common: a program string ID and a sampler_prog_key_data. I'd like to add another thing or two and need a place to put it. This commit adds a new brw_base_prog_key struct which contains those two common bits. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-07-10 19:35:55 +00:00
Jason Ekstrand	fa869f45c8	intel/fs: Use nir_lower_interpolation on gen11+ On gen11, the removed the PLN instruction so we have to emit a pile of MAD to emulate it. We may as well do that in NIR so we can optimize and later schedule it. Shader-db results on Ice Lake: total instructions in shared programs: 17145644 -> 16556440 (-3.44%) instructions in affected programs: 11507454 -> 10918250 (-5.12%) helped: 35763 HURT: 42085 helped stats (abs) min: 1 max: 140 x̄: 19.09 x̃: 18 helped stats (rel) min: 0.04% max: 37.93% x̄: 15.40% x̃: 14.49% HURT stats (abs) min: 1 max: 248 x̄: 2.22 x̃: 2 HURT stats (rel) min: 0.05% max: 50.00% x̄: 5.00% x̃: 2.47% 95% mean confidence interval for instructions value: -7.67 -7.47 95% mean confidence interval for instructions %-change: -4.46% -4.29% Instructions are helped. total loops in shared programs: 4370 -> 4370 (0.00%) loops in affected programs: 0 -> 0 helped: 0 HURT: 0 total cycles in shared programs: 360624645 -> 368220857 (2.11%) cycles in affected programs: 269631244 -> 277227456 (2.82%) helped: 15583 HURT: 65874 helped stats (abs) min: 1 max: 28561 x̄: 78.45 x̃: 32 helped stats (rel) min: <.01% max: 67.81% x̄: 5.38% x̃: 2.44% HURT stats (abs) min: 1 max: 238638 x̄: 133.87 x̃: 20 HURT stats (rel) min: <.01% max: 306.25% x̄: 5.81% x̃: 3.97% 95% mean confidence interval for cycles value: 67.42 119.09 95% mean confidence interval for cycles %-change: 3.61% 3.73% Cycles are HURT. total spills in shared programs: 8943 -> 8981 (0.42%) spills in affected programs: 1925 -> 1963 (1.97%) helped: 44 HURT: 14 total fills in shared programs: 21815 -> 21925 (0.50%) fills in affected programs: 3511 -> 3621 (3.13%) helped: 41 HURT: 18 LOST: 70 GAINED: 14 Reviewed-by: Matt Turner <mattst88@gmail.com>	2019-07-02 16:15:25 +00:00
Ian Romanick	a288708506	intel/fs: Add need_dest parameter to fs_visitor::nir_emit_alu This is the same as the need_dest parameter to prepare_alu_destination_and_sources. This allows us to not change the register that is expected to hold an result if an instruction is re-emitted. This is particularly a problem if the re-emitted instruction is a partial write. A later patch will use this feature. No shader-db changes on any Intel platform. v2: Don't do the Boolean resolve when there is no destination. If the ALU instruction didn't write a register, there's nothing to resolve. This replaces an earlier patch "intel/fs: Allocate dummy destination register when need_dest is false". Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Reviewed-by: Matt Turner <mattst88@gmail.com>	2019-06-05 17:04:08 -07:00
Kenneth Graunke	646924cfa1	intel/compiler: Implement TCS 8_PATCH mode and INTEL_DEBUG=tcs8 Our tessellation control shaders can be dispatched in several modes. - SINGLE_PATCH (Gen7+) processes a single patch per thread, with each channel corresponding to a different patch vertex. PATCHLIST_N will launch (N / 8) threads. If N is less than 8, some channels will be disabled, leaving some untapped hardware capabilities. Conditionals based on gl_InvocationID are non-uniform, which means that they'll often have to execute both paths. However, if there are fewer than 8 vertices, all invocations will happen within a single thread, so barriers can become no-ops, which is nice. We also burn a maximum of 4 registers for ICP handles, so we can compile without regard for the value of N. It also works in all cases. - DUAL_PATCH mode processes up to two patches at a time, where the first four channels come from patch 1, and the second group of four come from patch 2. This tries to provide better EU utilization for small patches (N <= 4). It cannot be used in all cases. - 8_PATCH mode processes 8 patches at a time, with a thread launched per vertex in the patch. Each channel corresponds to the same vertex, but in each of the 8 patches. This utilizes all channels even for small patches. It also makes conditions on gl_InvocationID uniform, leading to proper jumps. Barriers, unfortunately, become real. Worse, for PATCHLIST_N, the thread payload burns N registers for ICP handles. This can burn up to 32 registers, or 1/4 of our register file, for URB handles. For Vulkan (and DX), we know the number of vertices at compile time, so we can limit the amount of waste. In GL, the patch dimension is dynamic state, so we either would have to waste all 32 (not reasonable) or guess (badly) and recompile. This is unfortunate. Because we can only spawn 16 thread instances, we can only use this mode for PATCHLIST_16 and smaller. The rest must use SINGLE_PATCH. This patch implements the new 8_PATCH TCS mode, but leaves us using SINGLE_PATCH by default. A new INTEL_DEBUG=tcs8 flag will switch to using 8_PATCH mode for testing and benchmarking purposes. We may want to consider using 8_PATCH mode in Vulkan in some cases. The data I've seen shows that 8_PATCH mode can be more efficient in some cases, but SINGLE_PATCH mode (the one we use today) is faster in other cases. Ultimately, the TES matters much more than the TCS for performance, so the decision may not matter much. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2019-05-14 13:16:30 -07:00
Kenneth Graunke	076159b40b	intel/compiler: Move ICP handle fetching into a helper function. This will be significantly different in 8_PATCH mode. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2019-05-14 13:16:28 -07:00

1 2 3 4

157 Commits