third_party_mesa3d

Author	SHA1	Message	Date
Danylo Piliaiev	c8abe03f3b	i965,iris,anv: Make alpha to coverage work with sample mask From "Alpha Coverage" section of SKL PRM Volume 7: "If Pixel Shader outputs oMask, AlphaToCoverage is disabled in hardware, regardless of the state setting for this feature." From OpenGL spec 4.6, "15.2 Shader Execution": "The built-in integer array gl_SampleMask can be used to change the sample coverage for a fragment from within the shader." From OpenGL spec 4.6, "17.3.1 Alpha To Coverage": "If SAMPLE_ALPHA_TO_COVERAGE is enabled, a temporary coverage value is generated where each bit is determined by the alpha value at the corresponding sample location. The temporary coverage value is then ANDed with the fragment coverage value to generate a new fragment coverage value." Similar wording could be found in Vulkan spec 1.1.100 "25.6. Multisample Coverage" Thus we need to compute alpha to coverage dithering manually in shader and replace sample mask store with the bitwise-AND of sample mask and alpha to coverage dithering. The following formula is used to compute final sample mask: m = int(16.0 * clamp(src0_alpha, 0.0, 1.0)) dither_mask = 0x1111 * ((0xfea80 >> (m & ~3)) & 0xf) \| 0x0808 * (m & 2) \| 0x0100 * (m & 1) sample_mask = sample_mask & dither_mask Credits to Francisco Jerez <currojerez@riseup.net> for creating it. It gives a number of ones proportional to the alpha for 2, 4, 8 or 16 least significant bits of the result. GEN6 hardware does not have issue with simultaneous usage of sample mask and alpha to coverage however due to the wrong sending order of oMask and src0_alpha it is still affected by it. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109743 Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>	2019-03-25 13:54:55 -07:00
Ian Romanick	7725d60938	intel/fs: Emit better code for b2f(inot(a)) and b2i(inot(a)) Since Boolean values are either -1 (true) or 0 (false), b2f(inot(a)) maps -1 => 0.0 and 0 => 1.0. This is equivalent to 1.0 + float(boolBitsToInt(a)). On Intel GPUs, ADD is one of the few instructions that can type-convert during write to destination, so we can achieve this in a single instruction: add g47F, g26D, 1D v2: Fix swizzles. v3: Fix typos in comments. Noticed by Ken. All Gen6+ platforms had similar results. (Skylake shown) Skylake total instructions in shared programs: 15185583 -> 15184683 (<.01%) instructions in affected programs: 239389 -> 238489 (-0.38%) helped: 899 HURT: 1 helped stats (abs) min: 1 max: 2 x̄: 1.00 x̃: 1 helped stats (rel) min: 0.15% max: 1.85% x̄: 0.49% x̃: 0.44% HURT stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2 HURT stats (rel) min: 0.09% max: 0.09% x̄: 0.09% x̃: 0.09% 95% mean confidence interval for instructions value: -1.01 -0.99 95% mean confidence interval for instructions %-change: -0.51% -0.48% Instructions are helped. total cycles in shared programs: 370964249 -> 370961508 (<.01%) cycles in affected programs: 1487586 -> 1484845 (-0.18%) helped: 420 HURT: 268 helped stats (abs) min: 1 max: 232 x̄: 22.41 x̃: 6 helped stats (rel) min: 0.05% max: 22.60% x̄: 1.30% x̃: 0.41% HURT stats (abs) min: 1 max: 230 x̄: 24.90 x̃: 10 HURT stats (rel) min: <.01% max: 21.60% x̄: 1.45% x̃: 0.52% 95% mean confidence interval for cycles value: -7.61 -0.36 95% mean confidence interval for cycles %-change: -0.44% -0.02% Cycles are helped. No changes on Iron Lake or GM45. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-03-01 12:42:14 -08:00
Ian Romanick	8eb36c9129	intel/fs: Emit logical-not of operands on Gen8+ On Gen8+ specifying negation of a logical operation such as AND actually performs a logical-not. Take advantage of this to generate fewer instructions. v2: Major rebase. Use nir_src_as_alu_instr. Fix swizzle handling. No changes on any pre-Gen8 platform. Skylake and Broadwell had similar results. (Broadwell shown) total instructions in shared programs: 15466902 -> 15466274 (<.01%) instructions in affected programs: 1262953 -> 1262325 (-0.05%) helped: 682 HURT: 4 helped stats (abs) min: 1 max: 5 x̄: 1.02 x̃: 1 helped stats (rel) min: 0.03% max: 2.40% x̄: 0.18% x̃: 0.04% HURT stats (abs) min: 1 max: 62 x̄: 17.50 x̃: 3 HURT stats (rel) min: 0.03% max: 1.89% x̄: 0.53% x̃: 0.10% 95% mean confidence interval for instructions value: -1.10 -0.73 95% mean confidence interval for instructions %-change: -0.19% -0.15% Instructions are helped. total cycles in shared programs: 410996093 -> 410950440 (-0.01%) cycles in affected programs: 144389048 -> 144343395 (-0.03%) helped: 519 HURT: 51 helped stats (abs) min: 1 max: 1060 x̄: 104.46 x̃: 140 helped stats (rel) min: 0.01% max: 10.98% x̄: 0.34% x̃: 0.03% HURT stats (abs) min: 1 max: 4060 x̄: 167.90 x̃: 22 HURT stats (rel) min: <.01% max: 8.20% x̄: 0.96% x̃: 0.25% 95% mean confidence interval for cycles value: -97.16 -63.02 95% mean confidence interval for cycles %-change: -0.32% -0.13% Cycles are helped. total spills in shared programs: 95311 -> 95329 (0.02%) spills in affected programs: 881 -> 899 (2.04%) helped: 0 HURT: 4 total fills in shared programs: 93629 -> 93634 (<.01%) fills in affected programs: 794 -> 799 (0.63%) helped: 1 HURT: 2 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-03-01 12:42:14 -08:00
Ian Romanick	06eaaf2de9	intel/fs: Refactor ALU source and destination handling to a separate function Other places will need to do this soon to properly handle source swizzles. The patch looks a little odd, but the change is pretty straight forward. All of the swizzle and mask handling is moved out, but the code for handling move instructions and vecN instructions remains in nir_emit_alu. I'm not terribly pleased with the "need_dest" parameter, but get_nir_dest is (somewhat surprisingly) destructive. I am open to suggestions of alternatives. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-03-01 12:42:14 -08:00
Jason Ekstrand	e644ed468f	intel/fs: Implement nir_intrinsic_global_atomic_* eviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-02-01 16:11:00 -06:00
Jason Ekstrand	eab1c55590	intel/fs: Support SENDS in SHADER_OPCODE_SEND Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-29 18:43:55 +00:00
Jason Ekstrand	b284d222db	intel/fs: Use SHADER_OPCODE_SEND for varying UBO pulls on gen7+ Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-29 18:43:55 +00:00
Jason Ekstrand	8514eba693	intel/fs: Use SHADER_OPCODE_SEND for texturing on gen7+ Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-29 18:43:55 +00:00
Jason Ekstrand	7f1cf046cd	intel/fs: Add a generic SEND opcode Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-29 18:43:55 +00:00
Matt Turner	7e4e9da90d	intel/compiler: Prevent warnings in the following patch The next patch replaces an unsigned bitfield with a plain unsigned, which triggers gcc to begin warning on signed/unsigned comparisons. Keeping this patch separate from the actual move allows bisectablity and generates no additional warnings temporarily. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2019-01-09 16:42:41 -08:00
Francisco Jerez	230a8a541d	intel/fs: Remove FS_OPCODE_UNPACK_HALF_2x16_SPLIT opcodes. These are broken on a future platform, but it turns out we don't need to fix them, since they're just type-converting moves with strided source. Kill them. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-09 12:03:09 -08:00
Francisco Jerez	2c99c7a56c	intel/fs: Remove existing lower_conversions pass. It's redundant with the functionality provided by lower_regioning now. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-09 12:03:09 -08:00
Francisco Jerez	efa4e4bc5f	intel/fs: Introduce regioning lowering pass. This legalization pass is meant to handle situations where the source or destination regioning controls of an instruction are unsupported by the hardware and need to be lowered away into separate instructions. This should be more reliable and future-proof than the current approach of handling CHV/BXT restrictions manually all over the visitor. The same mechanism is leveraged to lower unsupported type conversions easily, which obsoletes the lower_conversions pass. v2: Give conditional modifiers the same treatment as predicates for SEL instructions in lower_dst_modifiers() (Iago). Special-case a couple of other instructions with inconsistent conditional mod semantics in lower_dst_modifiers() (Curro). Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-09 12:03:09 -08:00
Francisco Jerez	812ede088f	intel/fs: Implement quad swizzles on ICL+. Align16 is no longer a thing, so a new implementation is provided using Align1 instead. Not all possible swizzles can be represented as a single Align1 region, but some fast paths are provided for frequently used swizzles that can be represented efficiently in Align1 mode. Fixes ~90 subgroup quad swap Vulkan CTS tests. Cc: mesa-stable@lists.freedesktop.org Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-09 12:03:08 -08:00
Francisco Jerez	c5f9c0009d	intel/fs: Handle source modifiers in lower_integer_multiplication(). lower_integer_multiplication() implements 32x32-bit multiplication on some platforms by bit-casting one of the 32-bit sources into two 16-bit unsigned integer portions. This can give incorrect results if the original instruction specified a source modifier. Fix it by emitting an additional MOV instruction implementing the source modifiers where necessary. Cc: mesa-stable@lists.freedesktop.org Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2019-01-09 12:03:08 -08:00
Jason Ekstrand	6b2918709a	intel/fs,vec4: Clean up a repeated pattern with SSBOs Everywhere we handle SSBO intrinsics, we have exactly the same pattern for computing the index so we may as well make a helper for it. We also add a get_nir_src_imm to vec4 and use it for SSBO offsets. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-11-08 10:09:06 -06:00
Jason Ekstrand	09f1de97a7	anv,i965: Lower away image derefs in the driver Previously, the back-end compiler turn image access into magic uniform reads and there was a complex contract between back-end compiler and driver about setting up and filling out those params. As of this commit, both drivers now lower image_deref_load_param_intel intrinsics to load_uniform intrinsics controlled by the driver and lower the other image_deref_* intrinsics to image_* intrinsics which take an actual binding table index. There are still "magic" uniforms but they are now added and controlled entirely by the driver and that contract no longer spans components. This also has the side-effect of making most image use compile-time binding table indices. Previously, all image access pulled the binding table index from a uniform. Part of the reason for this was that the magic uniforms made it difficult to decouple binding table indices from the uniforms and, since they are indexed completely differently (especially in Vulkan), it was hard to pull them apart. Now that the driver is handling both, it's trivial to decouple the two and provide actual binding table indices. Shader-db results on Kaby Lake: total instructions in shared programs: 15166872 -> 15164293 (-0.02%) instructions in affected programs: 115834 -> 113255 (-2.23%) helped: 191 HURT: 0 total cycles in shared programs: 571311495 -> 571196465 (-0.02%) cycles in affected programs: 4757115 -> 4642085 (-2.42%) helped: 73 HURT: 67 total spills in shared programs: 10951 -> 10926 (-0.23%) spills in affected programs: 742 -> 717 (-3.37%) helped: 7 HURT: 0 total fills in shared programs: 22226 -> 22201 (-0.11%) fills in affected programs: 1146 -> 1121 (-2.18%) helped: 7 HURT: 0 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-08-29 14:04:03 -05:00
Ian Romanick	d515c75463	intel/compiler: Implement untyped atomic float min, max, and compare-swap dataport messages v2: Split changes to the message type field to another patch. Suggested by Caio. Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>	2018-08-22 20:31:32 -07:00
Iago Toral Quiroga	7e6c8b0cb7	intel/compiler: add setup_imm_(u)b helpers The hardware doesn't support byte immediates, so similar to setup_imm_df() for doubles, these helpers work by loading the constant value into a VGRF. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-08-01 08:08:15 +02:00
Iago Toral Quiroga	81ca08e030	intel/compiler: remove unused function Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>	2018-07-09 13:21:48 +02:00
Francisco Jerez	f6c4aace22	intel/fs: Extend thread payload layout to SIMD32 And handle 32-wide payload register reads in fetch_payload_reg(). v2 (Jason Ekstrand); - Fix some whitespace and brace placement Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Francisco Jerez	8f143f70d6	intel/fs: Wrap FS payload register look-up in a helper function. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Francisco Jerez	5b6e91dd35	intel/fs: Remove program key argument from generator. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Jason Ekstrand	71cd9ebed9	intel/fs: Use image_deref intrinsics instead of image_var Since we had to rewrite the deref walking loop anyway, I took the opportunity to make it a bit clearer and more efficient. In particular, in the AoA case, we will now emit one minmax instead of one per array level. Acked-by: Rob Clark <robdclark@gmail.com> Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Acked-by: Dave Airlie <airlied@redhat.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-06-22 20:54:00 -07:00
Jose Maria Casanova Crespo	b8e099e7d5	intel/fs: shuffle_64bit_data_for_32bit_write is not used anymore Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-06-16 22:39:08 +02:00
Jose Maria Casanova Crespo	a4d445b93c	intel/fs: shuffle_32bit_load_result_to_64bit_data is not used anymore Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-06-16 22:39:08 +02:00
Jose Maria Casanova Crespo	c2297bdf19	intel/fs: Remove old 16-bit shuffle/unshuffle functions Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-06-16 22:39:08 +02:00
Jose Maria Casanova Crespo	22c654941b	intel/fs: New shuffle_for_32bit_write and shuffle_from_32bit_read These new shuffle functions deal with the shuffle/unshuffle operations needed for read/write operations using 32-bit components when the read/written components have a different bit-size (8, 16, 64-bits). Shuffle from 32-bit to 32-bit becomes a simple MOV. shuffle_src_to_dst takes care of doing a shuffle when source type is smaller than destination type and an unshuffle when source type is bigger than destination. So this new read/write functions just need to call shuffle_src_to_dst assuming that writes use a 32-bit destination and reads use a 32-bit source. As shuffle_for_32bit_write/from_32bit_read components take components in unit of source/destination types and shuffle_src_to_dst takes units of the smallest type component, we adjust components and first_component parameters. To enable this new functions it is needed than there is no source/destination overlap in the case of shuffle_from_32bit_read. That never happens on shuffle_for_32bit_write as it allocates a new destination register as it was at shuffle_64bit_data_for_32bit_write. v2: Reword commit log and add comments to explain why first_component and components parameters are adjusted. (Jason Ekstrand) Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-06-16 22:39:08 +02:00
Francisco Jerez	39de901a96	intel/fs: Use the ATTR file for FS inputs This replaces the special magic opcodes which implicitly read inputs with explicit use of the ATTR file. v2 (Jason Ekstrand): - Break into multiple patches - Change the units of the FS ATTR to be in logical scalars Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-05-29 15:44:50 -07:00
Ian Romanick	52c7df1643	i965/fs: Merge CMP and SEL into CSEL on Gen8+ v2: Fix several problems handling inverted predicates. Add a much bigger comment around the BRW_CONDITIONAL_NZ case. v3: Allow uniforms and shader inputs as sources for the original SEL and CMP instructions. This enables a LOT more shaders to receive CSEL merging (5816 vs 8564 on SKL). v4: Report progress. Broadwell and Skylake had similar results. (Broadwell shown) helped: 8527 HURT: 0 helped stats (abs) min: 1 max: 27 x̄: 2.44 x̃: 1 helped stats (rel) min: 0.03% max: 17.80% x̄: 1.12% x̃: 0.70% 95% mean confidence interval for instructions value: -2.51 -2.36 95% mean confidence interval for instructions %-change: -1.15% -1.10% Instructions are helped. total cycles in shared programs: 559442317 -> 558288357 (-0.21%) cycles in affected programs: 372699860 -> 371545900 (-0.31%) helped: 6748 HURT: 1450 helped stats (abs) min: 1 max: 32000 x̄: 182.41 x̃: 12 helped stats (rel) min: <.01% max: 66.08% x̄: 3.42% x̃: 0.70% HURT stats (abs) min: 1 max: 2538 x̄: 53.08 x̃: 14 HURT stats (rel) min: <.01% max: 96.72% x̄: 3.32% x̃: 0.90% 95% mean confidence interval for cycles value: -179.01 -102.51 95% mean confidence interval for cycles %-change: -2.37% -2.08% Cycles are helped. LOST: 0 GAINED: 6 No changes on earlier platforms. Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> [v1] Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> [v3] Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-03-08 15:26:26 -08:00
Jason Ekstrand	90c9f29518	i965/fs: Add support for nir_intrinsic_shuffle Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2018-03-07 12:13:47 -08:00
Kenneth Graunke	9fa95359df	intel: Drop program size pointer from vec4/fs assembly getters. These days, we're just passing a pointer to a prog_data field, which we already have access to. We can just use it directly. (In the past, it was a pointer to a separate value.) Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2018-03-02 14:20:22 -08:00
Jose Maria Casanova Crespo	2dd94f462b	i965/fs: shuffle_32bit_load_result_to_16bit_data now skips components This helper used to load 16bit components from 32-bits read now allows skipping components with the new parameter first_component. The semantics now skip components until we reach the first_component, and then reads the number of components passed to the function. All previous uses of the helper are updated to use 0 as first_component. This will allow read 16-bit components when the first one is not aligned 32-bit. Enabling more usages of untyped_reads with 16-bit types. v2: (Jason Ektrand) Change parameters order to first_component, num_components Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-02-28 21:37:40 -08:00
Matt Turner	bed0267ff6	intel/compiler/fs: Pass fs_inst to generate_ddx/ddy instead of opcode In a future patch, generate_ddy will want to inspect inst->exec_size. Change generate_ddx as well for consistency. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-02-28 11:15:47 -08:00
Matt Turner	b5d8781e19	intel/compiler/fs: Return multiple_instructions_emitted from generate_linterp If multiple instructions are emitted, special handling of things like conditional mod and NoDDClr/NoDDChk need to be performed. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-02-28 11:15:47 -08:00
Francisco Jerez	acab52f520	intel/fs/bank_conflicts: Don't touch Gen7 MRF hack registers. Fixes: `af2c320190` "intel/fs: Implement GRF bank conflict mitigation pass." Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104199 Reported-by: Darius Spitznagel <d.spitznagel@goodbytez.de> Reviewed-by: Matt Turner <mattst88@gmail.com>	2017-12-12 12:05:45 -08:00
Francisco Jerez	acf98ff933	intel/fs: Teach instruction scheduler about GRF bank conflict cycles. This should allow the post-RA scheduler to do a slightly better job at hiding latency in presence of instructions incurring bank conflicts. The main purpuse of this patch is not to improve performance though, but to get conflict cycles to show up in shader-db statistics in order to make sure that regressions in the bank conflict mitigation pass don't go unnoticed. Acked-by: Matt Turner <mattst88@gmail.com>	2017-12-07 15:56:49 -08:00
Francisco Jerez	af2c320190	intel/fs: Implement GRF bank conflict mitigation pass. Unnecessary GRF bank conflicts increase the issue time of ternary instructions (the overwhelmingly most common of which is MAD) by roughly 50%, leading to reduced ALU throughput. This pass attempts to minimize the number of bank conflicts by rearranging the layout of the GRF space post-register allocation. It's in general not possible to eliminate all of them without introducing extra copies, which are typically more expensive than the bank conflict itself. In a shader-db run on SKL this helps roughly 46k shaders: total conflicts in shared programs: 1008981 -> 600461 (-40.49%) conflicts in affected programs: 816222 -> 407702 (-50.05%) helped: 46234 HURT: 72 The running time of shader-db itself on SKL seems to be increased by roughly 2.52%±1.13% with n=20 due to the additional work done by the compiler back-end. On earlier generations the pass is somewhat less effective in relative terms because the hardware incurs a bank conflict anytime the last two sources of the instruction are duplicate (e.g. while trying to square a value using MAD), which is impossible to avoid without introducing copies. E.g. for a shader-db run on SNB: total conflicts in shared programs: 944636 -> 623185 (-34.03%) conflicts in affected programs: 853258 -> 531807 (-37.67%) helped: 31052 HURT: 19 And on BDW: total conflicts in shared programs: 1418393 -> 987539 (-30.38%) conflicts in affected programs: 1179787 -> 748933 (-36.52%) helped: 47592 HURT: 70 On SKL GT4e this improves performance of GpuTest Volplosion by 3.64% ±0.33% with n=16. NOTE: This patch intentionally disregards some i965 coding conventions for the sake of reviewability. This is addressed by the next squash patch which introduces an amount of (for the most part boring) boilerplate that might distract reviewers from the non-trivial algorithmic details of the pass. The following patch is squashed in: SQUASH: intel/fs/bank_conflicts: Roll back to the nineties. Acked-by: Matt Turner <mattst88@gmail.com>	2017-12-07 15:56:06 -08:00
Jose Maria Casanova Crespo	3db31c0b06	i965/fs: Helpers for un/shuffle 16-bit pairs in 32-bit components This helpers are used to load/store 16-bit types from/to 32-bit components. The functions shuffle_32bit_load_result_to_16bit_data and shuffle_16bit_data_for_32bit_write are implemented in a similar way than the analogous functions for handling 64-bit types. v1: Explain need of temporary in shuffle operations. (Jason Ekstrand) Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	d038deaa40	i965/fs: Add remove_extra_rounding_modes optimization Although from SPIR-V point of view, rounding modes are attached to the operation/destination, on i965 it is a status, so we don't need to explicitly set the rounding mode if the one we want is already set. Taking into account that the default mode is RTE, one possible optimization would be optimize out the first RTE set for each block. For in order to work, we would need to take into account block interrelationships. At this point, it is not worth to complicate the optimization for such small gain. v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate with the rounding mode (Curro) v3: Reset optimization for every block. (Jason Ekstrand) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jason Ekstrand	295605c930	intel/cs: Push subgroup ID instead of base thread ID We're going to want subgroup ID for SPIR-V subgroups eventually anyway. We really only want to push one and calculate the other from it. It makes a bit more sense to push the subgroup ID because it's simpler to calculate and because it's a real API thing. The only advantage to pushing the base thread ID is to avoid a single SHL in the shader. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	80ddfab2f5	intel/cs: Rework the way thread local ID is handled Previously, brw_nir_lower_intrinsics added the param and then emitted a load_uniform intrinsic to load it directly. This commit switches things over to use a specific NIR intrinsic for the thread id. The one thing I don't like about this approach is that we have to copy thread_local_id over to the new visitor in import_uniforms. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	1077981eb5	intel/fs: Remove min_dispatch_width from fs_visitor It's 8 for everything except compute shaders. For compute shaders, there's no need to duplicate the computation and it's just a possible source of error. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	030d2b5016	i965/fs: Return a fs_reg from shuffle_64bit_data_for_32bit_write All callers of this function allocate a fs_reg expressly to pass into it. It's much easier if we just let the helper allocate the register. While we're here, we switch it to doing the MOVs with an integer type so that we don't accidentally canonicalize floats on half of a double. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	2975e4c56a	intel: Rewrite the world of push/pull params This moves us away to the array of pointers model and onto a model where each param is represented by a generic uint32_t handle. We reserve 2^16 of these handles for builtins that get generated by somewhere inside the compiler and have well-defined meanings. Generic params have handles whose meanings are defined by the driver. The primary downside to this new approach is that it moves a little bit of the work that we would normally do at compile time to draw time. On my laptop this hurts OglBatch6 by no more than 1% and doesn't seem to have any measurable affect on OglBatch7. So, while this may come back to bite us, it doesn't look too bad. Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-10-12 22:39:29 -07:00
Kenneth Graunke	b2da123801	i965: Use pushed UBO data in the scalar backend. This actually takes advantage of the newly pushed UBO data, avoiding pull loads. Improves performance in GLBenchmark Manhattan 3.1 by: HSW: ~1%, BDW/SKL/KBL GT2: 3-4%, SKL GT4: 7-8%, APL: 4-5%. (thanks to Eero Tamminen for these numbers) shader-db results on Skylake, ignoring programs with spill/fill changes: total instructions in shared programs: 13963994 -> 13651893 (-2.24%) instructions in affected programs: 4250328 -> 3938227 (-7.34%) helped: 28527 HURT: 0 total cycles in shared programs: 179808608 -> 172535170 (-4.05%) cycles in affected programs: 79720410 -> 72446972 (-9.12%) helped: 26951 HURT: 1248 LOST: 46 GAINED: 21 Many "Deus Ex: Mankind Divided" shaders which already spilled end up spill a lot more (about 240 programs hurt, 9 helped). The cycle estimator suggests this is still overall a win (-0.23% in cycle counts) presumably because we trade pull loads for fills. v2: Drop "PULL" environment variable left in for initial debugging (caught by Matt). Reviewed-by: Matt Turner <mattst88@gmail.com>	2017-07-13 20:18:54 -07:00
Kenneth Graunke	c9ef27e77b	i965: Factor out push locations. With UBOs, the answer of "have we decided to push this uniform" gets a bit more complicated - for one, we have multiple surfaces. This patch refactors things so we can add the new code in a single place. Reviewed-by: Matt Turner <mattst88@gmail.com>	2017-07-13 20:18:54 -07:00
Jason Ekstrand	ca4d192802	i965/fs: Lower gl_VertexID and friends to inputs at the NIR level NIR calls these system values but they come in from the VF unit as vertex data. It's terribly convenient to just be able to treat them as such in the back-end. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-05-09 15:07:47 -07:00
Samuel Iglesias Gonsálvez	af6fc3a8ea	i965/fs: rename lower_d2x to lower_conversions v2: - Change the name to lower_conversions. Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>	2017-04-14 14:56:07 -07:00
Emil Velikov	2438c0a236	intel/compiler: consistently use ifndef guards over pragma once Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com> Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Acked-by: Vedran Miletić <vedran@miletic.net> Acked-by: Juha-Pekka Heikkila <juhapekka.heikkila@gmail.com> Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>	2017-03-22 16:55:22 +00:00

1 2 3 4

151 Commits