bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[X86][SLP] Show example of failure to uniformly commute splats for 'alt' ↵	Simon Pilgrim	2019-03-23	1	-0/+38
\| \| \| \| \| \| \| \|	shuffles. If either the main/alt opcodes isn't commutable we may end up with the splats not correctly commuted to the same side. llvm-svn: 356837
*	[SLPVectorizer] Add test related to SLP Throttling support, NFCI.	Dinar Temirbulatov	2019-03-22	1	-0/+37
\| \| \| \|	llvm-svn: 356754
*	[SLP] Fix invalid triple in X86 tests	Florian Hahn	2019-03-05	2	-30/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	x86-64 is an invalid architecture in triples. Changing it to the correct triple (x86_64) changes some tests, because SLP is not deemed profitable any more. Reviewers: ABataev, RKSimon, spatel Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D58931 llvm-svn: 355420
*	[Vectorizer] Add vectorization support for fixed smul/umul intrinsics	Simon Pilgrim	2019-02-25	1	-972/+730
\| \| \| \| \| \| \| \|	This requires a couple of tweaks to existing vectorization functions as they were assuming that only the second call argument (ctlz/cttz/powi) could ever be the 'always scalar' argument, but for smul.fix + umul.fix its the third argument. Differential Revision: https://reviews.llvm.org/D58616 llvm-svn: 354790
*	[SLPVectorizer][X86] Add fixed smul/umul tests	Simon Pilgrim	2019-02-25	1	-0/+2007
\| \| \| \| \| \|	Baseline tests - fixed mul intrinsics aren't flagged as vectorizable yet llvm-svn: 354783
*	[SLPVectorizer][X86] Add add/sub/mul overflow tests	Simon Pilgrim	2019-02-20	6	-0/+7524
\| \| \| \| \| \|	Baseline tests - overflow intrinsics aren't flagged as vectorizable yet llvm-svn: 354454
*	Temporarily Revert "[X86][SLP] Enable SLP vectorization for 128-bit ↵	Eric Christopher	2019-02-20	27	-806/+758
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	horizontal X86 instructions (add, sub)" As this has broken the lto bootstrap build for 3 days and is showing a significant regression on the Dither_benchmark results (from the LLVM benchmark suite) -- specifically, on the BENCHMARK_FLOYD_DITHER_128, BENCHMARK_FLOYD_DITHER_256, and BENCHMARK_FLOYD_DITHER_512; the others are unchanged. These have regressed by about 28% on Skylake, 34% on Haswell, and over 40% on Sandybridge. This reverts commit r353923. llvm-svn: 354434
*	[X86][SLP] Enable SLP vectorization for 128-bit horizontal X86 instructions ↵	Anton Afanasyev	2019-02-13	27	-758/+806
\| \| \| \| \| \| \| \| \| \| \| \| \|	(add, sub) Try to use 64-bit SLP vectorization. In addition to horizontal instrs this change triggers optimizations for partial vector operations (for instance, using low halfs of 128-bit registers xmm0 and xmm1 to multiply <2 x float> by <2 x float>). Fixes llvm.org/PR32433 llvm-svn: 353923
*	[TTI] Add generic SADDSAT/SSUBSAT costs	Simon Pilgrim	2019-01-27	2	-200/+344
\| \| \| \| \| \| \| \| \| \|	Add generic costs calculation for SADDSAT/SSUBSAT intrinsics, this uses generic costs for sadd_with_overflow/ssub_with_overflow, an extra sign comparison + a selects based on the sign/overflow. This completes PR40316 Differential Revision: https://reviews.llvm.org/D57239 llvm-svn: 352315
*	[TTI] Add generic UADDSAT/USUBSAT costs	Simon Pilgrim	2019-01-24	2	-276/+168
\| \| \| \| \| \| \| \|	Add generic costs calculation for UADDSAT/USUBSAT intrinsics, this fallbacks to using generic costs for uadd_with_overflow/usub_with_overflow + a select. Differential Revision: https://reviews.llvm.org/D56907 llvm-svn: 352044
*	[CostModel][X86] Add explicit vector select costs	Simon Pilgrim	2019-01-20	2	-94/+120
\| \| \| \| \| \| \| \| \| \|	Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)\|(Y & ~C)) bit select. Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason). The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests. llvm-svn: 351685
*	[CostModel][X86] Add explicit fcmp costs for pre-SSE42 targets	Simon Pilgrim	2019-01-20	1	-297/+73
\| \| \| \| \| \|	Typical throughputs: cmpss/cmpps = 1cy and cmpsd/cmppd = 2cy before the Core2 era llvm-svn: 351684
*	[SLP] Fix PR40310: The reduction nodes should stay scalar.	Alexey Bataev	2019-01-16	2	-98/+186
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Sometimes the SLP vectorizer tries to vectorize the horizontal reduction nodes during regular vectorization. This may happen inside of the loops, when there are some vectorizable PHIs. Patch fixes this by checking if the node is the reduction node and thus it must not be vectorized, it must be gathered. Reviewers: RKSimon, spatel, hfinkel, fedor.sergeev Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D56783 llvm-svn: 351349
*	[SLP] Added test for PR40310, NFC.	Alexey Bataev	2019-01-15	1	-0/+97
\| \| \| \|	llvm-svn: 351240
*	Reapply "[CodeGen][X86] Expand USUBSAT to UMAX+SUB, also for vectors"	Nikita Popov	2019-01-15	1	-100/+260
\| \| \| \| \| \| \| \| \| \| \| \| \|	Related to https://bugs.llvm.org/show_bug.cgi?id=40123. Rather than scalarizing, expand a vector USUBSAT into UMAX+SUB, which produces much better code for X86. Reapplying with updated SLPVectorizer tests. Differential Revision: https://reviews.llvm.org/D56636 llvm-svn: 351219
*	[SLP][X86] Split prefer-256-bit 'AVX256BW' tests from AVX2 checks	Simon Pilgrim	2019-01-15	4	-28/+28
\| \| \| \| \| \|	Fixes SLP test issue with D56636 llvm-svn: 351199
*	[SLP]Moved NVPTX test under NVPTX directory, NFC.	Alexey Bataev	2019-01-11	1	-57/+0
\| \| \| \|	llvm-svn: 350969
*	[SLP]Update test checks for the SPL vectorizer, NFC.	Alexey Bataev	2019-01-11	66	-558/+3009
\| \| \| \|	llvm-svn: 350967
*	[SLPVectorizer] Flag ADD/SUB SSAT/USAT intrinsics trivially vectorizable ↵	Simon Pilgrim	2019-01-03	4	-1544/+300
\| \| \| \| \| \| \| \|	(PR40123) Enables SLP vectorization for the SSE2 PADDS/PADDUS/PSUBS/PSUBUS style intrinsics llvm-svn: 350300
*	[SLPVectorizer][X86] Add ADD/SUB SSAT/USAT tests (PR40123)	Simon Pilgrim	2019-01-03	4	-0/+4056
\| \| \| \|	llvm-svn: 350297
*	[CostModel][X86][AArch64] Adjust cost of the scalarization part of min/max ↵	Craig Topper	2018-12-10	1	-36/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	reduction. Summary: The comment says we need 3 extracts and a select at the end. But didn't we just account for the select in the vector cost above. Aren't we just extracting the single element after taking the min/max in the vector register? Reviewers: RKSimon, spatel, ABataev Reviewed By: RKSimon Subscribers: javed.absar, kristof.beyls, llvm-commits Differential Revision: https://reviews.llvm.org/D55480 llvm-svn: 348739
*	[CostModel][X86] Fix overcounting arithmetic cost in illegal types in ↵	Craig Topper	2018-12-07	2	-344/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	getArithmeticReductionCost/getMinMaxReductionCost We were overcounting the number of arithmetic operations needed at each level before we reach a legal type. We were using the full vector type for that level, but we are going to split the input vector at that level in half. So the effective arithmetic operation cost at that level is half the width. So for example on 8i32 on an sse target. Were were calculating the cost of an 8i32 op which is likely 2 for basic integer. Then after the loop we count 2 more v4i32 ops. For a total arith cost of 4. But if you look at the assembly there would only be 3 arithmetic ops. There are still more bugs in this code that I'm going to work on next. The non pairwise code shouldn't count extract subvectors in the loop. There are no extracts, the types are split in registers. For pairwise we need to use 2 two src permute shuffles. Differential Revision: https://reviews.llvm.org/D55397 llvm-svn: 348621
*	Add common check prefix. NFCI.	Simon Pilgrim	2018-12-04	1	-177/+89
\| \| \| \|	llvm-svn: 348265
*	[TTI] Reduction costs only need to include a single extract element cost ↵	Simon Pilgrim	2018-12-01	4	-301/+188
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(REAPPLIED) We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type. For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed. Fixes PR37731 Rebased and reapplied after being reverted in rL347541 due to PR39774 - which was fixed by D54955/rL347759 and D55017/rL347997 Differential Revision: https://reviews.llvm.org/D54585 llvm-svn: 348076
*	[X86] Split skylake-avx512 run lines in SLP vectorizer tests to cover ↵	Craig Topper	2018-11-30	14	-368/+493
\| \| \| \| \| \| \| \|	-mprefer=vector-width=256 and -mprefer-vector-width=512. This will make these tests immune if we ever change the default behavior of -march=skylake-avx512 to prefer 256 bit vectors. llvm-svn: 348046
*	[SLP]PR39774: Update references of the replaced external instructions.	Alexey Bataev	2018-11-30	4	-6/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: An additional fix for PR39774. Need to update the references for the RedcutionRoot instruction when it is replaced during the vectorization phase to avoid compiler crash on reduction vectorization. Reviewers: RKSimon, spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D55017 llvm-svn: 347997
*	[SLP]Fix PR39774: Set ReductionRoot if the original instruction is vectorized.	Alexey Bataev	2018-11-28	1	-0/+108
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: If the original reduction root instruction was vectorized, it might be removed from the tree. It means that the insertion point may become invalidated and the whole vectorization of the reduction leads to the incorrect output result. The ReductionRoot instruction must be marked as externally used so it could not be removed. Otherwise it might cause inconsistency with the cost model and we may end up with too optimistic optimization. Reviewers: RKSimon, spatel, hfinkel, mkuper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D54955 llvm-svn: 347759
*	Revert "[TTI] Reduction costs only need to include a single extract element ↵	Fedor Sergeev	2018-11-26	3	-133/+282
\| \| \| \| \| \| \| \| \| \|	cost" This reverts commit r346970. It was causing PR39774, a crash in slp-vectorizer on a rather simple loop with just a bunch of 'and's in the body. llvm-svn: 347541
*	[TTI] Reduction costs only need to include a single extract element cost	Simon Pilgrim	2018-11-15	3	-282/+133
\| \| \| \| \| \| \| \| \| \| \| \|	We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type. For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed. Fixes PR37731 Differential Revision: https://reviews.llvm.org/D54585 llvm-svn: 346970
*	[SLPVectorizer][X86] Regenerate reduction minmax tests and cleanup check ↵	Simon Pilgrim	2018-11-15	1	-659/+272
\| \| \| \| \| \|	prefixes llvm-svn: 346965
*	[SLPVectorizer][X86] Regenerate reduction tests and add PR37731 test	Simon Pilgrim	2018-11-15	1	-210/+196
\| \| \| \| \| \|	Cleanup check prefixes llvm-svn: 346964
*	[CostModel][X86] SK_ExtractSubvector is free if the subvector is at the ↵	Simon Pilgrim	2018-11-09	1	-1/+1
\| \| \| \| \| \|	start of the source vector llvm-svn: 346538
*	[CostModel][X86] Add realistic vXi64 uitofp vXf64 costs	Simon Pilgrim	2018-10-25	1	-138/+39
\| \| \| \| \| \|	Match codegen improvements from D53649/rL345256 llvm-svn: 345263
*	[CostModel][X86] Add realistic i64 uitofp f64 scalar costs	Simon Pilgrim	2018-10-25	1	-43/+73
\| \| \| \|	llvm-svn: 345261
*	[SLPVectorizer] Add basic support for mul/and/or/xor horizontal reductions	Simon Pilgrim	2018-10-23	1	-51/+54
\| \| \| \| \| \| \| \| \| \| \| \|	Expand arithmetic reduction to include mul/and/or/xor instructions. This patch just fixes the SLPVectorizer - the effective reduction costs for AVX1+ are still poor (see rL344846) and will need to be improved before SLP sees this as a valid transform - but we can already see the effect on SSE2 tests. This partially helps PR37731, but doesn't fix it all as it still falls over on the extraction/reduction order for some reason. Differential Revision: https://reviews.llvm.org/D53473 llvm-svn: 345037
*	[SLPVectorizer][X86] Add mul/and/or/xor unrolled reduction tests	Simon Pilgrim	2018-10-20	1	-7/+351
\| \| \| \| \| \|	We miss arithmetic reduction for everything but Add/FAdd (I assume because that's the only cases which x86 has horizontal ops for.....) llvm-svn: 344849
*	[SLPVectorizer] Check that lowered type is floating point before calling ↵	Sam Clegg	2018-10-09	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	isFabsFree In the case of soft-fp (e.g. fp128 under wasm) the result of getTypeLegalizationCost() can be an integer type even if the input is floating point (See LegalizeTypeAction::TypeSoftenFloat). Before calling isFabsFree() (which asserts if given a non-fp type) we need to check that that result is fp. This is safe since in fabs is certainly not free in the soft-fp case. Fixes PR39168 Differential Revision: https://reviews.llvm.org/D52899 llvm-svn: 344069
*	[SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform	Roman Tereshin	2018-07-25	1	-0/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	as well as sext(C + x + ...) -> (D + sext(C-D + x + ...))<nuw><nsw> similar to the equivalent transformation for zext's if the top level addition in (D + (C-D + x * n)) could be proven to not wrap, where the choice of D also maximizes the number of trailing zeroes of (C-D + x * n), ensuring homogeneous behaviour of the transformation and better canonicalization of such AddRec's (indeed, there are 2^(2w) different expressions in `B1 + ext(B2 + Y)` form for the same Y, but only 2^(2w - k) different expressions in the resulting `B3 + ext((B4 * 2^k) + Y)` form, where w is the bit width of the integral type) This patch generalizes sext(C1 + C2X) --> sext(C1) + sext(C2X) and sext{C1,+,C2} --> sext(C1) + sext{0,+,C2} transformations added in r209568 relaxing the requirements the following way: 1. C2 doesn't have to be a power of 2, it's enough if it's divisible by 2 a sufficient number of times; 2. C1 doesn't have to be less than C2, instead of extracting the entire C1 we can split it into 2 terms: (00...0XXX + YY...Y000), keep the second one that may cause wrapping within the extension operator, and move the first one that doesn't affect wrapping out of the extension operator, enabling further simplifications; 3. C1 and C2 don't have to be positive, splitting C1 like shown above produces a sum that is guaranteed to not wrap, signed or unsigned; 4. in AddExpr case there could be more than 2 terms, and in case of AddExpr the 2nd and following terms and in case of AddRecExpr the Step component don't have to be in the C2X form or constant (respectively), they just need to have enough trailing zeros, which in turn could be guaranteed by means other than arithmetics, e.g. by a pointer alignment; 5. the extension operator doesn't have to be a sext, the same transformation works and profitable for zext's as well. Apparently, optimizations like SLPVectorizer currently fail to vectorize even rather trivial cases like the following: double bar(double a, unsigned n) { double x = 0.0; double y = 0.0; for (unsigned i = 0; i < n; i += 2) { x += a[i]; y += a[i + 1]; } return x * y; } If compiled with `clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm` (!{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"}) it produces scalar code with the loop not unrolled with the unsigned `n` and `i` (like shown above), but vectorized and unrolled loop with signed `n` and `i`. With the changes made in this commit the unsigned version will be vectorized (though not unrolled for unclear reasons). How it all works: Let say we have an AddExpr that looks like (C + x + y + ...), where C is a constant and x, y, ... are arbitrary SCEVs. Let's compute the minimum number of trailing zeroes guaranteed of that sum w/o the constant term: (x + y + ...). If, for example, those terms look like follows: i XXXX...X000 YYYY...YY00 ... ZZZZ...0000 then the rightmost non-guaranteed-zero bit (a potential one at i-th position above) can change the bits of the sum to the left (and at i-th position itself), but it can not possibly change the bits to the right. So we can compute the number of trailing zeroes by taking a minimum between the numbers of trailing zeroes of the terms. Now let's say that our original sum with the constant is effectively just C + X, where X = x + y + .... Let's also say that we've got 2 guaranteed trailing zeros for X: j CCCC...CCCC XXXX...XX00 // this is X = (x + y + ...) Any bit of C to the left of j may in the end cause the C + X sum to wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not affect wrapping in any way. If the upper bits cause a wrap, it will be a wrap regardless of the values of the 2 least significant bits of C. If the upper bits do not cause a wrap, it won't be a wrap regardless of the values of the 2 bits on the right (again). So let's split C to 2 constants like follows: 0000...00CC = D CCCC...CC00 = (C - D) and represent the whole sum as D + (C - D + X). The second term of this new sum looks like this: CCCC...CC00 XXXX...XX00 ----------- // let's add them up YYYY...YY00 The sum above (let's call it Y)) may or may not wrap, we don't know, so we need to keep it under a sext/zext. Adding D to that sum though will never wrap, signed or unsigned, if performed on the original bit width or the extended one, because all that that final add does is setting the 2 least significant bits of Y to the bits of D: YYYY...YY00 = Y 0000...00CC = D ----------- <nuw><nsw> YYYY...YYCC Which means we can safely move that D out of the sext or zext and claim that the top-level sum neither sign wraps nor unsigned wraps. Let's run an example, let's say we're working in i8's and the original expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes like this: 0001 0101 // 21 XXXX XX00 // 12x YYYY Y000 // 8y 0001 0101 // 21 ZZZZ ZZ00 // 12x + 8y 0000 0001 // D 0001 0100 // 21 - D = 20 ZZZZ ZZ00 // 12x + 8y 0000 0001 // D WWWW WW00 // 21 - D + 12x + 8y = 20 + 12x + 8y therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y))<nuw><nsw> This approach could be improved if we move away from using trailing zeroes and use KnownBits instead. For instance, with KnownBits we could have the following picture: i 10 1110...0011 // this is C XX X1XX...XX00 // this is X = (x + y + ...) Notice that some of the bits of X are known ones, also notice that known bits of X are interspersed with unknown bits and not grouped on the rigth or left. We can see at the position i that C(i) and X(i) are both known ones, therefore the (i + 1)th carry bit is guaranteed to be 1 regardless of the bits of C to the right of i. For instance, the C(i - 1) bit only affects the bits of the sum at positions i - 1 and i, and does not influence if the sum is going to wrap or not. Therefore we could split the constant C the following way: i 00 0010...0011 = D 10 1100...0000 = (C - D) Let's compute the KnownBits of (C - D) + X: XX1 1 = carry bit, blanks stand for known zeroes 10 1100...0000 = (C - D) XX X1XX...XX00 = X --- ----------- XX X0XX...XX00 Will this add wrap or not essentially depends on bits of X. Adding D to this sum, however, is guaranteed to not to wrap: 0 X 00 0010...0011 = D sX X0XX...XX00 = (C - D) + X --- ----------- sX XXXX XX11 As could be seen above, adding D preserves the sign bit of (C - D) + X, if any, and has a guaranteed 0 carry out, as expected. The more bits of (C - D) we constrain, the better the transformations introduced here canonicalize expressions as it leaves less freedom to what values the constant part of ((C - D) + x + y + ...) can take. Reviewed By: mzolotukhin, efriedma Differential Revision: https://reviews.llvm.org/D48853 llvm-svn: 337943
*	[SLPVectorizer] Don't attempt horizontal reduction on pointer types (PR38191)	Simon Pilgrim	2018-07-17	1	-0/+128
\| \| \| \| \| \|	TTI::getMinMaxReductionCost typically can't handle pointer types - until this is changed its better to limit horizontal reduction to integer/float vector types only. llvm-svn: 337280
*	[SLPVectorizer] Add initial alternate opcode support for cast instructions. ↵	Simon Pilgrim	2018-07-13	1	-76/+289
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(REAPPLIED-2) We currently only support binary instructions in the alternate opcode shuffles. This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism: 1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly. 2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this. 3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc. 4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements. Reapplied with fix to only accept 2 different casts if they come from the same source type (PR38154). Differential Revision: https://reviews.llvm.org/D49135 llvm-svn: 336989
*	Revert "[SLPVectorizer] Add initial alternate opcode support for cast ↵	Martin Storsjo	2018-07-12	1	-154/+76
\| \| \| \| \| \| \| \| \|	instructions. (REAPPLIED)" This reverts commit r336812, which broke compilation of a number of projects, see PR38154. llvm-svn: 336949
*	[SLPVectorizer] Add initial alternate opcode support for cast instructions. ↵	Simon Pilgrim	2018-07-11	1	-76/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(REAPPLIED) We currently only support binary instructions in the alternate opcode shuffles. This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism: 1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly. 2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this. 3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc. 4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements. Reapplied with fix to only accept 2 different casts if they come from the same source type. Differential Revision: https://reviews.llvm.org/D49135 llvm-svn: 336812
*	[SLPVectorizer] Ensure alternate/passthrough doesn't vectorize sdiv with ↵	Simon Pilgrim	2018-07-11	1	-0/+81
\| \| \| \| \| \|	undef elts llvm-svn: 336809
*	[SLPVectorizer] Add some additional alternate cast tests	Simon Pilgrim	2018-07-11	1	-0/+107
\| \| \| \| \| \|	Initial attempt at D49135 failed as we weren't correctly handling casts with different source types. llvm-svn: 336808
*	Revert rL336804: [SLPVectorizer] Add initial alternate opcode support for ↵	Simon Pilgrim	2018-07-11	1	-151/+52
\| \| \| \| \| \| \| \|	cast instructions. Reverting due to buildbot failures llvm-svn: 336806
*	[SLPVectorizer] Add initial alternate opcode support for cast instructions.	Simon Pilgrim	2018-07-11	1	-52/+151
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We currently only support binary instructions in the alternate opcode shuffles. This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism: 1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly. 2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this. 3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc. 4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements. Differential Revision: https://reviews.llvm.org/D49135 llvm-svn: 336804
*	[SLPVectorizer][X86] Begin adding alternate tests for call operators	Simon Pilgrim	2018-07-02	1	-0/+65
\| \| \| \| \| \|	Alternate opcode handling only supports binary operators, these tests demonstrate a missed opportunity to vectorize ceil/floor calls llvm-svn: 336125
*	[SLPVectorizer] Fix alternate opcode + shuffle cost function to correct ↵	Simon Pilgrim	2018-07-02	1	-3/+24
\| \| \| \| \| \| \| \| \| \|	handle SK_Select patterns. We were always using the opcodes of the first 2 scalars for the costs of the alternate opcode + shuffle. This made sense when we used SK_Alternate and opcodes were guaranteed to be alternating, but this fails for the more general SK_Select case. This fix exposes an issue demonstrated by the fmul_fdiv_v4f32_const test - the SLM model has v4f32 fdiv costs which are more than twice those of the f32 scalar cost, meaning that the cost model determines that the vectorization is not performant. Unfortunately it completely ignores the fact that the fdiv by a constant will be changed into a fmul by InstCombine for a much lower cost vectorization. But at least we're seeing this now... llvm-svn: 336095
*	[SLPVectorizer][X86] Add some alternate tests for cast operators	Simon Pilgrim	2018-07-01	1	-0/+169
\| \| \| \| \| \|	Alternate opcode handling only supports binary operators, these tests demonstrate missed opportunities to vectorize some sitofp/uitofp and fptosi/fptoui style casts as well as some (successful) float bits manipulations llvm-svn: 336060
*	[SLPVectorizer] Recognise non uniform power of 2 constants	Simon Pilgrim	2018-06-26	1	-32/+53
\| \| \| \| \| \| \| \| \| \|	Since D46637 we are better at handling uniform/non-uniform constant Pow2 detection; this patch tweaks the SLP argument handling to support them. As SLP works with arrays of values I don't think we can easily use the pattern match helpers here. Differential Revision: https://reviews.llvm.org/D48214 llvm-svn: 335621