summaryrefslogtreecommitdiffstats
path: root/llvm/test/Transforms/LoopVectorize
Commit message (Collapse)AuthorAgeFilesLines
* Check bool attribute value in getOptionalBoolLoopAttribute.Alina Sbirlea2019-01-291-0/+95
| | | | | | | | | | | | | | | Summary: Check the bool value of the attribute in getOptionalBoolLoopAttribute not just its existance. Eliminates the warning noise generated when vectorization is explicitly disabled. Reviewers: Meinersbur, hfinkel, dmgreen Subscribers: jlebar, sanjoy, llvm-commits Differential Revision: https://reviews.llvm.org/D57260 llvm-svn: 352555
* [ValueTracking] Look through casts when determining non-nullnessJohannes Doerfert2019-01-262-63/+63
| | | | | | | | | | Bitcast and certain Ptr2Int/Int2Ptr instructions will not alter the value of their operand and can therefore be looked through when we determine non-nullness. Differential Revision: https://reviews.llvm.org/D54956 llvm-svn: 352293
* [CostModel][X86] Add explicit vector select costsSimon Pilgrim2019-01-202-55/+82
| | | | | | | | | | Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)|(Y & ~C)) bit select. Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason). The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests. llvm-svn: 351685
* [LoopVectorizer] give more advice in remark about failure to vectorize callSanjay Patel2019-01-121-0/+52
| | | | | | | | | | Something like this is requested by: https://bugs.llvm.org/show_bug.cgi?id=40265 ...and it seems like a common enough case that we should acknowledge it. Differential Revision: https://reviews.llvm.org/D56551 llvm-svn: 351010
* [LAA] Avoid generating RT checks for known deps preventing vectorization.Florian Hahn2018-12-201-2/+4
| | | | | | | | | | | | | | | | | | If we found unsafe dependences other than 'unknown', we already know at compile time that they are unsafe and the runtime checks should always fail. So we can avoid generating them in those cases. This should have no negative impact on performance as the runtime checks that would be created previously should always fail. As a sanity check, I measured the test-suite, spec2k and spec2k6 and there were no regressions. Reviewers: Ayal, anemet, hsaito Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D55798 llvm-svn: 349794
* Introduce llvm.loop.parallel_accesses and llvm.access.group metadata.Michael Kruse2018-12-207-62/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current llvm.mem.parallel_loop_access metadata has a problem in that it uses LoopIDs. LoopID unfortunately is not loop identifier. It is neither unique (there's even a regression test assigning the some LoopID to multiple loops; can otherwise happen if passes such as LoopVersioning make copies of entire loops) nor persistent (every time a property is removed/added from a LoopID's MDNode, it will also receive a new LoopID; this happens e.g. when calling Loop::setLoopAlreadyUnrolled()). Since most loop transformation passes change the loop attributes (even if it just to mark that a loop should not be processed again as llvm.loop.isvectorized does, for the versioned and unversioned loop), the parallel access information is lost for any subsequent pass. This patch unlinks LoopIDs and parallel accesses. llvm.mem.parallel_loop_access metadata on instruction is replaced by llvm.access.group metadata. llvm.access.group points to a distinct MDNode with no operands (avoiding the problem to ever need to add/remove operands), called "access group". Alternatively, it can point to a list of access groups. The LoopID then has an attribute llvm.loop.parallel_accesses with all the access groups that are parallel (no dependencies carries by this loop). This intentionally avoid any kind of "ID". Loops that are clones/have their attributes modifies retain the llvm.loop.parallel_accesses attribute. Access instructions that a cloned point to the same access group. It is not necessary for each access to have it's own "ID" MDNode, but those memory access instructions with the same behavior can be grouped together. The behavior of llvm.mem.parallel_loop_access is not changed by this patch, but should be considered deprecated. Differential Revision: https://reviews.llvm.org/D52116 llvm-svn: 349725
* [LAA] Introduce enum for vectorization safety status (NFC).Florian Hahn2018-12-181-0/+40
| | | | | | | | | | | | | | This patch adds a VectorizationSafetyStatus enum, which will be extended in a follow up patch to distinguish between 'safe with runtime checks' and 'known unsafe' dependences. Reviewers: anemet, anna, Ayal, hsaito Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D54892 llvm-svn: 349556
* [LoopVectorize] auto-generate complete checks; NFCSanjay Patel2018-12-181-9/+132
| | | | | | | | | | | The first test claims to show that the vectorizer will generate a vector load/loop, but then this file runs other passes which might scalarize that op. I'm removing instcombine from the RUN line here to break that dependency. Also, I'm generating full checks to make it clear exactly what the vectorizer has done. llvm-svn: 349554
* [Unroll/UnrollAndJam/Vectorizer/Distribute] Add followup loop attributes.Michael Kruse2018-12-126-17/+117
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When multiple loop transformation are defined in a loop's metadata, their order of execution is defined by the order of their respective passes in the pass pipeline. For instance, e.g. #pragma clang loop unroll_and_jam(enable) #pragma clang loop distribute(enable) is the same as #pragma clang loop distribute(enable) #pragma clang loop unroll_and_jam(enable) and will try to loop-distribute before Unroll-And-Jam because the LoopDistribute pass is scheduled after UnrollAndJam pass. UnrollAndJamPass only supports one inner loop, i.e. it will necessarily fail after loop distribution. It is not possible to specify another execution order. Also,t the order of passes in the pipeline is subject to change between versions of LLVM, optimization options and which pass manager is used. This patch adds 'followup' attributes to various loop transformation passes. These attributes define which attributes the resulting loop of a transformation should have. For instance, !0 = !{!0, !1, !2} !1 = !{!"llvm.loop.unroll_and_jam.enable"} !2 = !{!"llvm.loop.unroll_and_jam.followup_inner", !3} !3 = !{!"llvm.loop.distribute.enable"} defines a loop ID (!0) to be unrolled-and-jammed (!1) and then the attribute !3 to be added to the jammed inner loop, which contains the instruction to distribute the inner loop. Currently, in both pass managers, pass execution is in a fixed order and UnrollAndJamPass will not execute again after LoopDistribute. We hope to fix this in the future by allowing pass managers to run passes until a fixpoint is reached, use Polly to perform these transformations, or add a loop transformation pass which takes the order issue into account. For mandatory/forced transformations (e.g. by having been declared by #pragma omp simd), the user must be notified when a transformation could not be performed. It is not possible that the responsible pass emits such a warning because the transformation might be 'hidden' in a followup attribute when it is executed, or it is not present in the pipeline at all. For this reason, this patche introduces a WarnMissedTransformations pass, to warn about orphaned transformations. Since this changes the user-visible diagnostic message when a transformation is applied, two test cases in the clang repository need to be updated. To ensure that no other transformation is executed before the intended one, the attribute `llvm.loop.disable_nonforced` can be added which should disable transformation heuristics before the intended transformation is applied. E.g. it would be surprising if a loop is distributed before a #pragma unroll_and_jam is applied. With more supported code transformations (loop fusion, interchange, stripmining, offloading, etc.), transformations can be used as building blocks for more complex transformations (e.g. stripmining+stripmining+interchange -> tiling). Reviewed By: hfinkel, dmgreen Differential Revision: https://reviews.llvm.org/D49281 Differential Revision: https://reviews.llvm.org/D55288 llvm-svn: 348944
* Reapply "[DemandedBits][BDCE] Support vectors of integers"Nikita Popov2018-12-071-0/+20
| | | | | | | | | | | | | | | | | | | | | | DemandedBits and BDCE currently only support scalar integers. This patch extends them to also handle vector integer operations. In this case bits are not tracked for individual vector elements, instead a bit is demanded if it is demanded for any of the elements. This matches the behavior of computeKnownBits in ValueTracking and SimplifyDemandedBits in InstCombine. Unlike the previous iteration of this patch, getDemandedBits() can now again be called on arbirary (sized) instructions, even if they don't have integer or vector of integer type. (For vector types the size of the returned mask will now be the scalar size in bits though.) The added LoopVectorize test case shows a case which triggered an assertion failure with the previous attempt, because getDemandedBits() was called on a pointer-typed instruction. Differential Revision: https://reviews.llvm.org/D55297 llvm-svn: 348602
* Revert r347934 "[SCEV] Guard movement of insertion point for loop-invariants"David L. Jones2018-12-051-65/+0
| | | | | | | | This change caused SEGVs in instcombine. (The r347934 change seems to me to be a precipitating cause, not a root cause. Details are on the llvm-commits thread for r347934.) llvm-svn: 348426
* [X86][LoopVectorize] Replace -mcpu=skylake-avx512 with -mattr=avx512f in ↵Craig Topper2018-12-013-3/+3
| | | | | | some tests that failed when experimenting with defaulting to -mprefer-vector-width=256 for skylake-avx512. llvm-svn: 348063
* Add a new reduction pattern matchRenato Golin2018-11-301-0/+821
| | | | | | | | | | | | | | | | | | | | | | | | | | | Adding a new reduction pattern match for vectorizing code similar to TSVC s3111: for (int i = 0; i < N; i++) if (a[i] > b) sum += a[i]; This patch adds support for fadd, fsub and fmull, as well as multiple branches and different (but compatible) instructions (ex. add+sub) in different branches. The difference from the previous patch(https://reviews.llvm.org/D49168) is as follows: - Added check of fast-math property of fp-instruction to the previous patch - Fix/add some pattern for if-reduction.ll Differential Revision: https://reviews.llvm.org/D54464 Patch by Takahiro Miyoshi <takahiro.miyoshi@linaro.org> and Masakazu Ueno <masakazu.ueno@linaro.org> llvm-svn: 347989
* [SCEV] Guard movement of insertion point for loop-invariantsWarren Ristow2018-11-301-0/+65
| | | | | | | | | | | | | | | r320789 suppressed moving the insertion point of SCEV expressions with dev/rem operations to the loop header in non-loop-invariant situations. This, and similar, hoisting is also unsafe in the loop-invariant case, since there may be a guard against a zero denominator. This is an adjustment to the fix of r320789 to suppress the movement even in the loop-invariant case. This fixes PR30806. Differential Revision: https://reviews.llvm.org/D54713 llvm-svn: 347934
* Revert "[LICM] Enable control flow hoisting by default" and "[LICM] Reapply ↵Martin Storsjo2018-11-291-11/+9
| | | | | | | | | | | r347190 "Make LICM able to hoist phis" with fix" This reverts commits r347776 and r347778. The first one, r347776, caused significant compile time regressions for certain input files, see PR39836 for details. llvm-svn: 347867
* [LICM] Enable control flow hoisting by defaultJohn Brawn2018-11-281-9/+11
| | | | | | Differential Revision: https://reviews.llvm.org/D54949 llvm-svn: 347778
* Revert unapproved commitJoel Jones2018-11-241-1077/+0
| | | | llvm-svn: 347511
* [AArch64] Enable libm vectorized functions via SLEEFJoel Jones2018-11-241-0/+1077
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changeset is modeled after Intel's submission for SVML. It enables trigonometry functions vectorization via SLEEF: http://sleef.org/. * A new vectorization library enum is added to TargetLibraryInfo.h: SLEEF. * A new option is added to TargetLibraryInfoImpl - ClVectorLibrary: SLEEF. * A comprehensive test case is included in this changeset. * In a separate changeset (for clang), a new vectorization library argument is added to -fveclib: -fveclib=SLEEF. Trigonometry functions that are vectorized by sleef: acos asin atan atanh cos cosh exp exp2 exp10 lgamma log10 log2 log sin sinh sqrt tan tanh tgamma Patch by Stefan Teleman Differential Revision: https://reviews.llvm.org/D53927 llvm-svn: 347510
* Revert "[LICM] Make LICM able to hoist phis"Benjamin Kramer2018-11-191-11/+9
| | | | | | This reverts commit r347190. llvm-svn: 347225
* [LV] Avoid vectorizing unsafe dependencies in uniform addressAnna Thomas2018-11-191-0/+42
| | | | | | | | | | | | | | | | | | | Summary: Currently, when vectorizing stores to uniform addresses, the only instance we prevent vectorization is if there are multiple stores to the same uniform address causing an unsafe dependency. This patch teaches LAA to avoid vectorizing loops that have an unsafe cross-iteration dependency between a load and a store to the same uniform address. Fixes PR39653. Reviewers: Ayal, efriedma Subscribers: rkruppe, llvm-commits Differential Revision: https://reviews.llvm.org/D54538 llvm-svn: 347220
* [LICM] Make LICM able to hoist phisJohn Brawn2018-11-191-9/+11
| | | | | | | | | | | | | | | The general approach taken is to make note of loop invariant branches, then when we see something conditional on that branch, such as a phi, we create a copy of the branch and (empty versions of) its successors and hoist using that. This has no impact by itself that I've been able to see, as LICM typically doesn't see such phis as they will have been converted into selects by the time LICM is run, but once we start doing phi-to-select conversion later it will be important. Differential Revision: https://reviews.llvm.org/D52827 llvm-svn: 347190
* [CostModel] Add more realistic SK_InsertSubvector generic costs.Simon Pilgrim2018-11-121-55/+58
| | | | | | Instead of defaulting to a cost = 1, expand to element extract/insert like we do for other shuffles. llvm-svn: 346662
* [VectorUtils] add funnel-shifts to the list of vectorizable intrinsicsSanjay Patel2018-11-121-2/+2
| | | | | | | | | | | | | | | | This just identifies the intrinsics as candidates for vectorization. It does not mean we will attempt to vectorize under normal conditions (the test file is forcing vectorization). The cost model must be fixed to show that the transform is profitable in general. Allowing vectorization with these intrinsics is required to avoid potential regressions from canonicalizing to the intrinsics from generic IR: https://bugs.llvm.org/show_bug.cgi?id=37417 llvm-svn: 346661
* [LoopVectorize] add tests for funnel shifts; NFCSanjay Patel2018-11-121-0/+52
| | | | llvm-svn: 346658
* [SystemZ] Rework getInterleavedMemoryOpCost()Jonas Paulsson2018-11-021-0/+149
| | | | | | | | | | | | | Model this function more closely after the BasicTTIImpl version, with separate handling of loads and stores. For loads, the set of actually loaded vectors is checked. This makes it more readable and just slightly more accurate generally. Review: Ulrich Weigand https://reviews.llvm.org/D53071 llvm-svn: 345998
* [LV] Avoid vectorizing loops under opt for size that involve SCEV checksAyal Zaks2018-11-022-0/+114
| | | | | | | | | | | | | | Fix PR39417, PR39497 The loop vectorizer may generate runtime SCEV checks for overflow and stride==1 cases, leading to execution of original scalar loop. The latter is forbidden when optimizing for size. An assert introduced in r344743 triggered the above PR's showing it does happen. This patch fixes this behavior by preventing vectorization in such cases. Differential Revision: https://reviews.llvm.org/D53612 llvm-svn: 345959
* [LV] Support vectorization of interleave-groups that require an epilog underDorit Nuzman2018-10-311-70/+235
| | | | | | | | | | | | | | | | | | | | | | optsize using masked wide loads Under Opt for Size, the vectorizer does not vectorize interleave-groups that have gaps at the end of the group (such as a loop that reads only the even elements: a[2*i]) because that implies that we'll require a scalar epilogue (which is not allowed under Opt for Size). This patch extends the support for masked-interleave-groups (introduced by D53011 for conditional accesses) to also cover the case of gaps in a group of loads; Targets that enable the masked-interleave-group feature don't have to invalidate interleave-groups of loads with gaps; they could now use masked wide-loads and shuffles (if that's what the cost model selects). Reviewers: Ayal, hsaito, dcaballe, fhahn Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D53668 llvm-svn: 345705
* [LoopVectorizer] Fix for cost values of memory accesses.Jonas Paulsson2018-10-302-0/+55
| | | | | | | | | | | | | | | | | | | | | | This commit is a combination of two patches: * "Fix in getScalarizationOverhead()" If target returns false in TTI.prefersVectorizedAddressing(), it means the address registers will not need to be extracted. Therefore, there should be no operands scalarization overhead for a load instruction. * "Don't pass the instruction pointer from getMemInstScalarizationCost." Since VF is always > 1, this is a cost query for an instruction in the vectorized loop and it should not be evaluated within the scalar context of the instruction. Review: Ulrich Weigand, Hal Finkel https://reviews.llvm.org/D52351 https://reviews.llvm.org/D52417 llvm-svn: 345603
* Revert r344172: [LV] Add a new reduction pattern matchRenato Golin2018-10-271-666/+0
| | | | | | | | This patch has caused fast-math issues in the reduction pattern. Will re-work and land again. llvm-svn: 345465
* [CostModel][X86] Add realistic vXi64 uitofp vXf64 costsSimon Pilgrim2018-10-251-2/+3
| | | | | | Match codegen improvements from D53649/rL345256 llvm-svn: 345263
* [LV] Don't have fold-tail under optsize invalidate interleave-groups whenDorit Nuzman2018-10-241-0/+387
| | | | | | | | | | | | | | | | | | | | masked-interleaving is enabled Enable interleave-groups under fold-tail scenario for Opt for size compilation; D50480 added support for vectorizing loops of arbitrary trip-count without a remiander, which in turn makes everything in the loop conditional, including interleave-groups if any. It therefore invalidated all interleave-groups because we didn't have support for vectorizing predicated interleaved-groups at the time. In the meantime, D53011 introduced this support, so we don't have to invalidate interleave-groups when masked-interleaved support is enabled. Reviewers: Ayal, hsaito, dcaballe, fhahn Reviewed By: hsaito Differential Revision: https://reviews.llvm.org/D53559 llvm-svn: 345115
* [InstCombine] use 'match' to handle vectors and simplify codeSanjay Patel2018-10-231-2/+1
| | | | | | | This is another step towards completely removing the fake binop queries for not/neg/fneg. llvm-svn: 345036
* [IAI,LV] Avoid creating a scalar epilogue due to gaps in interleave-groups when Dorit Nuzman2018-10-221-2/+112
| | | | | | | | | | | | | | | | | | | | | | | optimizing for size LV is careful to respect -Os and not to create a scalar epilog in all cases (runtime tests, trip-counts that require a remainder loop) except for peeling due to gaps in interleave-groups. This patch fixes that; -Os will now have us invalidate such interleave-groups and vectorize without an epilog. The patch also removes a related FIXME comment that is now obsolete, and was also inaccurate: "FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a smaller MaxVF that does not require a scalar epilog." (requiresScalarEpilog() has nothing to do with VF). Reviewers: Ayal, hsaito, dcaballe, fhahn Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D53420 llvm-svn: 344883
* [LoopVectorize] Loop vectorization for minimum and maximumThomas Lively2018-10-191-0/+56
| | | | | | | | | | | | Summary: Depends on D52766. Reviewers: aheejin, dschuff Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D52767 llvm-svn: 344816
* [LV] Fold tail by masking to vectorize loops of arbitrary trip count under ↵Ayal Zaks2018-10-183-22/+282
| | | | | | | | | | | | | | | | | | | | | | | | opt for size When optimizing for size, a loop is vectorized only if the resulting vector loop completely replaces the original scalar loop. This holds if no runtime guards are needed, if the original trip-count TC does not overflow, and if TC is a known constant that is a multiple of the VF. The last two TC-related conditions can be overcome by 1. rounding the trip-count of the vector loop up from TC to a multiple of VF; 2. masking the vector body under a newly introduced "if (i <= TC-1)" condition. The patch allows loops with arbitrary trip counts to be vectorized under -Os, subject to the existing cost model considerations. It also applies to loops with small trip counts (under -O2) which are currently handled as if under -Os. The patch does not handle loops with reductions, live-outs, or w/o a primary induction variable, and disallows interleave groups. (Third, final and main part of -) Differential Revision: https://reviews.llvm.org/D50480 llvm-svn: 344743
* [LV] Teach vectorizer about variant value store into uniform addressAnna Thomas2018-10-163-12/+470
| | | | | | | | | | | | | | | | | | | | Summary: Teach vectorizer about vectorizing variant value stores to uniform address. Similar to rL343028, we do not allow vectorization if we have multiple stores to the same uniform address. Cost model already has the change for considering the extract instruction cost for a variant value store. See added test cases for how vectorization is done. The patch also contains changes to the ORE messages. Reviewers: Ayal, mkuper, anemet, hsaito Subscribers: rkruppe, llvm-commits Differential Revision: https://reviews.llvm.org/D52656 llvm-svn: 344613
* [LV] Add test checks when vectorizing loops under opt for size; NFCAyal Zaks2018-10-163-19/+253
| | | | | | | | Landing this as a separate part of https://reviews.llvm.org/D50480, recording current behavior more accurately, to clarify subsequent diff ([LV] Vectorizing loops of arbitrary trip count without remainder under opt for size). llvm-svn: 344606
* recommit 344472 after fixing build failure on ARM and PPC.Dorit Nuzman2018-10-143-0/+387
| | | | llvm-svn: 344475
* revert 344472 due to failures.Dorit Nuzman2018-10-143-387/+0
| | | | llvm-svn: 344473
* [IAI,LV] Add support for vectorizing predicated strided accesses using maskedDorit Nuzman2018-10-143-0/+387
| | | | | | | | | | | | | | | | | | | | | | | interleave-group The vectorizer currently does not attempt to create interleave-groups that contain predicated loads/stores; predicated strided accesses can currently be vectorized only using masked gather/scatter or scalarization. This patch makes predicated loads/stores candidates for forming interleave-groups during the Loop-Vectorizer's analysis, and adds the proper support for masked-interleave- groups to the Loop-Vectorizer's planning and transformation stages. The patch also extends the TTI API to allow querying the cost of masked interleave groups (which each target can control); Targets that support masked vector loads/ stores may choose to enable this feature and allow vectorizing predicated strided loads/stores using masked wide loads/stores and shuffles. Reviewers: Ayal, hsaito, dcaballe, fhahn, javed.absar Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D53011 llvm-svn: 344472
* [InstCombine] reverse 'trunc X to <N x i1>' canonicalization; 2nd trySanjay Patel2018-10-101-176/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Re-trying r344082 because it unintentionally included extra diffs. Original commit message: icmp ne (and X, 1), 0 --> trunc X to N x i1 Ideally, we'd do the same for scalars, but there will likely be regressions unless we add more trunc folds as we're doing here for vectors. The motivating vector case is from PR37549: https://bugs.llvm.org/show_bug.cgi?id=37549 define <4 x float> @bitwise_select(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x float> %w) { %c = fcmp ole <4 x float> %x, %y %s = sext <4 x i1> %c to <4 x i32> %s1 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1> %s2 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 2, i32 2, i32 3, i32 3> %cond = or <4 x i32> %s1, %s2 %condtr = trunc <4 x i32> %cond to <4 x i1> %r = select <4 x i1> %condtr, <4 x float> %z, <4 x float> %w ret <4 x float> %r } Here's a sampling of the vector codegen for that case using mask+icmp (current behavior) vs. trunc (with this patch): AVX before: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vandps LCPI0_0(%rip), %xmm0, %xmm0 vxorps %xmm1, %xmm1, %xmm1 vpcmpeqd %xmm1, %xmm0, %xmm0 vblendvps %xmm0, %xmm3, %xmm2, %xmm0 AVX after: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vblendvps %xmm0, %xmm2, %xmm3, %xmm0 AVX512f before: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vpbroadcastd LCPI0_0(%rip), %xmm1 ## xmm1 = [1,1,1,1] vptestnmd %zmm1, %zmm0, %k1 vblendmps %zmm3, %zmm2, %zmm0 {%k1} AVX512f after: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vpslld $31, %xmm0, %xmm0 vptestmd %zmm0, %zmm0, %k1 vblendmps %zmm2, %zmm3, %zmm0 {%k1} AArch64 before: fcmge v0.4s, v1.4s, v0.4s zip1 v1.4s, v0.4s, v0.4s zip2 v0.4s, v0.4s, v0.4s orr v0.16b, v1.16b, v0.16b movi v1.4s, #1 and v0.16b, v0.16b, v1.16b cmeq v0.4s, v0.4s, #0 bsl v0.16b, v3.16b, v2.16b AArch64 after: fcmge v0.4s, v1.4s, v0.4s zip1 v1.4s, v0.4s, v0.4s zip2 v0.4s, v0.4s, v0.4s orr v0.16b, v1.16b, v0.16b bsl v0.16b, v2.16b, v3.16b PowerPC-le before: xvcmpgesp 34, 35, 34 vspltisw 0, 1 vmrglw 3, 2, 2 vmrghw 2, 2, 2 xxlor 0, 35, 34 xxlxor 35, 35, 35 xxland 34, 0, 32 vcmpequw 2, 2, 3 xxsel 34, 36, 37, 34 PowerPC-le after: xvcmpgesp 34, 35, 34 vmrglw 3, 2, 2 vmrghw 2, 2, 2 xxlor 0, 35, 34 xxsel 34, 37, 36, 0 Differential Revision: https://reviews.llvm.org/D52747 llvm-svn: 344181
* revert r344082: [InstCombine] reverse 'trunc X to <N x i1>' canonicalizationSanjay Patel2018-10-101-160/+176
| | | | | | This commit accidentally included the diffs from D53057. llvm-svn: 344178
* [LV] Add a new reduction pattern matchRenato Golin2018-10-101-0/+666
| | | | | | | | | | | | | | | | | | | | | | Adding a new reduction pattern match for vectorizing code similar to TSVC s3111: for (int i = 0; i < N; i++) if (a[i] > b) sum += a[i]; This patch adds support for fadd, fsub and fmull, as well as multiple branches and different (but compatible) instructions (ex. add+sub) in different branches. I have forwarded to trunk, added fsub and fmul functionality and additional tests, but the credit goes to Takahiro, who did most of the actual work. Differential Revision: https://reviews.llvm.org/D49168 Patch by Takahiro Miyoshi <takahiro.miyoshi@linaro.org>. llvm-svn: 344172
* [LV] Move test for r343954 into x86 subdirectoryJustin Bogner2018-10-091-0/+0
| | | | | | | This test uses an x86 triple, so it needs to be in the x86 specific test directory. llvm-svn: 344087
* [InstCombine] reverse 'trunc X to <N x i1>' canonicalizationSanjay Patel2018-10-091-176/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | icmp ne (and X, 1), 0 --> trunc X to N x i1 Ideally, we'd do the same for scalars, but there will likely be regressions unless we add more trunc folds as we're doing here for vectors. The motivating vector case is from PR37549: https://bugs.llvm.org/show_bug.cgi?id=37549 define <4 x float> @bitwise_select(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x float> %w) { %c = fcmp ole <4 x float> %x, %y %s = sext <4 x i1> %c to <4 x i32> %s1 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1> %s2 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 2, i32 2, i32 3, i32 3> %cond = or <4 x i32> %s1, %s2 %condtr = trunc <4 x i32> %cond to <4 x i1> %r = select <4 x i1> %condtr, <4 x float> %z, <4 x float> %w ret <4 x float> %r } Here's a sampling of the vector codegen for that case using mask+icmp (current behavior) vs. trunc (with this patch): AVX before: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vandps LCPI0_0(%rip), %xmm0, %xmm0 vxorps %xmm1, %xmm1, %xmm1 vpcmpeqd %xmm1, %xmm0, %xmm0 vblendvps %xmm0, %xmm3, %xmm2, %xmm0 AVX after: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vblendvps %xmm0, %xmm2, %xmm3, %xmm0 AVX512f before: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vpbroadcastd LCPI0_0(%rip), %xmm1 ## xmm1 = [1,1,1,1] vptestnmd %zmm1, %zmm0, %k1 vblendmps %zmm3, %zmm2, %zmm0 {%k1} AVX512f after: vcmpleps %xmm1, %xmm0, %xmm0 vpermilps $80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1] vpermilps $250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3] vorps %xmm0, %xmm1, %xmm0 vpslld $31, %xmm0, %xmm0 vptestmd %zmm0, %zmm0, %k1 vblendmps %zmm2, %zmm3, %zmm0 {%k1} AArch64 before: fcmge v0.4s, v1.4s, v0.4s zip1 v1.4s, v0.4s, v0.4s zip2 v0.4s, v0.4s, v0.4s orr v0.16b, v1.16b, v0.16b movi v1.4s, #1 and v0.16b, v0.16b, v1.16b cmeq v0.4s, v0.4s, #0 bsl v0.16b, v3.16b, v2.16b AArch64 after: fcmge v0.4s, v1.4s, v0.4s zip1 v1.4s, v0.4s, v0.4s zip2 v0.4s, v0.4s, v0.4s orr v0.16b, v1.16b, v0.16b bsl v0.16b, v2.16b, v3.16b PowerPC-le before: xvcmpgesp 34, 35, 34 vspltisw 0, 1 vmrglw 3, 2, 2 vmrghw 2, 2, 2 xxlor 0, 35, 34 xxlxor 35, 35, 35 xxland 34, 0, 32 vcmpequw 2, 2, 3 xxsel 34, 36, 37, 34 PowerPC-le after: xvcmpgesp 34, 35, 34 vmrglw 3, 2, 2 vmrghw 2, 2, 2 xxlor 0, 35, 34 xxsel 34, 37, 36, 0 Differential Revision: https://reviews.llvm.org/D52747 llvm-svn: 344082
* [LV] Do not create SCEVs on broken IR in emitTransformedIndex. PR39160Max Kazantsev2018-10-084-19/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the point when we perform `emitTransformedIndex`, we have a broken IR (in particular, we have Phis for which not every incoming value is properly set). On such IR, it is illegal to create SCEV expressions, because their internal simplification process may try to prove some predicates and break when it stumbles across some broken IR. The only purpose of using SCEV in this particular place is attempt to simplify the generated code slightly. It seems that the result isn't worth it, because some trivial cases (like addition of zero and multiplication by 1) can be handled separately if needed, but more generally InstCombine is able to achieve the goals we want to achieve by using SCEV. This patch fixes a functional crash described in PR39160, and as side-effect it also generates a bit smarter code in some simple cases. It also may cause some optimality loss (i.e. we will now generate `mul` by power of `2` instead of shift etc), but there is nothing what InstCombine could not handle later. In case of dire need, we can support more trivial cases just in place. Note that this patch only fixes one particular case of the general problem that LV misuses SCEV, attempting to create SCEVs or prove predicates on invalid IR. The general solution, however, seems complex enough. Differential Revision: https://reviews.llvm.org/D52881 Reviewed By: fhahn, hsaito llvm-svn: 343954
* [IAI,LV] Avoid creating interleave-groups for predicated accesseDorit Nuzman2018-10-072-0/+102
| | | | | | | | | | | | | | | | | | | | | | This patch fixes PR39099. When strided loads are predicated, each of them will form an interleaved-group (with gaps). However, subsequent stages of vectorization (planning and transformation) assume that if a load is part of an Interleave-Group it is not predicated, resulting in wrong code - unmasked wide loads are created. The Interleaving Analysis does take care not to have conditional interleave groups of size > 1, but until we extend the planning and transformation stages to support masked-interleave-groups we should also avoid having them for size == 1. Reviewers: Ayal, hsaito, dcaballe, fhahn Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D52682 llvm-svn: 343931
* [LV][LAA] Vectorize loop invariant values stored into loop invariant addressAnna Thomas2018-09-253-1/+396
| | | | | | | | | | | | | | | | | | | Summary: We are overly conservative in loop vectorizer with respect to stores to loop invariant addresses. More details in https://bugs.llvm.org/show_bug.cgi?id=38546 This is the first part of the fix where we start with vectorizing loop invariant values to loop invariant addresses. This also includes changes to ORE for stores to invariant address. Reviewers: anemet, Ayal, mkuper, mssimpso Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D50665 llvm-svn: 343028
* [Loop Vectorizer] Abandon vectorization when no integer IV foundWarren Ristow2018-09-212-0/+54
| | | | | | | | | | | | | | | | | Support for vectorizing loops with secondary floating-point induction variables was added in r276554. A primary integer IV is still required for vectorization to be done. If an FP IV was found, but no integer IV was found at all (primary or secondary), the attempt to vectorize still went forward, causing a compiler-crash. This change abandons that attempt when no integer IV is found. (Vectorizing FP-only cases like this, rather than bailing out, is discussed as possible future work in D52327.) See PR38800 for more information. Differential Revision: https://reviews.llvm.org/D52327 llvm-svn: 342786
* [VPlan] Implement initial vector code generation support for simple outer loops.Hideki Saito2018-09-142-0/+194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: [VPlan] Implement vector code generation support for simple outer loops. Context: Patch Series #1 for outer loop vectorization support in LV using VPlan. (RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html). This patch introduces vector code generation support for simple outer loops that are currently supported in the VPlanNativePath. Changes here essentially do the following: - force vector code generation using explicit vectorize_width - add conservative early returns in cost model and other places for VPlanNativePath - add code for setting up outer loop inductions - support for widening non-induction PHIs that can result from inner loops and uniform conditional branches - support for generating uniform inner branches We plan to add a handful C outer loop executable tests once the initial code generation support is committed. This patch is expected to be NFC for the inner loop vectorizer path. Since we are moving in the direction of supporting outer loop vectorization in LV, it may also be time to rename classes such as InnerLoopVectorizer. Reviewers: fhahn, rengolin, hsaito, dcaballe, mkuper, hfinkel, Ayal Reviewed By: fhahn, hsaito Subscribers: dmgreen, bollu, tschuett, rkruppe, rogfer01, llvm-commits Differential Revision: https://reviews.llvm.org/D50820 llvm-svn: 342197
OpenPOWER on IntegriCloud