| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
Don't overwrite existing target-cpu attributes.
I've often found the replacement behavior annoying, and this is
inconsistent with how the fast math command line flags interact with
the function attributes.
Does not yet change target-features, since I think that should behave
as a concatenation.
|
|
|
|
| |
"frame-pointer"="none" as cleanups after D56351
|
|
|
|
| |
as cleanups after D56351
|
|
|
|
|
|
|
|
|
|
|
|
| |
A sequence of additions or multiplications that is known not to wrap, may wrap
if it's order is changed (i.e., reassociated). Therefore when vectorizing
integer sum or product reductions, their no-wrap flags need to be removed.
Fixes PR43828
Patch by Denis Antrushin
Differential Revision: https://reviews.llvm.org/D69563
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix PR40816: avoid considering scalar-with-predication instructions as also
uniform-after-vectorization.
Instructions identified as "scalar with predication" will be "vectorized" using
a replicating region. If such instructions are also optimized as "uniform after
vectorization", namely when only the first of VF lanes is used, such a
replicating region becomes erroneous - only the first instance of the region can
and should be formed. Fix such cases by not considering such instructions as
"uniform after vectorization".
Differential Revision: https://reviews.llvm.org/D70298
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I'm not sure what the effect of this change will be on all of the affected
tests or a larger benchmark, but it fixes the horizontal add/sub problems
noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc
The costs are based on reciprocal throughput numbers in Agner's tables for
PEXTR*; these appear to be very slow ops on Silvermont.
This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605
Also, it seems likely that insert/extract is the source of perf regressions on
other CPUs (up to 30%) that were cited as part of the reason to revert D59710,
so maybe we'll extend the table-based approach to other subtargets.
Differential Revision: https://reviews.llvm.org/D70607
|
|
|
|
| |
fix non-X86 bots.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Getelementptr has vector type if any of its operands are vectors
(the scalar operands being implicitly broadcast to all vector elements).
Extractelement applied to a vector getelementptr can be folded by
applying the extractelement in turn to all of the vector operands.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69379
|
|
|
|
|
|
|
|
|
|
| |
Currently we may do iterleaving by more than estimated trip count
coming from the profile or computed maximum trip count. The solution is to
use "best known" trip count instead of exact one in interleaving analysis.
Patch by Evgeniy Brevnov.
Differential Revision: https://reviews.llvm.org/D67948
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
separately in loop-vectorize
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374634
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
separately in loop-vectorize"
Also Revert "[LoopVectorize] Fix non-debug builds after rL374017"
This reverts commit 9f41deccc0e648a006c9f38e11919f181b6c7e0a.
This reverts commit 18b6fe07bcf44294f200bd2b526cb737ed275c04.
The patch is breaking PowerPC internal build, checked with author, reverting
on behalf of him for now due to timezone.
llvm-svn: 374091
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in loop-vectorize
In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not
estimate different register pressure for different register class separately(especially for scalar type,
float type should not be on the same position with int type), so it's not accurate. Specifically,
it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.
So we need classify the register classes in IR level, and importantly these are abstract register classes,
and are not the target register class of backend provided in td file. It's used to establish the mapping between
the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.
For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR),
float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled,
and 3 kinds of register class when VSX is NOT enabled.
It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions.
Differential revision: https://reviews.llvm.org/D67148
llvm-svn: 374017
|
|
|
|
| |
llvm-svn: 373913
|
|
|
|
|
|
| |
Turns out I'd already added exactly the same test under the name non_unit_stride.
llvm-svn: 371777
|
|
|
|
| |
llvm-svn: 371769
|
|
|
|
| |
llvm-svn: 371747
|
|
|
|
|
|
|
|
|
|
| |
Implement a TODO from rL371452, and handle loop invariant addresses in predicated blocks. If we can prove that the load is safe to speculate into the header, then we can avoid using a masked.load in favour of a normal load.
This is mostly about vectorization robustness. In the common case, it's generally expected that LICM/LoadStorePromotion would have eliminated such loads entirely.
Differential Revision: https://reviews.llvm.org/D67372
llvm-svn: 371745
|
|
|
|
| |
llvm-svn: 371456
|
|
|
|
| |
llvm-svn: 371455
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we're vectorizing a load in a predicated block, check to see if the load can be speculated rather than predicated. This allows us to generate a normal vector load instead of a masked.load.
To do so, we must prove that all bytes accessed on any iteration of the original loop are dereferenceable, and that all loads (across all iterations) are properly aligned. This is equivelent to proving that hoisting the load into the loop header in the original scalar loop is safe.
Note: There are a couple of code motion todos in the code. My intention is to wait about a day - to be sure this sticks - and then perform the NFC motion without furthe review.
Differential Revision: https://reviews.llvm.org/D66688
llvm-svn: 371452
|
|
|
|
| |
llvm-svn: 371260
|
|
|
|
|
|
|
|
|
|
|
|
| |
Allow vectorizing loops that have reductions when tail is folded by masking.
A select is introduced in VPlan, choosing between the last value carried by the
loop-exit/live-out instruction of the reduction, and the penultimate value
carried by the reduction phi, according to the "i < n" mask of fold-tail.
This select replaces the last value as the live-out value of the loop.
Differential Revision: https://reviews.llvm.org/D66720
llvm-svn: 370173
|
|
|
|
| |
llvm-svn: 369959
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
assume_safety implies that loads under "if's" can be safely executed
speculatively (unguarded, unmasked). However this assumption holds only for the
original user "if's", not those introduced by the compiler, such as the
fold-tail "if" that guards us from loading beyond the original loop trip-count.
Currently the combination of fold-tail and assume-safety pragmas results in
ignoring the fold-tail predicate that guards the loads, generating unmasked
loads. This patch fixes this behavior.
Differential Revision: https://reviews.llvm.org/D66106
Reviewers: Ayal, hsaito, fhahn
llvm-svn: 368973
|
|
|
|
|
|
|
|
|
|
|
| |
This is the compiler-flag equivalent of the Predicate pragma
(https://reviews.llvm.org/D65197), to direct the vectorizer to fold the
remainder-loop into the main-loop using predication.
Differential Revision: https://reviews.llvm.org/D66108
Reviewers: Ayal, hsaito, fhahn, SjoerdMeije
llvm-svn: 368801
|
|
|
|
|
|
|
|
|
|
|
|
| |
trip count that is less than VF*interleave
If we know the trip count, we should make sure the interleave factor won't cause the vectorized loop to exceed it.
Improves one of the cases from PR42674
Differential Revision: https://reviews.llvm.org/D65896
llvm-svn: 368215
|
|
|
|
|
|
|
| |
We do end vectorizing the code, but use an interleave factor that
is too high and causes the vector code to be dead.
llvm-svn: 368197
|
|
|
|
| |
llvm-svn: 367666
|
|
|
|
|
|
|
|
| |
subdirectory.
ps4-buildslave1 reported a failure. The test has x86 triple.
llvm-svn: 367659
|
|
|
|
|
|
|
|
|
|
|
| |
This allows folding of the scalar epilogue loop (the tail) into the main
vectorised loop body when the loop is annotated with a "vector predicate"
metadata hint. To fold the tail, instructions need to be predicated (masked),
enabling/disabling lanes for the remainder iterations.
Differential Revision: https://reviews.llvm.org/D65197
llvm-svn: 367592
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit r365260 which broke the following tests:
Clang :: CodeGenCXX/cfi-mfcall.cpp
Clang :: CodeGenObjC/ubsan-nullability.m
LLVM :: Transforms/LoopVectorize/AArch64/pr36032.ll
llvm-svn: 365284
|
|
|
|
|
|
| |
Without this, we have the unfortunate property that tests are dependent on the order of operads passed the CreateOr and CreateAnd functions. In actual usage, we'd promptly optimize them away, but it made tests slightly more verbose than they should have been.
llvm-svn: 365260
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
loop even after completion
Summary:
Bug: https://bugs.llvm.org/show_bug.cgi?id=39024
The bug reports that a vectorized loop is stepped through 4 times and each step through the loop seemed to show a different path. I found two problems here:
A) An incorrect line number on a preheader block (for.body.preheader) instruction causes a step into the loop before it begins.
B) Instructions in the middle block have different line numbers which give the impression of another iteration.
In this patch I give all of the middle block instructions the line number of the scalar loop latch terminator branch. This seems to provide the smoothest debugging experience because the vectorized loops will always end on this line before dropping into the scalar loop. To solve problem A I have altered llvm::SplitBlockPredecessors to accommodate loop header blocks.
I have set up a separate review D61933 for a fix which is required for this patch.
Reviewers: samsonov, vsk, aprantl, probinson, anemet, hfinkel, jmorse
Reviewed By: hfinkel, jmorse
Subscribers: jmorse, javed.absar, eraman, kcc, bjope, jmellorcrummey, hfinkel, gbedwell, hiraditya, zzheng, llvm-commits
Tags: #llvm, #debug-info
Differential Revision: https://reviews.llvm.org/D60831
> llvm-svn: 363046
llvm-svn: 363786
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When considering a loop containing nontemporal stores or loads for
vectorization, suppress the vectorization if the corresponding
vectorized store or load with the aligment of the original scaler
memory op is not supported with the nontemporal hint on the target.
This adds two new functions:
bool isLegalNTStore(Type *DataType, unsigned Alignment) const;
bool isLegalNTLoad(Type *DataType, unsigned Alignment) const;
to TTI, leaving the target independent default implementation as
returning true, but with overriding implementations for X86 that
check the legality based on available Subtarget features.
This fixes https://llvm.org/PR40759
Differential Revision: https://reviews.llvm.org/D61764
llvm-svn: 363581
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Avoid that loop vectorizer creates loads/stores of vectors
with "irregular" types when interleaving. An example of
an irregular type is x86_fp80 that is 80 bits, but that
may have an allocation size that is 96 bits. So an array
of x86_fp80 is not bitcast compatible with a vector
of the same type.
Not sure if interleavedAccessCanBeWidened is the best
place for this check, but it solves the problem seen
in the added test case. And it is the same kind of check
that already exists in memoryInstructionCanBeWidened.
Reviewers: fhahn, Ayal, craig.topper
Reviewed By: fhahn
Subscribers: hiraditya, rkruppe, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63386
llvm-svn: 363547
|
|
|
|
| |
llvm-svn: 363538
|
|
|
|
|
|
|
|
|
|
|
|
| |
InsertBinop now accepts NoWrapFlags, so pass them through when
expanding a simple add expression.
This is the first re-commit of the functional changes from rL362687,
which was previously reverted.
Differential Revision: https://reviews.llvm.org/D61934
llvm-svn: 363364
|
|
|
|
|
|
|
|
|
| |
through loop even after completion"
This reverts commit 1a0f7a2077b70c9864faa476e15b048686cf1ca7.
See phabricator thread for D60831.
llvm-svn: 363132
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
loop even after completion
Summary:
Bug: https://bugs.llvm.org/show_bug.cgi?id=39024
The bug reports that a vectorized loop is stepped through 4 times and each step through the loop seemed to show a different path. I found two problems here:
A) An incorrect line number on a preheader block (for.body.preheader) instruction causes a step into the loop before it begins.
B) Instructions in the middle block have different line numbers which give the impression of another iteration.
In this patch I give all of the middle block instructions the line number of the scalar loop latch terminator branch. This seems to provide the smoothest debugging experience because the vectorized loops will always end on this line before dropping into the scalar loop. To solve problem A I have altered llvm::SplitBlockPredecessors to accommodate loop header blocks.
I have set up a separate review D61933 for a fix which is required for this patch.
Reviewers: samsonov, vsk, aprantl, probinson, anemet, hfinkel, jmorse
Reviewed By: hfinkel, jmorse
Subscribers: jmorse, javed.absar, eraman, kcc, bjope, jmellorcrummey, hfinkel, gbedwell, hiraditya, zzheng, llvm-commits
Tags: #llvm, #debug-info
Differential Revision: https://reviews.llvm.org/D60831
llvm-svn: 363046
|
|
|
|
|
|
| |
This reverts commit r362687. Miscompiles llvm-profdata during selfhost.
llvm-svn: 362699
|
|
|
|
|
|
|
|
|
|
| |
If the given SCEVExpr has no (un)signed flags attached to it, transfer
these to the resulting instruction or use them to find an existing
instruction.
Differential Revision: https://reviews.llvm.org/D61934
llvm-svn: 362687
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A mixture of internal tests and review of the scheduler models indicates we're overestimating the cost of a masked load, which we're estimating at 4x regular memory ops - more realistic values indicates that its closer to 2x. Masked stores costs are a lot more diverse but 8x is roughly in the middle of the range.
e.g. SandyBridge
defm : X86WriteRes<WriteFMaskedLoad, [SBPort23,SBPort05], 8, [1,2], 3>;
defm : X86WriteRes<WriteFMaskedLoadY, [SBPort23,SBPort05], 9, [1,2], 3>;
defm : X86WriteRes<WriteFMaskedStore, [SBPort4,SBPort01,SBPort23], 5, [1,1,1], 3>;
defm : X86WriteRes<WriteFMaskedStoreY, [SBPort4,SBPort01,SBPort23], 5, [1,1,1], 3>;
e.g. Btver2
defm : X86WriteRes<WriteFMaskedLoad, [JLAGU, JFPU01, JFPX], 6, [1, 2, 2], 1>;
defm : X86WriteRes<WriteFMaskedLoadY, [JLAGU, JFPU01, JFPX], 6, [2, 4, 4], 2>;
defm : X86WriteRes<WriteFMaskedStore, [JSAGU, JFPU01, JFPX], 6, [1, 1, 4], 1>;
defm : X86WriteRes<WriteFMaskedStoreY, [JSAGU, JFPU01, JFPX], 6, [2, 2, 4], 2>;
Differential Revision: https://reviews.llvm.org/D61257
llvm-svn: 362338
|
|
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D62510
llvm-svn: 362124
|
|
|
|
| |
llvm-svn: 362060
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This test file has a long history of edits from changes outside
of vectorization, and it would happen again with the proposal in
D61726.
End-to-end testing shouldn't be happening in a test file that is
specifically checking for vector masked load/store ops.
Larger-scale testing goes in PhaseOrdering or the test-suite.
I've hopefully preserved the intent by taking what was completely
unoptimized IR in some tests and passing that through the -O1
pipeline. That becomes the input IR, and now we just run the loop
vectorizer and verify that the vector masked ops are produced as
expected.
llvm-svn: 360340
|
|
|
|
| |
llvm-svn: 360190
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
loop even after completion
Summary:
Bug: https://bugs.llvm.org/show_bug.cgi?id=39024
The bug reports that a vectorized loop is stepped through 4 times and each step through the loop seemed to show a different path. I found two problems here:
A) An incorrect line number on a preheader block (for.body.preheader) instruction causes a step into the loop before it begins.
B) Instructions in the middle block have different line numbers which give the impression of another iteration.
In this patch I give all of the middle block instructions the line number of the scalar loop latch terminator branch. This seems to provide the smoothest debugging experience because the vectorized loops will always end on this line before dropping into the scalar loop. To solve problem A I have altered llvm::SplitBlockPredecessors to accommodate loop header blocks.
Reviewers: samsonov, vsk, aprantl, probinson, anemet, hfinkel
Reviewed By: hfinkel
Subscribers: bjope, jmellorcrummey, hfinkel, gbedwell, hiraditya, zzheng, llvm-commits
Tags: #llvm, #debug-info
Differential Revision: https://reviews.llvm.org/D60831
llvm-svn: 360162
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Currently we express umin as `~umax(~x, ~y)`. However, this becomes
a problem for operands in non-integral pointer spaces, because `~x`
is not something we can compute for `x` non-integral. However, since
comparisons are generally still allowed, we are actually able to
express `umin(x, y)` directly as long as we don't try to express is
as a umax. Support this by adding an explicit umin/smin representation
to SCEV. We do this by factoring the existing getUMax/getSMax functions
into a new function that does all four. The previous two functions were
largely identical.
Reviewed By: sanjoy
Differential Revision: https://reviews.llvm.org/D50167
llvm-svn: 360159
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Match NewPassManager behavior: add option for interleaved loops in the
old pass manager, and use that instead of the flag used to disable loop unroll.
No changes in the defaults.
Reviewers: chandlerc
Subscribers: mehdi_amini, jlebar, dmgreen, hsaito, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61030
llvm-svn: 359615
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
When refactoring vectorization flags, vectorization was disabled by default in the new pass manager.
This patch re-enables is for both managers, and changes the assumptions opt makes, based on the new defaults.
Comments in opt.cpp should clarify the intended use of all flags to enable/disable vectorization.
Reviewers: chandlerc, jgorbe
Subscribers: jlebar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61091
llvm-svn: 359167
|