| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up to D70607 where we made any
extract element on SLM more costly than default. But that is
pessimistic for extract from element 0 because that corresponds
to x86 movd/movq instructions. These generally have >1 cycle
latency, but they are probably implemented as single uop
instructions.
Note that no vectorization tests are affected by this change.
Also, no targets besides SLM are affected because those are
falling through to the default cost of 1 anyway. But this will
become visible/important if we add more specializations via cost
tables.
Differential Revision: https://reviews.llvm.org/D71023
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I'm not sure what the effect of this change will be on all of the affected
tests or a larger benchmark, but it fixes the horizontal add/sub problems
noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc
The costs are based on reciprocal throughput numbers in Agner's tables for
PEXTR*; these appear to be very slow ops on Silvermont.
This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605
Also, it seems likely that insert/extract is the source of perf regressions on
other CPUs (up to 30%) that were cited as part of the reason to revert D59710,
so maybe we'll extend the table-based approach to other subtargets.
Differential Revision: https://reviews.llvm.org/D70607
|
|
|
|
|
|
|
|
| |
This is no longer needed after widening legalization as we
custom legalize v8i8 ourselves.
Added entries to the cost model, but bumped the cost slightly
to account for the truncate shuffle that wasn't costed before.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ShuffleVectorInst::isExtractSubvectorMask, introduced in
[CostModel] Add SK_ExtractSubvector handling to getInstructionThroughput (PR39368)
erroneously thought that
%340 = shufflevector <4 x float> %339, <4 x float> undef, <3 x i32> <i32 2, i32 3, i32 undef>
is a subvector extract, even though it goes off the end of the parent
vector with the undef index. That then caused an assert in
BasicTTIImplBase::getExtractSubvectorOverhead.
This commit fixes that, by not considering the above a subvector
extract.
Differential Revision: https://reviews.llvm.org/D70005
Change-Id: I87b8b00b24bef19ffc9a1b82ef4eca3b8a246eaf
|
|
|
|
| |
As noted on D59710 we weren't handling the high costs of these operations on SLM.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2*log2(bitwidth)+1 for legal types.
This better represents the kshift+binop we'd get for each stage
before the final extract. Its likely we'll do even better by
doing a kmov and a cmp with a GPR, but this is a good start.
The default handling was costing a worst case single source
permute shuffle of the vector before the binop. This worst
case assumes the shuffle might have to be emulated with
extracts and inserts. But since we know we're doing a reduction
we can assume we'll get kshift lowering.
There's still some room for improvement here, but this is
much better than it was.
|
|
|
|
|
|
|
|
| |
Add specific scalar costs for CTLZ instructions, we can't discriminate between CTLZ and CTLZ_ZERO_UNDEF so we have to assume the worst. Given how BSR is often a microcoded nightmare on some older targets we might still be underestimating it.
For targets supporting LZCNT (Intel Haswell+ or AMD Fam10+), we provide overrides that assume 1cy costs.
llvm-svn: 374786
|
|
|
|
|
|
|
|
| |
Add specific scalar costs for ctpop instructions, these are based on the llvm-mca's SLM throughput numbers (the oldest model we have).
For targets supporting POPCNT, we provide overrides that assume 1cy costs.
llvm-svn: 374775
|
|
|
|
|
|
|
|
| |
I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2.
I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674.
llvm-svn: 374655
|
|
|
|
|
|
| |
indices
llvm-svn: 374161
|
|
|
|
|
|
| |
element indices
llvm-svn: 374160
|
|
|
|
|
|
|
|
| |
SLM is 2 x slower for <2 x i64> comparison ops than other vector types, we should account for this like we do for SLM <2 x i64> add/sub/mul costs.
This should remove some of the SLM codegen diffs in D43582
llvm-svn: 372954
|
|
|
|
|
|
| |
The AVX512 cases still need some work to correct recognise the PMOV truncation cases.
llvm-svn: 372514
|
|
|
|
|
|
|
|
| |
We are missing costs for a lot of truncation cases, I'm hoping to address all the 'zero cost' cases in trunc.ll
I thought this was a vector widening side effect, but even before this we had some interesting LV decisions (notably over indvars) being made due to these zero costs.
llvm-svn: 372498
|
|
|
|
| |
llvm-svn: 370684
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
As disscussed in https://reviews.llvm.org/D65148#1606412,
`extractvalue` don't actually generate any code,
so we should treat them as free.
Reviewers: craig.topper, RKSimon, jnspaulsson, greened, asb, t.p.northover, jmolloy, dmgreen
Reviewed By: jmolloy
Subscribers: javed.absar, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66098
llvm-svn: 370339
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
legalization.
I don't really understand the costs we're using for fp_to_sint,
but prior to widening legalization we used 20 as the cost for this
via the v2i64->v2f64 entry. That number seems better than the 40
we got with widening legalization. So now we need either a
v2i32->v2f64 entry or a v4i32->v2f64 entry depending on whether
AVX is enabled or not since we skip the first SSE2 table look up
under AVX.
llvm-svn: 369628
|
|
|
|
|
|
|
|
| |
Now that we're using widening legalization. We need to improve our extract_subvector cost model for these types. This patch begins by modeling these as a subvector extract followed by a permute. I've left FIXMEs in the code for future improvements.
Differential Revision: https://reviews.llvm.org/D65892
llvm-svn: 369022
|
|
|
|
|
|
|
|
|
|
|
|
| |
128-bit inputs
Now that we legalize by widening, the element types here won't change. Previously these were modeled as the elements being widened and then the instruction might become an AND or SHL/ASHR pair. But now they'll become something like a ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG.
For AVX2, when the destination type is legal its clear the cost should be 1 since we have extend instructions that can produce 256 bit vectors from less than 128 bit vectors. I'm a little less sure about AVX1 costs, but I think the ones I changed were definitely too high, but they might still be too high.
Differential Revision: https://reviews.llvm.org/D66169
llvm-svn: 368858
|
|
|
|
|
|
| |
These tests don't cover many cases where the subvectors don't start on aligned indices, but that can be added later.
llvm-svn: 368839
|
|
|
|
|
|
| |
We don't have full 512-bit test coverage yet - but there's enough to help test D65892
llvm-svn: 368716
|
|
|
|
| |
llvm-svn: 368595
|
|
|
|
|
|
|
| |
In https://reviews.llvm.org/D65148 it is suggested that
it should have zero cost, always.
llvm-svn: 368548
|
|
|
|
|
|
| |
smaller element sizes and smaller than 128-bit vectors."
llvm-svn: 368185
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
-x86-experimental-vector-widening-legalization by default."
The assert that caused this to be reverted should be fixed now.
Original commit message:
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 368183
|
|
|
|
|
|
|
|
|
|
|
| |
element sizes and smaller than 128-bit vectors."
This reverts commit fc33e33776b7a7ce22e539f0ec2e3bfdb09ad361.
This commit depends on the rolled back commit rL367901, and thus needs
to be rolled back.
llvm-svn: 368109
|
|
|
|
|
|
|
|
|
| |
This reverts commit 3de33245d2c992c9e0af60372043540b60f3a810.
This commit broke the MSan buildbots. See
https://reviews.llvm.org/rL367901 for more information.
llvm-svn: 368107
|
|
|
|
|
|
|
|
|
| |
and smaller than 128-bit vectors.
With the switch to widening legalization, we need to a better
job of costing extractions of less than 128-bits.
llvm-svn: 368081
|
|
|
|
|
|
|
|
|
| |
test/Analysis/CostModel/X86/
This flag is now the default behavior so we don't need separate
tests.
llvm-svn: 368080
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 367901
|
|
|
|
| |
llvm-svn: 363538
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch uses the mechanism from D62995 to strengthen the
definitions of the reduction intrinsics by letting the scalar
result/accumulator type be overloaded from the vector element type.
For example:
; The LLVM LangRef specifies that the scalar result must equal the
; vector element type, but this is not checked/enforced by LLVM.
declare i32 @llvm.experimental.vector.reduce.or.i32.v4i32(<4 x i32> %a)
This patch changes that into:
declare i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> %a)
Which has the type-constraint more explicit and causes LLVM to check
the result type with the vector element type.
Reviewers: RKSimon, arsenm, rnk, greened, aemerson
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D62996
llvm-svn: 363240
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A mixture of internal tests and review of the scheduler models indicates we're overestimating the cost of a masked load, which we're estimating at 4x regular memory ops - more realistic values indicates that its closer to 2x. Masked stores costs are a lot more diverse but 8x is roughly in the middle of the range.
e.g. SandyBridge
defm : X86WriteRes<WriteFMaskedLoad, [SBPort23,SBPort05], 8, [1,2], 3>;
defm : X86WriteRes<WriteFMaskedLoadY, [SBPort23,SBPort05], 9, [1,2], 3>;
defm : X86WriteRes<WriteFMaskedStore, [SBPort4,SBPort01,SBPort23], 5, [1,1,1], 3>;
defm : X86WriteRes<WriteFMaskedStoreY, [SBPort4,SBPort01,SBPort23], 5, [1,1,1], 3>;
e.g. Btver2
defm : X86WriteRes<WriteFMaskedLoad, [JLAGU, JFPU01, JFPX], 6, [1, 2, 2], 1>;
defm : X86WriteRes<WriteFMaskedLoadY, [JLAGU, JFPU01, JFPX], 6, [2, 4, 4], 2>;
defm : X86WriteRes<WriteFMaskedStore, [JSAGU, JFPU01, JFPX], 6, [1, 1, 4], 1>;
defm : X86WriteRes<WriteFMaskedStoreY, [JSAGU, JFPU01, JFPX], 6, [2, 2, 4], 2>;
Differential Revision: https://reviews.llvm.org/D61257
llvm-svn: 362338
|
|
|
|
| |
llvm-svn: 362083
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FNeg instruction.
Summary:
This reuses the getArithmeticInstrCost, but passes dummy values of the second
operand flags.
The X86 costs are wrong and can be improved in a follow up. I just wanted to
stop it from reporting an unknown cost first.
Reviewers: RKSimon, spatel, andrew.w.kaylor, cameron.mcinally
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62444
llvm-svn: 361788
|
|
|
|
| |
llvm-svn: 361745
|
|
|
|
|
|
|
|
| |
The original costs stopped at SSE42, I've added conservative estimates for everything down to SSE1/SSE2 and moved some of the SSE42 costs to SSE41 (really only the addition of PCMPGT makes any difference).
I've also added missing vXi8 costs (we use PHMINPOSUW for i8/i16 for scarily quick results) and 256-bit vector costs for AVX1.
llvm-svn: 360528
|
|
|
|
|
|
|
|
| |
On pre-AVX512 targets we can use MOVMSK to extract reduced boolean results. This is properly optimized, annoyingly AVX512 isn't and produces code that is almost as bad as the (unchanged) costs suggest......
Differential Revision: https://reviews.llvm.org/D60403
llvm-svn: 358574
|
|
|
|
|
|
| |
load/store/gather/scatter/expand/compress cost tests
llvm-svn: 357838
|
|
|
|
|
|
| |
Based off smul/umul fixed costs and the implementation in TargetLowering::expandMULO.
llvm-svn: 354784
|
|
|
|
|
|
|
|
| |
Based on an IR equivalent of target lowering's generic expansion - target specific costs will typically be lower (IR doesn't have a good mull/mulh equivalent) but we need a baseline.
Differential Revision: https://reviews.llvm.org/D57925
llvm-svn: 354774
|
|
|
|
| |
llvm-svn: 353153
|
|
|
|
|
|
|
|
|
| |
Followup to D56636, this time handling the UADDSAT case by expanding
uadd.sat(a, b) to umin(a, ~b) + b.
Differential Revision: https://reviews.llvm.org/D56869
llvm-svn: 352409
|
|
|
|
|
|
|
|
|
|
| |
Add generic costs calculation for SADDSAT/SSUBSAT intrinsics, this uses generic costs for sadd_with_overflow/ssub_with_overflow, an extra sign comparison + a selects based on the sign/overflow.
This completes PR40316
Differential Revision: https://reviews.llvm.org/D57239
llvm-svn: 352315
|
|
|
|
| |
llvm-svn: 352046
|
|
|
|
|
|
| |
Added x86 scalar sadd_with_overflow/ssub_with_overflow costs.
llvm-svn: 352045
|
|
|
|
|
|
|
|
| |
Add generic costs calculation for UADDSAT/USUBSAT intrinsics, this fallbacks to using generic costs for uadd_with_overflow/usub_with_overflow + a select.
Differential Revision: https://reviews.llvm.org/D56907
llvm-svn: 352044
|
|
|
|
|
|
|
|
| |
Added x86 scalar uadd_with_overflow/usub_with_overflow costs.
Differential Revision: https://reviews.llvm.org/D56907
llvm-svn: 352043
|