summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
Commit message (Collapse)AuthorAgeFilesLines
...
* [CostModel][X86] Add UDIV/UREM by pow2 costsSimon Pilgrim2018-07-051-15/+29
| | | | | | Normally InstCombine would have simplified these to SRL/AND instructions but we may still see these during SLP vectorization etc. llvm-svn: 336371
* [X86][AVX] Reduce v4f64/v4i64 shuffle costs (PR37882)Simon Pilgrim2018-06-211-4/+4
| | | | | | These were being over cautious for costs for one/two op general shuffles - VSHUFPD doesn't have to replicate the same shuffle in both lanes like VSHUFPS does. llvm-svn: 335216
* [CostModel] Replace ShuffleKind::SK_Alternate with ShuffleKind::SK_Select ↵Simon Pilgrim2018-06-121-24/+24
| | | | | | | | | | | | | | | | | | (PR33744) As discussed on PR33744, this patch relaxes ShuffleKind::SK_Alternate which requires shuffle masks to only match an alternating pattern from its 2 sources: e.g. v4f32: <0,5,2,7> or <4,1,6,3> This seems far too restrictive as most SIMD hardware which will implement it using a general blend/bit-select instruction, so replaces it with SK_Select, permitting elements from either source as long as they are inline: e.g. v4f32: <0,5,2,7>, <4,1,6,3>, <0,1,6,7>, <4,1,2,3> etc. This initial patch just updates the name and cost model shuffle mask analysis, later patch reviews will update SLP to better utilise this - it still limits itself to SK_Alternate style patterns. Differential Revision: https://reviews.llvm.org/D47985 llvm-svn: 334513
* [TTI] Add uniform/non-uniform constant Pow2 detection to ↵Simon Pilgrim2018-05-221-3/+4
| | | | | | | | | | | | | | | | TargetTransformInfo::getInstructionThroughput This enables us to detect more fast path sdiv cases under cost analysis. This patch also enables us to handle non-uniform-constant pow2 cases for X86 SDIV costs. Found while working on D46276 Future patches can then extend the vectorizers to more fully support non-uniform pow2 cases. Differential Revision: https://reviews.llvm.org/D46637 llvm-svn: 332969
* Remove \brief commands from doxygen comments.Adrian Prantl2018-05-011-1/+1
| | | | | | | | | | | | | | | | We've been running doxygen with the autobrief option for a couple of years now. This makes the \brief markers into our comments redundant. Since they are a visual distraction and we don't want to encourage more \brief markers in new code either, this patch removes them all. Patch produced by for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done Differential Revision: https://reviews.llvm.org/D46290 llvm-svn: 331272
* [CostModel][X86] Remove hard coded SDIV/UDIV vector costsSimon Pilgrim2018-04-251-37/+13
| | | | | | Algorithmically compute the 'x20' SDIV/UDIV vector costs - this is necessary for PR36550 when DIV costs will be driven from the scheduler models. llvm-svn: 330870
* [CostModel][X86] Recursive call for cost of imul for packed v16i16 constant ↵Simon Pilgrim2018-04-251-1/+3
| | | | | | | | shift left. Don't just assume cost = 1. llvm-svn: 330834
* [CostModel][X86] Fix v32i16/v64i8 SETCC costs on AVX512BW targetsSimon Pilgrim2018-04-071-0/+9
| | | | llvm-svn: 329498
* [X86] Update cost model for Goldmont. Add fsqrt costs for SilvermontCraig Topper2018-03-251-15/+48
| | | | | | | | | | | | | | Add fdiv costs for Goldmont using table 16-17 of the Intel Optimization Manual. Also add overrides for FSQRT for Goldmont and Silvermont. Reviewers: RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D44644 llvm-svn: 328451
* [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280)Simon Pilgrim2018-02-261-0/+30
| | | | | | | | | | Agner's tables indicate that for SSE42+ targets (Core2 and later) we can reduce the FADD/FSUB/FMUL costs down to 1, which should fix the Himeno benchmark. Note: the AVX512 FDIV costs look rather dodgy, but this isn't part of this patch. Differential Revision: https://reviews.llvm.org/D43733 llvm-svn: 326133
* [X86][SSE] Increase PMULLD costs to better match hardwareSimon Pilgrim2018-02-101-3/+5
| | | | | | Until Skylake, most hardware could only issue a PMULLD op every other cycle llvm-svn: 324823
* [LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused ↵Sanjay Patel2018-02-051-0/+4
| | | | | | | | | | | | | | | (PR35681) In the motivating case from PR35681 and represented by the macro-fuse-cmp test: https://bugs.llvm.org/show_bug.cgi?id=35681 ...there's a 37 -> 31 byte size win for the loop because we eliminate the big base address offsets. SPEC2017 on Ryzen shows no significant perf difference. Differential Revision: https://reviews.llvm.org/D42607 llvm-svn: 324289
* Spelling mistake in comment. NFCI.Simon Pilgrim2018-01-301-1/+1
| | | | llvm-svn: 323752
* [X86] Add support for passing 'prefer-vector-width' function attribute into ↵Craig Topper2018-01-201-4/+5
| | | | | | | | | | | | X86Subtarget and exposing via X86's getRegisterWidth TTI interface. This will cause the vectorizers to do some limiting of the vector widths they create. This is not a strict limit. There are reasons I know of that the loop vectorizer will generate larger vectors for. I've written this in such a way that the interface will only return a properly supported width(0/128/256/512) even if the attribute says something funny like 384 or 10. This has been split from D41895 with the remainder in a follow up commit. llvm-svn: 323015
* [COST]Fix PR35865: Fix cost model evaluation for shuffle on X86.Alexey Bataev2018-01-091-1/+2
| | | | | | | | | | | | | | Summary: If the vector type is transformed to non-vector single type, the compile may crash trying to get vector information about non-vector type. Reviewers: RKSimon, spatel, mkuper, hfinkel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D41862 llvm-svn: 322106
* [X86] Simplify the TTI code for getInterleavedMemoryOpCost around for ↵Craig Topper2017-12-061-9/+4
| | | | | | | | | | AVX512BW. NFCI Previously the lambda for AVX512 passed out a flag that indicated whether AVX512BW was required and that was checked against the AVX512BW subtarget flag outside. This patch changes the interface to pass the AVX512BW subtarget bit in and return its value if we detect 16 or 8 bit types. llvm-svn: 319919
* [PartiallyInlineLibCalls][x86] add TTI hook to allow sqrt inlining to depend ↵Sanjay Patel2017-11-271-0/+4
| | | | | | | | | | | on arg rather than result This should fix PR31455: https://bugs.llvm.org/show_bug.cgi?id=31455 Differential Revision: https://reviews.llvm.org/D28314 llvm-svn: 319094
* [X86] Don't report gather is legal on Skylake CPUs when AVX2/AVX512 is ↵Craig Topper2017-11-251-3/+5
| | | | | | | | | | | | | | | | | | | disabled. Allow gather on SKX/CNL/ICL when AVX512 is disabled by using AVX2 instructions. Summary: This adds a new fast gather feature bit to cover all CPUs that support fast gather that we can use independent of whether the AVX512 feature is enabled. I'm only using this new bit to qualify AVX2 codegen. AVX512 is still implicitly assuming fast gather to keep tests working and to match the scatter behavior. Test command lines have been added for these two cases. Reviewers: magabari, delena, RKSimon, zvi Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D40282 llvm-svn: 318983
* [X86] Spell penryn correctly in some comments. NFCCraig Topper2017-11-221-3/+3
| | | | llvm-svn: 318855
* [LV][X86] Support of AVX2 Gathers code generation and update the LV with thisMohammed Agabaria2017-11-201-6/+13
| | | | | | | | | | | | | | | This patch depends on: https://reviews.llvm.org/D35348 Support of pattern selection of masked gathers of AVX2 (X86\AVX2 code gen) Update LoopVectorize to generate gathers for AVX2 processors. Reviewers: delena, zvi, RKSimon, craig.topper, aaboud, igorb Reviewed By: delena, RKSimon Differential Revision: https://reviews.llvm.org/D35772 llvm-svn: 318641
* Fix a bunch more layering of CodeGen headers that are in TargetDavid Blaikie2017-11-171-2/+2
| | | | | | | | All these headers already depend on CodeGen headers so moving them into CodeGen fixes the layering (since CodeGen depends on Target, not the other way around). llvm-svn: 318490
* [TTI][X86] update costs of interleaved load\store of i64\doubleMohammed Agabaria2017-11-161-0/+6
| | | | | | | | | | | | This patch contains more accurate cost of interelaved load\store of stride 2 for the types int64\double on AVX2. Reviewers: delena, RKSimon, craig.topper, dorit Reviewed By: dorit Differential Revision: https://reviews.llvm.org/D40008 llvm-svn: 318385
* [X86] Update TTI to report that v1iX/v1fX types aren't legal for masked ↵Craig Topper2017-11-161-2/+10
| | | | | | | | gather/scatter/load/store. The type legalizer will try to scalarize these operations if it sees them, but there is no handling for scalarizing them. This leads to a fatal error. With this change they will now be scalarized by the mem intrinsic scalarizing pass before SelectionDAG. llvm-svn: 318380
* [SLP] Fix PR35047: Fix default cost model for cast op in X86.Alexey Bataev2017-11-071-1/+1
| | | | | | | | | | | | | | | Summary: The cost calculation for default case on X86 target does not always follow correct wayt because of missing 4-th argument in `BaseT::getCastInstrCost()` call. Added this missing parameter. Reviewers: hfinkel, mkuper, RKSimon, spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D39687 llvm-svn: 317576
* [LV][X86] update the cost of interleaving mem. access of floatsMohammed Agabaria2017-11-061-1/+4
| | | | | | | | | | Recommit: This patch contains update of the costs of interleaved loads of v8f32 of stride 3 and 8. fixed the location of the lit test it works with make check-all. Differential Revision: https://reviews.llvm.org/D39403 llvm-svn: 317471
* [REVERT][LV][X86] update the cost of interleaving mem. access of floatsMohammed Agabaria2017-11-051-4/+1
| | | | | | | | | reverted my changes will be committed later after fixing the failure This patch contains update of the costs of interleaved loads of v8f32 of stride 3 and 8. Differential Revision: https://reviews.llvm.org/D39403 llvm-svn: 317433
* [LV][X86] update the cost of interleaving mem. access of floatsMohammed Agabaria2017-11-051-1/+4
| | | | | | | | This patch contains update of the costs of interleaved loads of v8f32 of stride 3 and 8. Differential Revision: https://reviews.llvm.org/D39403 llvm-svn: 317432
* [CodeGen][ExpandMemcmp] Allow memcmp to expand to vector loads (2).Clement Courbet2017-10-301-4/+29
| | | | | | | | | | | | - Targets that want to support memcmp expansions now return the list of supported load sizes. - Expansion codegen does not assume that all power-of-two load sizes smaller than the max load size are valid. For examples, this is not the case for x86(32bit)+sse2. Fixes PR34887. llvm-svn: 316905
* [AVX512][AVX2]Cost calculation for interleave load/store patterns ↵Michael Zuckerman2017-10-181-7/+43
| | | | | | | | | | | | | | | | | | | {v8i8,v16i8,v32i8,v64i8} This patch adds accurate instructions cost. The formula presents two cases(stride 3 and stride 4) and calculates the cost according to the VF and stride. Reviewers: 1. delena 2. Farhana 3. zvi 4. dorit 5. Ayal Differential Revision: https://reviews.llvm.org/D38762 Change-Id: If4cfbd4ac0e63694e8144cb78c7fa34850647ff7 llvm-svn: 316072
* [CodeGenPrepare][NFC] Rename TargetTransformInfo::expandMemCmp -> ↵Clement Courbet2017-09-251-1/+1
| | | | | | | | | | | | | | | | TargetTransformInfo::enableMemCmpExpansion. Summary: Right now there are two functions with the same name, one does the work and the other one returns true if expansion is needed. Rename TargetTransformInfo::expandMemCmp to make it more consistent with other members of TargetTransformInfo. Remove the unused Instruction* parameter. Differential Revision: https://reviews.llvm.org/D38165 llvm-svn: 314096
* [DivRempairs] add a pass to optimize div/rem pairs (PR31028)Sanjay Patel2017-09-091-0/+5
| | | | | | | | | | | | | | | | | | This is intended to be a superset of the functionality from D31037 (EarlyCSE) but implemented as an independent pass, so there's no stretching of scope and feature creep for an existing pass. I also proposed a weaker version of this for SimplifyCFG in D30910. And I initially had almost this same functionality as an addition to CGP in the motivating example of PR31028: https://bugs.llvm.org/show_bug.cgi?id=31028 The advantage of positioning this ahead of SimplifyCFG in the pass pipeline is that it can allow more flattening. But it needs to be after passes (InstCombine) that could sink a div/rem and undo the hoisting that is done here. Decomposing remainder may allow removing some code from the backend (PPC and possibly others). Differential Revision: https://reviews.llvm.org/D37121 llvm-svn: 312862
* [SLP] Support for horizontal min/max reduction.Alexey Bataev2017-09-081-0/+146
| | | | | | | | | | | | | SLP vectorizer supports horizontal reductions for Add/FAdd binary operations. Patch adds support for horizontal min/max reductions. Function getReductionCost() is split to getArithmeticReductionCost() for binary operation reductions and getMinMaxReductionCost() for min/max reductions. Patch fixes PR26956. Differential revision: https://reviews.llvm.org/D27846 llvm-svn: 312791
* X86: Improve AVX512 fptoui loweringZvi Rackover2017-09-071-0/+4
| | | | | | | | | | | | | | | | | Summary: Add patterns for fptoui <16 x float> to <16 x i8> fptoui <16 x float> to <16 x i16> Reviewers: igorb, delena, craig.topper Reviewed By: craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D37505 llvm-svn: 312704
* Model cache size and associativity in TargetTransformInfoTobias Grosser2017-08-241-0/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: We add the precise cache sizes and associativity for the following Intel architectures: - Penry - Nehalem - Westmere - Sandy Bridge - Ivy Bridge - Haswell - Broadwell - Skylake - Kabylake Polly uses since several months a performance model for BLAS computations that derives optimal cache and register tile sizes from cache and latency information (based on ideas from "Analytical Modeling Is Enough for High-Performance BLIS", by Tze Meng Low published at TOMS 2016). While bootstrapping this model, these target values have been kept in Polly. However, as our implementation is now rather mature, it seems time to teach LLVM itself about cache sizes. Interestingly, L1 and L2 cache sizes are pretty constant across micro-architectures, hence a set of architecture specific default values seems like a good start. They can be expanded to more target specific values, in case certain newer architectures require different values. For now a set of Intel architectures are provided. Just as a little teaser, for a simple gemm kernel this model allows us to improve performance from 1.2s to 0.27s. For gemm kernels with less optimal memory layouts even larger speedups can be reported. Reviewers: Meinersbur, bollu, singam-sanjay, hfinkel, gareevroman, fhahn, sebpop, efriedma, asb Reviewed By: fhahn, asb Subscribers: lsaba, asb, pollydev, llvm-commits Differential Revision: https://reviews.llvm.org/D37051 llvm-svn: 311647
* Changed basic cost of store operation on X86Elena Demikhovsky2017-08-201-0/+15
| | | | | | | | | Store operation takes 2 UOps on X86 processors. The exact cost calculation affects several optimization passes including loop unroling. This change compensates performance degradation caused by https://reviews.llvm.org/D34458 and shows improvements on some benchmarks. Differential Revision: https://reviews.llvm.org/D35888 llvm-svn: 311285
* [CostModel][X86][XOP] Improve costs for XOP shufflesSimon Pilgrim2017-08-161-0/+22
| | | | | | VPPERM/VPERMIL2PD/VPERMIL2PS all provide more effective 2-input shuffles than regular AVX instructions llvm-svn: 311005
* [CostModel][X86] Add SSE2 two-src shuffle costsSimon Pilgrim2017-08-101-0/+2
| | | | llvm-svn: 310654
* [CostModel][X86] Add avx1 two-src shuffle costsSimon Pilgrim2017-08-101-0/+9
| | | | llvm-svn: 310650
* [CostModel][X86] Add avx2 two-src shuffle costsSimon Pilgrim2017-08-101-2/+11
| | | | llvm-svn: 310645
* [CostModel][X86] Improve single src shuffle costsSimon Pilgrim2017-08-101-11/+36
| | | | | | Add missing SK_PermuteSingleSrc costs for AVX2 targets and earlier, also added some of the simpler SK_PermuteTwoSrc costs to support splitting of SK_PermuteSingleSrc shuffles llvm-svn: 310632
* Reapply fix PR23384 (part 3 of 3) r304824 (was reverted in r305720).Evgeny Stupachenko2017-08-071-0/+11
| | | | | | | | | | | | | | | | The root cause of reverting was fixed - PR33514. Summary: The patch makes instruction count the highest priority for LSR solution for X86 (previously registers had highest priority). Reviewers: qcolombet Differential Revision: http://reviews.llvm.org/D30562 From: Evgeny Stupachenko <evstupac@gmail.com> <evgeny.v.stupachenko@intel.com> llvm-svn: 310289
* Strip trailing whitespace. NFCI.Simon Pilgrim2017-07-311-7/+7
| | | | llvm-svn: 309584
* [Cost] Rename getReductionCost() to getArithmeticReductionCost(), NFC.Alexey Bataev2017-07-311-3/+3
| | | | llvm-svn: 309563
* [X86][CM] update add\sub costs of vectors of 64 in X86\SLM archMohammed Agabaria2017-07-021-4/+9
| | | | | | | | | this patch updates the cost of addq\subq (add\subtract of vectors of 64bits) based on the performance numbers of SLM arch. Differential Revision: https://reviews.llvm.org/D33983 llvm-svn: 306974
* [AVX2] [TTI CostModel] Add cost of interleaved loads/stores for AVX2Dorit Nuzman2017-06-251-0/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | The cost of an interleaved access was only implemented for AVX512. For other X86 targets an overly conservative Base cost was returned, resulting in avoiding vectorization where it is actually profitable to vectorize. This patch starts to add costs for AVX2 for most prominent cases of interleaved accesses (stride 3,4 chars, for now). Note1: Improvements of up to ~4x were observed in some of EEMBC's rgb workloads; There is also a known issue of 15-30% degradations on some of these workloads, associated with an interleaved access followed by type promotion/widening; the resulting shuffle sequence is currently inefficient and will be improved by a series of patches that extend the X86InterleavedAccess pass (such as D34601 and more to follow). Note 2: The costs in this patch do not reflect port pressure penalties which can be very dominant in the case of interleaved accesses since most of the shuffle operations are restricted to a single port. Further tuning, that may incorporate these considerations, will be done on top of the upcoming improved shuffle sequences (that is, along with the abovementioned work to extend X86InterleavedAccess pass). Differential Revision: https://reviews.llvm.org/D34023 llvm-svn: 306238
* [x86] enable CGP memcmp() expansion for 2/4/8 byte sizesSanjay Patel2017-06-201-0/+6
| | | | | | | | | There are a couple of potential improvements as seen in the IR and asm: 1. We're unnecessarily extending to a larger type to compare values. 2. The codegen for (select cond, 1, -1) could avoid a cmov. (or we could change the order of the compares, so we have a select with 0 operand) llvm-svn: 305802
* Revert r304824 "Fix PR23384 (part 3 of 3)"Hans Wennborg2017-06-191-11/+0
| | | | | | | | | | | | | | | | | This seems to be interacting badly with ASan somehow, causing false reports of heap-buffer overflows: PR33514. > Summary: > The patch makes instruction count the highest priority for > LSR solution for X86 (previously registers had highest priority). > > Reviewers: qcolombet > > Differential Revision: http://reviews.llvm.org/D30562 > > From: Evgeny Stupachenko <evstupac@gmail.com> llvm-svn: 305720
* Fix PR23384 (part 3 of 3)Evgeny Stupachenko2017-06-061-0/+11
| | | | | | | | | | | | | Summary: The patch makes instruction count the highest priority for LSR solution for X86 (previously registers had highest priority). Reviewers: qcolombet Differential Revision: http://reviews.llvm.org/D30562 From: Evgeny Stupachenko <evstupac@gmail.com> llvm-svn: 304824
* [Atomics][LoopIdiom] Recognize unordered atomic memcpyAnna Thomas2017-06-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | Summary: Expanding the loop idiom test for memcpy to also recognize unordered atomic memcpy. The only difference for recognizing an unordered atomic memcpy and instead of a normal memcpy is that the loads and/or stores involved are unordered atomic operations. Background: http://lists.llvm.org/pipermail/llvm-dev/2017-May/112779.html Patch by Daniel Neilson! Reviewers: reames, anna, skatkov Reviewed By: reames, anna Subscribers: llvm-commits, mzolotukhin Differential Revision: https://reviews.llvm.org/D33243 llvm-svn: 304806
* [X86][AVX512] Add 512-bit vector ctpop costs + testsSimon Pilgrim2017-05-181-0/+6
| | | | llvm-svn: 303342
OpenPOWER on IntegriCloud