summaryrefslogtreecommitdiffstats
path: root/llvm/test/CodeGen/X86/avx512-shuffles
Commit message (Collapse)AuthorAgeFilesLines
* [X86][AVX] Use lowerShuffleAsLanePermuteAndSHUFP to lower binary v4f64 shuffles.Simon Pilgrim2020-01-121-52/+46
| | | | | | Only perform this if we are shuffling lower and upper lane elements across the lanes (otherwise splitting to lower xmm shuffles would be better). This is a regression if we shuffle build_vectors due to getVectorShuffle canonicalizing 'blend of splat' build vectors, for now I've set this not to shuffle build_vector nodes at all to avoid this.
* [X86] Teach lowerV4I32Shuffle to only use broadcasts if the mask has more ↵Craig Topper2019-08-191-9/+7
| | | | | | | | | | | | | | than one undef element. Prioritize shifts over broadcast in lowerV8I16Shuffle. The motivating case are the changes in vector-reduce-add.ll where we were doing extra work in the scalar domain instead of shuffling. There may be some one use check that needs to be looked into there, but this patch sidesteps the issue by avoiding broadcasts that aren't really broadcasting. Differential Revision: https://reviews.llvm.org/D66071 llvm-svn: 369287
* [X86][AVX] Combine vpermi(bitcast(x)) -> bitcast(vpermi(x))Simon Pilgrim2019-07-031-2/+2
| | | | | | | | | | iff the number of elements doesn't change. This gets around an issue with combineX86ShuffleChain not being able to hint which domain is preferred for shuffles that can be done with either. Fixes regression introduced in rL365041 llvm-svn: 365044
* [X86][AVX] combineX86ShuffleChainWithExtract - add number of non-zero ↵Simon Pilgrim2019-07-031-2/+2
| | | | | | | | | | extract_subvectors to the combine depth This better accounts for the cost/benefit of removing extract_subvectors from the shuffle and will be more useful in future patches. The vpermq predicate regression will be fixed shortly. llvm-svn: 365041
* [SelectionDAG] Fold insert_subvector(undef, extract_subvector(v, c), c) -> v ↵Simon Pilgrim2019-06-171-27/+21
| | | | | | | | in getNode This is already done in DAGCombiner::visitINSERT_SUBVECTOR, but this helps a number of shuffles across different vector widths recognise when they come from the same source. llvm-svn: 363542
* [X86] CombineShuffleWithExtract - handle cases with different vector extract ↵Simon Pilgrim2019-06-161-18/+12
| | | | | | | | sources Insert the shorter vector source into an undef vector of the longer vector source's type. llvm-svn: 363507
* [X86][AVX] Handle lane-crossing ↵Simon Pilgrim2019-06-151-259/+227
| | | | | | | | | | shuffle(extract_subvector(x,c1),extract_subvector(y,c2),m1) shuffles Pull out the existing (non)lane-crossing fold into a helper lambda and use for lane-crossing unary shuffles as well. Fixes PR34380 llvm-svn: 363500
* [X86][AVX] Decode constant bits from insert_subvector(c1, c2, c3)Simon Pilgrim2019-06-151-4/+2
| | | | | | This mostly happens due to SimplifyDemandedVectorElts reducing a vector to insert_subvector(undef, c1, 0) llvm-svn: 363499
* [X86][AVX] combineX86ShuffleChain - combine ↵Simon Pilgrim2019-06-051-13/+12
| | | | | | | | | | shuffle(extractsubvector(x),extractsubvector(y)) We already handle the case where we combine shuffle(extractsubvector(x),extractsubvector(x)), this relaxes the requirement to permit different sources as long as they have the same value type. This causes a couple of cases where the VPERMV3 binary shuffles occur at a wider width than before, which I intend to improve in future commits - but as only the subvector's mask indices are defined, these will broadcast so we don't see any increase in constant size. llvm-svn: 362599
* [X86][AVX] Combine non-lane crossing binary shuffles using X86ISD::VPERMV3Simon Pilgrim2019-04-281-152/+144
| | | | | | Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results. llvm-svn: 359400
* [X86][AVX] Merge mask select with shuffles across extract_subvector (PR40332)Simon Pilgrim2019-04-271-122/+117
| | | | | | | | Fixes PR40332 in the limited case where we're selecting between a target shuffle and a zero vector. We can extend this in the future to handle more opcodes and non-zero selections. llvm-svn: 359378
* [X86][AVX] Fold extract_subvector(broadcast(x)) -> broadcast(x) iff x has ↵Simon Pilgrim2019-04-261-1/+1
| | | | | | one use llvm-svn: 359332
* [X86][AVX] Combine shuffles extracted from a common vectorSimon Pilgrim2019-04-261-58/+54
| | | | | | | | A small step towards combining shuffles across vector sizes - this recognizes when a shuffle's operands are all extracted from the same larger source and tries to combine to an unary shuffle of that source instead. Fixes one of the test cases from PR34380. Differential Revision: https://reviews.llvm.org/D60512 llvm-svn: 359292
* [X86][AVX] X86ISD::PERMV/PERMV3 node types can never fold index opsSimon Pilgrim2019-04-161-143/+148
| | | | | | | | | | Improves codegen demonstrated by D60512 - instructions represented by X86ISD::PERMV/PERMV3 can never memory fold the operand used for their index register. This patch updates the 'isUseOfShuffle' helper into the more capable 'isFoldableUseOfShuffle' that recognises that the op is used for a X86ISD::PERMV/PERMV3 index mask and can't be folded - allowing us to use broadcast/subvector-broadcast ops to reduce the size of the mask constant pool data. Differential Revision: https://reviews.llvm.org/D60562 llvm-svn: 358516
* [X86][AVX] Add PR34380 shuffle test casesSimon Pilgrim2019-04-081-0/+28
| | | | llvm-svn: 357914
* [X86] Prefer VPBLENDD for v2i64/v4i64 blends with AVX2.Craig Topper2019-03-031-4/+4
| | | | | | | | We were using VPBLENDW for v2i64 and VBLENDPD for v4i64. VPBLENDD has better throughput than VPBLENDW on some CPUs so it makes sense to use it when possible. VBLENDPD will probably become VBLENDD during execution domain fixing, but we might as well use integer in isel while we can. This should work around some issues with the domain fixing pass prefering PBLENDW when we start with PBLENDW. There may still be some v8i16 cases that could use PBLENDD. llvm-svn: 355281
* [X86][SSE] Use pblendw for v4i32/v2i64 during isel.Craig Topper2019-02-241-4/+4
| | | | | | | | | | | | | | | | | | Summary: Previously we used BLENDPS/BLENDPD but that puts the blend in the FP domain. Under optsize, the two address instruction pass can cause blendps/blendpd to commute to blendps/blendpd. But we probably shouldn't do that if the original type was a integer. So use pblendw instead. Reviewers: spatel, RKSimon Reviewed By: RKSimon Subscribers: jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58574 llvm-svn: 354755
* Recommit r354363 "[X86][SSE] Generalize X86ISD::BLENDI support to more value ↵Craig Topper2019-02-231-9/+8
| | | | | | | | | | types" And its follow ups r354511, r354640. A follow patch will fix the issue that caused it to be reverted. llvm-svn: 354737
* [X86][AVX] Shuffle->Permute+Blend if we have one v4f64/v4i64 shuffle input ↵Simon Pilgrim2019-02-231-12/+10
| | | | | | | | in place Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD. llvm-svn: 354729
* Revert r354363 & co "[X86][SSE] Generalize X86ISD::BLENDI support to more ↵Reid Kleckner2019-02-231-8/+9
| | | | | | | | | | | | | | | value types" r354363 caused https://crbug.com/934963#c1, which has a plain C reduced test case. I also had to revert some dependent changes: - r354648 - r354647 - r354640 - r354511 llvm-svn: 354713
* [X86] Add more load folding patterns for blend instructions as a follow up ↵Craig Topper2019-02-201-10/+10
| | | | | | | | | | | | | | to r354363. This avoids depending on the peephole pass to do load folding. Also adds some load folding for some insert_subvector patterns that use blend. All of this was found by temporarily adding TB_NO_FORWARD to the blend immediate entries in the load folding tables. I've added -disable-peephole to some of the affected tests from that experiment to ensure we're testing isel patterns. llvm-svn: 354511
* [X86][SSE] Generalize X86ISD::BLENDI support to more value typesSimon Pilgrim2019-02-191-13/+12
| | | | | | | | | | | | | | | | | | D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required. With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors. This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts. I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case). The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV. Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361 Differential Revision: https://reviews.llvm.org/D57888 llvm-svn: 354363
* Revert "[X86][SSE] Generalize X86ISD::BLENDI support to more value types"Sam McCall2019-02-111-12/+13
| | | | | | | | | This reverts commit r353610. It causes a miscompile visible in macro expansion in a bootstrapped clang. http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190211/626590.html llvm-svn: 353699
* [X86][SSE] Generalize X86ISD::BLENDI support to more value typesSimon Pilgrim2019-02-091-13/+12
| | | | | | | | | | | | | | | | D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required. With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors. This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts. I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case). The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV. Differential Revision: https://reviews.llvm.org/D57888 llvm-svn: 353610
* [x86] split more 256/512-bit shuffles in loweringSanjay Patel2019-02-071-33/+22
| | | | | | | | | | | | This is intentionally a small step because it's hard to know exactly where we might introduce a conflicting transform with the code that tries to form wider shuffles. But I think this is safe - if we have a wide shuffle with 2 operands, then we should do better with an extract + narrow shuffle. Differential Revision: https://reviews.llvm.org/D57867 llvm-svn: 353427
* [X86][AVX] Support shuffle combining for VBROADCAST with smaller vector sourcesSimon Pilgrim2019-02-031-13/+8
| | | | | | getTargetShuffleMask can only do this safely if we're extracting the lowest subvector from a vector of the same result type. llvm-svn: 352999
* [X86][AVX] More aggressively simplify BROADCAST source operandSimon Pilgrim2019-02-031-26/+9
| | | | | | | | Aim to use scalar source or lowest 128-bit vector directly. We're still missing some VZMOVL_LOAD combines. llvm-svn: 352994
* [X86][AVX] Enable INSERT_SUBVECTOR(SRC0, SHUFFLE(SRC1)) shuffle combiningSimon Pilgrim2019-02-022-37/+37
| | | | | | | | | | Push the insert_subvector up through the shuffle operands to help find more cross-lane shuffles. The is exposes a couple of minor issues that will be fixed shortly: Missed broadcast folds - we have a mixture of vzext_load lengths that need cleaning up combine-sdiv.ll - AVX1 SimplifyDemandedVectorElts failure (hits max depth due to a couple of extra bitcasts). llvm-svn: 352963
* [X86][AVX] Add VMOVDDUP-VPBROADCASTQ execution domain mappingSimon Pilgrim2019-02-013-6/+6
| | | | | | | | Noticed in D57514. Differential Revision: https://reviews.llvm.org/D57519 llvm-svn: 352922
* [DAGCombiner] fold extract_subvector of extract_subvectorSanjay Patel2019-01-291-27/+31
| | | | | | | | | | | | | | | This is the sibling fold for insert-of-insert that was added with D56604. Now that we have x86 shuffle narrowing (D57156), this change shows improvements for lots of AVX512 reduction code (not sure that we would ever expect extract-of-extract otherwise). There's a small regression in some of the partial-permute tests (extracting followed by splat). That is tracked by PR40500: https://bugs.llvm.org/show_bug.cgi?id=40500 Differential Revision: https://reviews.llvm.org/D57336 llvm-svn: 352528
* [x86] lower shuffle of extracts to AVX2 vperm instructionsSanjay Patel2019-01-161-90/+75
| | | | | | | | | | | | | | | | I was trying to prevent shuffle regressions while matching more horizontal ops and ended up here: shuf (extract X, 0), (extract X, 4), Mask --> extract (shuf X, undef, Mask'), 0 The affected tests were added for: https://bugs.llvm.org/show_bug.cgi?id=34380 This patch won't change the examples in the bug report itself, but we should be able to extend this to catch more types. Differential Revision: https://reviews.llvm.org/D56756 llvm-svn: 351346
* [x86] allow vector load narrowing with multi-use valuesSanjay Patel2018-11-101-602/+419
| | | | | | | | | | | | | | | | | | | | | | This is a long-awaited follow-up suggested in D33578. Since then, we've picked up even more opportunities for vector narrowing from changes like D53784, so there are a lot of test diffs. Apart from 2-3 strange cases, these are all wins. I've structured this to be no-functional-change-intended for any target except for x86 because I couldn't tell if AArch64, ARM, and AMDGPU would improve or not. All of those targets have existing regression tests (4, 4, 10 files respectively) that would be affected. Also, Hexagon overrides the shouldReduceLoadWidth() hook, but doesn't show any regression test diffs. The trade-off is deciding if an extra vector load is better than a single wide load + extract_subvector. For x86, this is almost always better (on paper at least) because we often can fold loads into subsequent ops and not increase the official instruction count. There's also some unknown -- but potentially large -- benefit from using narrower vector ops if wide ops are implemented with multiple uops and/or frequency throttling is avoided. Differential Revision: https://reviews.llvm.org/D54073 llvm-svn: 346595
* [X86][SSE] Move 2-input limit up from getFauxShuffleMask to ↵Simon Pilgrim2018-11-011-10/+11
| | | | | | | | | | resolveTargetShuffleInputs (reapplied) Reapplying an updated version of rL345395 (reverted in rL345451), now the issues noticed in PR39483 have been fixed. This patch allows resolveTargetShuffleInputs to remove UNDEF inputs from cases where we have more than 2 inputs. llvm-svn: 345824
* [X86][AVX] getFauxShuffleMask - add support for INSERT_SUBVECTOR subvector ↵Simon Pilgrim2018-10-051-13/+9
| | | | | | | | | | shuffles Decode subvector shuffles from INSERT_SUBVECTOR(SRC0, SHUFFLE(EXTRACT_SUBVECTOR(SRC1)) This was found necessary while investigating PR39161 llvm-svn: 343853
* [X86] Remove all the vector NOP bitcast patterns. Use a few lines of code in ↵Craig Topper2018-08-031-8/+8
| | | | | | | | | | the Select method in X86ISelDAGToDAG.cpp instead. There are a lot of permutations of types here generating a lot of patterns in the isel table. It's more efficient to just ReplaceUses and RemoveDeadNode from the Select function. The test changes are because we have a some shuffle patterns that have a bitcast as their root node. But the behavior is identical to another instruction whose pattern doesn't start with a bitcast. So this isn't a functional change. llvm-svn: 338824
* [X86] Add custom execution domain fixing for 128/256-bit integer logic ↵Craig Topper2018-07-159-928/+928
| | | | | | | | | | | | operations with AVX512F, but not AVX512DQ. AVX512F only has integer domain logic instructions. AVX512DQ added FP domain logic instructions. Execution domain fixing runs before EVEX->VEX. So if we have AVX512F and not AVX512DQ we fail to do execution domain switching of the logic operations. This leads to mismatches in execution domain and more test differences. This patch adds custom domain fixing that switches EVEX integer logic operations to VEX fp logic operations if XMM16-31 are not used. llvm-svn: 337137
* [X86] Fix a subtle bug in the custom execution domain fixing for blends.Craig Topper2018-07-141-52/+52
| | | | | | | | The code tried to find the immediate by using getNumOperands() on the MachineInstr, but there might be implicit-defs after the immediate that get counted. Instead use getNumOperands() from the instruction description which will only count the operands that are defined in the td file. llvm-svn: 337088
* [X86] Prefer blendi over movss/sd when avx512 is enabled unless optimizing ↵Craig Topper2018-07-141-6/+6
| | | | | | | | | | for size. AVX512 doesn't have an immediate controlled blend instruction. But blend throughput is still better than movss/sd on SKX. This commit changes AVX512 to use the AVX blend instructions instead of MOVSS/MOVSD. This constrains the register allocation since it won't be able to use XMM16-31, but hopefully the increased throughput and reduced port 5 pressure makes up for that. llvm-svn: 337083
* [X86] Rewrite printMasking code in X86InstComments to use TSFlags to ↵Craig Topper2018-03-101-8/+8
| | | | | | | | determine whether the instruction is masked. This should have been NFC, but it looks like we were missing PUNPCKLHQDQ/PUNPCKLQDQ instructions in there. llvm-svn: 327200
* [X86] Remove X86ISD::SHUF128 from combineBitcastForMaskedOp. Use isel ↵Craig Topper2018-02-051-8/+8
| | | | | | | | | | patterns instead. We always created X86ISD::SHUF128 with a 64-bit element type so we can use isel patterns to detect a bitconvert to 32-bit to handle masking. The test changes are because we also match the bitconvert even if there is no masking. This leads to unnecessary isel pattern, but it requires more multiclass hackery in tablegen to get rid of it. llvm-svn: 324205
* Followup on Proposal to move MIR physical register namespace to '$' sigil.Puyan Lotfi2018-01-311-5/+5
| | | | | | | | | | | | Discussed here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120320.html In preparation for adding support for named vregs we are changing the sigil for physical registers in MIR to '$' from '%'. This will prevent name clashes of named physical register with named vregs. llvm-svn: 323922
* [X86] Use vptestm/vptestnm for comparisons with zero to avoid creating a ↵Craig Topper2018-01-276-1584/+792
| | | | | | | | | | | | zero vector. We can use the same input for both operands to get a free compare with zero. We already use this trick in a couple places where we explicitly create PTESTM with the same input twice. This generalizes it. I'm hoping to remove the ISD opcodes and move this to isel patterns like we do for scalar cmp/test. llvm-svn: 323605
* [X86][SSE] Simplify demanded elements from BROADCAST shuffle source.Simon Pilgrim2018-01-271-43/+25
| | | | | | | | If broadcasting from another shuffle, attempt to simplify it. We can probably generalize this a lot more (embedding in combineX86ShufflesRecursively), but BROADCAST is one of the more troublesome as it accepts inputs of different sizes to the result. llvm-svn: 323602
* [X86] Remove isel patterns for using unmasked vmovdqa32/vmovdqu32 for ↵Craig Topper2018-01-183-42/+42
| | | | | | | | integer vector loads. These patterns were just looking for a vXi64 bitcasted to vXi32, but there is no advantage to using vmovdqa32 over vmovdqa64. llvm-svn: 322819
* [X86] Remove windows line endings from a test file. NFCCraig Topper2018-01-181-93/+93
| | | | llvm-svn: 322817
* [X86] Don't mutate shuffle arguments after early-out for AVX512Benjamin Kramer2018-01-171-0/+40
| | | | | | | | | | The match* functions have the annoying behavior of modifying its inputs. Save and restore the inputs, just in case the early out for AVX512 is hit. This is still not great and its only a matter of time this kind of bug happens again, but I couldn't come up with a better pattern without rewriting significant chunks of this code. Fixes PR35977. llvm-svn: 322644
* [X86][SSE] Add custom execution domain fixing for ↵Simon Pilgrim2018-01-151-93/+93
| | | | | | | | | | BLENDPD/BLENDPS/PBLENDD/PBLENDW (PR34873) Add support for custom execution domain fixing and implement support for BLENDPD/BLENDPS/PBLENDD/PBLENDW. Differential Revision: https://reviews.llvm.org/D42042 llvm-svn: 322524
* X86 Tests: Update more isel tests with FastVariableShuffle featureZvi Rackover2018-01-092-341/+358
| | | | | | | | | | | | | | | | | Summary: Added the FastVariableShuffle feature to cases that resembled processors for which this fearure is on. For AVX2 there are processors with and w/o this fearue enable. For AVX512 only KNL does enable this feature so cases which only have +avx512f were left without the FastVariableShuffle enabled. Reviewers: RKSimon, craig.topper Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D41851 llvm-svn: 322090
* [X86] Call lowerShuffleAsRepeatedMaskAndLanePermute from ↵Craig Topper2018-01-061-49/+40
| | | | | | lowerV4I64VectorShuffle. llvm-svn: 321929
* [X86] Run dos2unix on a test file. NFCCraig Topper2018-01-061-40/+40
| | | | llvm-svn: 321928
OpenPOWER on IntegriCloud