summaryrefslogtreecommitdiffstats
path: root/llvm/test/CodeGen/X86
Commit message (Collapse)AuthorAgeFilesLines
* [X86] Merge ISD::ADD/SUB nodes into X86ISD::ADD/SUB equivalents (PR40483)Simon Pilgrim2019-02-252-61/+28
| | | | | | | | | | Avoid ADD/SUB instruction duplication by reusing the X86ISD::ADD/SUB results. Includes ADD commutation - I tried to include NEG+SUB SUB commutation as well but this causes regressions as we don't have good combine coverage to simplify X86ISD::SUB. Differential Revision: https://reviews.llvm.org/D58597 llvm-svn: 354771
* [X86] Add PR40483 test casesSimon Pilgrim2019-02-242-0/+187
| | | | | | Demonstrate failure to merge ISD::ADD(x,y)/X86ISD::ADD(x,y) + ISD::SUB(x,y)/X86ISD::SUB(x,y) equivalent ops llvm-svn: 354758
* [X86] Combine zext(packus(x),packus(y)) -> concat(x,y) (PR39637)Simon Pilgrim2019-02-244-50/+26
| | | | | | Its proving tricky to combine shuffles across multiple vector sizes, so for now I'm adding this more specific combine - the pattern is common enough to be worth it as a first step. llvm-svn: 354757
* [X86] Fix tls variable lowering issue with large code modelCraig Topper2019-02-241-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: The problem here is the lowering for tls variable. Below is the DAG for the code. SelectionDAG has 11 nodes: t0: ch = EntryToken t8: i64,ch = load<(load 8 from `i8 addrspace(257)* null`, addrspace 257)> t0, Constant:i64<0>, undef:i64 t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10] t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64 t12: i64 = add t8, t11 t4: i32,ch = load<(dereferenceable load 4 from @x)> t0, t12, undef:i64 t6: ch = CopyToReg t0, Register:i32 %0, t4 And when mcmodel is large, below instruction can NOT be folded. t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10] t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64 So "t11: i64,ch = load<(load 8 from got)> t0, t10, undef:i64" is lowered to " Morphed node: t11: i64,ch = MOV64rm<Mem:(load 8 from got)> t10, TargetConstant:i8<1>, Register:i64 $noreg, TargetConstant:i32<0>, Register:i32 $noreg, t0" When llvm start to lower "t10: i64 = X86ISD::WrapperRIP TargetGlobalTLSAddress:i64<i32* @x> 0 [TF=10]", it fails. The patch is to fold the load and X86ISD::WrapperRIP. Fixes PR26906 Patch by LuoYuanke Reviewers: craig.topper, rnk, annita.zhang, wxiao3 Reviewed By: rnk Subscribers: llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58336 llvm-svn: 354756
* [X86][SSE] Use pblendw for v4i32/v2i64 during isel.Craig Topper2019-02-249-169/+79
| | | | | | | | | | | | | | | | | | Summary: Previously we used BLENDPS/BLENDPD but that puts the blend in the FP domain. Under optsize, the two address instruction pass can cause blendps/blendpd to commute to blendps/blendpd. But we probably shouldn't do that if the original type was a integer. So use pblendw instead. Reviewers: spatel, RKSimon Reviewed By: RKSimon Subscribers: jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58574 llvm-svn: 354755
* [LegalizeTypes][AArch64][X86] Make type legalization of vector ↵Craig Topper2019-02-246-294/+234
| | | | | | | | | | | | | | | | | | | | | (S/U)ADD/SUB/MULO follow getSetCCResultType for the overflow bits. Make UnrollVectorOverflowOp properly convert from scalar boolean contents to vector boolean contents Summary: When promoting the over flow vector for these ops we should use the target's desired setcc result type. This way a v8i32 result type will use a v8i32 overflow vector instead of a v8i16 overflow vector. A v8i16 overflow vector will cause LegalizeDAG/LegalizeVectorOps to have to use v8i32 and truncate to v8i16 in its expansion. By doing this in type legalization instead, we get the truncate into the DAG earlier and give DAG combine more of a chance to optimize it. We also have to fix unrolling to use the scalar setcc result type for the scalarized operation, and convert it to the required vector element type after the scalar operation. We have to observe the vector boolean contents when doing this conversion. The previous code was just taking the scalar result and putting it in the vector. But for X86 and AArch64 that would have only put a the boolean value in bit 0 of the element and left all other bits in the element 0. We need to ensure all bits in the element are the same. I'm using a select with constants here because that's what setcc unrolling in LegalizeVectorOps used. Reviewers: spatel, RKSimon, nikic Reviewed By: nikic Subscribers: javed.absar, kristof.beyls, dmgreen, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58567 llvm-svn: 354753
* [CGP] add special-cases to form unsigned add with overflow (PR40486)Sanjay Patel2019-02-242-24/+14
| | | | | | | | | | | | | | | | | | | | There's likely a missed IR canonicalization for at least 1 of these patterns. Otherwise, we wouldn't have needed the pattern-matching enhancement in D57516. Note that -- unlike usubo added with D57789 -- the TLI hook for this transform defaults to 'on'. So if there's any perf fallout from this, targets should look at how they're lowering the uaddo node in SDAG and/or override that hook. The x86 diffs suggest that there's some missing pattern-matching for forming inc/dec. This should fix the remaining known problems in: https://bugs.llvm.org/show_bug.cgi?id=40486 https://bugs.llvm.org/show_bug.cgi?id=31754 llvm-svn: 354746
* [TwoAddressInstructionPass] After commuting an instruction and before trying ↵Craig Topper2019-02-231-1/+27
| | | | | | | | | | | | to look for more commutable operands, resample the number of operands. The new instruciton might have less operands than the original instruction. If we don't resample, the next loop iteration might read an operand that doesn't exist. X86 can commute blends to movss/movsd which reduces from 4 operands to 3. This happened in the test case that caused r354363 & company to be reverted. A reduced version of that has been committed here. Really this whole checking for more commutable operands is a little fragile. It assumes that the new instructions operands are the same order and positions as the original except for the pair that was swapped. I don't know of anything that breaks this assumption today, but I've left a fixme. Fixing this will likely require an interface change. llvm-svn: 354738
* Recommit r354363 "[X86][SSE] Generalize X86ISD::BLENDI support to more value ↵Craig Topper2019-02-2322-313/+246
| | | | | | | | | | types" And its follow ups r354511, r354640. A follow patch will fix the issue that caused it to be reverted. llvm-svn: 354737
* Recommit r354647 and r354648 "[LegalizeTypes] When promoting the result of ↵Craig Topper2019-02-233-15/+9
| | | | | | | | | | EXTRACT_SUBVECTOR, also check if the input needs to be promoted. Use that to determine the element type to extract" r354648 was a follow up to fix a regression "[X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> (aext_vector_inreg X) to fix a regression from my previous commit." These were reverted in r354713 as their context depended on other patches that were reverted for a bug. llvm-svn: 354734
* [X86][AVX] concat_vectors(scalar_to_vector(x),scalar_to_vector(x)) --> ↵Simon Pilgrim2019-02-232-42/+13
| | | | | | | | broadcast(x) For AVX1, limit this to i32/f32/i64/f64 loading cases only. llvm-svn: 354730
* [X86][AVX] Shuffle->Permute+Blend if we have one v4f64/v4i64 shuffle input ↵Simon Pilgrim2019-02-232-26/+20
| | | | | | | | in place Even on AVX1 we can pretty cheaply (VPERM2F128+VSHUFPD) permute a single v4f64/v4i64 input (on AVX2 its just a single VPERMPD), followed by a BLENDPD. llvm-svn: 354729
* [X86] Sign extend the 8-bit immediate when commuting blend instructions to ↵Craig Topper2019-02-232-32/+18
| | | | | | | | | | | | match isel. Conversion from ConstantSDNode to MachineInstr sign extends immediates from their APInt representation to int64_t. This commit makes sure we do the same for commuting. The tests changes show how this improves CSE. This issue was made worse by the MachineCSE using commuteInstruction to undo a commute. So we virtually guarantee the sign extend from isel would be lost. The improved CSE also occurred with r354363, but that was reverted. I'm working to undo the revert, but wanted to get this fix in while it was easy to see the results. llvm-svn: 354724
* Revert r354363 & co "[X86][SSE] Generalize X86ISD::BLENDI support to more ↵Reid Kleckner2019-02-2325-271/+358
| | | | | | | | | | | | | | | value types" r354363 caused https://crbug.com/934963#c1, which has a plain C reduced test case. I also had to revert some dependent changes: - r354648 - r354647 - r354640 - r354511 llvm-svn: 354713
* [X86] Enable custom splitting of v8i64/v16i32 sext/zext for avx/avx2 when ↵Craig Topper2019-02-2310-338/+270
| | | | | | | | | | input type will be promoted by the type legalize to 128-bits. If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register. Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization. llvm-svn: 354709
* [X86] Add a few test cases for a v8i64 sext/zext from an illegal type that ↵Craig Topper2019-02-234-0/+758
| | | | | | | | | | needs to be promoted to 128 bits. If v8i64 isn't a legal type but v4i64 is, these will be split and then each half will get their input promoted and become an any_extend_vector_inreg/punpckhwd + any_extend + and/sign_extend_inreg. If we instead recognize the input will be promoted we can emit the and/sign_extend_inreg first in a 128 bit register. Then we can sign_extend/zero_extend one half and pshufd+sign_extend/zero_extend the other half. llvm-svn: 354708
* [CGP] add tests for uaddo increment/decrement; NFCSanjay Patel2019-02-221-0/+64
| | | | llvm-svn: 354699
* [MBP] Factor out function hasViableTopFallthrough and enhancementGuozhi Wei2019-02-221-0/+53
| | | | | | | | This patch factor out the function hasViableTopFallthrough from rotateLoop. It is also enhanced. Original code checks only if there is a block can be placed before current loop top. This patch also checks if the loop top is the most possible successor of its predecessor. The attached test case shows its effect. Differential Revision: https://reviews.llvm.org/D58393 llvm-svn: 354682
* [DAGCombine] Fold overlapping constant storesNirav Dave2019-02-221-2/+1
| | | | | | | | | | | | | | | Fold a smaller constant store into larger constant stores immediately preceeding it. Reviewers: rnk, courbet Subscribers: javed.absar, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58468 llvm-svn: 354676
* [x86] allow narrowing of vector UINT_TO_FPSanjay Patel2019-02-222-24/+8
| | | | | | | | | | | As discussed in: D56864 D58197 Always use the narrow (128-bit) instruction when possible. We already had the signed int version of this transform. llvm-svn: 354675
* [LegalizeVectorOps] Improve the placement of ANDs in the ExpandLoad path for ↵Craig Topper2019-02-222-130/+174
| | | | | | | | | | | | non-byte-sized loads. When we need to merge two adjacent loads the AND mask for the low piece was still sized for the full src element size. But we didn't have that many bits. The upper bits are already zero due to the SRL. So we can skip the AND if we're going to combine with the high bits. We do need an AND to clear out any bits from the high part. We were anding the high part before combining with the low part, but it looks like ANDing after the OR gets better results. So we can just emit the final AND after the optional concatentation is done. That will handling skipping before the OR and get rid of extra high bits after the OR. llvm-svn: 354655
* [X86] Add test cases to cover the path in VectorLegalizer::ExpandLoad for ↵Craig Topper2019-02-224-0/+518
| | | | | | | | | | non-byte sized loads where bits from two loads need to be concatenated. If the scalar type doesn't divide evenly into the WideVT then the code will need to take some bits from adjacent scalar loads and combine them. But most of our testing is for i1 element type which always divides evenly. llvm-svn: 354653
* [X86] Add a DAG combine for (aext_vector_inreg (aext_vector_inreg X)) -> ↵Craig Topper2019-02-221-4/+3
| | | | | | | | (aext_vector_inreg X) to fix a regression from my previous commit. Type legalization is causing two nodes to be created here, but we can use a single node to extend from v8i16 to v2i64. llvm-svn: 354648
* [LegalizeTypes] When promoting the result of EXTRACT_SUBVECTOR, also check ↵Craig Topper2019-02-223-18/+13
| | | | | | | | | | | | | | if the input needs to be promoted. Use that to determine the element type to extract. Otherwise we end up creating extract_vector_elts that then each need to have their input promoted. This can lead to truncates needing to be emitted for each of those. But we already emitted any_extends when we legalized the extract_subvector. So now we have pairs of any_extend+trunc that partially cancel. But depending on how DAGCombiner visits them we can get weird results. By promoting the input at the same time we can create only a single any_extend or truncate. There's one regression in the vector-narrow-binop.ll case, but that looks easy to fix with a follow up patch. llvm-svn: 354647
* [X86] Fix some copy/paste mistakes that caused a VR128 to be used as the ↵Craig Topper2019-02-221-0/+17
| | | | | | | | | | address of a load in an isel pattern This was introduced in r354511. Fixes PR40811. llvm-svn: 354640
* [X86] Remove hasSideEffects=1 from the X87 pseudos with folded load.Craig Topper2019-02-211-1/+1
| | | | | | This was done in r321424 to prevent scheduling from reordering things. But now that we model FPCW as a dependency, I don't think the same scheduling we were trying to prevent can occur. llvm-svn: 354628
* [x86] vectorize more cast ops in lowering to avoid register file transfersSanjay Patel2019-02-212-28/+69
| | | | | | | | | | | | | | | | This is a follow-up to D56864. If we're extracting from a non-zero index before casting to FP, then shuffle the vector and optionally narrow the vector before doing the cast: cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0 This might be enough to close PR39974: https://bugs.llvm.org/show_bug.cgi?id=39974 Differential Revision: https://reviews.llvm.org/D58197 llvm-svn: 354619
* [DAGCombiner] prevent infinite looping by truncating 'and' (PR40793)Sanjay Patel2019-02-211-0/+26
| | | | | | | | | | | | | | This fold can occur during legalization, so it can fight with promotion to the larger type. It apparently takes a special sequence and subtarget to avoid more basic simplifications that would hide the problem. But there's a bigger question raised here: why does distributeTruncateThroughAnd() even exist? It duplicates functionality from a more minimal pattern that we already have. But getting rid of this function requires some preliminary steps. https://bugs.llvm.org/show_bug.cgi?id=40793 llvm-svn: 354594
* [x86] regenerate checks; NFCSanjay Patel2019-02-211-8/+8
| | | | llvm-svn: 354589
* [X86] Fix copy-paste error in @ccz flag.Nirav Dave2019-02-211-2/+2
| | | | | | @ccz operand should be equivalent to @cce. llvm-svn: 354588
* [X86] Add test cases to show missed opportunities to remove AND mask from ↵Craig Topper2019-02-201-0/+187
| | | | | | | | | | BTC/BTS/BTR instructions when LHS of AND has known zeros. We can currently remove the mask if the immediate has all ones in the LSBs, but if the LHS of the AND is known zero, then the immediate might have had bits removed. A similar issue also occurs with shifts and rotates. I'm preparing a common fix for all of them. llvm-svn: 354520
* [CGP] match a special-case of unsigned subtract overflowSanjay Patel2019-02-201-5/+3
| | | | | | | This is the 'sub0' (negate) pattern from PR31754: https://bugs.llvm.org/show_bug.cgi?id=31754 llvm-svn: 354519
* [DAGCombine] Generalize Dead Store to overlapping stores.Nirav Dave2019-02-201-1/+0
| | | | | | | | | | | | | | | | | | Summary: Remove stores that are immediately overwritten by larger stores. Reviewers: courbet, rnk Reviewed By: rnk Subscribers: javed.absar, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58467 llvm-svn: 354518
* [SelectionDAG] Teach GetDemandedBits to look at the known zeros of the LHS ↵Craig Topper2019-02-201-2/+0
| | | | | | | | | | | | when handling ISD::AND If the LHS has known zeros, then the RHS immediate mask might have been simplified to remove those bits. This patch adds a call to computeKnownBits to get the known zeroes to handle that possibility. I left an early out to skip the call if all of the demanded bits are set in the mask. Differential Revision: https://reviews.llvm.org/D58464 llvm-svn: 354514
* [SDAG] Support vector UMULO/SMULONikita Popov2019-02-202-0/+5403
| | | | | | | | | | | | | | | Second part of https://bugs.llvm.org/show_bug.cgi?id=40442. This adds an extra UnrollVectorOverflowOp() method to SDAG, because the general UnrollOverflowOp() method can't deal with multiple results. Additionally we need to expand UMULO/SMULO during vector op legalization, as it may result in unrolling, which may need additional type legalization. Differential Revision: https://reviews.llvm.org/D57997 llvm-svn: 354513
* [X86] Add more load folding patterns for blend instructions as a follow up ↵Craig Topper2019-02-206-32/+32
| | | | | | | | | | | | | | to r354363. This avoids depending on the peephole pass to do load folding. Also adds some load folding for some insert_subvector patterns that use blend. All of this was found by temporarily adding TB_NO_FORWARD to the blend immediate entries in the load folding tables. I've added -disable-peephole to some of the affected tests from that experiment to ensure we're testing isel patterns. llvm-svn: 354511
* Fix testcase.Nirav Dave2019-02-201-0/+1
| | | | llvm-svn: 354508
* Add test case.Nirav Dave2019-02-201-0/+71
| | | | llvm-svn: 354504
* [X86] Add test case to show missed opportunity to remove an explicit AND on ↵Craig Topper2019-02-201-0/+31
| | | | | | | | | | the bit position from BT when it has known zeros. NFC If the bit position has known zeros in it, then the AND immediate will likely be optimized to remove bits. This can prevent GetDemandedBits from recognizing that the AND is unnecessary. llvm-svn: 354501
* Revert r354498 "[X86] Add test case to show missed opportunity to remove an ↵Craig Topper2019-02-201-29/+0
| | | | | | | | explicit AND on the bit position from BT when it has known zeros." I accidentally committed more than just the test. llvm-svn: 354499
* [X86] Add test case to show missed opportunity to remove an explicit AND on ↵Craig Topper2019-02-201-0/+29
| | | | | | | | | | the bit position from BT when it has known zeros. If the bit position has known zeros in it, then the AND immediate will likely be optimized to remove bits. This can prevent GetDemandedBits from recognizing that the AND is unnecessary. llvm-svn: 354498
* [CGP][x86] add tests for usubo special-case; NFCSanjay Patel2019-02-201-9/+26
| | | | | | | This is another example from PR31754: https://bugs.llvm.org/show_bug.cgi?id=31754 llvm-svn: 354475
* [X86] Remove FeatureSlowIncDec from Sandy Bridge and later Intel Core CPUsCraig Topper2019-02-202-3/+3
| | | | | | | | | | | | | | | | | | | Summary: Inc and Dec were at one point slow on Intel CPUs due to their tendency to cause partial flag stalls on P6 derived CPU cores. This is because these instructions are defined to preserve the carry flag. This partial flag stall issue persisted until Sandy Bridge when flag merging was changed to be handled as a data dependency instead of as a stall until retirement. Sandy Bridge and later CPUs rename the C flag separately from OSPAZ so there is no flag merge needed on INC/DEC to preserve the C flag. Given these improvements I don't know why INC/DEC was ever considered slow on Sandy Bridge. If anything they should have been disabled on the earlier CPUs instead. Note after this patch, INC/DEC are still considered slow on Silvermont, Goldmont, Knights Landing and our generic "x86-64" CPU. Reviewers: spatel, RKSimon, chandlerc Reviewed By: chandlerc Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D58412 llvm-svn: 354436
* [X86] Mark FP32_TO_INT16_IN_MEM/FP32_TO_INT32_IN_MEM/FP32_TO_INT64_IN_MEM as ↵Craig Topper2019-02-192-52/+52
| | | | | | | | | | | | | | clobbering EFLAGS to prevent mis-scheduling during conversion from SelectionDAG to MIR. After r354178, these instruction expand to a sequence that uses an OR instruction. That OR clobbers EFLAGS so we need to state that to avoid accidentally using the clobbered flags. Our tests show the bug, but I didn't notice because the SETcc instructions didn't move after r354178 since it used to be safe to do the fp->int conversion first. We should probably convert this whole sequence to SelectionDAG instead of a custom inserter to avoid mistakes like this. Fixes PR40779 llvm-svn: 354395
* [X86][SSE] Generalize X86ISD::BLENDI support to more value typesSimon Pilgrim2019-02-1917-325/+227
| | | | | | | | | | | | | | | | | | D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required. With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors. This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts. I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case). The vector-reduce-smin/smax regressions will be fixed in a future improvement to SimplifyDemandedBits to peek through bitcasts and support X86ISD::BLENDV. Reapplied after reversion at rL353699 - AVX2 isel fix was applied at rL354358, additional test at rL354360/rL354361 Differential Revision: https://reviews.llvm.org/D57888 llvm-svn: 354363
* Fix stupid assembly comment typoSimon Pilgrim2019-02-191-2/+2
| | | | llvm-svn: 354361
* [X86][SSE] Add pblendw commuted load test caseSimon Pilgrim2019-02-191-3/+15
| | | | | | Reduced test case for the regression caused in D57888/rL353610 llvm-svn: 354360
* [SDAG] Use shift amount type in MULO promotion; NFCNikita Popov2019-02-191-0/+9
| | | | | | | | | | Directly use the correct shift amount type if it is possible, and future-proof the code against vectors. The added test makes sure that bitwidths that do not fit into the shift amount type do not assert. Split out from D57997. llvm-svn: 354359
* [X86][AVX] Update VBROADCAST folds to always use v2i64 X86vzloadSimon Pilgrim2019-02-193-6/+3
| | | | | | | | The VBROADCAST combines and SimplifyDemandedVectorElts improvements mean that we now more consistently use shorter (128-bit) X86vzload input operands. Follow up to D58053 llvm-svn: 354346
* [X86][AVX] EltsFromConsecutiveLoads - Add BROADCAST lowering supportSimon Pilgrim2019-02-1910-142/+59
| | | | | | | | | | This patch adds scalar/subvector BROADCAST handling to EltsFromConsecutiveLoads. It mainly shows codegen changes to 32-bit code which failed to handle i64 loads, although 64-bit code is also using this new path to more efficiently combine to a broadcast load. Differential Revision: https://reviews.llvm.org/D58053 llvm-svn: 354340
OpenPOWER on IntegriCloud