summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86/X86ISelLowering.cpp
Commit message (Collapse)AuthorAgeFilesLines
...
* [X86] Pull out combineOrShiftToFunnelShift helper. NFCI.Simon Pilgrim2019-10-291-51/+64
|
* [X86] Add a DAG combine to turn (and (bitcast (vXi1 (concat_vectors (vYi1 ↵Craig Topper2019-10-281-0/+68
| | | | | | | | | | setcc), undef,))), C) into (bitcast (vXi1 (concat_vectors (vYi1 setcc), zero,))) The legalization of v2i1->i2 or v4i1->i4 bitcasts followed by a setcc can create an and after the bitcast. If we're lucky enough that the input to the bitcast is a concat_vectors where the first operand is a setcc that can natively 0 all the upper bits of ak-register, then we should replace the other operands of the concat_vectors with zero in order to remove the AND. With the AND removed we might be able to use a kortest on the result. Differential Revision: https://reviews.llvm.org/D69205
* [X86] Fix 48/96 byte memcmp code genDavid Zarzycki2019-10-281-2/+21
| | | | | | | Detect scalar ISD::ZERO_EXTEND generated by memcmp lowering and convert it to ISD::INSERT_SUBVECTOR. https://reviews.llvm.org/D69464
* [X86] Prefer KORTEST on Knights Landing or later for memcmp()David Zarzycki2019-10-261-17/+50
| | | | | | | | | | | PTEST and especially the MOVMSK instructions are slow on Knights Landing or later. As a bonus, this patch increases instruction parallelism by emitting: KORTEST(PCMPNEQ(a, b), PCMPNEQ(c, d)) == 0 Instead of: KORTEST(AND(PCMPEQ(a, b), PCMPEQ(c, d))) == ~0 https://reviews.llvm.org/D69157
* [X86] Add a check for SSE2 to the top of combineReductionToHorizontal.Craig Topper2019-10-251-0/+4
| | | | Without this, we can create a PSADBW node that isn't legal.
* [X86] combineX86ShufflesRecursively - assert the root mask is legal. NFCI.Simon Pilgrim2019-10-231-0/+3
|
* [X86][SSE] Add OR(EXTRACTELT(X,0),OR(EXTRACTELT(X,1))) -> MOVMSK+CMP ↵Simon Pilgrim2019-10-211-0/+18
| | | | | | reduction combine llvm-svn: 375463
* [X86] Rename matchBitOpReduction to matchScalarReduction. NFCI.Simon Pilgrim2019-10-211-4/+4
| | | | | | This doesn't need to be just for bitops, but the ops do need to be fully associative. llvm-svn: 375445
* [X86] Check Subtarget.hasSSE3() before calling shouldUseHorizontalOp and ↵Craig Topper2019-10-201-1/+1
| | | | | | | | | | emitting X86ISD::FHADD in LowerUINT_TO_FP_i64. This was a regression from r375341. Fixes PR43729. llvm-svn: 375381
* [X86] Pulled out helper to decode target shuffle element sentinel values to ↵Simon Pilgrim2019-10-191-13/+22
| | | | | | | | 'Zeroable' known undef/zero bits. NFCI. Renamed 'resolveTargetShuffleAndZeroables' to 'resolveTargetShuffleFromZeroables' to match. llvm-svn: 375348
* [X86][SSE] lowerV16I8Shuffle - tryToWidenViaDuplication - undef unpack argsSimon Pilgrim2019-10-191-1/+9
| | | | | | tryToWidenViaDuplication lowers using the shuffle_v8i16(unpack_v16i8(shuffle_v8i16(x),shuffle_v8i16(x))) pattern, but the unpack only needs the even/odd 16i8 args if the original v16i8 shuffle mask references the even/odd elements - which isn't true for many extension style shuffles. llvm-svn: 375342
* [X86][SSE] LowerUINT_TO_FP_i64 - only use HADDPD for size/fast-hopsSimon Pilgrim2019-10-191-12/+11
| | | | | | | | We were always generating a single source HADDPD, but really we should only do this if shouldUseHorizontalOp says its a good idea. Differential Revision: https://reviews.llvm.org/D69175 llvm-svn: 375341
* [X86] combineX86ShufflesRecursively - pull out isTargetShuffleVariableMask. ↵Simon Pilgrim2019-10-181-1/+2
| | | | | | NFCI. llvm-svn: 375253
* [X86] Emit KTEST when possibleDavid Zarzycki2019-10-181-8/+23
| | | | | | https://reviews.llvm.org/D69111 llvm-svn: 375197
* [DAGCombine][ARM] Enable extending masked loadsSam Parker2019-10-171-0/+3
| | | | | | | | | | | Add generic DAG combine for extending masked loads. Allow us to generate sext/zext masked loads which can access v4i8, v8i8 and v4i16 memory to produce v4i32, v8i16 and v4i32 respectively. Differential Revision: https://reviews.llvm.org/D68337 llvm-svn: 375085
* [X86] combineX86ShufflesRecursively - split the getTargetShuffleInputs call ↵Simon Pilgrim2019-10-151-3/+9
| | | | | | | | | | from the resolveTargetShuffleAndZeroables call. Exposes an issue in getFauxShuffleMask where the OR(SHUFFLE,SHUFFLE) decode should always resolve zero/undef elements. Part of the fix for PR43024 where ideally we shouldn't call resolveTargetShuffleAndZeroables for Depth == 0 llvm-svn: 374928
* [X86] Make memcmp() use PTEST if possible and also enable AVX1David Zarzycki2019-10-151-17/+46
| | | | llvm-svn: 374922
* [X86] Resolve KnownUndef/KnownZero bits into target shuffle masks in helper. ↵Simon Pilgrim2019-10-151-9/+18
| | | | | | NFCI. llvm-svn: 374878
* [X86] Don't check for VBROADCAST_LOAD being a user of the source of a ↵Craig Topper2019-10-151-3/+1
| | | | | | | | | | | | | | | | | | VBROADCAST when trying to share broadcasts. The only things VBROADCAST_LOAD uses is an address and a chain node. It has no vector inputs. So if its a user of the source of another broadcast that could only mean one of two things. The other broadcast is broadcasting the address of the broadcast_load. Or the source is a load and the use we're seeing is the chain result from that load. Neither of these cases make sense to combine here. This issue was reported post-commit r373871. Test case has not been reduced yet. llvm-svn: 374862
* [X86] Teach EmitTest to handle ISD::SSUBO/USUBO in order to use the Z flag ↵Craig Topper2019-10-141-0/+7
| | | | | | | | | | | | | | | | | | | | | | | from the subtract directly during isel. This prevents isel from emitting a TEST instruction that optimizeCompareInstr will need to remove later. In some of the modified tests, the SUB gets duplicated due to the flags being needed in two places and being clobbered in between. optimizeCompareInstr was able to optimize away the TEST that was using the result of one of them, but optimizeCompareInstr doesn't know to turn SUB into CMP after removing the TEST. It only knows how to turn SUB into CMP if the result was already dead. With this change the TEST never exists, so optimizeCompareInstr doesn't have to remove it. Then it can just turn the SUB into CMP immediately. Fixes PR43649. llvm-svn: 374755
* [X86] getTargetShuffleInputs - Control KnownUndef mask element resolution as ↵Simon Pilgrim2019-10-131-11/+11
| | | | | | | | well as KnownZero. We were already controlling whether the KnownZero elements were being written to the target mask, this extends it to the KnownUndef elements as well so we can prevent the target shuffle mask being manipulated at all. llvm-svn: 374732
* [X86] Enable use of avx512 saturating truncate instructions in more cases.Craig Topper2019-10-131-35/+42
| | | | | | | | | | This enables use of the saturating truncate instructions when the result type is less than 128 bits. It also enables the use of saturating truncate instructions on KNL when the input is less than 512 bits. We can do this by widening the input and then extracting the result. llvm-svn: 374731
* [X86] SimplifyMultipleUseDemandedBitsForTargetNode - use ↵Simon Pilgrim2019-10-131-12/+11
| | | | | | getTargetShuffleInputs with KnownUndef/Zero results. llvm-svn: 374725
* [X86] getTargetShuffleInputs - add KnownUndef/Zero output supportSimon Pilgrim2019-10-131-25/+25
| | | | | | Adjust SimplifyDemandedVectorEltsForTargetNode to use the known elts masks instead of recomputing it locally. llvm-svn: 374724
* [X86] Add a one use check on the setcc to the min/max canonicalization code ↵Craig Topper2019-10-131-0/+1
| | | | | | | | | | | | | | | | | | | in combineSelect. This seems to improve std::midpoint code where we have a min and a max with the same condition. If we split the setcc we can end up with two compares if the one of the operands is a constant. Since we aggressively canonicalize compares with constants. For non-constants it can interfere with our ability to share control flow if we need to expand cmovs into control flow. I'm also not sure I understand this min/max canonicalization code. The motivating case talks about comparing with 0. But we don't check for 0 explicitly. Removes one instruction from the codegen for PR43658. llvm-svn: 374706
* [X86] Enable v4i32->v4i16 and v8i16->v8i8 saturating truncates to use pack ↵Craig Topper2019-10-131-0/+1
| | | | | | instructions with avx512. llvm-svn: 374705
* [X86] scaleShuffleMask - use size_t Scale to avoid overflow warningsSimon Pilgrim2019-10-121-3/+3
| | | | llvm-svn: 374674
* Replace for-loop of SmallVector::push_back with SmallVector::append. NFCI.Simon Pilgrim2019-10-121-4/+2
| | | | llvm-svn: 374669
* Fix cppcheck shadow variable name warnings. NFCI.Simon Pilgrim2019-10-121-6/+6
| | | | llvm-svn: 374668
* [X86] Use any_of/all_of patterns in shuffle mask pattern recognisers. NFCI.Simon Pilgrim2019-10-121-24/+13
| | | | llvm-svn: 374667
* [X86][SSE] Avoid unnecessary PMOVZX in v4i8 sum reductionSimon Pilgrim2019-10-121-7/+18
| | | | | | This should go away once D66004 has landed and we can simplify shuffle chains using demanded elts. llvm-svn: 374658
* [X86] Use pack instructions for packus/ssat truncate patterns when 256-bit ↵Craig Topper2019-10-121-1/+4
| | | | | | | | | | | is the largest legal vector and the result type is at least 256 bits. Since the input type is larger than 256-bits we'll need to some concatenating to reassemble the results. The pack instructions ability to concatenate while packing make this a shorter/faster sequence. llvm-svn: 374643
* [X86] Fold a VTRUNCS/VTRUNCUS+store into a saturating truncating store.Craig Topper2019-10-121-13/+11
| | | | | | | | We already did this for VTRUNCUS with a specific combination of types. This extends this to VTRUNCS and handles any types where a truncating store is legal. llvm-svn: 374615
* [X86][SSE] Add support for v4i8 add reductionSimon Pilgrim2019-10-111-2/+7
| | | | llvm-svn: 374579
* [X86] isFNEG - add recursion depth limitSimon Pilgrim2019-10-111-5/+9
| | | | | | Now that its used by isNegatibleForFree we should try to avoid costly deep recursion llvm-svn: 374534
* [X86] Add a DAG combine to turn v16i16->v16i8 VTRUNCUS+store into a ↵Craig Topper2019-10-111-0/+13
| | | | | | saturating truncating store. llvm-svn: 374509
* [X86] Improve the AVX512 bailout in combineTruncateWithSat to allow pack ↵Craig Topper2019-10-111-2/+9
| | | | | | | | | | instructions in more situations. If we don't have VLX we won't end up selecting a saturating truncate for 256-bit or smaller vectors so we should just use the pack lowering. llvm-svn: 374487
* [X86] Guard against leaving a dangling node in combineTruncateWithSat.Craig Topper2019-10-101-4/+13
| | | | | | | | | | | | | | | | | | When handling the packus pattern for i32->i8 we do a two step process using a packss to i16 followed by a packus to i8. If the final i8 step is a type with less than 64-bits the packus step will return SDValue(), but the i32->i16 step might have succeeded. This leaves the nodes from the middle step dangling. Guard against this by pre-checking that the number of elements is at least 8 before doing the middle step. With that check in place this should mean the only other case the middle step itself can fail is when SSE2 is disabled. So add an early SSE2 check then just assert that neither the middle or final step ever fail. llvm-svn: 374460
* [X86] Use packusdw+vpmovuswb to implement v16i32->V16i8 that clamps signed ↵Craig Topper2019-10-101-0/+15
| | | | | | | | | | inputs to be between 0 and 255 when zmm registers are disabled on SKX. If we've disable zmm registers, the v16i32 will need to be split. This split will propagate through min/max the truncate. This creates two sequences that need to be concatenated back to v16i8. We can instead use packusdw to do part of the clamping, truncating, and concatenating all at once. Then we can use a vpmovuswb to finish off the clamp. Differential Revision: https://reviews.llvm.org/D68763 llvm-svn: 374431
* [X86] combineFMA - Convert to use isNegatibleForFree/GetNegatedExpression.Simon Pilgrim2019-10-101-7/+84
| | | | | | Split off from D67557. llvm-svn: 374356
* [X86] combineFMADDSUB - Convert to use isNegatibleForFree/GetNegatedExpression.Simon Pilgrim2019-10-101-10/+18
| | | | | | Split off from D67557, fixes the compile time regression mentioned in rL372756 llvm-svn: 374351
* [DAG][X86] Add isNegatibleForFree/GetNegatedExpression override ↵Simon Pilgrim2019-10-101-0/+16
| | | | | | | | | | placeholders. NFCI. Continuing to undo the rL372756 reversion. Differential Revision: https://reviews.llvm.org/D67557 llvm-svn: 374345
* Conservatively add volatility and atomic checks in a few placesPhilip Reames2019-10-091-3/+5
| | | | | | | | | | | | As background, starting in D66309, I'm working on support unordered atomics analogous to volatile flags on normal LoadSDNode/StoreSDNodes for X86. As part of that, I spent some time going through usages of LoadSDNode and StoreSDNode looking for cases where we might have missed a volatility check or need an atomic check. I couldn't find any cases that clearly miscompile - i.e. no test cases - but a couple of pieces in code loop suspicious though I can't figure out how to exercise them. This patch adds defensive checks and asserts in the places my manual audit found. If anyone has any ideas on how to either a) disprove any of the checks, or b) hit the bug they might be fixing, I welcome suggestions. Differential Revision: https://reviews.llvm.org/D68419 llvm-svn: 374261
* [X86] Shrink zero extends of gather indices from type less than i32 to types ↵Craig Topper2019-10-071-44/+26
| | | | | | | | | | larger than i32. Gather instructions can use i32 or i64 elements for indices. If the index is zero extended from a type smaller than i32 to i64, we can shrink the extend to just extend to i32. llvm-svn: 373982
* [X86] Add new calling convention that guarantees tail call optimizationReid Kleckner2019-10-071-8/+11
| | | | | | | | | | | | | | | | | When the target option GuaranteedTailCallOpt is specified, calls with the fastcc calling convention will be transformed into tail calls if they are in tail position. This diff adds a new calling convention, tailcc, currently supported only on X86, which behaves the same way as fastcc, except that the GuaranteedTailCallOpt flag does not need to enabled in order to enable tail call optimization. Patch by Dwight Guth <dwight.guth@runtimeverification.com>! Reviewed By: lebedev.ri, paquette, rnk Differential Revision: https://reviews.llvm.org/D67855 llvm-svn: 373976
* [X86][SSE] getTargetShuffleInputs - move VT.isSimple/isVector checks inside. ↵Simon Pilgrim2019-10-071-4/+11
| | | | | | | | NFCI. Stop all the callers from having to check the value type before calling getTargetShuffleInputs. llvm-svn: 373915
* [X86][AVX] Access a scalar float/double as a free extract from a broadcast ↵Simon Pilgrim2019-10-061-11/+24
| | | | | | | | | | | | | | | | load (PR43217) If a fp scalar is loaded and then used as both a scalar and a vector broadcast, perform the load as a broadcast and then extract the scalar for 'free' from the 0th element. This involved switching the order of the X86ISD::BROADCAST combines so we only convert to X86ISD::BROADCAST_LOAD once all other canonicalizations have been attempted. Adds a DAGCombinerInfo::recursivelyDeleteUnusedNodes wrapper. Fixes PR43217 Differential Revision: https://reviews.llvm.org/D68544 llvm-svn: 373871
* Fix signed/unsigned warning. NFCISimon Pilgrim2019-10-061-1/+1
| | | | llvm-svn: 373870
* [X86][SSE] Remove resolveTargetShuffleInputs and use getTargetShuffleInputs ↵Simon Pilgrim2019-10-061-42/+22
| | | | | | | | directly. Move the resolveTargetShuffleInputsAndMask call to after the shuffle mask combine before the undef/zero constant fold instead. llvm-svn: 373868
* [X86][SSE] Don't merge known undef/zero elements into target shuffle masks.Simon Pilgrim2019-10-061-30/+50
| | | | | | | | Replaces setTargetShuffleZeroElements with getTargetShuffleAndZeroables which reports the Zeroable elements but doesn't merge them into the decoded target shuffle mask (the merging has been moved up into getTargetShuffleInputs until we can get rid of it entirely). This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle mask isn't adjusted by its source inputs but instead we cache them in a parallel Zeroable mask. llvm-svn: 373867
OpenPOWER on IntegriCloud