summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86
Commit message (Collapse)AuthorAgeFilesLines
...
* [X86] getTargetShuffleInputs - add KnownUndef/Zero output supportSimon Pilgrim2019-10-131-25/+25
| | | | | | Adjust SimplifyDemandedVectorEltsForTargetNode to use the known elts masks instead of recomputing it locally. llvm-svn: 374724
* [X86] Add a one use check on the setcc to the min/max canonicalization code ↵Craig Topper2019-10-131-0/+1
| | | | | | | | | | | | | | | | | | | in combineSelect. This seems to improve std::midpoint code where we have a min and a max with the same condition. If we split the setcc we can end up with two compares if the one of the operands is a constant. Since we aggressively canonicalize compares with constants. For non-constants it can interfere with our ability to share control flow if we need to expand cmovs into control flow. I'm also not sure I understand this min/max canonicalization code. The motivating case talks about comparing with 0. But we don't check for 0 explicitly. Removes one instruction from the codegen for PR43658. llvm-svn: 374706
* [X86] Enable v4i32->v4i16 and v8i16->v8i8 saturating truncates to use pack ↵Craig Topper2019-10-131-0/+1
| | | | | | instructions with avx512. llvm-svn: 374705
* [X86] scaleShuffleMask - use size_t Scale to avoid overflow warningsSimon Pilgrim2019-10-122-7/+7
| | | | llvm-svn: 374674
* Replace for-loop of SmallVector::push_back with SmallVector::append. NFCI.Simon Pilgrim2019-10-121-4/+2
| | | | llvm-svn: 374669
* Fix cppcheck shadow variable name warnings. NFCI.Simon Pilgrim2019-10-121-6/+6
| | | | llvm-svn: 374668
* [X86] Use any_of/all_of patterns in shuffle mask pattern recognisers. NFCI.Simon Pilgrim2019-10-121-24/+13
| | | | llvm-svn: 374667
* [X86][SSE] Avoid unnecessary PMOVZX in v4i8 sum reductionSimon Pilgrim2019-10-121-7/+18
| | | | | | This should go away once D66004 has landed and we can simplify shuffle chains using demanded elts. llvm-svn: 374658
* [CostModel][X86] Improve sum reduction costs.Simon Pilgrim2019-10-121-22/+23
| | | | | | | | I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2. I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674. llvm-svn: 374655
* [X86] Use pack instructions for packus/ssat truncate patterns when 256-bit ↵Craig Topper2019-10-121-1/+4
| | | | | | | | | | | is the largest legal vector and the result type is at least 256 bits. Since the input type is larger than 256-bits we'll need to some concatenating to reassemble the results. The pack instructions ability to concatenate while packing make this a shorter/faster sequence. llvm-svn: 374643
* recommit: [LoopVectorize][PowerPC] Estimate int and float register pressure ↵Zi Xuan Wu2019-10-122-2/+3
| | | | | | | | | | | | | | | | | | | | | | | separately in loop-vectorize In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not estimate different register pressure for different register class separately(especially for scalar type, float type should not be on the same position with int type), so it's not accurate. Specifically, it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance. So we need classify the register classes in IR level, and importantly these are abstract register classes, and are not the target register class of backend provided in td file. It's used to establish the mapping between the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types. For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR), float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled, and 3 kinds of register class when VSX is NOT enabled. It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions. Differential revision: https://reviews.llvm.org/D67148 llvm-svn: 374634
* [X86] Fold a VTRUNCS/VTRUNCUS+store into a saturating truncating store.Craig Topper2019-10-121-13/+11
| | | | | | | | We already did this for VTRUNCUS with a specific combination of types. This extends this to VTRUNCS and handles any types where a truncating store is legal. llvm-svn: 374615
* [X86][SSE] Add support for v4i8 add reductionSimon Pilgrim2019-10-111-2/+7
| | | | llvm-svn: 374579
* [X86] isFNEG - add recursion depth limitSimon Pilgrim2019-10-111-5/+9
| | | | | | Now that its used by isNegatibleForFree we should try to avoid costly deep recursion llvm-svn: 374534
* [X86] Add a DAG combine to turn v16i16->v16i8 VTRUNCUS+store into a ↵Craig Topper2019-10-111-0/+13
| | | | | | saturating truncating store. llvm-svn: 374509
* [X86] Improve the AVX512 bailout in combineTruncateWithSat to allow pack ↵Craig Topper2019-10-111-2/+9
| | | | | | | | | | instructions in more situations. If we don't have VLX we won't end up selecting a saturating truncate for 256-bit or smaller vectors so we should just use the pack lowering. llvm-svn: 374487
* [X86] Guard against leaving a dangling node in combineTruncateWithSat.Craig Topper2019-10-101-4/+13
| | | | | | | | | | | | | | | | | | When handling the packus pattern for i32->i8 we do a two step process using a packss to i16 followed by a packus to i8. If the final i8 step is a type with less than 64-bits the packus step will return SDValue(), but the i32->i16 step might have succeeded. This leaves the nodes from the middle step dangling. Guard against this by pre-checking that the number of elements is at least 8 before doing the middle step. With that check in place this should mean the only other case the middle step itself can fail is when SSE2 is disabled. So add an early SSE2 check then just assert that neither the middle or final step ever fail. llvm-svn: 374460
* [X86] Use packusdw+vpmovuswb to implement v16i32->V16i8 that clamps signed ↵Craig Topper2019-10-101-0/+15
| | | | | | | | | | inputs to be between 0 and 255 when zmm registers are disabled on SKX. If we've disable zmm registers, the v16i32 will need to be split. This split will propagate through min/max the truncate. This creates two sequences that need to be concatenated back to v16i8. We can instead use packusdw to do part of the clamping, truncating, and concatenating all at once. Then we can use a vpmovuswb to finish off the clamp. Differential Revision: https://reviews.llvm.org/D68763 llvm-svn: 374431
* [X86] combineFMA - Convert to use isNegatibleForFree/GetNegatedExpression.Simon Pilgrim2019-10-101-7/+84
| | | | | | Split off from D67557. llvm-svn: 374356
* [X86] combineFMADDSUB - Convert to use isNegatibleForFree/GetNegatedExpression.Simon Pilgrim2019-10-101-10/+18
| | | | | | Split off from D67557, fixes the compile time regression mentioned in rL372756 llvm-svn: 374351
* [DAG][X86] Add isNegatibleForFree/GetNegatedExpression override ↵Simon Pilgrim2019-10-102-0/+27
| | | | | | | | | | placeholders. NFCI. Continuing to undo the rL372756 reversion. Differential Revision: https://reviews.llvm.org/D67557 llvm-svn: 374345
* Conservatively add volatility and atomic checks in a few placesPhilip Reames2019-10-091-3/+5
| | | | | | | | | | | | As background, starting in D66309, I'm working on support unordered atomics analogous to volatile flags on normal LoadSDNode/StoreSDNodes for X86. As part of that, I spent some time going through usages of LoadSDNode and StoreSDNode looking for cases where we might have missed a volatility check or need an atomic check. I couldn't find any cases that clearly miscompile - i.e. no test cases - but a couple of pieces in code loop suspicious though I can't figure out how to exercise them. This patch adds defensive checks and asserts in the places my manual audit found. If anyone has any ideas on how to either a) disprove any of the checks, or b) hit the bug they might be fixing, I welcome suggestions. Differential Revision: https://reviews.llvm.org/D68419 llvm-svn: 374261
* Revert "[LoopVectorize][PowerPC] Estimate int and float register pressure ↵Jinsong Ji2019-10-082-3/+2
| | | | | | | | | | | | | | separately in loop-vectorize" Also Revert "[LoopVectorize] Fix non-debug builds after rL374017" This reverts commit 9f41deccc0e648a006c9f38e11919f181b6c7e0a. This reverts commit 18b6fe07bcf44294f200bd2b526cb737ed275c04. The patch is breaking PowerPC internal build, checked with author, reverting on behalf of him for now due to timezone. llvm-svn: 374091
* [DebugInfo][If-Converter] Update call site info during the optimizationNikola Prica2019-10-081-1/+1
| | | | | | | | | | | | | | During the If-Converter optimization pay attention when copying or deleting call instructions in order to keep call site information in valid state. Reviewers: aprantl, vsk, efriedma Reviewed By: vsk, efriedma Differential Revision: https://reviews.llvm.org/D66955 llvm-svn: 374068
* [LoopVectorize][PowerPC] Estimate int and float register pressure separately ↵Zi Xuan Wu2019-10-082-2/+3
| | | | | | | | | | | | | | | | | | | | | | | in loop-vectorize In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not estimate different register pressure for different register class separately(especially for scalar type, float type should not be on the same position with int type), so it's not accurate. Specifically, it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance. So we need classify the register classes in IR level, and importantly these are abstract register classes, and are not the target register class of backend provided in td file. It's used to establish the mapping between the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types. For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR), float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled, and 3 kinds of register class when VSX is NOT enabled. It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions. Differential revision: https://reviews.llvm.org/D67148 llvm-svn: 374017
* [X86] Shrink zero extends of gather indices from type less than i32 to types ↵Craig Topper2019-10-071-44/+26
| | | | | | | | | | larger than i32. Gather instructions can use i32 or i64 elements for indices. If the index is zero extended from a type smaller than i32 to i64, we can shrink the extend to just extend to i32. llvm-svn: 373982
* [X86] Add new calling convention that guarantees tail call optimizationReid Kleckner2019-10-075-12/+23
| | | | | | | | | | | | | | | | | When the target option GuaranteedTailCallOpt is specified, calls with the fastcc calling convention will be transformed into tail calls if they are in tail position. This diff adds a new calling convention, tailcc, currently supported only on X86, which behaves the same way as fastcc, except that the GuaranteedTailCallOpt flag does not need to enabled in order to enable tail call optimization. Patch by Dwight Guth <dwight.guth@runtimeverification.com>! Reviewed By: lebedev.ri, paquette, rnk Differential Revision: https://reviews.llvm.org/D67855 llvm-svn: 373976
* [X86][SSE] getTargetShuffleInputs - move VT.isSimple/isVector checks inside. ↵Simon Pilgrim2019-10-071-4/+11
| | | | | | | | NFCI. Stop all the callers from having to check the value type before calling getTargetShuffleInputs. llvm-svn: 373915
* [X86] Support LEA64_32r in processInstrForSlow3OpLEA and use INC/DEC when ↵Craig Topper2019-10-071-80/+110
| | | | | | | | | | | | | | possible. Move the erasing and iterator updating inside to match the other slow LEA function. I've adapted code from optTwoAddrLEA and basically rebuilt the implementation here. We do lose the kill flags now just like optTwoAddrLEA. This runs late enough in the pipeline that shouldn't really be a problem. llvm-svn: 373877
* [X86][AVX] Access a scalar float/double as a free extract from a broadcast ↵Simon Pilgrim2019-10-061-11/+24
| | | | | | | | | | | | | | | | load (PR43217) If a fp scalar is loaded and then used as both a scalar and a vector broadcast, perform the load as a broadcast and then extract the scalar for 'free' from the 0th element. This involved switching the order of the X86ISD::BROADCAST combines so we only convert to X86ISD::BROADCAST_LOAD once all other canonicalizations have been attempted. Adds a DAGCombinerInfo::recursivelyDeleteUnusedNodes wrapper. Fixes PR43217 Differential Revision: https://reviews.llvm.org/D68544 llvm-svn: 373871
* Fix signed/unsigned warning. NFCISimon Pilgrim2019-10-061-1/+1
| | | | llvm-svn: 373870
* [X86][SSE] Remove resolveTargetShuffleInputs and use getTargetShuffleInputs ↵Simon Pilgrim2019-10-061-42/+22
| | | | | | | | directly. Move the resolveTargetShuffleInputsAndMask call to after the shuffle mask combine before the undef/zero constant fold instead. llvm-svn: 373868
* [X86][SSE] Don't merge known undef/zero elements into target shuffle masks.Simon Pilgrim2019-10-061-30/+50
| | | | | | | | Replaces setTargetShuffleZeroElements with getTargetShuffleAndZeroables which reports the Zeroable elements but doesn't merge them into the decoded target shuffle mask (the merging has been moved up into getTargetShuffleInputs until we can get rid of it entirely). This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle mask isn't adjusted by its source inputs but instead we cache them in a parallel Zeroable mask. llvm-svn: 373867
* [X86] Add custom type legalization for v16i64->v16i8 truncate and ↵Craig Topper2019-10-061-3/+23
| | | | | | | | | | | | | | | | | | | | | v8i64->v8i8 truncate when v8i64 isn't legal Summary: The default legalization for v16i64->v16i8 tries to create a multiple stage truncate concatenating after each stage and truncating again. But avx512 implements truncates with multiple uops. So it should be better to truncate all the way to the desired element size and then concatenate the pieces using unpckl instructions. This minimizes the number of 2 uop truncates. The unpcks are all single uop instructions. I tried to handle this by just custom splitting the v16i64->v16i8 shuffle. And hoped that the DAG combiner would leave the two halves in the state needed to make D68374 do the job for each half. This worked for the first half, but the second half got messed up. So I've implemented custom handling for v8i64->v8i8 when v8i64 needs to be split to produce the VTRUNCs directly. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68428 llvm-svn: 373864
* [X86][SSE] resolveTargetShuffleInputs - call getTargetShuffleInputs instead ↵Simon Pilgrim2019-10-061-5/+4
| | | | | | of using setTargetShuffleZeroElements directly. NFCI. llvm-svn: 373855
* [X86][AVX] combineExtractSubvector - merge duplicate variables. NFCI.Simon Pilgrim2019-10-061-18/+17
| | | | llvm-svn: 373849
* [X86][SSE] matchVectorShuffleAsBlend - use Zeroable element mask directly.Simon Pilgrim2019-10-061-34/+13
| | | | | | | | | | We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly. This allows us to remove createTargetShuffleMask. This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask. llvm-svn: 373846
* [X86] Enable AVX512BW for memcmp()David Zarzycki2019-10-061-2/+7
| | | | llvm-svn: 373845
* [X86][AVX] Push sign extensions of comparison bool results through bitops ↵Simon Pilgrim2019-10-051-6/+26
| | | | | | | | | | | | (PR42025) As discussed on PR42025, with more complex boolean math we can end up with many truncations/extensions of the comparison results through each bitop. This patch handles the cases introduced in combineBitcastvxi1 by pushing the sign extension through the AND/OR/XOR ops so its just the original SETCC ops that gets extended. Differential Revision: https://reviews.llvm.org/D68226 llvm-svn: 373834
* [X86] lowerShuffleAsLanePermuteAndRepeatedMask - variable renames. NFCI.Simon Pilgrim2019-10-051-27/+27
| | | | | | Rename some variables to match lowerShuffleAsRepeatedMaskAndLanePermute - prep work toward adding some equivalent sublane functionality. llvm-svn: 373832
* [X86] Remove isel patterns for mask vpcmpgt/vpcmpeq. Switch vpcmp to these ↵Craig Topper2019-10-042-146/+207
| | | | | | | | | | | | | | | | | based on the immediate in MCInstLower The immediate form of VPCMP can represent these completely. The vpcmpgt/eq are just shorter encodings. This patch removes the isel patterns and just swaps the opcodes and removes the immediate in MCInstLower. This matches where we do some other encodings tricks. Removes over 10K bytes from the isel table. Differential Revision: https://reviews.llvm.org/D68446 llvm-svn: 373766
* [X86] Add DAG combine to form saturating VTRUNCUS/VTRUNCS from VTRUNCCraig Topper2019-10-041-0/+14
| | | | | | | | We already do this for ISD::TRUNCATE, but we can do the same for X86ISD::VTRUNC Differential Revision: https://reviews.llvm.org/D68432 llvm-svn: 373765
* [X86] Enable inline memcmp() to use AVX512David Zarzycki2019-10-041-2/+1
| | | | llvm-svn: 373706
* [X86] Add v32i8 shuffle lowering strategy to recognize two v4i64 vectors ↵Craig Topper2019-10-031-0/+44
| | | | | | | | | | | | | truncated to v4i8 and concatenated into the lower 8 bytes with undef/zero upper bytes. This patch recognizes the shuffle pattern we get from a v8i64->v8i8 truncate when v8i64 isn't a legal type. With VLX we can use two VTRUNCs, unpckldq, and a insert_subvector. Diffrential Revision: https://reviews.llvm.org/D68374 llvm-svn: 373645
* [X86] matchShuffleWithSHUFPD - use Zeroable element mask directly. NFCI.Simon Pilgrim2019-10-031-7/+7
| | | | | | | | | | We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly. This only leaves one user of createTargetShuffleMask which we can hopefully get rid of in a similar manner. This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask. llvm-svn: 373641
* [X86] Add DAG combine to turn (bitcast (vbroadcast_load)) into just a ↵Craig Topper2019-10-032-103/+17
| | | | | | | | | | | | | | | | vbroadcast_load if the scalar size is the same. This improves broadcast load folding of i64 elements on 32-bit targets where i64 isn't legal. Previously we had to represent these as vXf64 vbroadcast_loads and a bitcast to vXi64. But we didn't have any isel patterns looking for that. This also allows us to remove or simplify some isel patterns that were looking for bitcasted vbroadcast_loads. llvm-svn: 373566
* [X86] Add broadcast load folding patterns to NoVLX ↵Craig Topper2019-10-031-7/+31
| | | | | | | | VPMULLQ/VPMAXSQ/VPMAXUQ/VPMINSQ/VPMINUQ patterns. More fixes for PR36191. llvm-svn: 373560
* [X86] Remove a couple redundant isel patterns that look to have been ↵Craig Topper2019-10-031-17/+0
| | | | | | copy/pasted from right above them. NFC llvm-svn: 373559
* [X86] Rewrite to the vXi1 subvector insertion code to not rely on the value ↵Craig Topper2019-10-021-14/+26
| | | | | | | | | | | | | | of bits that might be undef The previous code tried to do a trick where we would extract the subvector from the location we were inserting. Then xor that with the new value. Take the xored value and clear out the bits above the subvector size. Then shift that xored subvector to the insert location. And finally xor that with the original vector. Since the old subvector was used in both xors, this would leave just the new subvector at the inserted location. Since the surrounding bits had been zeroed no other bits of the original vector would be modified. Unfortunately, if the old subvector came from undef we might aggressively propagate the undef. Then we end up with the XORs not cancelling because they aren't using the same value for the two uses of the old subvector. @bkramer gave me a case that demonstrated this, but we haven't reduced it enough to make it easily readable to see what's happening. This patch uses a safer, but more costly approach. It isolate the bits above the insertion and bits below the insert point and ORs those together leaving 0 for the insertion location. Then widens the subvector with 0s in the upper bits, shifts it into position with 0s in the lower bits. Then we do another OR. Differential Revision: https://reviews.llvm.org/D68311 llvm-svn: 373495
* [X86] Add broadcast load folding patterns to the NoVLX compare patterns.Craig Topper2019-10-021-16/+138
| | | | | | | | | These patterns use zmm registers for 128/256-bit compares when the VLX instructions aren't available. Previously we only supported registers, but as PR36191 notes we can fold broadcast loads, but not regular loads. llvm-svn: 373423
OpenPOWER on IntegriCloud