summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86/X86ISelLowering.cpp
Commit message (Collapse)AuthorAgeFilesLines
...
* [x86] don't require a zext when forming ADC/SBBSanjay Patel2017-03-041-24/+29
| | | | | | | | | | | | | The larger goal is to move the ADC/SBB transforms currently in combineX86SetCC() to combineAddOrSubToADCOrSBB() because we're creating ADC/SBB in lots of places where we shouldn't. This was intended to be an NFC change, but avx-512 has something strange going on. It doesn't seem like any of the affected tests should really be using SET+TEST or ADC; a simple ADD could replace several instructions. But that's another bug... llvm-svn: 296978
* [X86][SSE] Enable post-legalize vXi64 shuffle combining on 32-bit targetsSimon Pilgrim2017-03-041-5/+0
| | | | | | | | Long ago (2010 according to svn blame), combineShuffle probably needed to prevent the accidental creation of illegal i64 types but there doesn't appear to be any combines that can cause this any more as they all have their own legality checks. Differential Revision: https://reviews.llvm.org/D30213 llvm-svn: 296966
* X86ISelLowering: Only perform copy elision on legal types.Matthias Braun2017-03-041-33/+37
| | | | | | | | | This fixes cases where i1 types were not properly legalized yet and lead to the creating of 0-sized stack slots. This fixes http://llvm.org/PR32136 llvm-svn: 296950
* [x86] check for commuted add pattern to find ADC/SBBSanjay Patel2017-03-041-4/+11
| | | | llvm-svn: 296933
* [x86] refactor combineAddOrSubToADCOrSBB(); NFCISanjay Patel2017-03-031-21/+25
| | | | | | | | | | | | The comments were wrong, and this is not an obvious transform. This hopefully makes it clearer that we're missing the commuted patterns for adds. It's less clear that this is actually a good transform for all micro-arch. This is prep work for trying to clean up the current adc/sbb codegen because it's definitely not happening optimally. llvm-svn: 296918
* [x86] clean up materializeSBB(); NFCISanjay Patel2017-03-031-20/+14
| | | | | | This is producing SBB where it is obviously not necessary, so it needs to be limited. llvm-svn: 296894
* [x86] fix formatting; NFCSanjay Patel2017-03-031-3/+2
| | | | llvm-svn: 296875
* Use APInt::getHighBitsSet instead of APInt::getBitsSet for upper bit mask ↵Simon Pilgrim2017-03-031-1/+1
| | | | | | creation llvm-svn: 296874
* [X86][MMX] Fixed i32 extraction on 32-bit targetsSimon Pilgrim2017-03-021-0/+10
| | | | | | MMX extraction often ends up as extract_i32(bitcast_v2i32(extract_i64(bitcast_v1i64(x86mmx v), 0)), 0) which fails to simplify on 32-bit targets as i64 isn't legal llvm-svn: 296782
* Elide argument copies during instruction selectionReid Kleckner2017-03-011-20/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: Avoids tons of prologue boilerplate when arguments are passed in memory and left in memory. This can happen in a debug build or in a release build when an argument alloca is escaped. This will dramatically affect the code size of x86 debug builds, because X86 fast isel doesn't handle arguments passed in memory at all. It only handles the x86_64 case of up to 6 basic register parameters. This is implemented by analyzing the entry block before ISel to identify copy elision candidates. A copy elision candidate is an argument that is used to fully initialize an alloca before any other possibly escaping uses of that alloca. If an argument is a copy elision candidate, we set a flag on the InputArg. If the the target generates loads from a fixed stack object that matches the size and alignment requirements of the alloca, the SelectionDAG builder will delete the stack object created for the alloca and replace it with the fixed stack object. The load is left behind to satisfy any remaining uses of the argument value. The store is now dead and is therefore elided. The fixed stack object is also marked as mutable, as it may now be modified by the user, and it would be invalid to rematerialize the initial load from it. Supersedes D28388 Fixes PR26328 Reviewers: chandlerc, MatzeB, qcolombet, inglorion, hans Subscribers: igorb, llvm-commits Differential Revision: https://reviews.llvm.org/D29668 llvm-svn: 296683
* [X86][SSE] Attempt to extract vector elements through target shufflesSimon Pilgrim2017-02-271-0/+97
| | | | | | | | | | DAGCombiner already supports peeking thorough shuffles to improve vector element extraction, but legalization often leaves us in situations where we need to extract vector elements after shuffles have already been lowered. This patch adds support for VECTOR_EXTRACT_ELEMENT/PEXTRW/PEXTRB instructions to attempt to handle target shuffles as well. I've covered some basic scenarios including handling shuffle mask scaling and the implicit zero-extension of PEXTRW/PEXTRB, there is more that could be done here (that I've mentioned in TODOs) but I haven't found many cases where its worth it. Differential Revision: https://reviews.llvm.org/D30176 llvm-svn: 296381
* [X86] Use APInt instead of SmallBitVector tracking undef elements from ↵Craig Topper2017-02-271-25/+25
| | | | | | | | | | | | | | | | | | | getTargetConstantBitsFromNode and getConstVector. Summary: SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc. APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt. Reviewers: RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D30392 llvm-svn: 296355
* [X86] Use APInt instead of SmallBitVector for tracking Zeroable elements in ↵Craig Topper2017-02-271-63/+57
| | | | | | | | | | | | | | | | | | | shuffle lowering Summary: SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc. APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt. Reviewers: RKSimon Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D30390 llvm-svn: 296354
* [X86] Check for less than 0 rather than explicit compare with -1. NFCCraig Topper2017-02-271-2/+3
| | | | llvm-svn: 296321
* [APInt] Add APInt::extractBits() method to extract APInt subrange (reapplied)Simon Pilgrim2017-02-251-4/+4
| | | | | | | | | | | | | | | | The current pattern for extract bits in range is typically: Mask.lshr(BitOffset).trunc(SubSizeInBits); Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable. This is another of the compile time issues identified in PR32037 (see also D30265). This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation. Differential Revision: https://reviews.llvm.org/D30336 llvm-svn: 296272
* Revert: r296141 [APInt] Add APInt::extractBits() method to extract APInt ↵Simon Pilgrim2017-02-241-4/+4
| | | | | | | | | | | | | | | | | | subrange The current pattern for extract bits in range is typically: Mask.lshr(BitOffset).trunc(SubSizeInBits); Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable. This is another of the compile time issues identified in PR32037 (see also D30265). This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation. Differential Revision: https://reviews.llvm.org/D30336 llvm-svn: 296147
* [APInt] Add APInt::extractBits() method to extract APInt subrangeSimon Pilgrim2017-02-241-4/+4
| | | | | | | | | | | | | | | | The current pattern for extract bits in range is typically: Mask.lshr(BitOffset).trunc(SubSizeInBits); Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable. This is another of the compile time issues identified in PR32037 (see also D30265). This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation. Differential Revision: https://reviews.llvm.org/D30336 llvm-svn: 296141
* [X86][SSE] Target shuffle combine can try to combine up to 16 vectorsSimon Pilgrim2017-02-241-6/+6
| | | | | | Noticed while profiling PR32037, the target shuffle ops were being stored in SmallVector<*,8> types but the combiner could store as many as 16 ops at maximum depth (2 per depth). llvm-svn: 296130
* [x86] use DAG.getAllOnesConstant(); NFCISanjay Patel2017-02-241-18/+11
| | | | llvm-svn: 296128
* [APInt] Add APInt::setBits() method to set all bits in rangeSimon Pilgrim2017-02-241-7/+5
| | | | | | | | | | | | | | | | | | The current pattern for setting bits in range is typically: Mask |= APInt::getBitsSet(MaskSizeInBits, LoPos, HiPos); Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation memory for the temporary variable. This is one of the key compile time issues identified in PR32037. This patch adds the APInt::setBits() helper method which avoids the temporary memory allocation completely, this first implementation uses setBit() internally instead but already significantly reduces the regression in PR32037 (~10% drop). Additional optimization may be possible. I investigated whether there is need for APInt::clearBits() and APInt::flipBits() equivalents but haven't seen these patterns to be particularly common, but reusing the code would be trivial. Differential Revision: https://reviews.llvm.org/D30265 llvm-svn: 296102
* [AVX-512] Separate the fadd/fsub/fmul/fdiv/fmax/fmin with rounding mode ISD ↵Craig Topper2017-02-241-0/+6
| | | | | | opcodes into separate packed and scalar opcodes. This is more consistent with the rest of the ISD opcodes. NFC llvm-svn: 296094
* [Fuchsia] Use thread-pointer ABI slots for stack-protector and safe-stackPetr Hosek2017-02-241-24/+36
| | | | | | | | | | | | The Fuchsia ABI defines slots from the thread pointer where the stack-guard value for stack-protector, and the unsafe stack pointer for safe-stack, are stored. This parallels the Android ABI support. Patch by Roland McGrath Differential Revision: https://reviews.llvm.org/D30237 llvm-svn: 296081
* Disable TLS for stack protector on Android API<17.Evgeniy Stepanov2017-02-231-5/+11
| | | | | | The TLS slot did not exist back then. llvm-svn: 296014
* [X86][SSE] getTargetConstantBitsFromNode - insert constant bits directly ↵Simon Pilgrim2017-02-221-18/+15
| | | | | | | | into masks. Minor optimization, don't create temporary mask APInts that are just going to be OR'd into the accumulate masks - insert directly instead. llvm-svn: 295848
* [X86][SSE] Use APInt::getBitsSet() instead of APInt::getLowBitsSet().shl() ↵Simon Pilgrim2017-02-221-3/+4
| | | | | | separately. NFCI. llvm-svn: 295845
* [AVX-512] Allow legacy scalar min/max intrinsics to select EVEX instructions ↵Craig Topper2017-02-221-0/+10
| | | | | | | | | | | | when available This patch introduces new X86ISD::FMAXS and X86ISD::FMINS opcodes. The legacy intrinsics now lower to this node. As do the AVX-512 masked intrinsics when the rounding mode is CUR_DIRECTION. I've merged a copy of the tablegen multiclass avx512_fp_scalar into avx512_fp_scalar_sae. avx512_fp_scalar still needs to support CUR_DIRECTION appearing as a rounding mode for X86ISD::FADD_ROUND and others. Differential revision: https://reviews.llvm.org/D30186 llvm-svn: 295810
* [CodeGenPrepare] Sink and duplicate more 'and' instructions.Geoff Berry2017-02-211-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: Rework the code that was sinking/duplicating (icmp and, 0) sequences into blocks where they were being used by conditional branches to form more tbz instructions on AArch64. The new code is more general in that it just looks for 'and's that have all icmp 0's as users, with a target hook used to select which subset of 'and' instructions to consider. This change also enables 'and' sinking for X86, where it is more widely beneficial than on AArch64. The 'and' sinking/duplicating code is moved into the optimizeInst phase of CodeGenPrepare, where it can take advantage of the fact the OptimizeCmpExpression has already sunk/duplicated any icmps into the blocks where they are used. One minor complication from this change is that optimizeLoadExt needed to be updated to always mark 'and's it has determined should be in the same block as their feeding load in the InsertedInsts set to avoid an infinite loop of hoisting and sinking the same 'and'. This change fixes a regression on X86 in the tsan runtime caused by moving GVNHoist to a later place in the optimization pipeline (see PR31382). Reviewers: t.p.northover, qcolombet, MatzeB Subscribers: aemerson, mcrosier, sebpop, llvm-commits Differential Revision: https://reviews.llvm.org/D28813 llvm-svn: 295746
* [X86] EltsFromConsecutiveLoads SDLoc argument should be const&.Simon Pilgrim2017-02-211-1/+1
| | | | | | There appears never to have been a time that the reference was updated. llvm-svn: 295739
* [X86][SSE] Prefer to combine shuffles to VZEXT over VZEXT_MOVL.Simon Pilgrim2017-02-211-9/+9
| | | | | | This matches what is already done during shuffle lowering and helps prevent the need for a zero-vector in cases where shuffles match both patterns. llvm-svn: 295723
* [AVX512] Fix EXTRACT_VECTOR_ELT for v2i1/v4i1/v32i1/v64i1 with variable index.Igor Breger2017-02-211-3/+7
| | | | | | Differential Revision: https://reviews.llvm.org/D30189 llvm-svn: 295718
* [X86] Fix formatting. NFCCraig Topper2017-02-211-1/+1
| | | | llvm-svn: 295695
* Add a wrapper around copy_if in STLExtras; NFCSanjoy Das2017-02-211-8/+5
| | | | | | I will add one more use for this in a later change. llvm-svn: 295685
* [X86] Tidyup combineExtractVectorElt. NFCI.Simon Pilgrim2017-02-201-8/+9
| | | | | | | | Pull out repeated code for extraction index operand and source vector value type. Use isNullConstant helper to check for zero extraction index. llvm-svn: 295670
* [X86] Fix EXTRACT_VECTOR_ELT with variable index from v32i16 and v64i8 vector.Igor Breger2017-02-201-18/+29
| | | | | | | | | | | | Its more profitable to go through memory (1 cycles throughput) than using VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput) to implement EXTRACT_VECTOR_ELT with variable index. IACA tool was used to get performace estimation (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) For example for var_shuffle_v16i8_v16i8_xxxxxxxxxxxxxxxx_i8 test from vector-shuffle-variable-128.ll I get 26 cycles vs 79 cycles. Removing the VINSERT node, we don't need it any more. Differential Revision: https://reviews.llvm.org/D29690 llvm-svn: 295660
* [X86][AVX512] Add support for ASHR v2i64/v4i64 support without VLXSimon Pilgrim2017-02-201-1/+1
| | | | | | | | Use v8i64 ASHR instructions if we don't have VLX. Differential Revision: https://reviews.llvm.org/D28537 llvm-svn: 295656
* [X86] Use peekThroughOneUseBitcasts helper. NFCI.Simon Pilgrim2017-02-191-10/+5
| | | | llvm-svn: 295618
* [X86][SSE] Use getTargetConstantBitsFromNode to find zeroable shuffle elements.Simon Pilgrim2017-02-191-35/+31
| | | | | | | | Replaces existing approach that could only search BUILD_VECTOR nodes. Requires getTargetConstantBitsFromNode to discriminate cases with all/partial UNDEF bits in each element - this should also be useful when we get around to supporting getTargetShuffleMaskIndices with UNDEF elements. llvm-svn: 295613
* [X86][SSE] Enable initial support for domain crossing at high shuffle ↵Simon Pilgrim2017-02-191-3/+3
| | | | | | | | | | combine depths. As discussed on D27692, this permits another domain to be used to combine a shuffle at high depths. We currently set the required depth at 4 or more combined shuffles, this is probably too high for most targets but is a good starting point and already helps avoid a number of costly variable shuffles. llvm-svn: 295608
* [X86][SSE] Generalize INSERTPS/SHUFPS/SHUFPD combines across domains.Simon Pilgrim2017-02-191-14/+23
| | | | | | Relax the INSERTPS/SHUFPS/SHUFPD combines to support integer inputs if permitted. llvm-svn: 295606
* [X86][SSE] Add domain crossing support for target shuffle combines.Simon Pilgrim2017-02-191-36/+48
| | | | | | | | Add the infrastructure to flag whether float and/or int domains are permitable. A future patch will enable domain crossing based off shuffle depth and the value types of the source vectors. llvm-svn: 295604
* Fix signed/unsigned comparison warning.Simon Pilgrim2017-02-181-2/+2
| | | | llvm-svn: 295580
* [X86] Fix enumeral/non-enumeral comparison warning.Simon Pilgrim2017-02-181-1/+1
| | | | | | gcc only allows you to mix enums / ints if they have the same signedness. llvm-svn: 295576
* [X86][SSE] Avoid repeated calls to SDValue::getValueType.Simon Pilgrim2017-02-181-4/+7
| | | | | | Added assertion to check input type of X86ISD::VZEXT during target known bits calculation. llvm-svn: 295575
* [x86] fold sext (xor Bool, -1) --> sub (zext Bool), 1Sanjay Patel2017-02-181-0/+10
| | | | | | | This is the same transform that is current used for: select Bool, 0, -1 llvm-svn: 295568
* [X86] Simplify by pulling out valuetype. NFCI.Simon Pilgrim2017-02-171-2/+2
| | | | llvm-svn: 295502
* [X86] Remove local areOnlyUsersOf helper and use SDNode::areOnlyUsersOf instead.Simon Pilgrim2017-02-161-9/+1
| | | | llvm-svn: 295326
* [X86][SSE] Don't call EltsFromConsecutiveLoads if any element is missing.Simon Pilgrim2017-02-151-4/+11
| | | | | | Minor performance speedup - if any call to getShuffleScalarElt fails to get a result, don't both calling for the remaining elements as EltsFromConsecutiveLoads will fail anyhow. llvm-svn: 295235
* [X86][SSE] Propagate undef upper elements from scalar_to_vector during ↵Simon Pilgrim2017-02-151-1/+7
| | | | | | | | shuffle combining Only do this for integer types currently - floats types (in particular insertps) load folding often fails with this. llvm-svn: 295208
* [X86][SSE] Allow matchVectorShuffleWithUNPCK to recognise ZERO inputsSimon Pilgrim2017-02-151-11/+46
| | | | | | Add support for specifying an UNPCK input as ZERO, particularly improves ZEXT cases with non-zero offsets llvm-svn: 295169
* [X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input typesCraig Topper2017-02-151-2/+18
| | | | | | | | | | | | | | | | | | | | | | | Summary: We don't seem to have great rules on what a valid VBROADCAST node looks like. And as a consequence we end up with a lot of patterns to try to catch everything. We have patterns with scalar inputs, 128-bit vector inputs, 256-bit vector inputs, and 512-bit vector inputs. As you can see from the things improved here we are currently missing patterns for 128-bit loads being extended to 256-bit before the vbroadcast. I'd like to propose that VBROADCAST should always take a 128-bit vector type as input. As a first step towards that this patch adds an EXTRACT_SUBVECTOR in front of VBROADCAST when the input is 256 or 512-bits. In the future I would like to add scalar_to_vector around all the scalar operations. And maybe we should consider adding a VBROADCAST+load node to avoid separating loads from the broadcasting operation when the load itself isn't foldable. This requires an additional change in target shuffle combining to look for the extract subvector and look through it to find the original operand. I'm sure this change isn't perfect but was enough to fix a few test failures that were being caused. Another interesting thing I noticed is that the changes in masked_gather_scatter.ll show cases were we don't remove a useless insert into element 1 before broadcasting element 0. Reviewers: delena, RKSimon, zvi Reviewed By: zvi Subscribers: igorb, llvm-commits Differential Revision: https://reviews.llvm.org/D28747 llvm-svn: 295155
OpenPOWER on IntegriCloud