summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86/X86ISelLowering.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Fix spelling in comments. NFCI.Simon Pilgrim2017-07-061-3/+3
| | | | llvm-svn: 307288
* [X86][SSE4A] Add support for shuffle combining to INSERTQI.Simon Pilgrim2017-07-061-0/+16
| | | | llvm-svn: 307268
* [X86][SSE] combineX86ShuffleChain - merge duplicate creations of integer ↵Simon Pilgrim2017-07-061-20/+12
| | | | | | mask types llvm-svn: 307257
* [X86][SSE] combineX86ShuffleChain - merge duplicate 'Zeroable' element masksSimon Pilgrim2017-07-061-20/+12
| | | | llvm-svn: 307255
* [X86][SSE4A] Add support for shuffle combining to EXTRQ.Simon Pilgrim2017-07-061-1/+28
| | | | llvm-svn: 307254
* [X86][SSE4A] Split EXTRQ/INSERTQ shuffle matching from lowering. NFCI.Simon Pilgrim2017-07-061-99/+112
| | | | | | First step toward supporting shuffle combining to EXTRQ/INSERTQ. llvm-svn: 307250
* [X86][SSE4A] Add support for combining from non-v16i8 EXTRQI/INSERTQI shufflesSimon Pilgrim2017-07-041-3/+3
| | | | | | With the improved shuffle decoding we can now combine EXTRQI/INSERTQI shuffles from non-v16i8 vector types llvm-svn: 307099
* [X86][SSE4A] Generalized EXTRQI/INSERTQI shuffle decodesSimon Pilgrim2017-07-041-2/+2
| | | | | | The existing decodes only worked for v16i8 vectors, this adds support for any 128-bit vector llvm-svn: 307095
* [X86][SSE4A] Add support for combining from EXTRQI/INSERTQI shufflesSimon Pilgrim2017-07-031-0/+22
| | | | llvm-svn: 307048
* [X86][AVX512VPOPCNTDQ] Improve support for v16i8/v8i16/v16i16/ CTPOPSimon Pilgrim2017-07-021-0/+14
| | | | | | Zero extend to v16i32/v8i64, use VPOPCNTDQ instructions and truncate back. llvm-svn: 306990
* [X86][SSE] Attempt to combine 64-bit and 32-bit shuffles to unary shuffles ↵Simon Pilgrim2017-07-021-32/+21
| | | | | | | | before bit shifts We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive llvm-svn: 306978
* [X86][SSE] Attempt to combine 64-bit and 16-bit shuffles to unary shuffles ↵Simon Pilgrim2017-07-021-64/+56
| | | | | | | | | | before bit shifts We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive The 32-bit shuffles are a bit tricky and will be dealt with in a later patch llvm-svn: 306977
* [X86][SSE] Pulled common variables to top of matchUnaryPermuteVectorShuffle. ↵Simon Pilgrim2017-06-301-5/+4
| | | | | | NFCI. llvm-svn: 306847
* Recommitting rL305465 after fixing bug in TableGen in rL306251 & rL306371Ayman Musa2017-06-271-0/+95
| | | | | | | | | | | | | | | [X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions). AVX512 compare instructions return v*i1 types. In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type. Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes. The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class. When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction. Differential Revision: https://reviews.llvm.org/D33188 llvm-svn: 306402
* Fixed the warning introduced by r306289 to make ubuntu-gcc7.1-werror bot green.Galina Kistanova2017-06-271-1/+1
| | | | llvm-svn: 306369
* [x86] transform vector inc/dec to use -1 constant (PR33483)Sanjay Patel2017-06-261-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert vector increment or decrement to sub/add with an all-ones constant: add X, <1, 1...> --> sub X, <-1, -1...> sub X, <1, 1...> --> add X, <-1, -1...> The all-ones vector constant can be materialized using a pcmpeq instruction that is commonly recognized as an idiom (has no register dependency), so that's better than loading a splat 1 constant. AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better way to produce 512 one-bits. The general advantages of this lowering are: 1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables, so in theory, this could be better for perf, but... 2. That seems unlikely to affect any OOO implementation, and I can't measure any real perf difference from this transform on Haswell or Jaguar, but... 3. It doesn't look like it from the diffs, but this is an overall size win because we eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting a scalar load (which might itself be a bug), then we're replacing a scalar constant load + broadcast with a single cheap op, so that should always be smaller/better too. 4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1 and psub x, -1, so we should use that form for +1 too because we can. If there's some reason to favor a constant load on some CPU, let's make the reverse transform for all of these cases (either here in the DAG or in a later machine pass). This should fix: https://bugs.llvm.org/show_bug.cgi?id=33483 Differential Revision: https://reviews.llvm.org/D34336 llvm-svn: 306289
* [x86] fix value types for SBB transform (PR33560)Sanjay Patel2017-06-231-8/+13
| | | | | | | | | | | I'm not sure yet why this wouldn't fail in the simple case, but clearly I used the wrong value type with: https://reviews.llvm.org/rL306040 ...and the bug manifests with: https://bugs.llvm.org/show_bug.cgi?id=33560 llvm-svn: 306139
* Remove trailing whitespace. NFCI.Simon Pilgrim2017-06-231-1/+1
| | | | llvm-svn: 306121
* [x86] add/sub (X==0) --> sbb(cmp X, 1)Sanjay Patel2017-06-221-5/+17
| | | | | | | | | | | | | | | | | | | | | | This is very similar to the transform in: https://reviews.llvm.org/rL306040 ...but in this case, we use cmp X, 1 to set the carry bit as needed. Again, we can show that all of these are logically equivalent (although InstCombine currently canonicalizes to a form not seen here), and if we believe IACA, then this is the smallest/fastest code. Eg, with SNB: | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | 1.0 | | | | | | | cmp edi, 0x1 | 2 | | 1.0 | | | | 1.0 | CP | sbb eax, eax The larger motivation is to clean up all select-of-constants combining/lowering because we're missing some common cases. llvm-svn: 306072
* [x86] add/sub (X==0) --> sbb(neg X)Sanjay Patel2017-06-221-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our handling of select-of-constants is lumpy in IR (https://reviews.llvm.org/D24480), lumpy in DAGCombiner, and lumpy in X86ISelLowering. That's why we only had the 'sbb' codegen in 1 out of the 4 tests. This is a step towards smoothing that out. First, show that all of these IR forms are equivalent: http://rise4fun.com/Alive/mx Second, show that the 'sbb' version is faster/smaller. IACA output for SandyBridge (later Intel and AMD chips are similar based on Agner's tables): This is the "obvious" x86 codegen (what gcc appears to produce currently): | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1* | | | | | | | | xor eax, eax | 1 | 1.0 | | | | | | CP | test edi, edi | 1 | | | | | | 1.0 | CP | setnz al | 1 | | 1.0 | | | | | CP | neg eax This is the adc version: | 1* | | | | | | | | xor eax, eax | 1 | 1.0 | | | | | | CP | cmp edi, 0x1 | 2 | | 1.0 | | | | 1.0 | CP | adc eax, 0xffffffff And this is sbb: | 1 | 1.0 | | | | | | | neg edi | 2 | | 1.0 | | | | 1.0 | CP | sbb eax, eax If IACA is trustworthy, then sbb became a single uop in Broadwell, so this will be clearly better than the alternatives going forward. llvm-svn: 306040
* [X86] Add support for "probe-stack" attributewhitequark2017-06-221-1/+21
| | | | | | | | | | | This commit adds prologue code emission for stack probe function calls. Reviewed By: majnemer Differential Revision: https://reviews.llvm.org/D34387 llvm-svn: 306010
* AVX-512: Lowering Masked Gather intrinsic - fixed a bugElena Demikhovsky2017-06-221-0/+52
| | | | | | | | | | | | Masked gather for vector length 2 is lowered incorrectly for element type i32. The type <2 x i32> was automatically extended to <2 x i64> and we generated VPGATHERQQ instead of VPGATHERQD. The type <2 x float> is extended to <4 x float>, so there is no bug for this type, but the sequence may be more optimal. In this patch I'm fixing <2 x i32>bug and optimizing <2 x float> sequence for GATHERs only. The same fix should be done for Scatters as well. Differential revision: https://reviews.llvm.org/D34343 llvm-svn: 305987
* [x86] fix formatting; NFCSanjay Patel2017-06-211-15/+13
| | | | llvm-svn: 305914
* [x86] enable CGP memcmp() expansion for 2/4/8 byte sizesSanjay Patel2017-06-201-0/+6
| | | | | | | | | There are a couple of potential improvements as seen in the IR and asm: 1. We're unnecessarily extending to a larger type to compare values. 2. The codegen for (select cond, 1, -1) could avoid a cmov. (or we could change the order of the compares, so we have a select with 0 operand) llvm-svn: 305802
* [X86][SSE] Relax 0/-1 vector element insertion to work for any vector with ↵Simon Pilgrim2017-06-201-1/+2
| | | | | | | | >=16bit elements Shuffle lowering/combining now does a good job for 256/512-bit vectors - we don't need to prevent this llvm-svn: 305801
* [X86][SSE] Dropped old INSERT_VECTOR_ELT lowering TODOSimon Pilgrim2017-06-201-2/+0
| | | | | | Target shuffle combining now supports the matching of INSERT_VECTOR_ELT/PINSRW/PINSRB for merging multiple insertions into shuffles/bitmasks. llvm-svn: 305788
* Remove trailing whitespace. NFCI.Simon Pilgrim2017-06-151-1/+1
| | | | llvm-svn: 305476
* [X86][AVX2] Fix issue in lowerV8I16GeneralSingleInputVectorShuffle that was ↵Simon Pilgrim2017-06-151-3/+4
| | | | | | | | | | assuming v8i16 vectors We can use this with v16i16/v32i16 as well. Found during fuzz testing. llvm-svn: 305472
* Revert r305465: [X86][AVX512] Improve lowering of AVX512 compare intrinsics ↵Simon Pilgrim2017-06-151-95/+0
| | | | | | | | (remove redundant shift left+right instructions). This is causing windows buildbot failures llvm-svn: 305470
* [X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove ↵Ayman Musa2017-06-151-0/+95
| | | | | | | | | | | | | | | redundant shift left+right instructions). AVX512 compare instructions return v*i1 types. In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type. Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes. The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class. When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction. Differential Revision: https://reviews.llvm.org/D33188 llvm-svn: 305465
* [x86] avoid unnecessary shuffle mask math in combineX86ShufflesRecursively()Sanjay Patel2017-06-141-6/+7
| | | | | | | | | | | | This is a follow-up to https://reviews.llvm.org/D34174 / https://reviews.llvm.org/rL305398. We mentioned replacing the multiplies with shifts, but the real win seems to be in bypassing the extra ops in the common case when the RootRatio and OpRatio are one. This gives us another 1-2% overall win for the test in PR32037: https://bugs.llvm.org/show_bug.cgi?id=32037 llvm-svn: 305414
* [x86] replace div/rem with shift/mask for better shuffle combining perfSanjay Patel2017-06-141-13/+31
| | | | | | | | | | | | We know that shuffle masks are power-of-2 sizes, but there's no way (?) for LLVM to know that, so hack combineX86ShufflesRecursively() to be much faster by replacing div/rem with shift/mask. This makes the motivating compile-time bug in PR32037 ( https://bugs.llvm.org/show_bug.cgi?id=32037 ) about 9% faster overall. Differential Revision: https://reviews.llvm.org/D34174 llvm-svn: 305398
* Strip UTF8 BOM that got added in rL305091Simon Pilgrim2017-06-131-1/+0
| | | | | | Seems my recent move to VS2017 has resulted in a few text editor issues..... llvm-svn: 305285
* [X86][SSE] Refactor getTargetConstantBitsFromNode to avoid large APInts ↵Simon Pilgrim2017-06-131-36/+66
| | | | | | | | | | (PR32037) Much of PR32037's compile time regression is due to getTargetConstantBitsFromNode always creating large (>64bit) APInts during the bitcasting from the source data to the destination bitwidth. This commit avoids this bitcast stage if the data is already the correct bitwidth. llvm-svn: 305284
* [DAG] add helper to bind memop chains; NFCISanjay Patel2017-06-121-34/+3
| | | | | | | | | | This step is just intended to reduce code duplication rather than change any functionality. A follow-up would be to replace PPCTargetLowering::spliceIntoChain() usage with this new helper. Differential Revision: https://reviews.llvm.org/D33649 llvm-svn: 305192
* [x86] use vperm2f128 rather than vinsertf128 when there's a chance to fold a ↵Sanjay Patel2017-06-111-9/+13
| | | | | | | | | | | | | | | | | | | | | 32-byte load I was looking closer at the x86 test diffs in D33866, and the first change seems like it shouldn't happen in the first place. So this patch will resolve that. Using Agner's tables and AMD docs, vperm2f128 and vinsertf128 have identical timing for any given CPU model, so we should be able to interchange those without affecting perf. But as we can see in some of the diffs here, using vperm2f128 allows load folding, so we should take that opportunity to reduce code size and register pressure. A secondary advantage is making AVX1 and AVX2 codegen more similar. Given that vperm2f128 was introduced with AVX1, we should be selecting it in all of the same situations that we would with AVX2. If there's some reason that an AVX1 CPU would not want to use this instruction, that should be fixed up in a later pass. Differential Revision: https://reviews.llvm.org/D33938 llvm-svn: 305171
* [X86][SSE] Add support for PACKSS nodes to faux shuffle extractionSimon Pilgrim2017-06-091-6/+32
| | | | | | If the inputs won't saturate during packing then we can treat the PACKSS as a truncation shuffle llvm-svn: 305091
* This patch closes PR28513: an optimization of multiplication by different ↵Andrew V. Tischenko2017-06-081-1/+81
| | | | | | | | constants. The initial patch was rejected: I fixed the issue and re-apply it. llvm-svn: 304972
* [x86] avoid flipping sign bits for vector icmp by using known bitsSanjay Patel2017-06-071-1/+7
| | | | | | | | | | | | | | | | If we know that both operands of an unsigned integer vector comparison are non-negative, then it's safe to directly use a signed-compare-greater-than instruction (the only non-equality integer vector compare predicate provided by SSE/AVX). We're intentionally not changing the condition code to signed in order to preserve the existing transforms that use min/max/psubus below here. This should solve PR33276: https://bugs.llvm.org/show_bug.cgi?id=33276 Differential Revision: https://reviews.llvm.org/D33862 llvm-svn: 304909
* [X86][SSE] Fix an issue with PEXTRW/PEXTRB indices during shuffle combiningSimon Pilgrim2017-06-071-3/+6
| | | | | | We were checking that the index was in range of the destination vector type, not the (larger) source vector type llvm-svn: 304894
* [X86][AVX1] Split 256-bit vector non-temporal loads to keep it non-temporal ↵Simon Pilgrim2017-06-051-6/+18
| | | | | | | | (PR32744) Differential Revision: https://reviews.llvm.org/D33728 llvm-svn: 304718
* [X86][SSE] Change BUILD_VECTOR interleaving ordering to improve ↵Simon Pilgrim2017-06-041-18/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | coalescing/combine opportunities We currently generate BUILD_VECTOR as a tree of UNPCKL shuffles of the same type: e.g. for v4f32: Step 1: unpcklps 0, 2 ==> X: <?, ?, 2, 0> : unpcklps 1, 3 ==> Y: <?, ?, 3, 1> Step 2: unpcklps X, Y ==> <3, 2, 1, 0> The issue is because we are not placing sequential vector elements together early enough, we fail to recognise many combinable patterns - consecutive scalar loads, extractions etc. Instead, this patch unpacks progressively larger sequential vector elements together: e.g. for v4f32: Step 1: unpcklps 0, 2 ==> X: <?, ?, 1, 0> : unpcklps 1, 3 ==> Y: <?, ?, 3, 2> Step 2: unpcklpd X, Y ==> <3, 2, 1, 0> This does mean that we are creating UNPCKL shuffle of different value types, but the relevant combines that benefit from this are quite capable of handling the additional BITCASTs that are now included in the shuffle tree. Differential Revision: https://reviews.llvm.org/D33864 llvm-svn: 304688
* [X86][SSE] Add SCALAR_TO_VECTOR(PEXTRW/PEXTRB) support to faux shuffle combiningSimon Pilgrim2017-06-031-9/+31
| | | | | | Generalized existing SCALAR_TO_VECTOR(EXTRACT_VECTOR_ELT) code to support AssertZext + PEXTRW/PEXTRB cases as well. llvm-svn: 304659
* [x86] simplify code for vector icmp pred transforms; NFCISanjay Patel2017-06-021-31/+19
| | | | | | Organizing by transform is smaller and easier to read than a squashed switch with fall-throughs. llvm-svn: 304611
* [X86] Correctly broadcast NaN-like integers as float on AVX.Ahmed Bougacha2017-06-021-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | Since r288804, we try to lower build_vectors on AVX using broadcasts of float/double. However, when we broadcast integer values that happen to have a NaN float bitpattern, we lose the NaN payload, thereby changing the integer value being broadcast. This is caused by ConstantFP::get, to which we pass the splat i32 as a float (by bitcasting it using bitsToFloat). ConstantFP::get takes a double parameter, so we end up lossily converting a single-precision NaN to double-precision. Instead, avoid any kinds of conversions by directly building an APFloat from the splatted APInt. Note that this also fixes another piece of code (broadcast of subvectors), that currently isn't susceptible to the same problem. Also note that we could really just use APInt and ConstantInt throughout: the constant pool type doesn't matter much. Still, for consistency, use the appropriate type. llvm-svn: 304590
* [x86] fix formatting; NFCISanjay Patel2017-06-021-8/+9
| | | | llvm-svn: 304576
* Tidy up a bit of r304516, use SmallVector::assign rather than for loopDavid Blaikie2017-06-021-2/+2
| | | | | | | | | | | | | | | | This might give a few better opportunities to optimize these to memcpy rather than loops - also a few minor cleanups (StringRef-izing, templating (to avoid std::function indirection), etc). The SmallVector::assign(iter, iter) could be improved with the use of SFINAE, but the (iter, iter) ctor and append(iter, iter) need it to and don't have it - so, workaround it for now rather than bothering with the added complexity. (also, as noted in the added FIXME, these assign ops could potentially be optimized better at least for non-trivially-copyable types) llvm-svn: 304566
* Remove ADDC, ADDE, SUBC, SUBE and SETCCE support from the X86 backend, use ↵Amaury Sechet2017-06-011-61/+0
| | | | | | | | | | | | | | | | | the CARRY ops instead. Summary: As per title. This cleanup some technical debt. Depends on D33374 Reviewers: jyknight, nemanjai, mkuper, spatel, RKSimon, zvi, bkramer Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D33390 llvm-svn: 304435
* [X86] Match bitcast of vxi1 to pmovmskZvi Rackover2017-06-011-1/+107
| | | | | | | | | | | | | | | | | | | Summary: Add an early combine to match patterns such as: (i16 bitcast (v16i1 x)) -> (i16 movmsk (v16i8 sext (v16i1 x))) This combine needs to happen early enough before type-legalization scalarizes the result of the setcc. Reviewers: igorb, craig.topper, RKSimon Subscribers: delena, llvm-commits Differential Revision: https://reviews.llvm.org/D33311 llvm-svn: 304406
* Do not legalize large setcc with setcce, introduce setcccarry and do it with ↵Amaury Sechet2017-06-011-0/+26
| | | | | | | | | | | | | | | | | usubo/setcccarry. Summary: This is a continuation of the work started in D29872 . Passing the carry down as a value rather than as a glue allows for further optimizations. Introducing setcccarry makes the use of addc/subc unecessary and we can start the removal process. This patch only introduce the optimization strictly required to get the same level of optimization as was available before nothing more. Reviewers: jyknight, nemanjai, mkuper, spatel, RKSimon, zvi, bkramer Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D33374 llvm-svn: 304404
OpenPOWER on IntegriCloud