summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86
Commit message (Collapse)AuthorAgeFilesLines
* Fix -Wself-assign from r289955Hans Wennborg2016-12-161-7/+8
| | | | llvm-svn: 289962
* [X86] Fold (setcc (cmp (atomic_load_add x, -C) C), COND) to (setcc (LADD x, ↵Hans Wennborg2016-12-161-9/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | -C), COND) (PR31367) atomic_load_add returns the value before addition, but sets EFLAGS based on the result of the addition. That means it's setting the flags based on effectively subtracting C from the value at x, which is also what the outer cmp does. This targets a pattern that occurs frequently with reference counting pointers: void decrement(long volatile *ptr) { if (_InterlockedDecrement(ptr) == 0) release(); } Clang would previously compile it (for 32-bit at -Os) as: 00000000 <?decrement@@YAXPCJ@Z>: 0: 8b 44 24 04 mov 0x4(%esp),%eax 4: 31 c9 xor %ecx,%ecx 6: 49 dec %ecx 7: f0 0f c1 08 lock xadd %ecx,(%eax) b: 83 f9 01 cmp $0x1,%ecx e: 0f 84 00 00 00 00 je 14 <?decrement@@YAXPCJ@Z+0x14> 14: c3 ret and with this patch it becomes: 00000000 <?decrement@@YAXPCJ@Z>: 0: 8b 44 24 04 mov 0x4(%esp),%eax 4: f0 ff 08 lock decl (%eax) 7: 0f 84 00 00 00 00 je d <?decrement@@YAXPCJ@Z+0xd> d: c3 ret (Equivalent variants with _InterlockedExchangeAdd, std::atomic<>'s fetch_add or pre-decrement operator generate the same code.) Differential Revision: https://reviews.llvm.org/D27781 llvm-svn: 289955
* [X86][AVX] Call lowerVectorShuffleWithSHUFPS directly instead of calling ↵Simon Pilgrim2016-12-161-3/+4
| | | | | | | | DAG.getVectorShuffle (PR27885) We've already done the hardwork of ensuring the mask is safe for 'SHUFPS'. llvm-svn: 289950
* [X86][AVX512] use a single shufps for 512-bit vectors when it can save ↵Simon Pilgrim2016-12-161-1/+13
| | | | | | | | | | | | instructions This is the 512-bit counterpart to the 128-bit transform checked in here: https://reviews.llvm.org/rL289837 This patch is based on the draft by @sroland (Roland Scheidegger) that is attached to PR27885: https://llvm.org/bugs/show_bug.cgi?id=27885 llvm-svn: 289946
* [X86][SSE] Combine shuffles to MOVSS/MOVSD whatever the domain.Simon Pilgrim2016-12-161-9/+9
| | | | | | We already do the same thing in shuffle lowering; but don't do it if we have SSE41 (PBLEND) instead. llvm-svn: 289937
* [GlobalISel] Drop workaround for Legalizer member/class sharing a name. NFC.Ahmed Bougacha2016-12-151-1/+1
| | | | | | | | MachineLegalizer used to be the name of both the class and the member, causing GCC errors. r276522 fixed that by renaming the member to just 'Legalizer'. The 'class' workaround isn't necessary anymore; drop it. llvm-svn: 289848
* [x86] use a single shufps for 256-bit vectors when it can save instructionsSanjay Patel2016-12-151-1/+13
| | | | | | | | | | | This is the 256-bit counterpart to the 128-bit transform checked in here: https://reviews.llvm.org/rL289837 This patch is based on the draft by @sroland (Roland Scheidegger) that is attached to PR27885: https://llvm.org/bugs/show_bug.cgi?id=27885 llvm-svn: 289846
* [x86] use a single shufps when it can save instructionsSanjay Patel2016-12-151-14/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a tiny patch with a big pile of test changes. This partially fixes PR27885: https://llvm.org/bugs/show_bug.cgi?id=27885 My motivating case looks like this: - vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,2] - vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3] - vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7] + vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2] And this happens several times in the diffs. For chips with domain-crossing penalties, the instruction count and size reduction should usually overcome any potential domain-crossing penalty due to using an FP op in a sequence of int ops. For chips such as recent Intel big cores and Atom, there is no domain-crossing penalty for shufps, so using shufps is a pure win. So the test case diffs all appear to be improvements except one test in vector-shuffle-combining.ll where we miss an opportunity to use a shift to generate zero elements and one test in combine-sra.ll where multiple uses prevent the expected shuffle combining. Differential Revision: https://reviews.llvm.org/D27692 llvm-svn: 289837
* [X86][SSE] Fix domains for scalar store instructionsSimon Pilgrim2016-12-151-0/+4
| | | | | | As discussed on D27692 llvm-svn: 289834
* [X86][AVX512] Moved instruction domain lookups to the right table. NFCI.Simon Pilgrim2016-12-151-4/+4
| | | | | | Avoid duplicating instructions in the int32/int64 domains. llvm-svn: 289830
* [X86][SSE] Fix domains for VZEXT_LOAD type instructionsSimon Pilgrim2016-12-151-0/+6
| | | | | | | | Add the missing domain equivalences for movss, movsd, movd and movq zero extending loading instructions. Differential Revision: https://reviews.llvm.org/D27684 llvm-svn: 289825
* [CostModel][X86] Updated reverse shuffle costsSimon Pilgrim2016-12-151-5/+95
| | | | llvm-svn: 289819
* Fix bug 30945- [AVX512] Failure to flip vector comparison to remove not mask ↵Michael Zuckerman2016-12-141-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | instruction adding new optimization opportunity by adding new X86ISelLowering pattern. The test case was shown in https://llvm.org/bugs/show_bug.cgi?id=30945. Test explanation: Select gets three arguments mask, op and op2. In this case, the Mask is a result of ICMP. The ICMP instruction compares (with equal operand) the zero initializer vector and the result of the first ICMP. In general, The result of "cmp eq, op1, zero initializers" is "not(op1)" where op1 is a mask. By rearranging of the two arguments inside the Select instruction, we can get the same result. Without the necessary of the middle phase ("cmp eq, op1, zero initializers"). Missed optimization opportunity: vpcmpled %zmm0, %zmm1, %k0 knotw %k0, %k1 can be combine to vpcmpgtd %zmm0, %zmm2, %k1 Reviewers: 1. delena 2. igorb Commited after check all Differential Revision: https://reviews.llvm.org/D27160 llvm-svn: 289653
* Replace APFloatBase static fltSemantics data members with getter functionsStephan Bergmann2016-12-141-14/+14
| | | | | | | | | | | | | At least the plugin used by the LibreOffice build (<https://wiki.documentfoundation.org/Development/Clang_plugins>) indirectly uses those members (through inline functions in LLVM/Clang include files in turn using them), but they are not exported by utils/extract_symbols.py on Windows, and accessing data across DLL/EXE boundaries on Windows is generally problematic. Differential Revision: https://reviews.llvm.org/D26671 llvm-svn: 289647
* [peephole] Enhance folding logic to work for STATEPOINTsPhilip Reames2016-12-131-18/+7
| | | | | | | | | | | | | | The general idea here is to get enough of the existing restrictions out of the way that the already existing folding logic in foldMemoryOperand can kick in for STATEPOINTs and fold references to immutable stack slots. The key changes are: Support for folding multiple operands at once which reference the same load Support for folding multiple loads into a single instruction Walk all the operands of the instruction for varidic instructions (this is a bug fix!) Once this lands, I'll post another patch which refactors the TII interface here. There's nothing actually x86 specific about the x86 code used here. Differential Revision: https://reviews.llvm.org/D24103 llvm-svn: 289510
* [x86] fix formatting; NFCSanjay Patel2016-12-121-1/+1
| | | | llvm-svn: 289476
* Update inline argument comment. NFCI.Simon Pilgrim2016-12-121-3/+3
| | | | | | combineX86ShufflesRecursively 'HasPSHUFB' flag has been the more generic 'HasVariableMask' flag for some time. llvm-svn: 289430
* [X86][SSE] Add support for combining SSE VSHLI/VSRLI uniform constant shifts.Simon Pilgrim2016-12-121-0/+33
| | | | | | Fixes some missed constant folding opportunities and allows us to combine shuffles that end with a logical bit shift. llvm-svn: 289429
* [X86][SSE] Lower suitably sign-extended mul vXi64 using PMULDQSimon Pilgrim2016-12-121-4/+21
| | | | | | | | PMULDQ returns the 64-bit result of the signed multiplication of the lower 32-bits of vXi64 vector inputs, we can lower with this if the sign bits stretch that far. Differential Revision: https://reviews.llvm.org/D27657 llvm-svn: 289426
* [X86] Teach selectScalarSSELoad to accept full 128-bit vector loads and the ↵Craig Topper2016-12-121-0/+22
| | | | | | | | X86ISD::VZEXT_LOAD opcode. Disable peephole on some of the tests that no longer require it to properly fold scalar intrinsics. llvm-svn: 289424
* [X86] Change CMPSS/CMPSD intrinsic instructions to use sse_load_f32/f64 as ↵Craig Topper2016-12-121-12/+12
| | | | | | | | | | its memory pattern instead of full vector load. These intrinsics only load a single element. We should use sse_loadf32/f64 to give more options of what loads it can match. Currently these instructions are often only getting their load folded thanks to the load folding in the peephole pass. I plan to add more types of loads to sse_load_f32/64 so we can match without the peephole. llvm-svn: 289423
* [X86] Remove some intrinsic instructions from hasPartialRegUpdateCraig Topper2016-12-121-8/+0
| | | | | | | | | | | | | | | | | Summary: These intrinsic instructions are all selected from intrinsics that have well defined behavior for where the upper bits come from. It's not the same place as the lower bits. As you can see we were suppressing load folding for these instructions in some cases. In none of the cases was the separate load helping avoid a partial dependency on the destination register. So we should just go ahead and allow the load to be folded. Only foldMemoryOperand was suppressing folding for these. They all have patterns for folding sse_load_f32/f64 that aren't gated with OptForSize, but sse_load_f32/f64 doesn't allow 128-bit vector loads. It only allows scalar_to_vector and vzmovl of scalar loads to match. There's no reason we can't allow a 128-bit vector load to be narrowed so I would like to fix sse_load_f32/f64 to allow that. And if I do that it changes some of these same test cases to fold the load too. Reviewers: spatel, zvi, RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D27611 llvm-svn: 289419
* [X86][SSE] Add support for combining target shuffles to SHUFPD.Simon Pilgrim2016-12-111-17/+45
| | | | llvm-svn: 289407
* [X86][AVX512] Add missing patterns for broadcast fallback in case load node ↵Ayman Musa2016-12-111-0/+6
| | | | | | | | | | | has multiple uses (for v4i64 and v4f64). When the load node which the broadcast instruction broadcasts has multiple uses, it cannot be folded. A fallback pattern is added to catch these cases and provide another solution. Differential Revision: https://reviews.llvm.org/D27661 llvm-svn: 289404
* [X86] Regcall - Adding support for mask typesOren Ben Simhon2016-12-113-38/+62
| | | | | | | | | Regcall calling convention passes mask types arguments in x86 GPR registers. The review includes the changes required in order to support v32i1, v16i1 and v8i1. Differential Revision: https://reviews.llvm.org/D27148 llvm-svn: 289383
* [X86] Fix a comment to say 'an FMA' instead of 'a FMA'. NFCCraig Topper2016-12-111-1/+1
| | | | llvm-svn: 289352
* [X86] Remove masking from 512-bit VPERMIL intrinsics in preparation for ↵Craig Topper2016-12-111-4/+2
| | | | | | being able to constant fold them in InstCombineCalls like we do for 128/256-bit. llvm-svn: 289350
* [X86] Remove masking from 512-bit PSHUFB intrinsics in preparation for being ↵Craig Topper2016-12-101-2/+1
| | | | | | able to constant fold it in InstCombineCalls like we do for 128/256-bit. llvm-svn: 289344
* [X86][SSE] Ensure UNPCK inputs are a consistent value type in ↵Simon Pilgrim2016-12-101-2/+3
| | | | | | LowerHorizontalByteSum llvm-svn: 289341
* [AVX-512] Remove 128/256 masked vpermil instrinsics and autoupgrade to a ↵Craig Topper2016-12-101-8/+0
| | | | | | select around the unmasked avx1 intrinsics. llvm-svn: 289340
* [X86][SSE] Move ZeroVector creation into the shuffle pattern case where its ↵Simon Pilgrim2016-12-101-2/+2
| | | | | | | | actually used. Also fix the ZeroVector's type - I've no idea how this hasn't caused problems........ llvm-svn: 289336
* [AVX-512] Add support for lowering (v2i64 (fp_to_sint (v2f32))) to ↵Craig Topper2016-12-102-6/+25
| | | | | | vcvttps2uqq when AVX512DQ and AVX512VL are available. llvm-svn: 289335
* [X86] Clarify indentation. NFCCraig Topper2016-12-101-1/+1
| | | | llvm-svn: 289334
* [X86] Combine LowerFP_TO_SINT and LowerFP_TO_UINT. They only differ by a ↵Craig Topper2016-12-102-24/+7
| | | | | | single boolean flag passed to a helper function. Just check the opcode and create the flag. llvm-svn: 289333
* [X86] Use X86ISD::CVTTP2SI and X86ISD::CVTTP2UI for lowering 128-bit ↵Craig Topper2016-12-102-7/+7
| | | | | | | | | | cvttps2qq and cvttps2uqq intrinsics since there is a mismatch between number of input and output elements. Ideally ISD::FP_TO_SINT and ISD::FP_TO_UINT would only be used for cases with the same number of input and output elements. Similar things have already been done for other convert intrinsics. llvm-svn: 289316
* [SelectionDAG] Add knownbits support for EXTRACT_VECTOR_ELT opcodes (REAPPLIED)Simon Pilgrim2016-12-091-1/+6
| | | | | | Reapplied with fix for PR31323 - X86 SSE2 vXi16 multiplies for illegal types were creating CONCAT_VECTORS nodes with vector inputs that might not total the number of elements in the result type. llvm-svn: 289232
* [X86] Modify patterns from memory form of RCP/RSQRT/SQRT intrinsics to only ↵Craig Topper2016-12-091-14/+11
| | | | | | | | | | allow (scalar_to_vector (loadf32/load64)) instead of anything that sse_load_f32/f64 can match. sse_load_f32/f64 can also match loads that are zero extended to vectors. We shouldn't match that because we wouldn't be able to get the instruction to zero the upper bits like the intrinsic semantics would require for such a case. There is a test case that does depend on this behavior. llvm-svn: 289193
* [AVX-512] Correctly preserve the passthru semantics of the FMA scalar intrinsicsCraig Topper2016-12-095-57/+107
| | | | | | | | | | | | | | | | | | | | | Summary: Scalar intrinsics have specific semantics about the which input's upper bits are passed through to the output. The same input is also supposed to be the input we use for the lower element when the mask bit is 0 in a masked operation. We aren't currently keeping these semantics with instruction selection. This patch corrects this by introducing new scalar FMA ISD nodes that indicate whether operand 1(one of the multiply inputs) or operand 3(the additon/subtraction input) should pass thru its upper bits. We use this information to select 213/132 form for the operand 1 version and the 231 form for the operand 3 version. We also use this information to suppress combining FNEG operations on the passthru input since semantically the passthru bits aren't negated. This is stronger than the earlier check added for a user being SELECTS so we can remove that. This fixes PR30913. Reviewers: delena, zvi, v_klochkov Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D27144 llvm-svn: 289190
* [X86] Add masked versions of VPERMT2* and VPERMI2* to load folding tables.Craig Topper2016-12-091-6/+84
| | | | llvm-svn: 289186
* [AVX-512] Add vpermilps/pd to load folding tables.Craig Topper2016-12-091-0/+36
| | | | llvm-svn: 289173
* IR, X86: Understand !absolute_symbol metadata on global variables.Peter Collingbourne2016-12-086-11/+58
| | | | | | | | | | | | | | | | | Summary: Attaching !absolute_symbol to a global variable does two things: 1) Marks it as an absolute symbol reference. 2) Specifies the value range of that symbol's address. Teach the X86 backend to allow absolute symbols to appear in place of immediates by extending the relocImm and mov64imm32 matchers. Start using relocImm in more places where it is legal. As previously proposed on llvm-dev: http://lists.llvm.org/pipermail/llvm-dev/2016-October/105800.html Differential Revision: https://reviews.llvm.org/D25878 llvm-svn: 289087
* LivePhysReg: Use reference instead of pointer in init(); NFCMatthias Braun2016-12-081-1/+1
| | | | llvm-svn: 289002
* [X86] Skip over DEBUG_VALUE while looking for start of call sequenceMichael Kuperstein2016-12-071-3/+3
| | | | | | | | | | | If we don't skip over DEBUG_VALUEs, we get differences between -g and non-g code. This fixes PR31242. Differential Revision: https://reviews.llvm.org/D27485 llvm-svn: 288965
* [X86] Do not assume "ri" instructions always have an immediate operandMichael Kuperstein2016-12-072-3/+8
| | | | | | | | | | | The second operand of an "ri" instruction may be an immediate, but it may also be a globalvariable, so we should make any assumptions. This fixes PR31271. Differential Revision: https://reviews.llvm.org/D27481 llvm-svn: 288964
* [X86][SSE] Remove AND -> VZEXT combineSimon Pilgrim2016-12-071-92/+0
| | | | | | | | | | This is now performed more generally by the target shuffle combine code. Already covered by tests that were originally added in D7666/rL229480 to support combineVectorZext (or VectorZextCombine as it was known then....). Differential Revision: https://reviews.llvm.org/D27510 llvm-svn: 288918
* [X86][SSE] Consistently set MOVD/MOVQ load/store/move instructions to ↵Simon Pilgrim2016-12-072-67/+81
| | | | | | | | | | integer domain We are being inconsistent with these instructions (and all their variants.....) with a random mix of them using the default float domain. Differential Revision: https://reviews.llvm.org/D27419 llvm-svn: 288902
* [X86][XOP] Fix VPERMIL2 non-constant pool shuffle decoding (PR31296)Simon Pilgrim2016-12-071-6/+8
| | | | | | | | The non-constant pool version of DecodeVPERMIL2PMask was not offsetting correctly for the second input. I've updated the code to match the implementation in the constant-pool version. Annoyingly this bug was hidden for so long as it's tricky to combine to useful variable shuffle masks that don't become constant-pool entries. llvm-svn: 288898
* [X86] Prefer reduced width multiplication over pmulld on SilvermontZvi Rackover2016-12-064-3/+22
| | | | | | | | | | | | | | | | | | | | Summary: Prefer expansions such as: pmullw,pmulhw,unpacklwd,unpackhwd over pmulld. On Silvermont [source: Optimization Reference Manual]: PMULLD has a throughput of 1/11 [instruction/cycles]. PMULHUW/PMULHW/PMULLW have a throughput of 1/2 [instruction/cycles]. Fixes pr31202. Analysis of this issue was done by Fahana Aleen. Reviewers: wmi, delena, mkuper Subscribers: RKSimon, llvm-commits Differential Revision: https://reviews.llvm.org/D27203 llvm-svn: 288844
* [X86][AVX512] Detect repeated constant patterns in BUILD_VECTOR suitable for ↵Ayman Musa2016-12-061-3/+116
| | | | | | | | | | | | broadcasting. Check if a build_vector node includes a repeated constant pattern and replace it with a broadcast of that pattern. For example: "build_vector <0, 1, 2, 3, 0, 1, 2, 3>" would be replaced by "broadcast <0, 1, 2, 3>" Differential Revision: https://reviews.llvm.org/D26802 llvm-svn: 288804
* [framelowering] Improve tracking of first CS pop instruction.Florian Hahn2016-12-061-5/+8
| | | | | | | | | | | | Summary: This patch makes sure FirstCSPop and MBBI never point to DBG_VALUE instructions, which affected the code generated. Reviewers: mkuper, aprantl, MatzeB Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D27343 llvm-svn: 288794
OpenPOWER on IntegriCloud