summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86
Commit message (Collapse)AuthorAgeFilesLines
...
* Enable FeatureFastUAMem for btver2Sanjay Patel2014-11-281-2/+8
| | | | | | | | | | | | | Allow unaligned 16-byte memop codegen for btver2. No functional changes for any other subtargets. Replace the existing supposed small memcpy test with an actual test of a small memcpy. The previous test wasn't using FileCheck either. This patch should allow us to close PR21541 ( http://llvm.org/bugs/show_bug.cgi?id=21541 ). Differential Revision: http://reviews.llvm.org/D6360 llvm-svn: 222925
* AVX-512: Scalar ERI intrinsicsElena Demikhovsky2014-11-264-60/+103
| | | | | | | | | | | | including SAE mode and memory operand. Added AVX512_maskable_scalar template, that should cover all scalar instructions in the future. The main difference between AVX512_maskable_scalar<> and AVX512_maskable<> is using X86select instead of vselect. I need it, because I can't create vselect node for MVT::i1 mask for scalar instruction. http://reviews.llvm.org/D6378 llvm-svn: 222820
* Replace neverHasSideEffects=1 with hasSideEffects=0 in all .td files.Craig Topper2014-11-269-69/+69
| | | | llvm-svn: 222801
* [X86][SSE] Improvements to byte shift shuffle matchingSimon Pilgrim2014-11-251-20/+20
| | | | | | | | Since (v)pslldq / (v)psrldq instructions resolve to a single input argument it is useful to match it much earlier than we currently do - this prevents more complicated shuffles (notably insertion into a zero vector) matching before it. Differential Revision: http://reviews.llvm.org/D6409 llvm-svn: 222796
* [AVX512] Add 512b integer shift by variable intrinsics and patterns.Cameron McInally2014-11-253-48/+43
| | | | llvm-svn: 222786
* Remove space before tab in all AVX512 mnemonic strings.Craig Topper2014-11-251-137/+137
| | | | llvm-svn: 222778
* [X86] Improved target specific combine on VSELECT dag nodes.Andrea Di Biagio2014-11-241-89/+8
| | | | | | | | | | | This patch teaches function 'transformVSELECTtoBlendVECTOR_SHUFFLE' how to convert VSELECT dag nodes to shuffles on targets that do not have SSE4.1. On pre-SSE4.1 targets, we can still perform blend operations using movss/movsd. Also, removed a target specific combine that performed a premature lowering of VSELECT nodes to target specific MOVSS/MOVSD nodes. llvm-svn: 222647
* [X86] Fixes bug in build_vector v4x32 loweringMichael Kuperstein2014-11-231-3/+6
| | | | | | | | | | | | | r222375 made some improvements to build_vector lowering of v4x32 and v4xf32 into an insertps, but it missed a case where: 1. A single extracted element is used twice. 2. The lower of the two non-zero indexes should be preserved, and the higher should be used for the dest mask. This caused a crash, since the source value for the insertps ends-up uninitialized. Differential Revision: http://reviews.llvm.org/D6377 llvm-svn: 222635
* Add missing override keywords.Craig Topper2014-11-231-2/+2
| | | | llvm-svn: 222634
* Masked Vector Load and Store Intrinsics.Elena Demikhovsky2014-11-234-4/+166
| | | | | | | | | | | | | | Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores. Added SDNodes for masked operations and lowering patterns for X86 code generator. Examples: <16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask) declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask) Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch. http://reviews.llvm.org/D6191 llvm-svn: 222632
* Tidied up target triple OS detection. NFCSimon Pilgrim2014-11-222-10/+5
| | | | | | Use Triple::isOS*() helper functions where possible. llvm-svn: 222622
* [x86] Teach the vector shuffle yet another step of canonicalization.Chandler Carruth2014-11-221-2/+13
| | | | | | | | No functionality changed yet, but this will prevent subsequent patches from having to handle permutations of various interleaved shuffle patterns. llvm-svn: 222614
* Add a feature flag for slow 32-byte unaligned memory accesses [x86].Sanjay Patel2014-11-214-10/+19
| | | | | | | | | | | | | | This patch adds a feature flag to avoid unaligned 32-byte load/store AVX codegen for Sandy Bridge and Ivy Bridge. There is no functionality change intended for those chips. Previously, the absence of AVX2 was being used as a proxy to detect this feature. But that hindered codegen for AVX-enabled AMD chips such as btver2 that do not have the 32-byte unaligned access slowdown. Performance measurements are included in PR21541 ( http://llvm.org/bugs/show_bug.cgi?id=21541 ). Differential Revision: http://reviews.llvm.org/D6355 llvm-svn: 222544
* [x86] Restructure the checking patterns for v16 and v32 avx2 vectorChandler Carruth2014-11-211-28/+24
| | | | | | | | | | shuffle lowering to allow much better blend matching. Specifically, with the new structure the code seems clearer to me and we correctly can hit the cases where merging two 128-bit lanes is a clear win and can be shuffled cheaply afterward. llvm-svn: 222539
* [x86] Make the previous logic significantly less conservative and getChandler Carruth2014-11-211-14/+10
| | | | | | | | | | | | | a bunch more improvements. Non-lane-crossing is fine, the key is that lane merging only makes sense for single-input shuffles. Not sure why I got so turned around here. The code all works, I was just using the wrong model for it. This only updates v4 and v8 lowering. The v16 and v32 lowering requires restructuring the entire check sequence. llvm-svn: 222537
* [x86] Teach the x86 vector shuffle lowering to detect mergable 128-bitChandler Carruth2014-11-211-4/+154
| | | | | | | | | | | | | | | | | | | lanes. By special casing these we can often either reduce the total number of shuffles significantly or reduce the number of (high latency on Haswell) AVX2 shuffles that potentially cross 128-bit lanes. Even when these don't actually cross lanes, they have much higher latency to support that. Doing two of them and a blend is worse than doing a single insert across the 128-bit lanes to blend and then doing a single interleaved shuffle. While this seems like a narrow case, it kept cropping up on me and the difference is *huge* as you can see in many of the test cases. I first hit this trying to perfectly fix the interleaving shuffle patterns used by Halide for AVX2. llvm-svn: 222533
* [X86] For Silvermont CPU use 16-bit division instead of 64-bit for small ↵Alexey Volkov2014-11-214-12/+23
| | | | | | | | positive numbers Differential Revision: http://reviews.llvm.org/D5938 llvm-svn: 222521
* Remove a bunch of unnecessary typecasts to 'const TargetRegisterClass *'Craig Topper2014-11-211-3/+2
| | | | llvm-svn: 222509
* [X86] Do not custom lower UINT_TO_FP when the target type does notQuentin Colombet2014-11-211-0/+5
| | | | | | | | match the custom lowering. <rdar://problem/19026326> llvm-svn: 222489
* Fix more instances of -Wsentinel on Windows with s/NULL/nullptr/Reid Kleckner2014-11-201-1/+1
| | | | | | Follow up to r221940, where I must not have caught em all. NFC llvm-svn: 222481
* Add out of line virtual destructors to all LLVMTargetMachine subclassesReid Kleckner2014-11-202-3/+4
| | | | | | | | | | | | | | | | | These recently all grew a unique_ptr<TargetLoweringObjectFile> member in r221878. When anyone calls a virtual method of a class, clang-cl requires all virtual methods to be semantically valid. This includes the implicit virtual destructor, which triggers instantiation of the unique_ptr destructor, which fails because the type being deleted is incomplete. This is just part of the ongoing saga of PR20337, which is affecting Blink as well. Because the MSVC ABI doesn't have key functions, we end up referencing the vtable and implicit destructor on any virtual call through a class. We don't actually end up emitting the dtor, so it'd be good if we could avoid this unneeded type completion work. llvm-svn: 222480
* X86: use the correct alloca symbol for Windows ItaniumSaleem Abdulrasool2014-11-202-2/+8
| | | | | | | Windows itanium targets the MSVCRT, and the stack probe symbol is provided by MSVCRT. This corrects the emission of stack probes on i686-windows-itanium. llvm-svn: 222439
* Fix a typo in a comment.Craig Topper2014-11-201-1/+1
| | | | llvm-svn: 222412
* [X86] Improved lowering of v4x32 build_vector dag nodes.Andrea Di Biagio2014-11-191-58/+90
| | | | | | | | | | | | | | | | | | This patch improves the lowering of v4f32 and v4i32 build_vector dag nodes that are known to have at least two non-zero elements. With this patch, a build_vector that performs a blend with zero is converted into a shuffle. This is done to let the shuffle legalizer expand the dag node in a optimal way. For example, if we know that a build_vector performs a blend with zero, we can try to lower it as a movq/blend instead of always selecting an insertps. This patch also improves the logic that lowers a build_vector into a insertps with zero masking. See for example the extra test cases added to test sse41.ll. Differential Revision: http://reviews.llvm.org/D6311 llvm-svn: 222375
* [X86][SSE] pslldq/psrldq byte shifts/rotation for SSE2Simon Pilgrim2014-11-191-31/+158
| | | | | | | | | | This patch builds on http://reviews.llvm.org/D5598 to perform byte rotation shuffles (lowerVectorShuffleAsByteRotate) on pre-SSSE3 (palignr) targets - pre-SSSE3 is only enabled on i8 and i16 vector targets where it is a more definite performance gain. I've also added a separate byte shift shuffle (lowerVectorShuffleAsByteShift) that makes use of the ability of the SLLDQ/SRLDQ instructions to implicitly shift in zero bytes to avoid the need to create a zero register if we had used palignr. Differential Revision: http://reviews.llvm.org/D5699 llvm-svn: 222340
* Update SetVector to rely on the underlying set's insert to return a ↵David Blaikie2014-11-191-1/+1
| | | | | | | | | | | | | pair<iterator, bool> This is to be consistent with StringSet and ultimately with the standard library's associative container insert function. This lead to updating SmallSet::insert to return pair<iterator, bool>, and then to update SmallPtrSet::insert to return pair<iterator, bool>, and then to update all the existing users of those functions... llvm-svn: 222334
* [X86][AVX] 256-bit vector stack unaligned load/stores identificationSimon Pilgrim2014-11-181-0/+6
| | | | | | | | | | | | Under many circumstances the stack is not 32-byte aligned, resulting in the use of the vmovups/vmovupd/vmovdqu instructions when inserting ymm reloads/spills. This minor patch adds these instructions to the isFrameLoadOpcode/isFrameStoreOpcode helpers so that they can be correctly identified and not be treated as folded reloads/spills. This has also been noticed by http://llvm.org/bugs/show_bug.cgi?id=18846 where it was causing redundant spills - I've added a reduced test case at test/CodeGen/X86/pr18846.ll Differential Revision: http://reviews.llvm.org/D6252 llvm-svn: 222281
* Revert "ADT: correctly report isMSVCEnvironment for windows itanium"Reid Kleckner2014-11-171-1/+1
| | | | | | This reverts commit r222180. llvm-svn: 222188
* ADT: correctly report isMSVCEnvironment for windows itaniumSaleem Abdulrasool2014-11-171-1/+1
| | | | | | | The itanium environment on Windows uses MSVC and is a MSVC environment. Report this correctly. llvm-svn: 222180
* [X86] Use ADD/SUB instead of INC/DEC for Haswell and Broadwell CPUsAlexey Volkov2014-11-171-2/+3
| | | | | | Differential Revision: http://reviews.llvm.org/D5934 llvm-svn: 222141
* [x86] Remove two redundant isel patterns. They equivalent already exists in ↵Craig Topper2014-11-161-5/+0
| | | | | | the instruction pattern. llvm-svn: 222094
* [X86][SSE] Improve legal SHUFP and PSHUFD shuffle matchingSimon Pilgrim2014-11-151-8/+19
| | | | | | | | | | Updated X86TargetLowering::isShuffleMaskLegal to match SHUFP masks with commuted inputs and PSHUFD masks that reference the second input. As part of this I've refactored isPSHUFDMask to work in a more general manner and allow it to match against either the first or second input vector. Differential Revision: http://reviews.llvm.org/D6287 llvm-svn: 222087
* Rename EH related stuff to be more preciseReid Kleckner2014-11-142-8/+7
| | | | | | | | | | | | | | | | | | | | Summary: The current "WinEH" exception handling type is more about Itanium-style LSDA tables layered on top of the Windows native unwind info format instead of .eh_frame tables or EHABI unwind info. Use the name "ItaniumWinEH" to better reflect the hybrid nature of the design. Also rename isExceptionHandlingDWARF to usesItaniumLSDAForExceptions, since the LSDA is part of the Itanium C++ ABI document, and not the DWARF standard. Reviewers: echristo Subscribers: llvm-commits, compnerd Differential Revision: http://reviews.llvm.org/D6279 llvm-svn: 222062
* [AVX512] Add 512b masked integer shift by immediate patterns.Cameron McInally2014-11-142-29/+21
| | | | llvm-svn: 222002
* X86: use getConstant rather than getTargetConstant behind BUILD_VECTOR.Tim Northover2014-11-141-7/+7
| | | | | | | | | | | | getTargetConstant should only be used when you can guarantee the instruction selected will be able to cope with the raw value. BUILD_VECTOR is rather too generic for this so we should use getConstant instead. In that case, an instruction can still consume the constant, but if it doesn't it'll be materialised through its own round of ISel. Should fix PR21352. llvm-svn: 221961
* We can get the TLOF from the TargetMachine - so constructor no longer ↵Aditya Nandakumar2014-11-131-1/+1
| | | | | | requires TargetLoweringObjectFile to be passed. llvm-svn: 221926
* AVX-512: SINT_TO_FP cost model and some bugfixesElena Demikhovsky2014-11-132-4/+25
| | | | | | | Checked some corner cases, for example translation of <8 x i1> to <8 x double> llvm-svn: 221883
* This patch changes the ownership of TLOF from TargetLoweringBase to ↵Aditya Nandakumar2014-11-133-19/+25
| | | | | | TargetMachine so that different subtargets could share the TLOF effectively llvm-svn: 221878
* [x86] Teach the vector shuffle lowering to make a more nuanced decisionChandler Carruth2014-11-131-12/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | between splitting a vector into 128-bit lanes and recombining them vs. decomposing things into single-input shuffles and a final blend. This handles a large number of cases in AVX1 where the cross-lane shuffles would be much more expensive to represent even though we end up with a fast blend at the root. Instead, we can do a better job of shuffling in a single lane and then inserting it into the other lanes. This fixes the remaining bits of Halide's regression captured in PR21281 for AVX1. However, the bug persists in AVX2 because I've made this change reasonably conservative. The cases where it makes sense in AVX2 to split into 128-bit lanes are much more rare because we can often do full permutations across all elements of the 256-bit vector. However, the particular test case in PR21281 is an example of one of the rare cases where it is *always* better to work in a single 128-bit lane. I'm going to try to teach the logic to detect and form the good code even in AVX2 next, but it will need to use a separate heuristic. Finally, there is one pesky regression here where we previously would craftily use vpermilps in AVX1 to shuffle both high and low halves at the same time. We no longer pull that off, and not for any really good reason. Ultimately, I think this is just another missing nuance to the selection heuristic that I'll try to add in afterward, but this change already seems strictly worth doing considering the magnitude of the improvements in common matrix math shuffle patterns. As always, please let me know if this causes a surprising regression for you. llvm-svn: 221861
* [x86] Don't form overly fragmented blends when splitting andChandler Carruth2014-11-131-2/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | re-combining shuffles because nothing was available in the wider vector type. The key observation (which I've put in the comments for future maintainers) is that at this point, no further combining is really possible. And so even though these shuffles trivially could be combined, we need to actually do that as we produce them when producing them this late in the lowering. This fixes another (huge) part of the Halide vector shuffle regressions. As it happens, this was already well covered by the tests, but I hadn't noticed how bad some of these got. The specific patterns that turn directly into unpckl/h patterns were occurring *many* times in common vector processing code. There are still more problems here sadly, but trying to incrementally tease them apart and it looks like this is the core of the problem in the splitting logic. There is some chance of regression here, you can see it in the test changes. Specifically, where we stop forming pshufb in some cases, it is possible that pshufb was in fact faster. Intel "says" that pshufb is slower than the instruction sequences replacing it. llvm-svn: 221852
* Expose the number of Newton-Raphson iterations applied to the hardware's ↵Sanjay Patel2014-11-121-3/+7
| | | | | | | | | | | | reciprocal estimate as a parameter (x86). This is a follow-on to r221706 and r221731 and discussed in more detail in PR21385. This patch also loosens the testcase checking for btver2. We know that the "1.0" will be loaded, but we can't tell exactly when, so replace the CHECK-NEXT specifiers with plain CHECKs. The CHECK-NEXT sequence relied on a quirk of post-RA-scheduling that may change independently of anything in these tests. llvm-svn: 221819
* [AVX512] Add integer shift by immediate intrinsics.Cameron McInally2014-11-122-1/+12
| | | | llvm-svn: 221811
* [x86] Start improving the matching of unpck instructions based on testChandler Carruth2014-11-121-0/+6
| | | | | | | | cases from Halide folks. This initial step was extracted from a prototype change by Clay Wood to try and address regressions found with Halide and the new vector shuffle lowering. llvm-svn: 221779
* AVX-512: Intrinsics for ERIElena Demikhovsky2014-11-125-59/+94
| | | | | | | | | 3 instructions: vrcp28, vrsqrt28, vexp2, only vector forms. Intrinsics include SAE (Suppres All Exceptions) parameter. http://reviews.llvm.org/D6214 llvm-svn: 221774
* Pass an ArrayRef to MCDisassembler::getInstruction.Rafael Espindola2014-11-122-9/+21
| | | | | | | | | | | | With this patch MCDisassembler::getInstruction takes an ArrayRef<uint8_t> instead of a MemoryObject. Even on X86 there is a maximum size an instruction can have. Given that, it seems way simpler and more efficient to just pass an ArrayRef to the disassembler instead of a MemoryObject and have it do a virtual call every time it wants some extra bytes. llvm-svn: 221751
* Initialize new subtarget feature variable for generating reciprocal estimate ↵Sanjay Patel2014-11-111-0/+1
| | | | | | | | instructions. This was missed in r221706. llvm-svn: 221731
* Add Forward Control-Flow Integrity.Tom Roeder2014-11-112-0/+18
| | | | | | | | | | | | | | | | | | | | This commit adds a new pass that can inject checks before indirect calls to make sure that these calls target known locations. It supports three types of checks and, at compile time, it can take the name of a custom function to call when an indirect call check fails. The default failure function ignores the error and continues. This pass incidentally moves the function JumpInstrTables::transformType from private to public and makes it static (with a new argument that specifies the table type to use); this is so that the CFI code can transform function types at call sites to determine which jump-instruction table to use for the check at that site. Also, this removes support for jumptables in ARM, pending further performance analysis and discussion. Review: http://reviews.llvm.org/D4167 llvm-svn: 221708
* Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385).Sanjay Patel2014-11-114-1/+44
| | | | | | | | | | | | | | | | | | | | | | This is a first step for generating SSE rcp instructions for reciprocal calcs when fast-math allows it. This is very similar to the rsqrt optimization enabled in D5658 ( http://reviews.llvm.org/rL220570 ). For now, be conservative and only enable this for AMD btver2 where performance improves significantly both in terms of latency and throughput. We may never enable this codegen for Intel Core* chips because the divider circuits are just too fast. On SandyBridge, divss can be as fast as 10 cycles versus the 21 cycle critical path for the rcp + mul + sub + mul + add estimate. Follow-on patches may allow configuration of the number of Newton-Raphson refinement steps, add AVX512 support, and enable the optimization for more chips. More background here: http://llvm.org/bugs/show_bug.cgi?id=21385 Differential Revision: http://reviews.llvm.org/D6175 llvm-svn: 221706
* Use a 8 bit immediate when possible.Rafael Espindola2014-11-111-2/+14
| | | | | | This fixes pr21529. llvm-svn: 221700
* [X86][ELF] Fix PR20243 - leaf frame pointer bug with TLS accessDario Domizioli2014-11-111-0/+1
| | | | | | | | | | | | | The ISel lowering for global TLS access in PIC mode was creating a pseudo instruction that is later expanded to a call, but the code was not setting the hasCalls flag in the MachineFrameInfo alongside the adjustsStack flag. This caused some functions to be mistakenly recognized as leaf functions, and this in turn affected the decision to eliminate the frame pointer. With the fix, hasCalls is properly set and the leaf frame pointer is correctly preserved. llvm-svn: 221695
OpenPOWER on IntegriCloud