summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86
Commit message (Collapse)AuthorAgeFilesLines
...
* [x86] Explicitly lower to a blend early if it is trivial to do so forChandler Carruth2014-09-211-0/+5
| | | | | | | | | | | | v8f32 shuffles in the new vector shuffle lowering code. This is very cheap to do and makes it much more clear that anything more expensive but overlapping with this lowering should be selected afterward (for example using AVX2's VPERMPS). However, no functionality changed here as without this code we would fall through to create no-op shuffles of each input and a blend. =] llvm-svn: 218209
* [x86] Teach the new vector shuffle lowering of v4f64 to prefer a directChandler Carruth2014-09-211-0/+5
| | | | | | | | | | VBLENDPD over using VSHUFPD. While the 256-bit variant of VBLENDPD slows down to the same speed as VSHUFPD on Sandy Bridge CPUs, it has twice the reciprocal throughput on Ivy Bridge CPUs much like it does everywhere for 128-bits. There isn't a downside, so just eagerly use this instruction when it suffices. llvm-svn: 218208
* [x86] Switch the blend implementation to use a MVT switch rather thanChandler Carruth2014-09-211-18/+25
| | | | | | | | | awkward conditions. The readability improvement of this will be even more important as I generalize it to handle more types. No functionality changed. llvm-svn: 218205
* [x86] Remove some essentially lying comments from the v4f64 path of theChandler Carruth2014-09-211-6/+0
| | | | | | new vector shuffle lowering. llvm-svn: 218204
* [x86] Fix a helper to reflect that what we actually care about isChandler Carruth2014-09-211-9/+12
| | | | | | | | | | 128-bit lane crossings, not 'half' crossings. This came up in code review ages ago, but I hadn't really addresesd it. Also added some documentation for the helper. No functionality changed. llvm-svn: 218203
* [x86] Teach the new vector shuffle lowering the first step toward moreChandler Carruth2014-09-211-1/+40
| | | | | | | | | | | | actual support for complex AVX shuffling tricks. We can do independent blends of the low and high 128-bit lanes of an avx vector, so shuffle the inputs into place and then do the blend at 256 bits. This will in many cases remove one blend instruction. The next step is to permute the low and high halves in-place rather than extracting them and re-inserting them. llvm-svn: 218202
* [x86] Teach the new vector shuffle lowering to use VPERMILPD forChandler Carruth2014-09-201-0/+8
| | | | | | | | single-input shuffles with doubles. This allows them to fold memory operands into the shuffle, etc. This is just the analog to the v4f32 case in my prior commit. llvm-svn: 218193
* [x86] Teach the new vector shuffle lowering to use the AVX VPERMILPSChandler Carruth2014-09-201-3/+11
| | | | | | | | instruction for single-vector floating point shuffles. This in turn allows the shuffles to fold a load into the instruction which is one of the common regressions hit with the new shuffle lowering. llvm-svn: 218190
* [x86] Teach the v4f32 path of the new shuffle lowering to handle theChandler Carruth2014-09-201-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | tricky case of single-element insertion into the zero lane of a zero vector. We can't just use the same pattern here as we do in every other vector type because the general insertion logic can handle insertion into the non-zero lane of the vector. However, in SSE4.1 with v4f32 vectors we have INSERTPS that is a much better choice than the generic one for such lowerings. But INSERTPS can do lots of other lowerings as well so factoring its logic into the general insertion logic doesn't work very well. We also can't just extract the core common part of the general insertion logic that is faster (forming VZEXT_MOVL synthetic nodes that lower to MOVSS when they can) because VZEXT_MOVL is often *faster* than a blend while INSERTPS is slower! So instead we do a restrictive condition on attempting to use the generic insertion logic to narrow it to those cases where VZEXT_MOVL won't need a shuffle afterward and thus will do better than INSERTPS. Then we try blending. Then we go back to INSERTPS. This still doesn't generate perfect code for some silly reasons that can be fixed by tweaking the td files for lowering VZEXT_MOVL to use XORPS+BLENDPS when available rather than XORPS+MOVSS when the input ends up in a register rather than a load from memory -- BLENDPSrr has twice the reciprocal throughput of MOVSSrr. Don't you love this ISA? llvm-svn: 218177
* [x86] Refactor the code for emitting INSERTPS to reuse the zeroable maskChandler Carruth2014-09-201-25/+15
| | | | | | | analysis used elsewhere. This removes the last duplicate of this logic. Also simplify the code here quite a bit. No functionality changed. llvm-svn: 218176
* [x86] Generalize the single-element insertion lowering to work withChandler Carruth2014-09-201-13/+45
| | | | | | | | | | | | floating point types and use it for both v2f64 and v2i64 single-element insertion lowering. This fixes the last non-AVX performance regression test case I've gotten of for the new vector shuffle lowering. There is obvious analogous lowering for v4f32 that I'll add in a follow-up patch (because with INSERTPS, v4f32 requires special treatment). After that, its AVX stuff. llvm-svn: 218175
* [x86] Replace some duplicated logic reasoning about whether particularChandler Carruth2014-09-201-13/+6
| | | | | | | | | | vector lanes can be modeled as zero with a call to the new function that computes a bit-vector representing that information. No functionality changed here, but will allow doing more clever things with the zero-test. llvm-svn: 218174
* [X86] Erase some obsolete comments from README.txtRobin Morisset2014-09-191-177/+0
| | | | | | | | | | | I just tried reproducing some of the optimization failures in README.txt in the X86 backend, and many of them could not be reproduced. In general the entire file appears quite bit-rotted, whatever interesting parts remain should be moved to bugzilla, and the rest deleted. I did not spend the time to do that, so I just deleted the few I tried reproducing which are obsolete, to save some time to whoever will find the courage to do it. llvm-svn: 218170
* [x86] Hoist a function up to the rest of the non-type-specific loweringChandler Carruth2014-09-191-75/+74
| | | | | | | | | helpers, and re-flow the logic to use early exit and be a bit more readable. No functionality changed. llvm-svn: 218155
* [x86] Hoist the actual lowering logic into a helper function to separateChandler Carruth2014-09-191-74/+89
| | | | | | | | it from the shuffle pattern matching logic. Also cleaned up variable names, comments, etc. No functionality changed. llvm-svn: 218152
* [x86] Fully generalize the zext lowering in the new vector shuffleChandler Carruth2014-09-191-33/+91
| | | | | | | | | | | | lowering to support both anyext and zext and to custom lower for many different microarchitectures. Using this allows us to get *exactly* the right code for zext and anyext shuffles in all the vector sizes. For v16i8, the improvement is *huge*. The new SSE2 test case added I refused to add before this because it was sooooo muny instructions. llvm-svn: 218143
* [x86] Recognize that we can use duplication to widen v16i8 shuffles dueChandler Carruth2014-09-191-3/+3
| | | | | | | | to undef lanes as well as defined widenable lanes. This dramatically improves the lowering we use for undef-shuffles in a zext-ish pattern for SSE2. llvm-svn: 218115
* [x86] Teach the new vector shuffle lowering to also use pmovzx for v4i32Chandler Carruth2014-09-191-1/+7
| | | | | | | | | shuffles that are zext-ing. Not a lot to see here; the undef lane variant is better handled with pshufd, but this improves the actual zext pattern. llvm-svn: 218112
* [x86] Add a dedicated lowering path for zext-compatible vector shufflesChandler Carruth2014-09-191-0/+134
| | | | | | | | | | | | | | | | | | | | | to the new vector shuffle lowering code. This allows us to emit PMOVZX variants consistently for patterns where it is a viable lowering. This instruction is both fast and allows us to fold loads into it. This only hooks the new lowering up for i16 and i8 element widths, mostly so I could manage the change to the tests. I'll add the i32 one next, although it is significantly less interesting. One thing to note is that we already had some tests for these patterns but those tests had far less horrible instructions. The problem is that those tests weren't checking the strict start and end of the instruction sequence. =[ As a consequence something changed in the lowering making us generate *TERRIBLE* code for these patterns in SSE2 through SSSE3. I've consolidated all of the tests and spelled out the madness that we currently emit for these shuffles. I'm going to try to figure out what has gone wrong here. llvm-svn: 218102
* Reverting NFC changes from r218050. Instead, the warning was disabled for ↵Aaron Ballman2014-09-181-2/+0
| | | | | | GCC in r218059, so these changes are no longer required. llvm-svn: 218062
* [SKX] Deriving rmb multiclasses from general one (avx512_icmp_packed_rmb and ↵Robert Khasanov2014-09-181-26/+12
| | | | | | | | avx512_icmp_cc_rmb). Thanks Adam Nemet for notice about this. llvm-svn: 218051
* Fixing a bunch of -Woverloaded-virtual warnings due to hiding ↵Aaron Ballman2014-09-181-0/+2
| | | | | | getSubtargetImpl from the base class. NFC. llvm-svn: 218050
* [x86] Use PALIGNR for v4i32 and v2i64 blends when appropriate.Chandler Carruth2014-09-181-0/+12
| | | | | | | | | | | | | | | There is no purpose in using it for single-input shuffles as pshufd is just as fast and doesn't tie the two operands. This removes a substantial amount of wrong-domain blend operations in SSSE3 mode. It also completes the usage of PALIGNR for integer shuffles and addresses one of the test cases Quentin hit with the new vector shuffle lowering. There is still the question of whether and when to use this for floating point shuffles. It is faster than shufps or shufpd but in the integer domain. I don't yet really have a good heuristic here for when to use this instruction for floating point vectors. llvm-svn: 218038
* [x86] Initial step of teaching the new vector shuffle lowering aboutChandler Carruth2014-09-181-0/+115
| | | | | | | | | | | | | | | | | PALIGNR. This just adds it to the v8i16 and v16i8 lowering steps where it is completely unmatched. It also introduces the logic for detecting rotation shuffle masks even in the presence of single input or blend masks and arbitrarily undef lanes. I've added fairly comprehensive tests for the matching logic in v8i16 because the tests at that size are much easier to write and manage. I've not checked the SSE2 code generated for these tests because the code is *horrible*. It is absolute madness. Testing it will just make the test brittle without giving any interesting improvements in the correctness confidence. llvm-svn: 218013
* Add and update reset() and doInitialization() methods to MC* and passes.Yaron Keren2014-09-171-0/+6
| | | | | | This enables reusing a PassManager instead of re-constructing it every time. llvm-svn: 217948
* [x32] Fix function indirect callsPavel Chupin2014-09-171-0/+3
| | | | | | | | | | | | | | Summary: Zero-extend register to 64-bit for callq/jmpq. Test Plan: 3 tests added Reviewers: nadav, dschuff Subscribers: llvm-commits, zinovy.nis Differential Revision: http://reviews.llvm.org/D5355 llvm-svn: 217942
* [X86] Use the generic AtomicExpandPass instead of X86AtomicExpandPassRobin Morisset2014-09-176-290/+70
| | | | | | | | | | | | This required a new hook called hasLoadLinkedStoreConditional to know whether to expand atomics to LL/SC (ARM, AArch64, in a future patch Power) or to CmpXchg (X86). Apart from that, the new code in AtomicExpandPass is mostly moved from X86AtomicExpandPass. The main result of this patch is to get rid of that pass, which had lots of code duplicated with AtomicExpandPass. llvm-svn: 217928
* [X86] Improve commentAdam Nemet2014-09-161-3/+4
| | | | llvm-svn: 217885
* AVX-512: added cost for some AVX-512 instructionsElena Demikhovsky2014-09-161-0/+62
| | | | llvm-svn: 217863
* [x86] Remove a FIXME that doesn't make any sense. Only the lanes feedingChandler Carruth2014-09-161-3/+0
| | | | | | | the blend that is matched by this are "used" in any sense, and so any build_vector or other nodes feeding these will already drop other lanes. llvm-svn: 217855
* [x86] Cleanup an unused variable by actually using it in the non-assertsChandler Carruth2014-09-161-1/+1
| | | | | | place where it was needed. llvm-svn: 217854
* [x86] Remove the last vestiges of the BLENDI-based ADDSUB patternChandler Carruth2014-09-162-50/+10
| | | | | | | | | | | | | matching. This design just fundamentally didn't work because ADDSUB is available prior to any legal lowerings of BLENDI nodes. Instead, we have a dedicated ADDSUB synthetic ISD node which is pattern matched trivially into the instructions. These nodes are then recognized by both the existing and a trivial new lowering combine in the backend. Removing these patterns required adding 2 missing shuffle masks to the DAG combine, without which tests would have failed. Added the masks and a helpful assert as well to catch if anything ever goes wrong here. llvm-svn: 217851
* [x86] As a follow-up to r217819, don't check for VSELECT legality nowChandler Carruth2014-09-161-7/+1
| | | | | | | | | | | that we don't use VSELECT and directly emit an addsub synthetic node. Also remove a stale comment referencing VSELECT. The test case is updated to use 'core2' which only has SSE3, not SSE4.1, and it still passes. Previously it would not because we lacked sufficient blend support to legalize the VSELECT. llvm-svn: 217849
* [x86] Add the beginnings of a proper DAG combine to match ADDSUBPS andChandler Carruth2014-09-161-0/+55
| | | | | | | | | | | | | ADDSUBPD nodes out of blends of adds and subs. This allows us to actually form these instructions with SSE3 rather than only forming them when we had both SSE3 for the ADDSUB instructions and SSE4.1 for the blend instructions. ;] Kind-of important. I've adjusted the CPU requirements on one of the tests to demonstrate this kicking in nicely for an SSE3 cpu configuration. llvm-svn: 217848
* [FastISel] Move optimizeCmpPredicate to FastISel base class. NFC.Juergen Ributzka2014-09-151-40/+0
| | | | | | Make the optimizeCmpPredicate function available to all targets. llvm-svn: 217822
* [x86] Start fixing our emission of ADDSUBPS and ADDSUBPD instructions byChandler Carruth2014-09-154-26/+37
| | | | | | | | | | | | | | | | introducing a synthetic X86 ISD node representing this generic operation. The relevant patterns for mapping these nodes into the concrete instructions are also added, and a gnarly bit of C++ code in the target-specific DAG combiner is replaced with simple code emitting this primitive. The next step is to generically combine blends of adds and subs into this node so that we can drop the reliance on an SSE4.1 ISD node (BLENDI) when matching an SSE3 feature (ADDSUB). llvm-svn: 217819
* [X86] Fix a bug in X86's peephole optimization.Akira Hatanaka2014-09-151-14/+24
| | | | | | | | | | | | | | | | | | | | | | Peephole optimization was folding MOVSDrm, which is a zero-extending double precision floating point load, into ADDPDrr, which is a SIMD add of two packed double precision floating point values. (before) %vreg21<def> = MOVSDrm <fi#0>, 1, %noreg, 0, %noreg; mem:LD8[%7](align=16)(tbaa=<badref>) VR128:%vreg21 %vreg23<def,tied1> = ADDPDrr %vreg20<tied0>, %vreg21; VR128:%vreg23,%vreg20,%vreg21 (after) %vreg23<def,tied1> = ADDPDrm %vreg20<tied0>, <fi#0>, 1, %noreg, 0, %noreg; mem:LD8[%7](align=16)(tbaa=<badref>) VR128:%vreg23,%vreg20 X86InstrInfo::foldMemoryOperandImpl already had the logic that prevented this from happening. However the check wasn't being conducted for loads from stack objects. This commit factors out the logic into a new function and uses it for checking loads from stack slots are not zero-extending loads. rdar://problem/18236850 llvm-svn: 217799
* [x86] Begin emitting PBLENDW instructions for integer blend operationsChandler Carruth2014-09-151-2/+36
| | | | | | | | | | | | | | | | | when SSE4.1 is available. This removes a ton of domain crossing from blend code paths that were ending up in the floating point code path. This is just the tip of the iceberg though. The real switch is for integer blend lowering to more actively rely on this instruction being available so we don't hit shufps at all any longer. =] That will come in a follow-up patch. Another place where we need better support is for using PBLENDVB when doing so avoids the need to have two complementary PSHUFB masks. llvm-svn: 217767
* [x86] Teach the x86 DAG combiner to form UNPCKLPS and UNPCKHPSChandler Carruth2014-09-151-0/+14
| | | | | | | | | | | | | | | instructions from the relevant shuffle patterns. This is the last tweak I'm aware of to generate essentially perfect v4f32 and v2f64 shuffles with the new vector shuffle lowering up through SSE4.1. I'm sure I've missed some and it'd be nice to check since v4f32 is amenable to exhaustive exploration, but this is all of the tricks I'm aware of. With AVX there is a new trick to use the VPERMILPS instruction, that's coming up in a subsequent patch. llvm-svn: 217761
* [x86] Teach the x86 DAG combiner to form MOVSLDUP and MOVSHDUPChandler Carruth2014-09-154-30/+105
| | | | | | | | | | | | instructions when it finds an appropriate pattern. These are lovely instructions, and its a shame to not use them. =] They are fast, and can hand loads folded into their operands, etc. I've also plumbed the comment shuffle decoding through the various layers so that the test cases are printed nicely. llvm-svn: 217758
* [x86] Undo a flawed transform I added to form UNPCK instructions whenChandler Carruth2014-09-151-79/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | AVX is available, and generally tidy up things surrounding UNPCK formation. Originally, I was thinking that the only advantage of PSHUFD over UNPCK instruction variants was its free copy, and otherwise we should use the shorter encoding UNPCK instructions. This isn't right though, there is a larger advantage of being able to fold a load into the operand of a PSHUFD. For UNPCK, the operand *must* be in a register so it can be the second input. This removes the UNPCK formation in the target-specific DAG combine for v4i32 shuffles. It also lifts the v8 and v16 cases out of the AVX-specific check as they are potentially replacing multiple instructions with a single instruction and so should always be valuable. The floating point checks are simplified accordingly. This also adjusts the formation of PSHUFD instructions to attempt to match the shuffle mask to one which would fit an UNPCK instruction variant. This was originally motivated to allow it to match the UNPCK instructions in the combiner, but clearly won't now. Eventually, we should add a MachineCombiner pass that can form UNPCK instructions post-RA when the operand is known to be in a register and thus there is no loss. llvm-svn: 217755
* [x86] Teach the new vector shuffle lowering to use 'punpcklwd' andChandler Carruth2014-09-151-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | 'punpckhwd' instructions when suitable rather than falling back to the generic algorithm. While we could canonicalize to these patterns late in the process, that wouldn't help when the freedom to use them is only visible during initial lowering when undef lanes are well understood. This, it turns out, is very important for matching the shuffle patterns that are used to lower sign extension. Fixes a small but relevant regression in gcc-loops with the new lowering. When I changed this I noticed that several 'pshufd' lowerings became unpck variants. This is bad because it removes the ability to freely copy in the same instruction. I've adjusted the widening test to handle undef lanes correctly and now those will correctly continue to use 'pshufd' to lower. However, this caused a bunch of churn in the test cases. No functional change, just churn. Both of these changes are part of addressing a general weakness in the new lowering -- it doesn't sufficiently leverage undef lanes. I've at least a couple of patches that will help there at least in an academic sense. llvm-svn: 217752
* [x86] Teach the new vector shuffle lowering to use BLENDPS and BLENDPD.Chandler Carruth2014-09-141-0/+35
| | | | | | | | | | | | | | | | | These are super simple. They even take precedence over crazy instructions like INSERTPS because they have very high throughput on modern x86 chips. I still have to teach the integer shuffle variants about this to avoid so many domain crossings. However, due to the particular instructions available, that's a touch more complex and so a separate patch. Also, the backend doesn't seem to realize it can commute blend instructions by negating the mask. That would help remove a number of copies here. Suggestions on how to do this welcome, it's an area I'm less familiar with. llvm-svn: 217744
* [x86] Teach the vector combiner that picks a canonical shuffle from toChandler Carruth2014-09-141-9/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | support transforming the forms from the new vector shuffle lowering to use 'movddup' when appropriate. A bunch of the cases where we actually form 'movddup' don't actually show up in the test results because something even later than DAG legalization maps them back to 'unpcklpd'. If this shows back up as a performance problem, I'll probably chase it down, but it is at least an encoded size loss. =/ To make this work, also always do this canonicalizing step for floating point vectors where the baseline shuffle instructions don't provide any free copies of their inputs. This also causes us to canonicalize unpck[hl]pd into mov{hl,lh}ps (resp.) which is a nice encoding space win. There is one test which is "regressed" by this: extractelement-load. There, the test case where the optimization it is testing *fails*, the exact instruction pattern which results is slightly different. This should probably be fixed by having the appropriate extract formed earlier in the DAG, but that would defeat the purpose of the test.... If this test case is critically important for anyone, please let me know and I'll try to work on it. The prior behavior was actually contrary to the comment in the test case and seems likely to have been an accident. llvm-svn: 217738
* The MCAssembler.h include isn't used.Yaron Keren2014-09-121-1/+0
| | | | llvm-svn: 217705
* [AVX512] Fix miscompile for unpackAdam Nemet2014-09-111-56/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r189189 implemented AVX512 unpack by essentially performing a 256-bit unpack between the low and the high 256 bits of src1 into the low part of the destination and another unpack of the low and high 256 bits of src2 into the high part of the destination. I don't think that's how unpack works. AVX512 unpack simply has more 128-bit lanes but other than it works the same way as AVX. So in each 128-bit lane, we're always interleaving certain parts of both operands rather different parts of one of the operands. E.g. for this: __v16sf a = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; __v16sf b = { 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }; __v16sf c = __builtin_shufflevector(a, b, 0, 8, 1, 9, 4, 12, 5, 13, 16, 24, 17, 25, 20, 28, 21, 29); we generated punpcklps (notice how the elements of a and b are not interleaved in the shuffle). In turn, c was set to this: 0 16 1 17 4 20 5 21 8 24 9 25 12 28 13 29 Obviously this should have just returned the mask vector of the shuffle vector. I mostly reverted this change and made sure the original AVX code worked for 512-bit vectors as well. Also updated the tests because they matched the logic from the code. llvm-svn: 217602
* Move constant-sized bitvector to the stack.Benjamin Kramer2014-09-111-2/+2
| | | | llvm-svn: 217600
* Rename getMaximumUnrollFactor -> getMaxInterleaveFactor; also rename option ↵Sanjay Patel2014-09-101-2/+2
| | | | | | | | | | | names controlling this variable. "Unroll" is not the appropriate name for this variable. Clang already uses the term "interleave" in pragmas and metadata for this. Differential Revision: http://reviews.llvm.org/D5066 llvm-svn: 217528
* [asan-assembly-instrumentation] Added CFI directives to the generated ↵Yuri Gorshenin2014-09-103-1/+62
| | | | | | | | | | | | | | instrumentation code. Summary: [asan-assembly-instrumentation] Added CFI directives to the generated instrumentation code. Reviewers: eugenis Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D5189 llvm-svn: 217482
* Add a scheduling model for AMD 16H Jaguar (btver2).Sanjay Patel2014-09-093-4/+350
| | | | | | | | | | | | | This is a first pass at a scheduling model for Jaguar. It's structured largely on the existing SandyBridge and SLM sched models. Using this model, in addition to turning on the PostRA scheduler, results in some perf wins on internal and 3rd party benchmarks. There's not much difference in LLVM's test-suite benchmarking subset of tests. Differential Revision: http://reviews.llvm.org/D5229 llvm-svn: 217457
OpenPOWER on IntegriCloud