summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
Commit message (Collapse)AuthorAgeFilesLines
...
* getRegForInlineAsmConstraint wants to use TargetRegisterInfo forEric Christopher2015-02-261-7/+6
| | | | | | | | | a lookup, pass that in rather than use a naked call to getSubtargetImpl. This involved passing down and around either a TargetMachine or TargetRegisterInfo. Update all callers/definitions around the targets and SelectionDAG. llvm-svn: 230699
* Remove an argument-less call to getSubtargetImpl from TargetLoweringBase.Eric Christopher2015-02-261-1/+1
| | | | | | | | | This required plumbing a TargetRegisterInfo through computeRegisterProperties and into findRepresentativeClass which uses it for register class iteration. This required passing a subtarget into a few target specific initializations of TargetLowering. llvm-svn: 230583
* [PowerPC] Make LDtocL and friends invariant loadsHal Finkel2015-02-251-16/+20
| | | | | | | | | | | | | | | | | | | | | LDtocL, and other loads that roughly correspond to the TOC_ENTRY SDAG node, represent loads from the TOC, which is invariant. As a result, these loads can be hoisted out of loops, etc. In order to do this, we need to generate GOT-style MMOs for TOC_ENTRY, which requires treating it as a legitimate memory intrinsic node type. Once this is done, the MMO transfer is automatically handled for TableGen-driven instruction selection, and for nodes generated directly in PPCISelDAGToDAG, we need to transfer the MMOs manually. Also, we were not transferring MMOs associated with pre-increment loads, so do that too. Lastly, this fixes an exposed bug where R30 was not added as a defined operand of UpdateGBR. This problem was highlighted by an example (used to generate the test case) posted to llvmdev by Francois Pichet. llvm-svn: 230553
* [PowerPC] Cleanup unused target-specific SDAG nodesHal Finkel2015-02-251-3/+0
| | | | | | | | We had somehow accumulated a few target-specific SDAG nodes dealing with PPC64 TOC access that were referenced only in TableGen patterns. The associated (pseudo-)instructions are used, but are being generated directly. NFC. llvm-svn: 230518
* Silencing a "result of 32-bit shift implicitly converted to 64 bits (was ↵Aaron Ballman2015-02-251-1/+1
| | | | | | 64-bit shift intended?)" warning in MSVC; NFC. llvm-svn: 230489
* Silencing a -Wsign-compare warning triggered in MSVC; NFC.Aaron Ballman2015-02-251-1/+1
| | | | llvm-svn: 230488
* [PowerPC] Add support for the QPX vector instruction setHal Finkel2015-02-251-48/+1097
| | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for the QPX vector instruction set, which is used by the enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes wide, holding 4 double-precision floating-point values. Boolean values, modeled here as <4 x i1> are actually also represented as floating-point values (essentially { -1, 1 } for { false, true }). QPX shares many features with Altivec and VSX, but is distinct from both of them. One major difference is that, instead of adding completely-separate vector registers, QPX vector registers are extensions of the scalar floating-point registers (lane 0 is the corresponding scalar floating-point value). The operations supported on QPX vectors mirrors that supported on the scalar floating-point values (with some additional ones for permutations and logical/comparison operations). I've been maintaining this support out-of-tree, as part of the bgclang project, for several years. This is not the entire bgclang patch set, but is most of the subset that can be cleanly integrated into LLVM proper at this time. Adding this to the LLVM backend is part of my efforts to rebase bgclang to the current LLVM trunk, but is independently useful (especially for codes that use LLVM as a JIT in library form). The assembler/disassembler test coverage is complete. The CodeGen test coverage is not, but I've included some tests, and more will be added as follow-up work. llvm-svn: 230413
* CodeGen: convert CCState interface to using ArrayRefsTim Northover2015-02-211-6/+4
| | | | | | | | | | | Everyone except R600 was manually passing the length of a static array at each callsite, calculated in a variety of interesting ways. Far easier to let ArrayRef handle that. There should be no functional change, but out of tree targets may have to tweak their calls as with these examples. llvm-svn: 230118
* AArch64: Safely handle the incoming sret call argument.Andrew Trick2015-02-161-6/+12
| | | | | | | | | | | | | | | | | | This adds a safe interface to the machine independent InputArg struct for accessing the index of the original (IR-level) argument. When a non-native return type is lowered, we generate the hidden machine-level sret argument on-the-fly. Before this fix, we were representing this argument as OrigArgIndex == 0, which is an outright lie. In particular this crashed in the AArch64 backend where we actually try to access the type of the original argument. Now we use a sentinel value for machine arguments that have no original argument index. AArch64, ARM, Mips, and PPC now check for this case before accessing the original argument. Fixes <rdar://19792160> Null pointer assertion in AArch64TargetLowering llvm-svn: 229413
* PowerPC: Canonicalize access to function attributes, NFCDuncan P. N. Exon Smith2015-02-141-4/+2
| | | | | | | | | | | | Canonicalize access to function attributes to use the simpler API. getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind) => getFnAttribute(Kind) getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind) => hasFnAttribute(Kind) llvm-svn: 229224
* Stash the TargetMachine on the subtarget so we can access it later.Eric Christopher2015-02-131-1/+1
| | | | | | Clean up a subtarget function that has it passed in while we're at it. llvm-svn: 229164
* PPC LinkageSize can be computed at initialization time, do so.Eric Christopher2015-02-131-12/+6
| | | | llvm-svn: 229163
* PPCFrameLowering's FramePointerOffset can be computed at initializationEric Christopher2015-02-131-10/+5
| | | | | | time. Do so. llvm-svn: 228998
* The TOC save offset can be computed at compile time, do so andEric Christopher2015-02-131-3/+2
| | | | | | propagate changes. llvm-svn: 228997
* The return save offset can be computed at initialization time - doEric Christopher2015-02-131-8/+7
| | | | | | so and save the value. llvm-svn: 228996
* [PowerPC] Mark jumps as expensive (using using CR bits)Hal Finkel2015-02-121-1/+3
| | | | | | | | | | | | | | | On PowerPC, which has a full set of logical operations on (its multiple sets of) condition-register bits, it is not profitable to break of complex conditions feeding a jump into multiple jumps. We can turn off this feature of CGP/SDAGBuilder by marking jumps as "expensive". P7 test-suite speedups (no regressions): MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2 -0.626647% +/- 0.323583% MultiSource/Benchmarks/Olden/power/power -18.2821% +/- 8.06481% llvm-svn: 228895
* [PowerPC] Fix reverted patch r227976 to avoid register assignment issuesBill Schmidt2015-02-101-59/+13
| | | | | | | | | | | | | | | | | | | See full discussion in http://reviews.llvm.org/D7491. We now hide the add-immediate and call instructions together in a separate pseudo-op, which is tagged to define GPR3 and clobber the call-killed registers. The PPCTLSDynamicCall pass prior to RA now expands this op into the two separate addi and call ops, with explicit definitions of GPR3 on both instructions, and explicit clobbers on the call instruction. The pass is now marked as requiring and preserving the LiveIntervals and SlotIndexes analyses, and fixes these up after the replacement sequences are introduced. Self-hosting has been verified on LE P8 and BE P7 with various optimization levels, etc. It has also been verified with the --no-tls-optimize flag workaround removed. llvm-svn: 228725
* Revert "r227976 - [PowerPC] Yet another approach to __tls_get_addr" and ↵Hal Finkel2015-02-061-12/+57
| | | | | | | | | | | | | | related fixups Unfortunately, even with the workaround of disabling the linker TLS optimizations in Clang restored (which has already been done), this still breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native). Bill is currently working on an alternate implementation to address the TLS issue in a way that also fully elides the linker bug (which, unfortunately, this approach did not fully), so I'm reverting this now. llvm-svn: 228460
* [PowerPC] Generate pre-increment floating-point ld/st instructionsHal Finkel2015-02-051-0/+4
| | | | | | | | PowerPC supports pre-increment floating-point load/store instructions, both r+r and r+i, and we had patterns for them, but they were not marked as legal. Mark them as legal (and add a test case). llvm-svn: 228327
* [PowerPC] Implement the vclz instructions for PWR8Bill Schmidt2015-02-051-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | Patch by Kit Barton. Add the vector count leading zeros instruction for byte, halfword, word, and doubleword sizes. This is a fairly straightforward addition after the changes made for vpopcnt: 1. Add the correct definitions for the various instructions in PPCInstrAltivec.td 2. Make the CTLZ operation legal on vector types when using P8Altivec in PPCISelLowering.cpp Test Plan Created new test case in test/CodeGen/PowerPC/vec_clz.ll to check the instructions are being generated when the CTLZ operation is used in LLVM. Check the encoding and decoding in test/MC/PowerPC/ppc_encoding_vmx.s and test/Disassembler/PowerPC/ppc_encoding_vmx.txt respectively. llvm-svn: 228301
* [PowerPC] Implement the vpopcnt instructions for POWER8Bill Schmidt2015-02-031-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Patch by Kit Barton. Add the vector population count instructions for byte, halfword, word, and doubleword sizes. There are two major changes here: PPCISelLowering.cpp: Make CTPOP legal for vector types. PPCRegisterInfo.td: Added v2i64 to the VRRC register definition. This is needed for the doubleword variations of the integer ops that were added in P8. Test Plan Test the instruction vpcnt* encoding/decoding in ppc64-encoding-vmx.s Test the generation of the vpopcnt instructions for various vector data types. When adding the v2i64 type to the Vector Register set, I also needed to add the appropriate bit conversion patterns between v2i64 and the existing vector types. Testing for these conversions were also added in the test case by passing a different vector type as a parameter into the test functions. There is also a run step that will ensure the vpopcnt instructions are generated when the vsx feature is disabled. llvm-svn: 228046
* [PowerPC] Yet another approach to __tls_get_addrBill Schmidt2015-02-031-57/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a third attempt to properly handle the local-dynamic and global-dynamic TLS models. In my original implementation, calls to __tls_get_addr were hidden from view until the asm-printer phase, at which point the underlying branch-and-link instruction was created with proper relocations. This mostly worked well, but I used some repellent techniques to ensure that the TLS_GET_ADDR nodes at the SD and MI levels correctly received input from GPR3 and produced output into GPR3. This proved to work badly in the presence of multiple TLS variable accesses, with the copies to and from GPR3 being scheduled incorrectly and generally creating havoc. In r221703, I addressed that problem by representing the calls to __tls_get_addr as true calls during instruction lowering. This had the advantage of removing all of the bad hacks and relying on the existing call machinery to properly glue the copies in place. It looked like this was going to be the right way to go. However, as a side effect of the recent discovery of problems with linker optimizations for TLS, we discovered cases of suboptimal code generation with this strategy. The problem comes when tls_get_addr is called for the same address, and there is a resulting CSE opportunity. It turns out that in such cases MachineCSE will common the addis/addi instructions that set up the input value to tls_get_addr, but will not common the calls themselves. MachineCSE does not have any machinery to common idempotent calls. This is perfectly sensible, since presumably this would be done at the IR level, and introducing calls in the back end isn't commonplace. In any case, we end up with two calls to __tls_get_addr when one would suffice, and that isn't good. I presumed that the original design would have allowed commoning of the machine-specific nodes that hid the __tls_get_addr calls, so as suggested by Ulrich Weigand, I went back to that design and cleaned it up so that the copies were properly held together by glue nodes. However, it turned out that this didn't work either...the presence of copies to physical registers kept the machine-specific nodes from being commoned also. All of which leads to the design presented here. This is a return to the original design, except that no attempt is made to introduce copies to and from GPR3 during instruction lowering. Virtual registers are used until prior to register allocation. At that point, a special pass is run that identifies the machine-specific nodes that hide the tls_get_addr calls and introduces the copies to and from GPR3 around them. The register allocator then coalesces these copies away. With this design, MachineCSE succeeds in commoning tls_get_addr calls where possible, and we get nice optimal code generation (better than GCC at the moment, which does not common these calls). One additional problem must be dealt with: After introducing the mentions of the physical register GPR3, the aggressive anti-dependence breaker sees opportunities to improve scheduling by selecting a different register instead. Flags must be used on the instruction descriptions to tell the anti-dependence breaker to keep its hands in its pockets. One thing missing from the original design was recording a definition of the link register on the GET_TLS_ADDR nodes. Doing this was found to be insufficient to force a stack frame to be created, which led to looping behavior because two different LR values were stored at the same address. This appears to have been an oversight in PPCFrameLowering::determineFrameLayout(), which is repaired here. Because MustSaveLR() returns true for calls to builtin_return_address, this changed the expected behavior of test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but formerly did not. I've fixed the test case to reflect this. There are existing TLS tests to catch regressions; the checks in test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the face of instruction scheduling with these changes, so I fixed that up. I've added a new test case based on the PrettyStackTrace module that demonstrated the original problem. This checks that we get correct code generation and that CSE of the calls to __get_tls_addr has taken place. llvm-svn: 227976
* [PowerPC] Make r2 allocatable on PPC64/ELF for some leaf functionsHal Finkel2015-02-011-2/+25
| | | | | | | | | | | | | | | | | The TOC base pointer is passed in r2, and we normally reserve this register so that we can depend on it being there. However, for leaf functions, and specifically those leaf functions that don't do any TOC access of their own (which is generally due to accessing the constant pool, using TLS, etc.), we can treat r2 as an ordinary callee-saved register (it must be callee-saved because, for local direct calls, the linker will not insert any save/restore code). The allocation order has been changed slightly for PPC64/ELF systems to put r2 at the end of the list (while leaving it near the beginning for Darwin systems to prevent unnecessary output changes). While r2 is allocatable, using it still requires spill/restore traffic, and thus comes at the end of the list. llvm-svn: 227745
* Use the cached subtargets and remove calls to getSubtarget/getSubtargetImplEric Christopher2015-01-301-131/+120
| | | | | | without a Function argument. llvm-svn: 227622
* Move DataLayout back to the TargetMachine from TargetSubtargetInfoEric Christopher2015-01-261-7/+6
| | | | | | | | | | | | | | | | | | | derived classes. Since global data alignment, layout, and mangling is often based on the DataLayout, move it to the TargetMachine. This ensures that global data is going to be layed out and mangled consistently if the subtarget changes on a per function basis. Prior to this all targets(*) have had subtarget dependent code moved out and onto the TargetMachine. *One target hasn't been migrated as part of this change: R600. The R600 port has, as a subtarget feature, the size of pointers and this affects global data layout. I've currently hacked in a FIXME to enable progress, but the port needs to be updated to either pass the 64-bitness to the TargetMachine, or fix the DataLayout to avoid subtarget dependent features. llvm-svn: 227113
* [PowerPC] Add r2 as an operand for all calls under both PPC64 ELF V1 and V2Hal Finkel2015-01-191-3/+15
| | | | | | | | | | | Our PPC64 ELF V2 call lowering logic added r2 as an operand to all direct call instructions in order to represent the dependency on the TOC base pointer value. Restricting this to ELF V2, however, does not seem to make sense: calls under ELF V1 have the same dependence, and indirect calls have an r2 dependence just as direct ones. Make sure the dependence is noted for all calls under both ELF V1 and ELF V2. llvm-svn: 226432
* [PowerPC] Add some FIXMEs for fastcc and FPR <-> GPR movesHal Finkel2015-01-181-0/+6
| | | | | | | So we don't forget, once we support FPR <-> GPR moves on the P8, we'll likely want to re-visit this part of the calling convention. llvm-svn: 226401
* [PowerPC] Initial PPC64 calling-convention changes for fastccHal Finkel2015-01-181-61/+158
| | | | | | | | | | | | | | | | | The default calling convention specified by the PPC64 ELF (V1 and V2) ABI is designed to work with both prototyped and non-prototyped/varargs functions. As a result, GPRs and stack space are allocated for every argument, even those that are passed in floating-point or vector registers. GlobalOpt::OptimizeFunctions will transform local non-varargs functions (that do not have their address taken) to use the 'fast' calling convention. When functions are using the 'fast' calling convention, don't allocate GPRs for arguments passed in other types of registers, and don't allocate stack space for arguments passed in registers. Other changes for the fast calling convention may be added in the future. llvm-svn: 226399
* [PowerPC] Don't list R11 as a patchpoint scratch registerHal Finkel2015-01-171-9/+1
| | | | | | | | | | R11's status is the same under both the PPC64 ELF V1 and V2 ABIs: it is reserved for use as an "environment pointer" for compilation models that require such a thing. We don't, we also don't need a second scratch register, and because we support only "local" patchpoint call targets, we might as well let R11 be used for anyregcc patchpoints. llvm-svn: 226369
* [PowerPC] Adjust PatchPoints for ppc64leHal Finkel2015-01-161-1/+9
| | | | | | | | | | Bill Schmidt pointed out that some adjustments would be needed to properly support powerpc64le (using the ELF V2 ABI). For one thing, R11 is not available as a scratch register, so we need to use R12. R12 is also available under ELF V1, so to maintain consistency, I flipped the order to make R12 the first scratch register in the array under both ABIs. llvm-svn: 226247
* [PowerPC] Loosen ELFv1 PPC64 func descriptor loads for indirect callsHal Finkel2015-01-151-52/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Function pointers under PPC64 ELFv1 (which is used on PPC64/Linux on the POWER7, A2 and earlier cores) are really pointers to a function descriptor, a structure with three pointers: the actual pointer to the code to which to jump, the pointer to the TOC needed by the callee, and an environment pointer. We used to chain these loads, and make them opaque to the rest of the optimizer, so that they'd always occur directly before the call. This is not necessary, and in fact, highly suboptimal on embedded cores. Once the function pointer is known, the loads can be performed ahead of time; in fact, they can be hoisted out of loops. Now these function descriptors are almost always generated by the linker, and thus the contents of the descriptors are invariant. As a result, by default, we'll mark the associated loads as invariant (allowing them to be hoisted out of loops). I've added a target feature to turn this off, however, just in case someone needs that option (constructing an on-stack descriptor, casting it to a function pointer, and then calling it cannot be well-defined C/C++ code, but I can imagine some JIT-compilation system doing so). Consider this simple test: $ cat call.c typedef void (*fp)(); void bar(fp x) { for (int i = 0; i < 1600000000; ++i) x(); } $ cat main.c typedef void (*fp)(); void bar(fp x); void foo() {} int main() { bar(foo); } On the PPC A2 (the BG/Q supercomputer), marking the function-descriptor loads as invariant brings the execution time down to ~8 seconds from ~32 seconds with the loads in the loop. The difference on the POWER7 is smaller. Compiling with: gcc -std=c99 -O3 -mcpu=native call.c main.c : ~6 seconds [this is 4.8.2] clang -O3 -mcpu=native call.c main.c : ~5.3 seconds clang -O3 -mcpu=native call.c main.c -mno-invariant-function-descriptors : ~4 seconds (looks like we'd benefit from additional loop unrolling here, as a first guess, because this is faster with the extra loads) The -mno-invariant-function-descriptors will be added to Clang shortly. llvm-svn: 226207
* [PowerPC] Add assembler support for mcrfs and friendsHal Finkel2015-01-151-1/+1
| | | | | | | | | | Fill out our support for the floating-point status and control register instructions (mcrfs and friends). As it turns out, these are necessary for compiling src/test/harness_fp.h in TBB for PowerPC. Thanks to Raf Schietekat for reporting the issue! llvm-svn: 226070
* Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""Hal Finkel2015-01-141-16/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This re-applies r225808, fixed to avoid problems with SDAG dependencies along with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs. These problems caused the original regression tests to assert/segfault on many (but not all) systems. Original commit message: This commit does two things: 1. Refactors PPCFastISel to use more of the common infrastructure for call lowering (this lets us take advantage of this common code for lowering some common intrinsics, stackmap/patchpoint among them). 2. Adds support for stackmap/patchpoint lowering. For the most part, this is very similar to the support in the AArch64 target, with the obvious differences (different registers, NOP instructions, etc.). The test cases are adapted from the AArch64 test cases. One difference of note is that the patchpoint call sequence takes 24 bytes, so you can't use less than that (on AArch64 you can go down to 16). Also, as noted in the docs, we take the patchpoint address to be the actual code address (assuming the call is local in the TOC-sharing sense), which should yield higher performance than generating the full cross-DSO indirect-call sequence and is likely just as useful for JITed code (if not, we'll change it). StackMaps and Patchpoints are still marked as experimental, and so this support is doubly experimental. So go ahead and experiment! llvm-svn: 225909
* Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support"Hal Finkel2015-01-131-42/+20
| | | | | | | Reverting this while I investiage buildbot failures (segfaulting in GetCostForDef at ScheduleDAGRRList.cpp:314). llvm-svn: 225811
* [PowerPC] Add StackMap/PatchPoint supportHal Finkel2015-01-131-20/+42
| | | | | | | | | | | | | | | | | | | | | | | | | This commit does two things: 1. Refactors PPCFastISel to use more of the common infrastructure for call lowering (this lets us take advantage of this common code for lowering some common intrinsics, stackmap/patchpoint among them). 2. Adds support for stackmap/patchpoint lowering. For the most part, this is very similar to the support in the AArch64 target, with the obvious differences (different registers, NOP instructions, etc.). The test cases are adapted from the AArch64 test cases. One difference of note is that the patchpoint call sequence takes 24 bytes, so you can't use less than that (on AArch64 you can go down to 16). Also, as noted in the docs, we take the patchpoint address to be the actual code address (assuming the call is local in the TOC-sharing sense), which should yield higher performance than generating the full cross-DSO indirect-call sequence and is likely just as useful for JITed code (if not, we'll change it). StackMaps and Patchpoints are still marked as experimental, and so this support is doubly experimental. So go ahead and experiment! llvm-svn: 225808
* Added TLI hook for isFPExtFree. Some of the FMA combine heuristics are now ↵Olivier Sallenave2015-01-131-0/+5
| | | | | | guarded with that hook. llvm-svn: 225795
* [PowerPC] Fix calls to non-function objectsHal Finkel2015-01-121-5/+22
| | | | | | | | | | | | | | | | | | Looking at r225438 inspired me to see how the PowerPC backend handled the situation (calling a bitcasted TLS global), and it turns out we also produced an error (cannot select ...). What it means to "call" something that is not a function is implementation and platform specific, but in the name of doing something (besides crashing), this makes sure we do what GCC does (treat all such calls as calls through a function pointer -- meaning that the pointer is assumed, as is the convention on PPC, to point to a function descriptor structure holding the actual code address along with the function's TOC pointer and environment pointer). As GCC does, we now do the same for calling regular (non-TLS) non-function globals too. I'm not sure whether this is the most useful way to define the behavior, but at least we won't be alone. llvm-svn: 225617
* [PowerPC] Mark zext of a small scalar load as freeHal Finkel2015-01-101-0/+20
| | | | | | | | | | | | | This initial implementation of PPCTargetLowering::isZExtFree marks as free zexts of small scalar loads (that are not sign-extending). This callback is used by SelectionDAGBuilder's RegsForValue::getCopyToRegs, and thus to determine whether a zext or an anyext is used to lower illegally-typed PHIs. Because later truncates of zero-extended values are nops, this allows for the elimination of later unnecessary truncations. Fixes the initial complaint associated with PR22120. llvm-svn: 225584
* Remove some whitespace.Justin Hibbits2015-01-101-1/+1
| | | | llvm-svn: 225583
* [PowerPC] Fold [sz]ext with fp_to_int lowering where possibleHal Finkel2015-01-091-3/+59
| | | | | | | On modern cores with lfiw[az]x, we can fold a sign or zero extension from i32 to i64 into the load necessary for an i64 -> fp conversion. llvm-svn: 225493
* [SelectionDAG] Allow targets to specify legality of extloads' resultAhmed Bougacha2015-01-081-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | type (in addition to the memory type). The *LoadExt* legalization handling used to only have one type, the memory type. This forced users to assume that as long as the extload for the memory type was declared legal, and the result type was legal, the whole extload was legal. However, this isn't always the case. For instance, on X86, with AVX, this is legal: v4i32 load, zext from v4i8 but this isn't: v4i64 load, zext from v4i8 Whereas v4i64 is (arguably) legal, even without AVX2. Note that the same thing was done a while ago for truncstores (r46140), but I assume no one needed it yet for extloads, so here we go. Calls to getLoadExtAction were changed to add the value type, found manually in the surrounding code. Calls to setLoadExtAction were mechanically changed, by wrapping the call in a loop, to match previous behavior. The loop iterates over the MVT subrange corresponding to the memory type (FP vectors, etc...). I also pulled neighboring setTruncStoreActions into some of the loops; those shouldn't make a difference, as the additional types are illegal. (e.g., i128->i1 truncstores on PPC.) No functional change intended. Differential Revision: http://reviews.llvm.org/D6532 llvm-svn: 225421
* [CodeGen] Use MVT iterator_ranges in legality loops. NFC intended.Ahmed Bougacha2015-01-071-8/+2
| | | | | | | | A few loops do trickier things than just iterating on an MVT subset, so I'll leave them be for now. Follow-up of r225387. llvm-svn: 225392
* [PowerPC] Reuse a load operand in int->fp conversionsHal Finkel2015-01-061-20/+120
| | | | | | | | | | | | | | | | | | | | | | | int->fp conversions on PPC must be done through memory loads and stores. On a modern core, this process begins by storing the int value to memory, then loading it using a (sometimes special) FP load instruction. Unfortunately, we would do this even when the value to be converted was itself a load, and we can just use that same memory location instead of copying it to another first. There is a slight complication when handling int_to_fp(fp_to_int(x)) pairs, because the fp_to_int operand has not been lowered when the int_to_fp is being lowered. We handle this specially by invoking fp_to_int's lowering logic (partially) and getting the necessary memory location (some trivial refactoring was done to make this possible). This is all somewhat ugly, and it would be nice if some later CodeGen stage could just clean this stuff up, but because doing so would involve modifying target-specific nodes (or instructions), it is not immediately clear how that would work. Also, remove a related entry from the README.txt for which we now generate reasonable code. llvm-svn: 225301
* [PowerPC] Add some missing names in getTargetNodeNameHal Finkel2015-01-061-0/+7
| | | | | | These are used for debugging output; NFC. llvm-svn: 225249
* [PowerPC] Improve int_to_fp(fp_to_int(x)) combiningHal Finkel2015-01-061-30/+73
| | | | | | | | | The old target DAG combine that allowed for performing int_to_fp(fp_to_int(x)) without a load/store pair is updated here with support for unsigned integers, and to support single-precision values without a third rounding step, on newer cores with the appropriate instructions. llvm-svn: 225248
* [PowerPC/BlockPlacement] Allow target to provide a per-loop alignment preferenceHal Finkel2015-01-031-0/+35
| | | | | | | | | | | | | | | | | The existing code provided for specifying a global loop alignment preference. However, the preferred loop alignment might depend on the loop itself. For recent POWER cores, loops between 5 and 8 instructions should have 32-byte alignment (while the others are better with 16-byte alignment) so that the entire loop will fit in one i-cache line. To support this, getPrefLoopAlignment has been made virtual, and can be provided with an optional MachineLoop* so the target can inspect the loop before answering the query. The default behavior, as before, is to return the value set with setPrefLoopAlignment. MachineBlockPlacement now queries the target for each loop instead of only once per function. There should be no functional change for other targets. llvm-svn: 225117
* [PowerPC] Use 16-byte alignment for modern cores for functions/loopsHal Finkel2015-01-031-4/+20
| | | | | | | | | | | | Most modern PowerPC cores prefer that functions and loops start on 16-byte-aligned boundaries (*), so instruct block placement, etc. to make this happen. The branch selector has also been adjusted so account for the extra nops that might now be inserted before loop headers. (*) Some cores actually prefer other alignments for small loops, but that will be addressed in a follow-up commit. llvm-svn: 225115
* [PowerPC] Add support for the CMPB instructionHal Finkel2015-01-031-0/+1
| | | | | | | | | | | | | | Newer POWER cores, and the A2, support the cmpb instruction. This instruction compares its operands, treating each of the 8 bytes in the GPRs separately, returning a 'mask' result of 0 (for false) or -1 (for true) in each byte. Code generation support is added, in the form of a PPCISelDAGToDAG DAG-preprocessing routine, that recognizes patterns close to what the instruction computes (either exactly, or related by a constant masking operation), and generates the cmpb instruction (along with any necessary constant masking operation). This can be expanded if use cases arise. llvm-svn: 225106
* [PowerPC] Ensure that the TOC reload directly follows bctrl on PPC64Hal Finkel2014-12-231-13/+12
| | | | | | | | | | | | | | | | On non-Darwin PPC64, the TOC reload needs to come directly after the bctrl instruction (for indirect calls) because the 'bctrl/ld 2, 40(1)' instruction sequence is interpreted by the unwinding code in libgcc. To make sure these occur as a pair, as with other pairings interpreted by the linker, fuse the two instructions into one instruction (for code generation only). In the future, we might wish to do this by emitting CFI directives instead, but this solution is simpler, and mirrors what GCC does. Additional discussion on this point is contained in the PR. Fixes PR22015. llvm-svn: 224788
* [PowerPC] Don't mark the return-address slot as immutableHal Finkel2014-12-231-1/+1
| | | | | | | | | | | | | It is tempting to mark the fixed stack slot used to store the return address as immutable when lowering @llvm.returnaddress(i32 0). Unfortunately, within the function, it is not completely immutable: it is written during the function prologue. When using post-RA instruction scheduling, the prologue instructions are available for scheduling, and we're not free to interchange the order of a particular store in the prologue with loads from that stack location. Fixes PR21976. llvm-svn: 224761
OpenPOWER on IntegriCloud