summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
Commit message (Collapse)AuthorAgeFilesLines
...
* [C++] Use 'nullptr'. Target edition.Craig Topper2014-04-251-7/+7
| | | | llvm-svn: 207197
* [Modules] Fix potential ODR violations by sinking the DEBUG_TYPEChandler Carruth2014-04-221-1/+2
| | | | | | | definition below all of the header #include lines, lib/Target/... edition. llvm-svn: 206842
* [PowerPC] Fix rlwimi isel when mask is not constantHal Finkel2014-04-131-1/+8
| | | | | | | | | | | | | | | | | We had been using the known-zero values of the operand of the or to construct the mask for an rlwimi; this is not quite correct, but fine when the mask is constant. When the mask is constant, then the known zeros of the operand must be a superset of the zeros in the mask. However, when the mask is not a constant, then there might be bits in the operand that are not known to be zero that, at runtime, might be zero in the mask. Therefore, we check that any bits not known to be zero *are* known to be one in the mask. Otherwise, we can't fold the mask with the or and shift. This was revealed as a miscompile of MultiSource/Benchmarks/BitBench/drop3/drop3 when I started experimenting with constant hoisting. llvm-svn: 206136
* [PowerPC] Fix VSX permutation iselHal Finkel2014-03-281-1/+1
| | | | | | | Not only did I invert the indices when I wrote the code, but I also did the same thing when I wrote the regression test. Oops. llvm-svn: 205046
* Prevent alias from pointing to weak aliases.Rafael Espindola2014-03-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds back r204781. Original message: Aliases are just another name for a position in a file. As such, the regular symbol resolutions are not applied. For example, given define void @my_func() { ret void } @my_alias = alias weak void ()* @my_func @my_alias2 = alias void ()* @my_alias We produce without this patch: .weak my_alias my_alias = my_func .globl my_alias2 my_alias2 = my_alias That is, in the resulting ELF file my_alias, my_func and my_alias are just 3 names pointing to offset 0 of .text. That is *not* the semantics of IR linking. For example, linking in a @my_alias = alias void ()* @other_func would require the strong my_alias to override the weak one and my_alias2 would end up pointing to other_func. There is no way to represent that with aliases being just another name, so the best solution seems to be to just disallow it, converting a miscompile into an error. llvm-svn: 204934
* [PowerPC] Generate VSX permutations for v2[fi]64 vectorsHal Finkel2014-03-261-0/+37
| | | | llvm-svn: 204873
* [PowerPC] Lower VSELECT using xxsel when VSX is availableHal Finkel2014-03-261-3/+16
| | | | | | | | With VSX there is a real vector select instruction, and so we should use it. Note that VSELECT will still scalarize for v2f64 because the corresponding SetCC result type (v2i64) is not currently a legal type. llvm-svn: 204801
* Revert "Prevent alias from pointing to weak aliases."Rafael Espindola2014-03-261-2/+2
| | | | | | | | | This reverts commit r204781. I will follow up to with msan folks to see what is what they were trying to do with aliases to weak aliases. llvm-svn: 204784
* Prevent alias from pointing to weak aliases.Rafael Espindola2014-03-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Aliases are just another name for a position in a file. As such, the regular symbol resolutions are not applied. For example, given define void @my_func() { ret void } @my_alias = alias weak void ()* @my_func @my_alias2 = alias void ()* @my_alias We produce without this patch: .weak my_alias my_alias = my_func .globl my_alias2 my_alias2 = my_alias That is, in the resulting ELF file my_alias, my_func and my_alias are just 3 names pointing to offset 0 of .text. That is *not* the semantics of IR linking. For example, linking in a @my_alias = alias void ()* @other_func would require the strong my_alias to override the weak one and my_alias2 would end up pointing to other_func. There is no way to represent that with aliases being just another name, so the best solution seems to be to just disallow it, converting a miscompile into an error. llvm-svn: 204781
* [PowerPC] Initial support for the VSX instruction setHal Finkel2014-03-131-13/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VSX is an ISA extension supported on the POWER7 and later cores that enhances floating-point vector and scalar capabilities. Among other things, this adds <2 x double> support and generally helps to reduce register pressure. The interesting part of this ISA feature is the register configuration: there are 64 new 128-bit vector registers, the 32 of which are super-registers of the existing 32 scalar floating-point registers, and the second 32 of which overlap with the 32 Altivec vector registers. This makes things like vector insertion and extraction tricky: this can be free but only if we force a restriction to the right register subclass when needed. A new "minipass" PPCVSXCopy takes care of this (although it could do a more-optimal job of it; see the comment about unnecessary copies below). Please note that, currently, VSX is not enabled by default when targeting anything because it is not yet ready for that. The assembler and disassembler are fully implemented and tested. However: - CodeGen support causes miscompiles; test-suite runtime failures: MultiSource/Benchmarks/FreeBench/distray/distray MultiSource/Benchmarks/McCat/08-main/main MultiSource/Benchmarks/Olden/voronoi/voronoi MultiSource/Benchmarks/mafft/pairlocalalign MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 SingleSource/Benchmarks/CoyoteBench/almabench SingleSource/Benchmarks/Misc/matmul_f64_4x4 - The lowering currently falls back to using Altivec instructions far more than it should. Worse, there are some things that are scalarized through the stack that shouldn't be. - A lot of unnecessary copies make it past the optimizers, and this needs to be fixed. - Many more regression tests are needed. Normally, I'd fix these things prior to committing, but there are some students and other contributors who would like to work this, and so it makes sense to move this development process upstream where it can be subject to the regular code-review procedures. llvm-svn: 203768
* The PPC global base register cannot be r0Hal Finkel2014-03-061-2/+2
| | | | | | | | | The global base register cannot be r0 because it might end up as the first argument to addi or addis. Fixes PR18316. I don't have a small stable test case. llvm-svn: 203054
* Swap PPC isel operands to allow for 0-foldingHal Finkel2014-02-281-0/+117
| | | | | | | | | The PPC isel instruction can fold 0 into the first operand (thus eliminating the need to materialize a zero-containing register when the 'true' result of the isel is 0). When the isel is fed by a bit register operation that we can invert, do so as part of the bit-register-operation peephole routine. llvm-svn: 202469
* Add CR-bit tracking to the PowerPC backend for i1 valuesHal Finkel2014-02-281-4/+458
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change enables tracking i1 values in the PowerPC backend using the condition register bits. These bits can be treated on PowerPC as separate registers; individual bit operations (and, or, xor, etc.) are supported. Tracking booleans in CR bits has several advantages: - Reduction in register pressure (because we no longer need GPRs to store boolean values). - Logical operations on booleans can be handled more efficiently; we used to have to move all results from comparisons into GPRs, perform promoted logical operations in GPRs, and then move the result back into condition register bits to be used by conditional branches. This can be very inefficient, because the throughput of these CR <-> GPR moves have high latency and low throughput (especially when other associated instructions are accounted for). - On the POWER7 and similar cores, we can increase total throughput by using the CR bits. CR bit operations have a dedicated functional unit. Most of this is more-or-less mechanical: Adjustments were needed in the calling-convention code, support was added for spilling/restoring individual condition-register bits, and conditional branch instruction definitions taking specific CR bits were added (plus patterns and code for generating bit-level operations). This is enabled by default when running at -O2 and higher. For -O0 and -O1, where the ability to debug is more important, this feature is disabled by default. Individual CR bits do not have assigned DWARF register numbers, and storing values in CR bits makes them invisible to the debugger. It is critical, however, that we don't move i1 values that have been promoted to larger values (such as those passed as function arguments) into bit registers only to quickly turn around and move the values back into GPRs (such as happens when values are returned by functions). A pair of target-specific DAG combines are added to remove the trunc/extends in: trunc(binary-ops(binary-ops(zext(x), zext(y)), ...) and: zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...) In short, we only want to use CR bits where some of the i1 values come from comparisons or are used by conditional branches or selects. To put it another way, if we can do the entire i1 computation in GPRs, then we probably should (on the POWER7, the GPR-operation throughput is higher, and for all cores, the CR <-> GPR moves are expensive). POWER7 test-suite performance results (from 10 runs in each configuration): SingleSource/Benchmarks/Misc/mandel-2: 35% speedup MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown MultiSource/Applications/lemon/lemon: 8% slowdown llvm-svn: 202451
* [PPC] Fix comment to match function nameHal Finkel2014-01-021-1/+1
| | | | llvm-svn: 198362
* PPC: Optimize rldicl generation for masked shiftsHal Finkel2013-11-201-1/+15
| | | | | | | | | | | | | | Masking operations (where only some number of the low bits are being kept) are selected to rldicl(x, 0, mb). If x is a logical right shift (which would become rldicl(y, 64-n, n)), we might be able to fold the two instructions together: rldicl(rldicl(x, 64-n, n), 0, mb) -> rldicl(x, 64-n, mb) for n <= mb The right shift is really a left rotate followed by a mask, and if the explicit mask is a more-restrictive sub-mask of the mask implied by the shift, only one rldicl is needed. llvm-svn: 195185
* ISelDAG: spot chain cycles involving MachineNodesTim Northover2013-09-221-1/+3
| | | | | | | | | | | | | | | | | Previously, the DAGISel function WalkChainUsers was spotting that it had entered already-selected territory by whether a node was a MachineNode (amongst other things). Since it's fairly common practice to insert MachineNodes during ISelLowering, this was not the correct check. Looking around, it seems that other nodes get their NodeId set to -1 upon selection, so this makes sure the same thing happens to all MachineNodes and uses that characteristic to determine whether we should stop looking for a loop during selection. This should fix PR15840. llvm-svn: 191165
* PPCDAGToDAGISel::isRunOfOnes should return false on zeroHal Finkel2013-07-111-1/+4
| | | | | | | | | | | | This fixes a bug (found by csmith) at -O0 where we attempt to create a RLWIMI with an out-of-range operand. Most uses of the isRunOfOnes function are guarded by a condition that the value is not zero. This was not true in two places, and in both places a zero input would result in an out-of-rage MB value (= 32). To fix this, isRunOfOnes returns false on a zero input (and I've remove one now-redundant guard). llvm-svn: 186101
* [PowerPC] Always use mfocrf if availableUlrich Weigand2013-07-031-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When accessing just a single CR register, it is always preferable to use mfocrf instead of mfcr, if the former is available on the CPU. Current code makes that distinction in many, but not all places where a single CR register value is retrieved. One missing location is PPCRegisterInfo::lowerCRSpilling. To fix this and make this simpler in the future, this patch changes the bulk of the back-end to always assume mfocrf is available and simply generate it when needed. On machines that actually do not support mfocrf, the instruction is replaced by mfcr at the very end, in EmitInstruction. This has the additional benefit that we no longer need the MFCRpseud hack, since before EmitInstruction we always have a MFOCRF instruction pattern, which already models data flow as required. The patch also adds the MFOCRF8 version of the instruction, which was missing so far. Except for the PPCRegisterInfo::lowerCRSpilling case, no change in generated code intended. llvm-svn: 185556
* [PowerPC] Remove dead code from PPCDAGToDAGISel::SelectSETCCUlrich Weigand2013-07-031-23/+5
| | | | | | | | | | | | | | | | The subroutine getCRIdxForSetCC has a parameter "Other" and comment: If this returns with Other != -1, then the returned comparison is an or of two simpler comparisons. However for at least the last five years this routine has never returned a value of Other != -1; these cases are now handled differently to begin with. This patch removes the parameter and the code in SelectSETCC that attempted to handle the Other != -1 case. llvm-svn: 185541
* Index: test/CodeGen/PowerPC/reloc-align.llBill Schmidt2013-07-011-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | =================================================================== --- test/CodeGen/PowerPC/reloc-align.ll (revision 0) +++ test/CodeGen/PowerPC/reloc-align.ll (revision 0) @@ -0,0 +1,34 @@ +; RUN: llc -mcpu=pwr7 -O1 < %s | FileCheck %s + +; This test verifies that the peephole optimization of address accesses +; does not produce a load or store with a relocation that can't be +; satisfied for a given instruction encoding. Reduced from a test supplied +; by Hal Finkel. + +target datalayout = "E-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-f128:128:128-v128:128:128-n32:64" +target triple = "powerpc64-unknown-linux-gnu" + +%struct.S1 = type { [8 x i8] } + +@main.l_1554 = internal global { i8, i8, i8, i8, i8, i8, i8, i8 } { i8 -1, i8 -6, i8 57, i8 62, i8 -48, i8 0, i8 58, i8 80 }, align 1 + +; Function Attrs: nounwind readonly +define signext i32 @main() #0 { +entry: + %call = tail call fastcc signext i32 @func_90(%struct.S1* byval bitcast ({ i8, i8, i8, i8, i8, i8, i8, i8 }* @main.l_1554 to %struct.S1*)) +; CHECK-NOT: ld {{[0-9]+}}, main.l_1554@toc@l + ret i32 %call +} + +; Function Attrs: nounwind readonly +define internal fastcc signext i32 @func_90(%struct.S1* byval nocapture %p_91) #0 { +entry: + %0 = bitcast %struct.S1* %p_91 to i64* + %bf.load = load i64* %0, align 1 + %bf.shl = shl i64 %bf.load, 26 + %bf.ashr = ashr i64 %bf.shl, 54 + %bf.cast = trunc i64 %bf.ashr to i32 + ret i32 %bf.cast +} + +attributes #0 = { nounwind readonly "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"="true" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "unsafe-fp-math"="false" "use-soft-float"="false" } Index: lib/Target/PowerPC/PPCAsmPrinter.cpp =================================================================== --- lib/Target/PowerPC/PPCAsmPrinter.cpp (revision 185327) +++ lib/Target/PowerPC/PPCAsmPrinter.cpp (working copy) @@ -679,7 +679,26 @@ void PPCAsmPrinter::EmitInstruction(const MachineI OutStreamer.EmitRawText(StringRef("\tmsync")); return; } + break; + case PPC::LD: + case PPC::STD: + case PPC::LWA: { + // Verify alignment is legal, so we don't create relocations + // that can't be supported. + // FIXME: This test is currently disabled for Darwin. The test + // suite shows a handful of test cases that fail this check for + // Darwin. Those need to be investigated before this sanity test + // can be enabled for those subtargets. + if (!Subtarget.isDarwin()) { + unsigned OpNum = (MI->getOpcode() == PPC::STD) ? 2 : 1; + const MachineOperand &MO = MI->getOperand(OpNum); + if (MO.isGlobal() && MO.getGlobal()->getAlignment() < 4) + llvm_unreachable("Global must be word-aligned for LD, STD, LWA!"); + } + // Now process the instruction normally. + break; } + } LowerPPCMachineInstrToMCInst(MI, TmpInst, *this); OutStreamer.EmitInstruction(TmpInst); Index: lib/Target/PowerPC/PPCISelDAGToDAG.cpp =================================================================== --- lib/Target/PowerPC/PPCISelDAGToDAG.cpp (revision 185327) +++ lib/Target/PowerPC/PPCISelDAGToDAG.cpp (working copy) @@ -1530,6 +1530,14 @@ void PPCDAGToDAGISel::PostprocessISelDAG() { if (GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(ImmOpnd)) { SDLoc dl(GA); const GlobalValue *GV = GA->getGlobal(); + // We can't perform this optimization for data whose alignment + // is insufficient for the instruction encoding. + if (GV->getAlignment() < 4 && + (StorageOpcode == PPC::LD || StorageOpcode == PPC::STD || + StorageOpcode == PPC::LWA)) { + DEBUG(dbgs() << "Rejected this candidate for alignment.\n\n"); + continue; + } ImmOpnd = CurDAG->getTargetGlobalAddress(GV, dl, MVT::i64, 0, Flags); } else if (ConstantPoolSDNode *CP = dyn_cast<ConstantPoolSDNode>(ImmOpnd)) { llvm-svn: 185380
* Fix a PPC rlwimi instruction-selection bugHal Finkel2013-06-281-2/+2
| | | | | | | | | Under certain (evidently rare) circumstances, this code used to convert OR(a, AND(x, y)) into OR(a, x). This was incorrect. While there, I've added a comment to the code immediately above. llvm-svn: 185201
* [PowerPC] Rename some more VK_PPC_ enumsUlrich Weigand2013-06-211-3/+3
| | | | | | | | | | | | | This renames more VK_PPC_ enums, to make them more closely reflect the @modifier string they represent. This also prepares for adding a bunch of new VK_PPC_ enums in upcoming patches. For consistency, some MO_ flags related to VK_PPC_ enums are likewise renamed. No change in behaviour. llvm-svn: 184547
* Track IR ordering of SelectionDAG nodes 2/4.Andrew Trick2013-05-251-6/+6
| | | | | | | Change SelectionDAG::getXXXNode() interfaces as well as call sites of these functions to pass in SDLoc instead of DebugLoc. llvm-svn: 182703
* Replace Count{Leading,Trailing}Zeros_{32,64} with count{Leading,Trailing}Zeros.Michael J. Spencer2013-05-241-5/+5
| | | | llvm-svn: 182680
* [PowerPC] Use true offset value in "memrix" machine operandsUlrich Weigand2013-05-161-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the second part of the change to always return "true" offset values from getPreIndexedAddressParts, tackling the case of "memrix" type operands. This is about instructions like LD/STD that only have a 14-bit field to encode immediate offsets, which are implicitly extended by two zero bits by the machine, so that in effect we can access 16-bit offsets as long as they are a multiple of 4. The PowerPC back end currently handles such instructions by carrying the 14-bit value (as it will get encoded into the actual machine instructions) in the machine operand fields for such instructions. This means that those values are in fact not the true offset, but rather the offset divided by 4 (and then truncated to an unsigned 14-bit value). Like in the case fixed in r182012, this makes common code operations on such offset values not work as expected. Furthermore, there doesn't really appear to be any strong reason why we should encode machine operands this way. This patch therefore changes the encoding of "memrix" type machine operands to simply contain the "true" offset value as a signed immediate value, while enforcing the rules that it must fit in a 16-bit signed value and must also be a multiple of 4. This change must be made simultaneously in all places that access machine operands of this type. However, just about all those changes make the code simpler; in many cases we can now just share the same code for memri and memrix operands. llvm-svn: 182032
* Implement PPC counter loops as a late IR-level passHal Finkel2013-05-151-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old PPCCTRLoops pass, like the Hexagon pass version from which it was derived, could only handle some simple loops in canonical form. We cannot directly adapt the new Hexagon hardware loops pass, however, because the Hexagon pass contains a fundamental assumption that non-constant-trip-count loops will contain a guard, and this is not always true (the result being that incorrect negative counts can be generated). With this commit, we replace the pass with a late IR-level pass which makes use of SE to calculate the backedge-taken counts and safely generate the loop-count expressions (including any necessary max() parts). This IR level pass inserts custom intrinsics that are lowered into the desired decrement-and-branch instructions. The most fragile part of this new implementation is that interfering uses of the counter register must be detected on the IR level (and, on PPC, this also includes any indirect branches in addition to function calls). Also, to make all of this work, we need a variant of the mtctr instruction that is marked as having side effects. Without this, machine-code level CSE, DCE, etc. illegally transform the resulting code. Hopefully, this can be improved in the future. This new pass is smaller than the original (and much smaller than the new Hexagon hardware loops pass), and can handle many additional cases correctly. In addition, the preheader-creation code has been copied from LoopSimplify, and after we decide on where it belongs, this code will be refactored so that it can be explicitly shared (making this implementation even smaller). The new test-case files ctrloop-{le,lt,ne}.ll have been adapted from tests for the new Hexagon pass. There are a few classes of loops that this pass does not transform (noted by FIXMEs in the files), but these deficiencies can be addressed within the SE infrastructure (thus helping many other passes as well). llvm-svn: 181927
* ArrayRefize getMachineNode(). No functionality change.Michael Liao2013-04-191-7/+7
| | | | llvm-svn: 179901
* PowerPC: Remove ADDIL patterns.Ulrich Weigand2013-03-261-2/+1
| | | | | | | | | | | | | | The ADDI/ADDI8 patterns are currently duplicated into ADDIL/ADDI8L, which describe the same instruction, except that they accept a symbolLo[64] operand instead of a s16imm[64] operand. This duplication confuses the asm parser, and it actually not really needed, since symbolLo[64] already accepts immediate operands anyway. So this commit removes the duplicate patterns. No change in generated code. llvm-svn: 178004
* Fix swapped BasePtr and Offset in pre-inc memory addresses.Ulrich Weigand2013-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | PPCTargetLowering::getPreIndexedAddressParts currently provides the base part of a memory address in the offset result, and the offset part in the base result. That swap is then undone again when an MI instruction is generated (in PPCDAGToDAGISel::Select for loads, and using .md Pat patterns for stores). This patch reverts this double swap, to make common code and back-end be in sync as to which part of the address is base and which is offset. To avoid performance regressions in certain cases, target code now checks whether the choice of base register would be rejected for pre-inc accesses by common code, and attempts to swap base and offset again in such cases. (Overall, this means that now pre-ice accesses are generated *more* frequently than before.) llvm-svn: 177733
* Tighten iaddroff ComplexPattern.Ulrich Weigand2013-03-221-4/+4
| | | | | | | | | | | | | | | | | The iaddroff ComplexPattern is supposed to recognize displacement expressions that have been processed by a SelectAddressRegImm, which means it needs to accept TargetConstant and TargetGlobalAddress nodes. Currently, it erroneously also accepts some other nodes, in particular Constant and PPCISD::Lo. While this problem is currently latent, it would cause wrong-code bugs with a follow-on patch I'm about to commit, so this patch tightens the ComplexPattern. The equivalent change is made in PPCDAGToDAGISel::Select, where pre-inc load patterns are handled (as opposed to store patterns, the loads are handled in C++ code without making use of the .td ComplexPattern). llvm-svn: 177732
* Remove the xaddroff ComplexPattern.Ulrich Weigand2013-03-221-12/+0
| | | | | | | | | The xaddroff pattern is currently (mistakenly) used to recognize the *base* register in pre-inc store patterns. This patch replaces those uses by ptr_rc_nor0 (as is elsewhere done to match the base register of an address), and removes the now unused ComplexPattern. llvm-svn: 177731
* Implement builtin_{setjmp/longjmp} on PPCHal Finkel2013-03-211-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implements SJLJ lowering on PPC, making the Clang functions __builtin_{setjmp/longjmp} functional on PPC platforms. The implementation strategy is similar to that on X86, with the exception that a branch-and-link variant is used to get the right jump address. Credit goes to Bill Schmidt for suggesting the use of the unconditional bcl form (instead of the regular bl instruction) to limit return-address-cache pollution. Benchmarking the speed at -O3 of: static jmp_buf env_sigill; void foo() { __builtin_longjmp(env_sigill,1); } main() { ... for (int i = 0; i < c; ++i) { if (__builtin_setjmp(env_sigill)) { goto done; } else { foo(); } done:; } ... } vs. the same code using the libc setjmp/longjmp functions on a P7 shows that this builtin implementation is ~4x faster with Altivec enabled and ~7.25x faster with Altivec disabled. This comparison is somewhat unfair because the libc version must also save/restore the VSX registers which we don't yet support. llvm-svn: 177666
* Trivial cleanupBill Schmidt2013-02-211-2/+2
| | | | llvm-svn: 175771
* Large code model support for PowerPC.Bill Schmidt2013-02-211-6/+7
| | | | | | | | | | | Large code model is identical to medium code model except that the addis/addi sequence for "local" accesses is never used. All accesses use the addis/ld sequence. The coding changes are straightforward; most of the patch is taken up with creating variants of the medium model tests for large model. llvm-svn: 175767
* Code review cleanup for r175697Bill Schmidt2013-02-211-11/+7
| | | | llvm-svn: 175739
* PPCDAGToDAGISel::PostprocessISelDAG()Bill Schmidt2013-02-211-0/+155
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the PPCDAGToDAGISel::PostprocessISelDAG virtual method to perform post-selection peephole optimizations on the DAG representation. One optimization is implemented here: folds to clean up complex addressing expressions for thread-local storage and medium code model. It will also be useful for large code model sequences when those are added later. I originally thought about doing this on the MI representation prior to register assignment, but it's difficult to do effective global dead code elimination at that point. DCE is trivial on the DAG representation. A typical example of a candidate code sequence in assembly: addis 3, 2, globalvar@toc@ha addi 3, 3, globalvar@toc@l lwz 5, 0(3) When the final instruction is a load or store with an immediate offset of zero, the offset from the add-immediate can replace the zero, provided the relocation information is carried along: addis 3, 2, globalvar@toc@ha lwz 5, globalvar@toc@l(3) Since the addi can in general have multiple uses, we need to only delete the instruction when the last use is removed. llvm-svn: 175697
* Additional fixes for bug 15155.Bill Schmidt2013-02-201-9/+50
| | | | | | | | This handles the cases where the 6-bit splat element is odd, converting to a three-instruction sequence to add or subtract two splats. With this fix, the XFAIL in test/CodeGen/PowerPC/vec_constants.ll is removed. llvm-svn: 175663
* Fix PR15155: lost vadd/vsplat optimization.Bill Schmidt2013-02-201-0/+30
| | | | | | | | | | | | | | During lowering of a BUILD_VECTOR, we look for opportunities to use a vector splat. When the splatted value fits in 5 signed bits, a single splat does the job. When it doesn't fit in 5 bits but does fit in 6, and is an even value, we can splat on half the value and add the result to itself. This last optimization hasn't been working recently because of improved constant folding. To circumvent this, create a pseudo VADD_SPLAT that can be expanded during instruction selection. llvm-svn: 175632
* Add registration for PPC-specific passes to allow the IR to be dumpedKrzysztof Parzyszek2013-02-131-1/+18
| | | | | | via -print-after-all. llvm-svn: 175058
* Sort all of the includes. Several files got checked in with mis-sortedChandler Carruth2013-01-191-1/+1
| | | | | | includes. llvm-svn: 172891
* This patch addresses bug 14678 by fixing two problems in medium code modelBill Schmidt2013-01-071-3/+8
| | | | | | | | | code generation. Variables addressed through a GlobalAlias were not being handled, and variables with available_externally linkage were treated incorrectly. The patch contains two new tests to verify the correct code generation for these cases. llvm-svn: 171778
* Move all of the header files which are involved in modelling the LLVM IRChandler Carruth2013-01-021-5/+5
| | | | | | | | | | | | | | | | | | | | | into their new header subdirectory: include/llvm/IR. This matches the directory structure of lib, and begins to correct a long standing point of file layout clutter in LLVM. There are still more header files to move here, but I wanted to handle them in separate commits to make tracking what files make sense at each layer easier. The only really questionable files here are the target intrinsic tablegen files. But that's a battle I'd rather not fight today. I've updated both CMake and Makefile build systems (I think, and my tests think, but I may have missed something). I've also re-sorted the includes throughout the project. I'll be committing updates to Clang, DragonEgg, and Polly momentarily. llvm-svn: 171366
* This is another cleanup patch for 64-bit PowerPC TLS processing. I hadBill Schmidt2012-12-131-47/+0
| | | | | | | some hackery in place that hid my poor use of TblGen, which I've now sorted out and cleaned up. No change in observable behavior, so no new test cases. llvm-svn: 170149
* This patch implements local-dynamic TLS model support for the 64-bitBill Schmidt2012-12-121-0/+26
| | | | | | | | | | | | | | | | | | | | | | PowerPC target. This is the last of the four models, so we now have full TLS support. This is mostly a straightforward extension of the general dynamic model. I had to use an additional Chain operand to tie ADDIS_DTPREL_HA to the register copy following ADDI_TLSLD_L; otherwise everything above the ADDIS_DTPREL_HA appeared dead and was removed. As before, there are new test cases to test the assembly generation, and the relocations output during integrated assembly. The expected code gen sequence can be read in test/CodeGen/PowerPC/tls-ld.ll. There are a couple of things I think can be done more efficiently in the overall TLS code, so there will likely be a clean-up patch forthcoming; but for now I want to be sure the functionality is in place. Bill llvm-svn: 170003
* This patch implements the general dynamic TLS model for 64-bit PowerPC.Bill Schmidt2012-12-111-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given a thread-local symbol x with global-dynamic access, the generated code to obtain x's address is: Instruction Relocation Symbol addis ra,r2,x@got@tlsgd@ha R_PPC64_GOT_TLSGD16_HA x addi r3,ra,x@got@tlsgd@l R_PPC64_GOT_TLSGD16_L x bl __tls_get_addr(x@tlsgd) R_PPC64_TLSGD x R_PPC64_REL24 __tls_get_addr nop <use address in r3> The implementation borrows from the medium code model work for introducing special forms of ADDIS and ADDI into the DAG representation. This is made slightly more complicated by having to introduce a call to the external function __tls_get_addr. Using the full call machinery is overkill and, more importantly, makes it difficult to add a special relocation. So I've introduced another opcode GET_TLS_ADDR to represent the function call, and surrounded it with register copies to set up the parameter and return value. Most of the code is pretty straightforward. I ran into one peculiarity when I introduced a new PPC opcode BL8_NOP_ELF_TLSGD, which is just like BL8_NOP_ELF except that it takes another parameter to represent the symbol ("x" above) that requires a relocation on the call. Something in the TblGen machinery causes BL8_NOP_ELF and BL8_NOP_ELF_TLSGD to be treated identically during the emit phase, so this second operand was never visited to generate relocations. This is the reason for the slightly messy workaround in PPCMCCodeEmitter.cpp:getDirectBrEncoding(). Two new tests are included to demonstrate correct external assembly and correct generation of relocations using the integrated assembler. Comments welcome! Thanks, Bill llvm-svn: 169910
* This patch introduces initial-exec model support for thread-local storageBill Schmidt2012-12-041-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | on 64-bit PowerPC ELF. The patch includes code to handle external assembly and MC output with the integrated assembler. It intentionally does not support the "old" JIT. For the initial-exec TLS model, the ABI requires the following to calculate the address of external thread-local variable x: Code sequence Relocation Symbol ld 9,x@got@tprel(2) R_PPC64_GOT_TPREL16_DS x add 9,9,x@tls R_PPC64_TLS x The register 9 is arbitrary here. The linker will replace x@got@tprel with the offset relative to the thread pointer to the generated GOT entry for symbol x. It will replace x@tls with the thread-pointer register (13). The two test cases verify correct assembly output and relocation output as just described. PowerPC-specific selection node variants are added for the two instructions above: LD_GOT_TPREL and ADD_TLS. These are inserted when an initial-exec global variable is encountered by PPCTargetLowering::LowerGlobalTLSAddress(), and later lowered to machine instructions LDgotTPREL and ADD8TLS. LDgotTPREL is a pseudo that uses the same LDrs support added for medium code model's LDtocL, with a different relocation type. The rest of the processing is straightforward. llvm-svn: 169281
* Use the new script to sort the includes of every file under lib.Chandler Carruth2012-12-031-4/+4
| | | | | | | | | | | | | | | | | Sooooo many of these had incorrect or strange main module includes. I have manually inspected all of these, and fixed the main module include to be the nearest plausible thing I could find. If you own or care about any of these source files, I encourage you to take some time and check that these edits were sensible. I can't have broken anything (I strictly added headers, and reordered them, never removed), but they may not be the headers you'd really like to identify as containing the API being implemented. Many forward declarations and missing includes were added to a header files to allow them to parse cleanly when included first. The main module rule does in fact have its merits. =] llvm-svn: 169131
* This patch implements medium code model support for 64-bit PowerPC.Bill Schmidt2012-11-271-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The default for 64-bit PowerPC is small code model, in which TOC entries must be addressable using a 16-bit offset from the TOC pointer. Additionally, only TOC entries are addressed via the TOC pointer. With medium code model, TOC entries and data sections can all be addressed via the TOC pointer using a 32-bit offset. Cooperation with the linker allows 16-bit offsets to be used when these are sufficient, reducing the number of extra instructions that need to be executed. Medium code model also does not generate explicit TOC entries in ".section toc" for variables that are wholly internal to the compilation unit. Consider a load of an external 4-byte integer. With small code model, the compiler generates: ld 3, .LC1@toc(2) lwz 4, 0(3) .section .toc,"aw",@progbits .LC1: .tc ei[TC],ei With medium model, it instead generates: addis 3, 2, .LC1@toc@ha ld 3, .LC1@toc@l(3) lwz 4, 0(3) .section .toc,"aw",@progbits .LC1: .tc ei[TC],ei Here .LC1@toc@ha is a relocation requesting the upper 16 bits of the 32-bit offset of ei's TOC entry from the TOC base pointer. Similarly, .LC1@toc@l is a relocation requesting the lower 16 bits. Note that if the linker determines that ei's TOC entry is within a 16-bit offset of the TOC base pointer, it will replace the "addis" with a "nop", and replace the "ld" with the identical "ld" instruction from the small code model example. Consider next a load of a function-scope static integer. For small code model, the compiler generates: ld 3, .LC1@toc(2) lwz 4, 0(3) .section .toc,"aw",@progbits .LC1: .tc test_fn_static.si[TC],test_fn_static.si .type test_fn_static.si,@object .local test_fn_static.si .comm test_fn_static.si,4,4 For medium code model, the compiler generates: addis 3, 2, test_fn_static.si@toc@ha addi 3, 3, test_fn_static.si@toc@l lwz 4, 0(3) .type test_fn_static.si,@object .local test_fn_static.si .comm test_fn_static.si,4,4 Again, the linker may replace the "addis" with a "nop", calculating only a 16-bit offset when this is sufficient. Note that it would be more efficient for the compiler to generate: addis 3, 2, test_fn_static.si@toc@ha lwz 4, test_fn_static.si@toc@l(3) The current patch does not perform this optimization yet. This will be addressed as a peephole optimization in a later patch. For the moment, the default code model for 64-bit PowerPC will remain the small code model. We plan to eventually change the default to medium code model, which matches current upstream GCC behavior. Note that the different code models are ABI-compatible, so code compiled with different models will be linked and execute correctly. I've tested the regression suite and the application/benchmark test suite in two ways: Once with the patch as submitted here, and once with additional logic to force medium code model as the default. The tests all compile cleanly, with one exception. The mandel-2 application test fails due to an unrelated ABI compatibility with passing complex numbers. It just so happens that small code model was incredibly lucky, in that temporary values in floating-point registers held the expected values needed by the external library routine that was called incorrectly. My current thought is to correct the ABI problems with _Complex before making medium code model the default, to avoid introducing this "regression." Here are a few comments on how the patch works, since the selection code can be difficult to follow: The existing logic for small code model defines three pseudo-instructions: LDtoc for most uses, LDtocJTI for jump table addresses, and LDtocCPT for constant pool addresses. These are expanded by SelectCodeCommon(). The pseudo-instruction approach doesn't work for medium code model, because we need to generate two instructions when we match the same pattern. Instead, new logic in PPCDAGToDAGISel::Select() intercepts the TOC_ENTRY node for medium code model, and generates an ADDIStocHA followed by either a LDtocL or an ADDItocL. These new node types correspond naturally to the sequences described above. The addis/ld sequence is generated for the following cases: * Jump table addresses * Function addresses * External global variables * Tentative definitions of global variables (common linkage) The addis/addi sequence is generated for the following cases: * Constant pool entries * File-scope static global variables * Function-scope static variables Expanding to the two-instruction sequences at select time exposes the instructions to subsequent optimization, particularly scheduling. The rest of the processing occurs at assembly time, in PPCAsmPrinter::EmitInstruction. Each of the instructions is converted to a "real" PowerPC instruction. When a TOC entry needs to be created, this is done here in the same manner as for the existing LDtoc, LDtocJTI, and LDtocCPT pseudo-instructions (I factored out a new routine to handle this). I had originally thought that if a TOC entry was needed for LDtocL or ADDItocL, it would already have been generated for the previous ADDIStocHA. However, at higher optimization levels, the ADDIStocHA may appear in a different block, which may be assembled textually following the block containing the LDtocL or ADDItocL. So it is necessary to include the possibility of creating a new TOC entry for those two instructions. Note that for LDtocL, we generate a new form of LD called LDrs. This allows specifying the @toc@l relocation for the offset field of the LD instruction (i.e., the offset is replaced by a SymbolLo relocation). When the peephole optimization described above is added, we will need to do similar things for all immediate-form load and store operations. The seven "mcm-n.ll" test cases are kept separate because otherwise the intermingling of various TOC entries and so forth makes the tests fragile and hard to understand. The above assumes use of an external assembler. For use of the integrated assembler, new relocations are added and used by PPCELFObjectWriter. Testing is done with "mcm-obj.ll", which tests for proper generation of the various relocations for the same sequences tested with the external assembler. llvm-svn: 168708
* PowerPC: More support for Altivec compare operationsAdhemerval Zanella2012-10-301-13/+133
| | | | | | | | This patch adds more support for vector type comparisons using altivec. It adds correct support for v16i8, v8i16, v4i32, and v4f32 vector types for comparison operators ==, !=, >, >=, <, and <=. llvm-svn: 167015
* The PowerPC VRSAVE register has been somewhat of an odd beast sinceBill Schmidt2012-10-101-1/+3
| | | | | | | | | | | | | | | | | the Altivec extensions were introduced. Its use is optional, and allows the compiler to communicate to the operating system which vector registers should be saved and restored during a context switch. In practice, this information is ignored by the various operating systems using the SVR4 ABI; the kernel saves and restores the entire register state. Setting the VRSAVE register is no longer performed by the AIX XL compilers, the IBM i compilers, or by GCC on Power Linux systems. It seems best to avoid this logic within LLVM as well. This patch avoids generating code to update and restore VRSAVE for the PowerPC SVR4 ABIs (32- and 64-bit). The code remains in place for the Darwin ABI. llvm-svn: 165656
OpenPOWER on IntegriCloud