bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	Revert "r227976 - [PowerPC] Yet another approach to __tls_get_addr" and ↵	Hal Finkel	2015-02-06	4	-58/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	related fixups Unfortunately, even with the workaround of disabling the linker TLS optimizations in Clang restored (which has already been done), this still breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native). Bill is currently working on an alternate implementation to address the TLS issue in a way that also fully elides the linker bug (which, unfortunately, this approach did not fully), so I'm reverting this now. llvm-svn: 228460
*	[PowerPC] Prepare loops for pre-increment loads/stores	Hal Finkel	2015-02-05	2	-6/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PowerPC supports pre-increment load/store instructions (except for Altivec/VSX vector load/stores). Using these on embedded cores can be very important, but most loops are not naturally set up to use them. We can often change that, however, by placing loops into a non-canonical form. Generically, this means transforming loops like this: for (int i = 0; i < n; ++i) array[i] = c; to look like this: T p = array[-1]; for (int i = 0; i < n; ++i) ++p = c; the key point is that addresses accessed are pulled into dedicated PHIs and "pre-decremented" in the loop preheader. This allows the use of pre-increment load/store instructions without loop peeling. A target-specific late IR-level pass (running post-LSR), PPCLoopPreIncPrep, is introduced to perform this transformation. I've used this code out-of-tree for generating code for the PPC A2 for over a year. Somewhat to my surprise, running the test suite + externals on a P7 with this transformation enabled showed no performance regressions, and one speedup: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk -2.32514% +/- 1.03736% So I'm going to enable it on everything for now. I was surprised by this because, on the POWER cores, these pre-increment load/store instructions are cracked (and, thus, harder to schedule effectively). But seeing no regressions, and feeling that it is generally easier to split instructions apart late than it is to combine them late, this might be the better approach regardless. In the future, we might want to integrate this functionality into LSR (but currently LSR does not create new PHI nodes, so (for that and other reasons) significant work would need to be done). llvm-svn: 228328
*	[PowerPC] Generate pre-increment floating-point ld/st instructions	Hal Finkel	2015-02-05	1	-0/+40
\| \| \| \| \| \| \| \|	PowerPC supports pre-increment floating-point load/store instructions, both r+r and r+i, and we had patterns for them, but they were not marked as legal. Mark them as legal (and add a test case). llvm-svn: 228327
*	[PowerPC] Implement the vclz instructions for PWR8	Bill Schmidt	2015-02-05	1	-0/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Patch by Kit Barton. Add the vector count leading zeros instruction for byte, halfword, word, and doubleword sizes. This is a fairly straightforward addition after the changes made for vpopcnt: 1. Add the correct definitions for the various instructions in PPCInstrAltivec.td 2. Make the CTLZ operation legal on vector types when using P8Altivec in PPCISelLowering.cpp Test Plan Created new test case in test/CodeGen/PowerPC/vec_clz.ll to check the instructions are being generated when the CTLZ operation is used in LLVM. Check the encoding and decoding in test/MC/PowerPC/ppc_encoding_vmx.s and test/Disassembler/PowerPC/ppc_encoding_vmx.txt respectively. llvm-svn: 228301
*	Add missing test case from r228046	Bill Schmidt	2015-02-04	1	-0/+72
\| \| \| \|	llvm-svn: 228182
*	[PowerPC] Handle 32-bit targets properly in PPCTLSDynamicCall.cpp	Bill Schmidt	2015-02-04	1	-2/+2
\| \| \| \|	llvm-svn: 228116
*	Disable 32-bit tests in tls-pic.ll until they can be repaired	Bill Schmidt	2015-02-03	1	-2/+2
\| \| \| \|	llvm-svn: 227981
*	Further revise too-restrictive test CodeGen/PowerPC/tls-pic.ll	Bill Schmidt	2015-02-03	1	-1/+1
\| \| \| \|	llvm-svn: 227980
*	Further revise too-restrictive test CodeGen/PowerPC/tls-pic.ll	Bill Schmidt	2015-02-03	1	-1/+1
\| \| \| \|	llvm-svn: 227978
*	Revise too-restrictive test CodeGen/PowerPC/tls-pic.ll	Bill Schmidt	2015-02-03	1	-6/+6
\| \| \| \|	llvm-svn: 227977
*	[PowerPC] Yet another approach to __tls_get_addr	Bill Schmidt	2015-02-03	3	-6/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch is a third attempt to properly handle the local-dynamic and global-dynamic TLS models. In my original implementation, calls to __tls_get_addr were hidden from view until the asm-printer phase, at which point the underlying branch-and-link instruction was created with proper relocations. This mostly worked well, but I used some repellent techniques to ensure that the TLS_GET_ADDR nodes at the SD and MI levels correctly received input from GPR3 and produced output into GPR3. This proved to work badly in the presence of multiple TLS variable accesses, with the copies to and from GPR3 being scheduled incorrectly and generally creating havoc. In r221703, I addressed that problem by representing the calls to __tls_get_addr as true calls during instruction lowering. This had the advantage of removing all of the bad hacks and relying on the existing call machinery to properly glue the copies in place. It looked like this was going to be the right way to go. However, as a side effect of the recent discovery of problems with linker optimizations for TLS, we discovered cases of suboptimal code generation with this strategy. The problem comes when tls_get_addr is called for the same address, and there is a resulting CSE opportunity. It turns out that in such cases MachineCSE will common the addis/addi instructions that set up the input value to tls_get_addr, but will not common the calls themselves. MachineCSE does not have any machinery to common idempotent calls. This is perfectly sensible, since presumably this would be done at the IR level, and introducing calls in the back end isn't commonplace. In any case, we end up with two calls to __tls_get_addr when one would suffice, and that isn't good. I presumed that the original design would have allowed commoning of the machine-specific nodes that hid the __tls_get_addr calls, so as suggested by Ulrich Weigand, I went back to that design and cleaned it up so that the copies were properly held together by glue nodes. However, it turned out that this didn't work either...the presence of copies to physical registers kept the machine-specific nodes from being commoned also. All of which leads to the design presented here. This is a return to the original design, except that no attempt is made to introduce copies to and from GPR3 during instruction lowering. Virtual registers are used until prior to register allocation. At that point, a special pass is run that identifies the machine-specific nodes that hide the tls_get_addr calls and introduces the copies to and from GPR3 around them. The register allocator then coalesces these copies away. With this design, MachineCSE succeeds in commoning tls_get_addr calls where possible, and we get nice optimal code generation (better than GCC at the moment, which does not common these calls). One additional problem must be dealt with: After introducing the mentions of the physical register GPR3, the aggressive anti-dependence breaker sees opportunities to improve scheduling by selecting a different register instead. Flags must be used on the instruction descriptions to tell the anti-dependence breaker to keep its hands in its pockets. One thing missing from the original design was recording a definition of the link register on the GET_TLS_ADDR nodes. Doing this was found to be insufficient to force a stack frame to be created, which led to looping behavior because two different LR values were stored at the same address. This appears to have been an oversight in PPCFrameLowering::determineFrameLayout(), which is repaired here. Because MustSaveLR() returns true for calls to builtin_return_address, this changed the expected behavior of test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but formerly did not. I've fixed the test case to reflect this. There are existing TLS tests to catch regressions; the checks in test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the face of instruction scheduling with these changes, so I fixed that up. I've added a new test case based on the PrettyStackTrace module that demonstrated the original problem. This checks that we get correct code generation and that CSE of the calls to __get_tls_addr has taken place. llvm-svn: 227976
*	[PowerPC] VSX stores don't also read	Hal Finkel	2015-02-01	2	-2/+65
\| \| \| \| \| \| \| \| \| \|	The VSX store instructions were also picking up an implicit "may read" from the default pattern, which was an intrinsic (and we don't currently have a way of specifying write-only intrinsics). This was causing MI verification to fail for VSX spill restores. llvm-svn: 227759
*	[PowerPC] Better scheduling for isel on P7/P8	Hal Finkel	2015-02-01	1	-0/+33
\| \| \| \| \| \| \| \| \| \|	isel is actually a cracked instruction on the P7/P8, and must start a dispatch group. The scheduling model should reflect this so that we don't bunch too many of them together when possible. Thanks to Bill Schmidt and Pat Haugen for helping to sort this out. llvm-svn: 227758
*	[PowerPC] Make r2 allocatable on PPC64/ELF for some leaf functions	Hal Finkel	2015-02-01	3	-9/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TOC base pointer is passed in r2, and we normally reserve this register so that we can depend on it being there. However, for leaf functions, and specifically those leaf functions that don't do any TOC access of their own (which is generally due to accessing the constant pool, using TLS, etc.), we can treat r2 as an ordinary callee-saved register (it must be callee-saved because, for local direct calls, the linker will not insert any save/restore code). The allocation order has been changed slightly for PPC64/ELF systems to put r2 at the end of the list (while leaving it near the beginning for Darwin systems to prevent unnecessary output changes). While r2 is allocatable, using it still requires spill/restore traffic, and thus comes at the end of the list. llvm-svn: 227745
*	[PowerPC] Complete setting the baseline for ppc64le	Bill Schmidt	2015-01-29	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Patch by Nemanja Ivanovic. As was uncovered by the failing test case (when run on non-PPC platforms), the feature set when compiling with -march=ppc64le was not being picked up. This change ensures that if the -mcpu option is not specified, the correct feature set is picked up regardless of whether we are on PPC or not. llvm-svn: 227455
*	[PowerPC] Revert ppc64le-aggregates.ll test changes from r227053	Bill Schmidt	2015-01-25	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It appears we have different behavior with and without -mcpu=pwr8 even with ppc64le defaulting to POWER8. The failure appears as follows: /home/bb/cmake-llvm-x86_64-linux/llvm-project/llvm/test/CodeGen/PowerPC/ppc64le-aggregates.ll:268:14: error: expected string not found in input ; CHECK-DAG: lfs 1, 0([[REG]]) ^ <stdin>:497:11: note: scanning from here ld 3, .LC1@toc@l(3) ^ <stdin>:497:11: note: with variable "REG" equal to "3" ld 3, .LC1@toc@l(3) ^ <stdin>:514:2: note: possible intended match here lfs 1, 0(4) ^ Reverting this particular test case change. Nemanja, please have a look at the reason for the failure. llvm-svn: 227055
*	[PowerPC] Reset the baseline for ppc64le to be equivalent to pwr8	Bill Schmidt	2015-01-25	3	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Test by Nemanja Ivanovic. Since ppc64le implies POWER8 as a minimum, it makes sense that the same features are included. Since the pwr8 processor model will likely be getting new features until the implementation is complete, I created a new list to add these updates to. This will include them in both pwr8 and ppc64le. Furthermore, it seems that it would make sense to compose the feature lists for other processor models (pwr3 and up). Per discussion in the review, I will make this change in a subsequent patch. In order to test the changes, I've added an additional run step to test cases that specify -march=ppc64le -mcpu=pwr8 to omit the -mcpu option. Since the feature lists are the same, the behaviour should be unchanged. llvm-svn: 227053
*	[PowerPC] Add r2 as an operand for all calls under both PPC64 ELF V1 and V2	Hal Finkel	2015-01-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	Our PPC64 ELF V2 call lowering logic added r2 as an operand to all direct call instructions in order to represent the dependency on the TOC base pointer value. Restricting this to ELF V2, however, does not seem to make sense: calls under ELF V1 have the same dependence, and indirect calls have an r2 dependence just as direct ones. Make sure the dependence is noted for all calls under both ELF V1 and ELF V2. llvm-svn: 226432
*	[PowerPC] Initial PPC64 calling-convention changes for fastcc	Hal Finkel	2015-01-18	2	-0/+596
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The default calling convention specified by the PPC64 ELF (V1 and V2) ABI is designed to work with both prototyped and non-prototyped/varargs functions. As a result, GPRs and stack space are allocated for every argument, even those that are passed in floating-point or vector registers. GlobalOpt::OptimizeFunctions will transform local non-varargs functions (that do not have their address taken) to use the 'fast' calling convention. When functions are using the 'fast' calling convention, don't allocate GPRs for arguments passed in other types of registers, and don't allocate stack space for arguments passed in registers. Other changes for the fast calling convention may be added in the future. llvm-svn: 226399
*	[PowerPC] Don't list R11 as a patchpoint scratch register	Hal Finkel	2015-01-17	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	R11's status is the same under both the PPC64 ELF V1 and V2 ABIs: it is reserved for use as an "environment pointer" for compilation models that require such a thing. We don't, we also don't need a second scratch register, and because we support only "local" patchpoint call targets, we might as well let R11 be used for anyregcc patchpoints. llvm-svn: 226369
*	[PowerPC] Adjust PatchPoints for ppc64le	Hal Finkel	2015-01-16	1	-15/+19
\| \| \| \| \| \| \| \| \| \|	Bill Schmidt pointed out that some adjustments would be needed to properly support powerpc64le (using the ELF V2 ABI). For one thing, R11 is not available as a scratch register, so we need to use R12. R12 is also available under ELF V1, so to maintain consistency, I flipped the order to make R12 the first scratch register in the array under both ABIs. llvm-svn: 226247
*	[PowerPC] Loosen ELFv1 PPC64 func descriptor loads for indirect calls	Hal Finkel	2015-01-15	2	-1/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Function pointers under PPC64 ELFv1 (which is used on PPC64/Linux on the POWER7, A2 and earlier cores) are really pointers to a function descriptor, a structure with three pointers: the actual pointer to the code to which to jump, the pointer to the TOC needed by the callee, and an environment pointer. We used to chain these loads, and make them opaque to the rest of the optimizer, so that they'd always occur directly before the call. This is not necessary, and in fact, highly suboptimal on embedded cores. Once the function pointer is known, the loads can be performed ahead of time; in fact, they can be hoisted out of loops. Now these function descriptors are almost always generated by the linker, and thus the contents of the descriptors are invariant. As a result, by default, we'll mark the associated loads as invariant (allowing them to be hoisted out of loops). I've added a target feature to turn this off, however, just in case someone needs that option (constructing an on-stack descriptor, casting it to a function pointer, and then calling it cannot be well-defined C/C++ code, but I can imagine some JIT-compilation system doing so). Consider this simple test: $ cat call.c typedef void (fp)(); void bar(fp x) { for (int i = 0; i < 1600000000; ++i) x(); } $ cat main.c typedef void (fp)(); void bar(fp x); void foo() {} int main() { bar(foo); } On the PPC A2 (the BG/Q supercomputer), marking the function-descriptor loads as invariant brings the execution time down to ~8 seconds from ~32 seconds with the loads in the loop. The difference on the POWER7 is smaller. Compiling with: gcc -std=c99 -O3 -mcpu=native call.c main.c : ~6 seconds [this is 4.8.2] clang -O3 -mcpu=native call.c main.c : ~5.3 seconds clang -O3 -mcpu=native call.c main.c -mno-invariant-function-descriptors : ~4 seconds (looks like we'd benefit from additional loop unrolling here, as a first guess, because this is faster with the extra loads) The -mno-invariant-function-descriptors will be added to Clang shortly. llvm-svn: 226207
*	IR: Move MDLocation into place	Duncan P. N. Exon Smith	2015-01-14	3	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit moves `MDLocation`, finishing off PR21433. There's an accompanying clang commit for frontend testcases. I'll attach the testcase upgrade script I used to PR21433 to help out-of-tree frontends/backends. This changes the schema for `DebugLoc` and `DILocation` from: !{i32 3, i32 7, !7, !8} to: !MDLocation(line: 3, column: 7, scope: !7, inlinedAt: !8) Note that empty fields (line/column: 0 and inlinedAt: null) don't get printed by the assembly writer. llvm-svn: 226048
*	[PPC64] Add support for the ICBT instruction on POWER8.	Bill Schmidt	2015-01-14	2	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Patch by Kit Barton. Support for the ICBT instruction is currently present, but limited to embedded processors. This change adds a new FeatureICBT that can be used to identify whether the ICBT instruction is available on a specific processor. Two new tests are added: * Positive test to ensure the icbt instruction is present when using -mcpu=pwr8 * Negative test to ensure the icbt instruction is not generated when using -mcpu=pwr7 Both test cases use the Prefetch opcode in LLVM. They are based on the ppc64-prefetch.ll test case. llvm-svn: 226033
*	Revert "Insert random noops to increase security against ROP attacks (llvm)"	JF Bastien	2015-01-14	1	-31/+0
\| \| \| \| \| \| \|	This reverts commit: http://reviews.llvm.org/D3392 llvm-svn: 225948
*	[PowerPC] Fix the noop-insert test	Hal Finkel	2015-01-14	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	The form of nops used is CPU-specific (some CPUs, such as the POWER7, have special group-terminating nops). We probably want a different callback for this kind of nop insertion (something more like MCAsmBackend::writeNopData), or for PPC to use a different mechanism for scheduling nops, but this will stop the test from failing for now. llvm-svn: 225928
*	Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""	Hal Finkel	2015-01-14	5	-0/+792
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This re-applies r225808, fixed to avoid problems with SDAG dependencies along with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs. These problems caused the original regression tests to assert/segfault on many (but not all) systems. Original commit message: This commit does two things: 1. Refactors PPCFastISel to use more of the common infrastructure for call lowering (this lets us take advantage of this common code for lowering some common intrinsics, stackmap/patchpoint among them). 2. Adds support for stackmap/patchpoint lowering. For the most part, this is very similar to the support in the AArch64 target, with the obvious differences (different registers, NOP instructions, etc.). The test cases are adapted from the AArch64 test cases. One difference of note is that the patchpoint call sequence takes 24 bytes, so you can't use less than that (on AArch64 you can go down to 16). Also, as noted in the docs, we take the patchpoint address to be the actual code address (assuming the call is local in the TOC-sharing sense), which should yield higher performance than generating the full cross-DSO indirect-call sequence and is likely just as useful for JITed code (if not, we'll change it). StackMaps and Patchpoints are still marked as experimental, and so this support is doubly experimental. So go ahead and experiment! llvm-svn: 225909
*	Insert random noops to increase security against ROP attacks (llvm)	JF Bastien	2015-01-14	1	-0/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A pass that adds random noops to X86 binaries to introduce diversity with the goal of increasing security against most return-oriented programming attacks. Command line options: -noop-insertion // Enable noop insertion. -noop-insertion-percentage=X // X% of assembly instructions will have a noop prepended (default: 50%, requires -noop-insertion) -max-noops-per-instruction=X // Randomly generate X noops per instruction. ie. roll the dice X times with probability set above (default: 1). This doesn't guarantee X noop instructions. In addition, the following 'quick switch' in clang enables basic diversity using default settings (currently: noop insertion and schedule randomization; it is intended to be extended in the future). -fdiversify This is the llvm part of the patch. clang part: D3393 http://reviews.llvm.org/D3392 Patch by Stephen Crane (@rinon) llvm-svn: 225908
*	Use the integrated assembler as default on PowerPC	Ulrich Weigand	2015-01-13	6	-11/+12
\| \| \| \| \| \| \| \| \| \| \|	This was already done in clang, this commit now uses the integrated assembler as default when using LLVM tools directly. A number of test cases using inline asm had to be adapted, either by updating the expected output, or by using -no-integrated-as (for such tests that deliberately use an invalid instruction in inline asm). llvm-svn: 225819
*	Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support"	Hal Finkel	2015-01-13	5	-792/+0
\| \| \| \| \| \| \|	Reverting this while I investiage buildbot failures (segfaulting in GetCostForDef at ScheduleDAGRRList.cpp:314). llvm-svn: 225811
*	[PowerPC] Add StackMap/PatchPoint support	Hal Finkel	2015-01-13	5	-0/+792
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit does two things: 1. Refactors PPCFastISel to use more of the common infrastructure for call lowering (this lets us take advantage of this common code for lowering some common intrinsics, stackmap/patchpoint among them). 2. Adds support for stackmap/patchpoint lowering. For the most part, this is very similar to the support in the AArch64 target, with the obvious differences (different registers, NOP instructions, etc.). The test cases are adapted from the AArch64 test cases. One difference of note is that the patchpoint call sequence takes 24 bytes, so you can't use less than that (on AArch64 you can go down to 16). Also, as noted in the docs, we take the patchpoint address to be the actual code address (assuming the call is local in the TOC-sharing sense), which should yield higher performance than generating the full cross-DSO indirect-call sequence and is likely just as useful for JITed code (if not, we'll change it). StackMaps and Patchpoints are still marked as experimental, and so this support is doubly experimental. So go ahead and experiment! llvm-svn: 225808
*	[PowerPC] Fix calls to non-function objects	Hal Finkel	2015-01-12	1	-0/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Looking at r225438 inspired me to see how the PowerPC backend handled the situation (calling a bitcasted TLS global), and it turns out we also produced an error (cannot select ...). What it means to "call" something that is not a function is implementation and platform specific, but in the name of doing something (besides crashing), this makes sure we do what GCC does (treat all such calls as calls through a function pointer -- meaning that the pointer is assumed, as is the convention on PPC, to point to a function descriptor structure holding the actual code address along with the function's TOC pointer and environment pointer). As GCC does, we now do the same for calling regular (non-TLS) non-function globals too. I'm not sure whether this is the most useful way to define the behavior, but at least we won't be alone. llvm-svn: 225617
*	[PowerPC] Mark zext of a small scalar load as free	Hal Finkel	2015-01-10	1	-0/+37
\| \| \| \| \| \| \| \| \| \| \| \| \|	This initial implementation of PPCTargetLowering::isZExtFree marks as free zexts of small scalar loads (that are not sign-extending). This callback is used by SelectionDAGBuilder's RegsForValue::getCopyToRegs, and thus to determine whether a zext or an anyext is used to lower illegally-typed PHIs. Because later truncates of zero-extended values are nops, this allows for the elimination of later unnecessary truncations. Fixes the initial complaint associated with PR22120. llvm-svn: 225584
*	Fully fix Bug #22115.	Justin Hibbits	2015-01-10	2	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: In the previous commit, the register was saved, but space was not allocated. This resulted in the parameter save area potentially clobbering r30, leading to nasty results. Test Plan: Tests updated Reviewers: hfinkel Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D6906 llvm-svn: 225573
*	[PowerPC] Fold [sz]ext with fp_to_int lowering where possible	Hal Finkel	2015-01-09	1	-0/+69
\| \| \| \| \| \| \|	On modern cores with lfiw[az]x, we can fold a sign or zero extension from i32 to i64 into the load necessary for an i64 -> fp conversion. llvm-svn: 225493
*	[PowerPC] Mark all instructions as non-cheap for MachineLICM	Hal Finkel	2015-01-08	1	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \|	MachineLICM uses a callback named hasLowDefLatency to determine if an instruction def operand has a 'low' latency. If all relevant operands have a 'low' latency, the instruction is considered too cheap to hoist out of loops even in low-register-pressure situations. On PowerPC cores, both the embedded cores and the others, there is no reason to believe that this is a good choice: all instructions have a cost inside a loop, and hoisting them when not limited by register pressure is a reasonable default. llvm-svn: 225471
*	Add saving and restoring of r30 to the prologue and epilogue, respectively	Justin Hibbits	2015-01-08	2	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: The PIC additions didn't update the prologue and epilogue code to save and restore r30 (PIC base register). This does that. Test Plan: Tests updated. Reviewers: hfinkel Reviewed By: hfinkel Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D6876 llvm-svn: 225450
*	More FMA folding opportunities.	Olivier Sallenave	2015-01-07	2	-0/+172
\| \| \| \|	llvm-svn: 225380
*	[PowerPC] Reuse a load operand in int->fp conversions	Hal Finkel	2015-01-06	1	-0/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	int->fp conversions on PPC must be done through memory loads and stores. On a modern core, this process begins by storing the int value to memory, then loading it using a (sometimes special) FP load instruction. Unfortunately, we would do this even when the value to be converted was itself a load, and we can just use that same memory location instead of copying it to another first. There is a slight complication when handling int_to_fp(fp_to_int(x)) pairs, because the fp_to_int operand has not been lowered when the int_to_fp is being lowered. We handle this specially by invoking fp_to_int's lowering logic (partially) and getting the necessary memory location (some trivial refactoring was done to make this possible). This is all somewhat ugly, and it would be nice if some later CodeGen stage could just clean this stuff up, but because doing so would involve modifying target-specific nodes (or instructions), it is not immediately clear how that would work. Also, remove a related entry from the README.txt for which we now generate reasonable code. llvm-svn: 225301
*	[PowerPC] Add a regression test for r225251	Hal Finkel	2015-01-06	1	-0/+23
\| \| \| \| \| \| \| \|	In r225251, I removed an old entry from the README.txt file. While there are several contributing factors (including pieces in Clang's ABI code), upon further reflection, the backend part deserves a regression test. llvm-svn: 225268
*	[PowerPC] Improve int_to_fp(fp_to_int(x)) combining	Hal Finkel	2015-01-06	1	-0/+70
\| \| \| \| \| \| \| \| \|	The old target DAG combine that allowed for performing int_to_fp(fp_to_int(x)) without a load/store pair is updated here with support for unsigned integers, and to support single-precision values without a third rounding step, on newer cores with the appropriate instructions. llvm-svn: 225248
*	[PowerPC] Fix test to pass on Darwin hosts	Hal Finkel	2015-01-05	1	-1/+3
\| \| \| \|	llvm-svn: 225220
*	[PowerPC] Convert a README.txt entry into a better test	Hal Finkel	2015-01-05	1	-1/+7
\| \| \| \| \| \| \|	We now produce the desired code as noted in the README.txt file (no spurious or). Remove the README entry and improve the regression test. llvm-svn: 225214
*	[PowerPC] Add a test for truncating a shifted load	Hal Finkel	2015-01-05	1	-0/+18
\| \| \| \| \| \| \|	We now produce the desired code as noted in the README.txt file. Remove the README entry and add a regression test. llvm-svn: 225209
*	[PowerPC] Add another test for load/store with update	Hal Finkel	2015-01-05	1	-0/+19
\| \| \| \| \| \| \|	We now produce the desired code as noted in the README.txt file. Remove the README entry and add a regression test. llvm-svn: 225205
*	[PowerPC] Fold i1 extensions with other ops	Hal Finkel	2015-01-05	1	-0/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Consider this function from our README.txt file: int foo(int a, int b) { return (a < b) << 4; } We now explicitly track CR bits by default, so the comment in the README.txt about not really having a SETCC is no longer accurate, but we did generate this somewhat silly code: cmpw 0, 3, 4 li 3, 0 li 12, 1 isel 3, 12, 3, 0 sldi 3, 3, 4 blr which generates the zext as a select between 0 and 1, and then shifts the result by a constant amount. Here we preprocess the DAG in order to fold the results of operations on an extension of an i1 value into the SELECT_I[48] pseudo instruction when the resulting constant can be materialized using one instruction (just like the 0 and 1). This was not implemented as a DAGCombine because the resulting code would have been anti-canonical and depends on replacing chained user nodes, which does not fit well into the lowering paradigm. Now we generate: cmpw 0, 3, 4 li 3, 0 li 12, 16 isel 3, 12, 3, 0 blr which is less silly. llvm-svn: 225203
*	[PowerPC] Remove zexts after i32 ctlz	Hal Finkel	2015-01-05	1	-4/+20
\| \| \| \| \| \| \| \| \|	The 64-bit semantics of cntlzw are not special, the 32-bit population count is stored as a 64-bit value in the range [0,32]. As a result, it is always zero extended, and it can be added to the PPCISelDAGToDAG peephole optimization as a frontier instruction for the removal of unnecessary zero extensions. llvm-svn: 225192
*	[PowerPC] Remove zexts after byte-swapping loads	Hal Finkel	2015-01-05	1	-0/+30
\| \| \| \| \| \| \| \| \|	lhbrx and lwbrx not only load their data with byte swapping, but also clear the upper 32 bits (at least). As a result, they can be added to the PPCISelDAGToDAG peephole optimization as frontier instructions for the removal of unnecessary zero extensions. llvm-svn: 225189
*	[PowerPC] Enable speculation of cttz/ctlz	Hal Finkel	2015-01-05	1	-0/+41
\| \| \| \| \| \| \| \|	PPC has an instruction for ctlz with defined zero behavior, and our lowering of cttz (provided by DAGCombine) is also efficient and branchless, so speculating these makes sense. llvm-svn: 225150
*	[PowerPC] Materialize i64 constants using rotation with masking	Hal Finkel	2015-01-05	1	-7/+27
\| \| \| \| \| \| \| \| \|	r225135 added the ability to materialize i64 constants using rotations in order to reduce the instruction count. Sometimes we can use a rotation only with some extra masking, so that we take advantage of the fact that generating a bunch of extra higher-order 1 bits is easy using li/lis. llvm-svn: 225147