bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	[Hexagon] Converting XTYPE/SHIFT intrinsics. Cleaning out old intrinsic ↵	Colin LeMahieu	2015-02-03	3	-2/+90
\| \| \| \| \| \|	patterns and updating tests. llvm-svn: 228026
*	Allow PRE to insert no-cost phi nodes	Daniel Berlin	2015-02-03	1	-0/+31
\| \| \| \|	llvm-svn: 228024
*	[X86][SSE] Added general integer shuffle matching for MOVQ instruction	Simon Pilgrim	2015-02-03	4	-20/+45
\| \| \| \| \| \|	This patch adds general shuffle pattern matching for the MOVQ zero-extend instruction (copy lower 64bits, zero upper) for all 128-bit integer vectors, it is added as a fallback test in lowerVectorShuffleAsZeroOrAnyExtend. llvm-svn: 228022
*	[Hexagon] Updating XTYPE/PRED intrinsics.	Colin LeMahieu	2015-02-03	1	-1/+154
\| \| \| \|	llvm-svn: 228019
*	Add straight-line strength reduction to LLVM	Jingyue Wu	2015-02-03	1	-0/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Straight-line strength reduction (SLSR) is implemented in GCC but not yet in LLVM. It has proven to effectively simplify statements derived from an unrolled loop, and can potentially benefit many other cases too. For example, LLVM unrolls #pragma unroll foo (int i = 0; i < 3; ++i) { sum += foo((b + i) * s); } into sum += foo(b * s); sum += foo((b + 1) * s); sum += foo((b + 2) * s); However, no optimizations yet reduce the internal redundancy of the three expressions: b * s (b + 1) * s (b + 2) * s With SLSR, LLVM can optimize these three expressions into: t1 = b * s t2 = t1 + s t3 = t2 + s This commit is only an initial step towards implementing a series of such optimizations. I will implement more (see TODO in the file commentary) in the near future. This optimization is enabled for the NVPTX backend for now. However, I am more than happy to push it to the standard optimization pipeline after more thorough performance tests. Test Plan: test/StraightLineStrengthReduce/slsr.ll Reviewers: eliben, HaoLiu, meheff, hfinkel, jholewinski, atrick Reviewed By: jholewinski, atrick Subscribers: karthikthecool, jholewinski, llvm-commits Differential Revision: http://reviews.llvm.org/D7310 llvm-svn: 228016
*	[Hexagon] Updating XTYPE/PERM intrinsics.	Colin LeMahieu	2015-02-03	1	-0/+206
\| \| \| \|	llvm-svn: 228015
*	[X86][AVX2] Enabled shuffle matching for the AVX2 zero extension (128bit -> ↵	Simon Pilgrim	2015-02-03	3	-4/+110
\| \| \| \| \| \| \| \|	256bit) vpmovzx* instructions. Differential Revision: http://reviews.llvm.org/D7251 llvm-svn: 228014
*	Fix typo in test/CodeGen/X86/sibcall.ll (pr22331).	Rafael Espindola	2015-02-03	1	-4/+4
\| \| \| \|	llvm-svn: 228011
*	[Hexagon] Adding missing vector multiply instruction encodings. Converting ↵	Colin LeMahieu	2015-02-03	3	-0/+426
\| \| \| \| \| \|	multiply intrinsics and updating tests. llvm-svn: 228010
*	Merge consecutive 16-byte loads into one 32-byte load (PR22329)	Sanjay Patel	2015-02-03	1	-29/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch detects consecutive vector loads using the existing EltsFromConsecutiveLoads() logic. This fixes: http://llvm.org/bugs/show_bug.cgi?id=22329 This patch effectively reverts the tablegen additions of D6492 / http://reviews.llvm.org/rL224344 ...which in hindsight were a horrible hack. The test cases that were added with that patch are simply modified to load from varying offsets of a base pointer. These loads did not match the existing tablegen patterns. A happy side effect of doing this optimization earlier is that we can now fold the load into a math op where possible; this is shown in some of the updated checks in the test file. Differential Revision: http://reviews.llvm.org/D7303 llvm-svn: 228006
*	[Hexagon] Converting complex number intrinsics and adding tests.	Colin LeMahieu	2015-02-03	1	-0/+349
\| \| \| \|	llvm-svn: 227995
*	[Hexagon] Adding vector intrinsics for alu32/alu and xtype/alu.	Colin LeMahieu	2015-02-03	2	-0/+533
\| \| \| \|	llvm-svn: 227993
*	R600/SI: Don't generate non-existent LSHL, LSHR, ASHR B32 variants on VI	Marek Olsak	2015-02-03	2	-2/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This can happen when a REV instruction is commuted. The trick is not to define the _vi versions of instructions, which has these consequences: - code generation will always fail if a pseudo cannot be lowered (very useful to catch bugs where an unsupported instruction somehow makes it to the printer) - ability to query if a pseudo can be lowered, which is done in commuteOpcode to prevent REV from commuting to non-REV on VI Tested-by: Michel Dänzer <michel.daenzer@amd.com> llvm-svn: 227990
*	R600/SI: Fix dependency between instruction writing M0 and S_SENDMSG on VI (v2)	Marek Olsak	2015-02-03	1	-0/+20
\| \| \| \| \| \| \| \| \| \|	This fixes a hang when using an empty geometry shader. v2: - don't add s_nop when followed by s_waitcnt - comestic changes Tested-by: Michel Dänzer <michel.daenzer@amd.com> llvm-svn: 227986
*	Fix program crashes due to alignment exceptions generated for SSE memop ↵	Sanjay Patel	2015-02-03	2	-5/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instructions (PR22371). r224330 introduced a bug by misinterpreting the "FeatureVectorUAMem" bit. The commit log says that change did not affect anything, but that's not correct. That change allowed SSE instructions to have unaligned mem operands folded into math ops, and that's not allowed in the default specification for any SSE variant. The bug is exposed when compiling for an AVX-capable CPU that had this feature flag but without enabling AVX codegen. Another mistake in r224330 was not adding the feature flag to all AVX CPUs; the AMD chips were excluded. This is part of the fix for PR22371 ( http://llvm.org/bugs/show_bug.cgi?id=22371 ). This feature bit is SSE-specific, so I've renamed it to "FeatureSSEUnalignedMem". Changed the existing test case for the feature bit to reflect the new name and renamed the test file itself to better reflect the feature. Added runs to fold-vex.ll to check for the failing codegen. Note that the feature bit is not set by default on any CPU because it may require a configuration register setting to enable the enhanced unaligned behavior. llvm-svn: 227983
*	Disable 32-bit tests in tls-pic.ll until they can be repaired	Bill Schmidt	2015-02-03	1	-2/+2
\| \| \| \|	llvm-svn: 227981
*	Further revise too-restrictive test CodeGen/PowerPC/tls-pic.ll	Bill Schmidt	2015-02-03	1	-1/+1
\| \| \| \|	llvm-svn: 227980
*	Further revise too-restrictive test CodeGen/PowerPC/tls-pic.ll	Bill Schmidt	2015-02-03	1	-1/+1
\| \| \| \|	llvm-svn: 227978
*	Revise too-restrictive test CodeGen/PowerPC/tls-pic.ll	Bill Schmidt	2015-02-03	1	-6/+6
\| \| \| \|	llvm-svn: 227977
*	[PowerPC] Yet another approach to __tls_get_addr	Bill Schmidt	2015-02-03	3	-6/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch is a third attempt to properly handle the local-dynamic and global-dynamic TLS models. In my original implementation, calls to __tls_get_addr were hidden from view until the asm-printer phase, at which point the underlying branch-and-link instruction was created with proper relocations. This mostly worked well, but I used some repellent techniques to ensure that the TLS_GET_ADDR nodes at the SD and MI levels correctly received input from GPR3 and produced output into GPR3. This proved to work badly in the presence of multiple TLS variable accesses, with the copies to and from GPR3 being scheduled incorrectly and generally creating havoc. In r221703, I addressed that problem by representing the calls to __tls_get_addr as true calls during instruction lowering. This had the advantage of removing all of the bad hacks and relying on the existing call machinery to properly glue the copies in place. It looked like this was going to be the right way to go. However, as a side effect of the recent discovery of problems with linker optimizations for TLS, we discovered cases of suboptimal code generation with this strategy. The problem comes when tls_get_addr is called for the same address, and there is a resulting CSE opportunity. It turns out that in such cases MachineCSE will common the addis/addi instructions that set up the input value to tls_get_addr, but will not common the calls themselves. MachineCSE does not have any machinery to common idempotent calls. This is perfectly sensible, since presumably this would be done at the IR level, and introducing calls in the back end isn't commonplace. In any case, we end up with two calls to __tls_get_addr when one would suffice, and that isn't good. I presumed that the original design would have allowed commoning of the machine-specific nodes that hid the __tls_get_addr calls, so as suggested by Ulrich Weigand, I went back to that design and cleaned it up so that the copies were properly held together by glue nodes. However, it turned out that this didn't work either...the presence of copies to physical registers kept the machine-specific nodes from being commoned also. All of which leads to the design presented here. This is a return to the original design, except that no attempt is made to introduce copies to and from GPR3 during instruction lowering. Virtual registers are used until prior to register allocation. At that point, a special pass is run that identifies the machine-specific nodes that hide the tls_get_addr calls and introduces the copies to and from GPR3 around them. The register allocator then coalesces these copies away. With this design, MachineCSE succeeds in commoning tls_get_addr calls where possible, and we get nice optimal code generation (better than GCC at the moment, which does not common these calls). One additional problem must be dealt with: After introducing the mentions of the physical register GPR3, the aggressive anti-dependence breaker sees opportunities to improve scheduling by selecting a different register instead. Flags must be used on the instruction descriptions to tell the anti-dependence breaker to keep its hands in its pockets. One thing missing from the original design was recording a definition of the link register on the GET_TLS_ADDR nodes. Doing this was found to be insufficient to force a stack frame to be created, which led to looping behavior because two different LR values were stored at the same address. This appears to have been an oversight in PPCFrameLowering::determineFrameLayout(), which is repaired here. Because MustSaveLR() returns true for calls to builtin_return_address, this changed the expected behavior of test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but formerly did not. I've fixed the test case to reflect this. There are existing TLS tests to catch regressions; the checks in test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the face of instruction scheduling with these changes, so I fixed that up. I've added a new test case based on the PrettyStackTrace module that demonstrated the original problem. This checks that we get correct code generation and that CSE of the calls to __get_tls_addr has taken place. llvm-svn: 227976
*	Improve test to actually check for a folded load.	Sanjay Patel	2015-02-03	1	-11/+15
\| \| \| \| \| \| \| \| \| \| \|	This test was checking for lack of a "movaps" (an aligned load) rather than a "movups" (an unaligned load). It also included a store which complicated the checking. Add specific CPU runs to prevent subtarget feature flag overrides from inhibiting this optimization. llvm-svn: 227972
*	[X86][MMX] Improve transfer from mmx to i32	Bruno Cardoso Lopes	2015-02-03	1	-7/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Improve EXTRACT_VECTOR_ELT DAG combine to catch conversion patterns between x86mmx and i32 with more layers of indirection. Before: movq2dq %mm0, %xmm0 movd %xmm0, %eax After: movd %mm0, %eax llvm-svn: 227969
*	[X86] Make fxsave64/fxrstor64/xsave64/xsrstor64/xsaveopt64 parseable in AT&T ↵	Craig Topper	2015-02-03	3	-7/+15
\| \| \| \| \| \|	syntax. Also make them the default output. llvm-svn: 227963
*	Propagate a better error message to the C api.	Rafael Espindola	2015-02-03	1	-1/+1
\| \| \| \|	llvm-svn: 227934
*	Use a non-fatal diag handler in the C API. FIxes PR22368.	Rafael Espindola	2015-02-03	2	-0/+3
\| \| \| \|	llvm-svn: 227903
*	Revert part of r227437 as it was unnecessary. Thanks to echristo for	Alex Rosenberg	2015-02-02	3	-3/+3
\| \| \| \| \| \|	pointing this out. llvm-svn: 227897
*	[X86][MMX] Add tests for MMX extract element	Bruno Cardoso Lopes	2015-02-02	1	-0/+75
\| \| \| \| \| \| \| \|	LLVM ToT produces poor MMX code compared to 3.5. However, part of the previous functionality can be achieved by using -x86-experimental-vector-widening-legalization. Add tests to be sure we don't regress again. llvm-svn: 227869
*	[X86][MMX] Cleanup shuffle, bitcast and insert element tests	Bruno Cardoso Lopes	2015-02-02	16	-175/+228
\| \| \| \| \| \| \|	- Merge MMX arg passing test files - Merge MMX bitcast, insert elt and shuffle tests llvm-svn: 227867
*	[Orc] Make OrcMCJITReplacement::addObject calls transfer buffer ownership to the	Lang Hames	2015-02-02	1	-0/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ObjectLinkingLayer. There are a two of overloads for addObject, one of which transfers ownership of the underlying buffer to OrcMCJITReplacement. This commit makes the ownership transfering version pass ownership down to the ObjectLinkingLayer in order to prevent the issue described in r227778. I think this commit will fix the sanitizer bot failures that necessitated the removal of the load-object-a.ll regression test in r227785, so I'm reinstating that test. llvm-svn: 227845
*	Debug Info: Relax assertion in isUnsignedDIType() to allow floats to be	Adrian Prantl	2015-02-02	1	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	described by integer constants. This is a bit ugly, but if the source language allows arbitrary type casting, the debug info must follow suit. For example: void foo() { float a; (int )&a = 0; } For the curious: SROA replaces the float alloca with an i32 alloca, which is then optimized away and described via dbg.value(i32 0, ...). llvm-svn: 227827
*	R600/SI: 64-bit and larger memory access must be at least 4-byte aligned	Tom Stellard	2015-02-02	2	-5/+76
\| \| \| \| \| \| \| \|	This is true for SI only. CI+ supports unaligned memory accesses, but this requires driver support, so for now we disallow unaligned accesses for all GCN targets. llvm-svn: 227822
*	R600/SI: Merge two test files	Tom Stellard	2015-02-02	2	-24/+15
\| \| \| \|	llvm-svn: 227821
*	[AArch64] Prefer DUP/MOV ("CPY") to INS for vector_extract.	Ahmed Bougacha	2015-02-02	4	-10/+10
\| \| \| \| \| \| \| \| \|	This avoids a partial false dependency on the previous content of the upper lanes of the destination vector register. Differential Revision: http://reviews.llvm.org/D7307 llvm-svn: 227820
*	fix typo	Sanjay Patel	2015-02-02	1	-1/+1
\| \| \| \|	llvm-svn: 227815
*	Fix ARM peephole optimizeCompare to avoid optimizing unsigned cmp to 0.	Jan Wen Voung	2015-02-02	1	-0/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Previously it only avoided optimizing signed comparisons to 0. Sometimes the DAGCombiner will optimize the unsigned comparisons to 0 before it gets to the peephole pass, but sometimes it doesn't. Fix for PR22373. Test Plan: test/CodeGen/ARM/sub-cmp-peephole.ll Reviewers: jfb, manmanren Subscribers: aemerson, llvm-commits Differential Revision: http://reviews.llvm.org/D7274 llvm-svn: 227809
*	Fix: SLPVectorizer crashes with assertion when vectorizing a cmp instruction.	Erik Eckstein	2015-02-02	1	-0/+56
\| \| \| \| \| \| \| \| \|	The commit r225977 uncovered this bug. The problem was that the vectorizer tried to read the second operand of an already deleted instruction. The bug didn't show up before r225977 because the freed memory still contained a non-null pointer. With r225977 deletion of instructions is delayed and the read operand pointer is always null. llvm-svn: 227800
*	[Orc] Remove one of the OrcMCJITReplacement regression tests while I	Lang Hames	2015-02-02	1	-24/+0
\| \| \| \| \| \|	investigate a sanitizer bot failure. llvm-svn: 227785
*	[Orc] Regression tests for OrcMCJITReplacement.	Lang Hames	2015-02-02	89	-0/+2171
\| \| \| \| \| \|	Duplicated from the MCJIT regression tests. llvm-svn: 227780
*	[PowerPC] VSX stores don't also read	Hal Finkel	2015-02-01	2	-2/+65
\| \| \| \| \| \| \| \| \| \|	The VSX store instructions were also picking up an implicit "may read" from the default pattern, which was an intrinsic (and we don't currently have a way of specifying write-only intrinsics). This was causing MI verification to fail for VSX spill restores. llvm-svn: 227759
*	[PowerPC] Better scheduling for isel on P7/P8	Hal Finkel	2015-02-01	1	-0/+33
\| \| \| \| \| \| \| \| \| \|	isel is actually a cracked instruction on the P7/P8, and must start a dispatch group. The scheduling model should reflect this so that we don't bunch too many of them together when possible. Thanks to Bill Schmidt and Pat Haugen for helping to sort this out. llvm-svn: 227758
*	[X86] Convert esp-relative movs of function arguments to pushes, step 2	Michael Kuperstein	2015-02-01	2	-17/+163
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This moves the transformation introduced in r223757 into a separate MI pass. This allows it to cover many more cases (not only cases where there must be a reserved call frame), and perform rudimentary call folding. It still doesn't have a heuristic, so it is enabled only for optsize/minsize, with stack alignment <= 8, where it ought to be a fairly clear win. (Re-commit of r227728) Differential Revision: http://reviews.llvm.org/D6789 llvm-svn: 227752
*	Revert r227728 due to bad line endings.	Michael Kuperstein	2015-02-01	2	-163/+17
\| \| \| \|	llvm-svn: 227746
*	[PowerPC] Make r2 allocatable on PPC64/ELF for some leaf functions	Hal Finkel	2015-02-01	3	-9/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TOC base pointer is passed in r2, and we normally reserve this register so that we can depend on it being there. However, for leaf functions, and specifically those leaf functions that don't do any TOC access of their own (which is generally due to accessing the constant pool, using TLS, etc.), we can treat r2 as an ordinary callee-saved register (it must be callee-saved because, for local direct calls, the linker will not insert any save/restore code). The allocation order has been changed slightly for PPC64/ELF systems to put r2 at the end of the list (while leaving it near the beginning for Darwin systems to prevent unnecessary output changes). While r2 is allocatable, using it still requires spill/restore traffic, and thus comes at the end of the list. llvm-svn: 227745
*	[X86] Convert esp-relative movs of function arguments to pushes, step 2	Michael Kuperstein	2015-02-01	2	-17/+163
\| \| \| \| \| \| \| \| \| \| \| \|	This moves the transformation introduced in r223757 into a separate MI pass. This allows it to cover many more cases (not only cases where there must be a reserved call frame), and perform rudimentary call folding. It still doesn't have a heuristic, so it is enabled only for optsize/minsize, with stack alignment <= 8, where it ought to be a fairly clear win. Differential Revision: http://reviews.llvm.org/D6789 llvm-svn: 227728
*	[PM] Port SimplifyCFG to the new pass manager.	Chandler Carruth	2015-02-01	1	-0/+1
\| \| \| \| \| \| \| \|	This should be sufficient to replace the initial (minor) function pass pipeline in Clang with the new pass manager. I'll probably add an (off by default) flag to do that just to ensure we can get extra testing. llvm-svn: 227726
*	[PM] Port EarlyCSE to the new pass manager.	Chandler Carruth	2015-02-01	2	-0/+2
\| \| \| \| \| \| \| \|	I've added RUN lines both to the basic test for EarlyCSE and the target-specific test, as this serves as a nice test that the TTI layer in the new pass manager is in fact working well. llvm-svn: 227725
*	[PM] Teach the module-to-function adaptor to not run function passes	Chandler Carruth	2015-02-01	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \|	over declarations. This is both quite unproductive and causes things to crash, for example domtree would just assert. I've added a declaration and a domtree run to the basic high-level tests for the new pass manager. llvm-svn: 227724
*	[PM] Port TTI to the new pass manager, introducing a TargetIRAnalysis to	Chandler Carruth	2015-02-01	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	produce it. This adds a function to the TargetMachine that produces this analysis via a callback for each function. This in turn faves the way to produce a different TTI per-function with the correct subtarget cached. I've also done the necessary wiring in the opt tool to thread the target machine down and make it available to the pass registry so that we can construct this analysis from a target machine when available. llvm-svn: 227721
*	AVX2: Added 2 more tests for gather intrinsics.	Elena Demikhovsky	2015-02-01	1	-0/+27
\| \| \| \|	llvm-svn: 227718
*	[NVPTX] Emit .pragma "nounroll" for loops marked with nounroll	Jingyue Wu	2015-02-01	1	-0/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: CUDA driver can unroll loops when jit-compiling PTX. To prevent CUDA driver from unrolling a loop marked with llvm.loop.unroll.disable is not unrolled by CUDA driver, we need to emit .pragma "nounroll" at the header of that loop. This patch also extracts getting unroll metadata from loop ID metadata into a shared helper function. Test Plan: test/CodeGen/NVPTX/nounroll.ll Reviewers: eliben, meheff, jholewinski Reviewed By: jholewinski Subscribers: jholewinski, llvm-commits Differential Revision: http://reviews.llvm.org/D7041 llvm-svn: 227703