bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	GlobalOpt: Aliases don't have sections, don't copy them when replacing	Reid Kleckner	2014-02-13	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As defined in LangRef, aliases do not have sections. However, LLVM's GlobalAlias class inherits from GlobalValue, which means we can read and set its section. We should probably ban that as a separate change, since it doesn't make much sense for an alias to have a section that differs from its aliasee. Fixes PR18757, where the section was being lost on the global in code from Clang like: extern "C" { __attribute__((used, section("CUSTOM"))) static int in_custom_section; } Reviewers: rafael.espindola Differential Revision: http://llvm-reviews.chandlerc.com/D2758 llvm-svn: 201286
*	Remove a very old instcombine where we would turn sequences of selects into	Owen Anderson	2014-02-12	1	-25/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	logical operations on the i1's driving them. This is a bad idea for every target I can think of (confirmed with micro tests on all of: x86-64, ARM, AArch64, Mips, and PowerPC) because it forces the i1 to be materialized into a general purpose register, whereas consuming it directly into a select generally allows it to exist only transiently in a predicate or flags register. Chandler ran a set of performance tests with this change, and reported no measurable change on x86-64. llvm-svn: 201275
*	[Vectorizer] Add a new 'OperandValueKind' in TargetTransformInfo called	Andrea Di Biagio	2014-02-12	3	-7/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	'OK_NonUniformConstValue' to identify operands which are constants but not constant splats. The cost model now allows returning 'OK_NonUniformConstValue' for non splat operands that are instances of ConstantVector or ConstantDataVector. With this change, targets are now able to compute different costs for instructions with non-uniform constant operands. For example, On X86 the cost of a vector shift may vary depending on whether the second operand is a uniform or non-uniform constant. This patch applies the following changes: - The cost model computation now takes into account non-uniform constants; - The cost of vector shift instructions has been improved in X86TargetTransformInfo analysis pass; - BBVectorize, SLPVectorizer and LoopVectorize now know how to distinguish between non-uniform and uniform constant operands. Added a new test to verify that the output of opt '-cost-model -analyze' is valid in the following configurations: SSE2, SSE4.1, AVX, AVX2. llvm-svn: 201272
*	InstCombine: Teach icmp merging about the equivalence of bit tests and ↵	Benjamin Kramer	2014-02-11	1	-23/+38
\| \| \| \| \| \| \| \| \|	UGE/ULT with a power of 2. This happens in bitfield code. While there reorganize the existing code a bit. llvm-svn: 201176
*	[LPM] Switch LICM to actively use LCSSA in addition to preserving it.	Chandler Carruth	2014-02-11	1	-152/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes PR18753 and PR18782. This is necessary for LICM to preserve LCSSA correctly and efficiently. There is still some active discussion about whether we should be using LCSSA, but we can't just immediately stop using it and we need LICM to preserve it while we are using it. We can restore the old SSAUpdater driven code if and when there is a serious effort to remove the reliance on LCSSA from all of the loop passes. However, this also serves as a great example of why LCSSA is very nice to have. This change significantly simplifies the process of sinking instructions for LICM, and makes it quite a bit less expensive. It wouldn't even be as complex as it is except that I had to start the process of removing the big recursive LCSSA formation hammer in order to switch even this much of the re-forming code to asserting that LCSSA was preserved. I'll fully remove that next just to tidy things up until the LCSSA debate settles one way or the other. llvm-svn: 201148
*	[CodeGenPrepare] Undo changes that happened for the profitability check.	Quentin Colombet	2014-02-11	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The addressing mode matcher checks at some point the profitability of folding an instruction into the addressing mode. When the instruction to be folded has several uses, it checks that the instruction can be folded in each use. To do so, it creates a new matcher for each use and check if the instruction is in the list of the matched instructions of this new matcher. The new matchers may promote some instructions and this has to be undone to keep the state of the original matcher consistent. A test case will follow. <rdar://problem/16020230> llvm-svn: 201121
*	[LPM] A terribly simple fix to a terribly complex bug: PR18773.	Chandler Carruth	2014-02-10	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The crux of the issue is that LCSSA doesn't preserve stateful alias analyses. Before r200067, LICM didn't cause LCSSA to run in the LTO pass manager, where LICM runs essentially without any of the other loop passes. As a consequence the globalmodref-aa pass run before that loop pass manager was able to survive the loop pass manager and be used by DSE to eliminate stores in the function called from the loop body in Adobe-C++/loop_unroll (and similar patterns in other benchmarks). When LICM was taught to preserve LCSSA it had to require it as well. This caused it to be run in the loop pass manager and because it did not preserve AA, the stateful AA was lost. Most of LLVM's AA isn't stateful and so this didn't manifest in most cases. Also, in most cases LCSSA was already running, and so there was no interesting change. The real kicker is that LCSSA by its definition (injecting PHI nodes only) trivially preserves AA! All we need to do is mark it, and then everything goes back to working as intended. It probably was blocking some other weird cases of stateful AA but the only one I have is a 1000-line IR test case from loop_unroll, so I don't really have a good test case here. Hopefully this fixes the regressions on performance that have been seen since that revision. llvm-svn: 201104
*	Make succ_iterator a real random access iterator and clean up a couple of users.	Benjamin Kramer	2014-02-10	1	-2/+1
\| \| \| \|	llvm-svn: 201088
*	[asan] support for FreeBSD, LLVM part. patch by Viktor Kutuzov	Kostya Serebryany	2014-02-10	1	-2/+7
\| \| \| \|	llvm-svn: 201067
*	LoopVectorizer: Keep track of conditional store basic blocks	Arnold Schwaighofer	2014-02-08	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Before conditional store vectorization/unrolling we had only one vectorized/unrolled basic block. After adding support for conditional store vectorization this will not only be one block but multiple basic blocks. The last block would have the back-edge. I updated the code to use a vector of basic blocks instead of a single basic block and fixed the users to use the last entry in this vector. But, I forgot to add the basic blocks to this vector! Fixes PR18724. llvm-svn: 201028
*	[Constant Hoisting] Fix insertion point for constant materialization.	Juergen Ributzka	2014-02-08	1	-18/+21
\| \| \| \| \| \| \| \| \| \|	The bitcast instruction during constant materialization was not placed correcly in the presence of phi nodes. This commit fixes the insertion point to be in the idom instead. This fixes PR18768 llvm-svn: 201009
*	[Constant Hoisting] Don't update the use list while traversing it - DOH!	Juergen Ributzka	2014-02-08	1	-5/+16
\| \| \| \| \| \| \| \|	This fix first traverses the whole use list of the constant expression and keeps track of the instructions that need to be updated. Then perform the fixup afterwards. llvm-svn: 201008
*	[CodeGenPrepare] Move away sign extensions that get in the way of addressing	Quentin Colombet	2014-02-06	1	-14/+808
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mode. Basically the idea is to transform code like this: %idx = add nsw i32 %a, 1 %sextidx = sext i32 %idx to i64 %gep = gep i8* %myArray, i64 %sextidx load i8* %gep Into: %sexta = sext i32 %a to i64 %idx = add nsw i64 %sexta, 1 %gep = gep i8* %myArray, i64 %idx load i8* %gep That way the computation can be folded into the addressing mode. This transformation is done as part of the addressing mode matcher. If the matching fails (not profitable, addressing mode not legal, etc.), the matcher will revert the related promotions. <rdar://problem/15519855> llvm-svn: 200947
*	A memcpy out of an fresh alloca is a no-op, delete it. Patch by Patrick Walton!	Nick Lewycky	2014-02-06	1	-1/+11
\| \| \| \|	llvm-svn: 200907
*	Set default of inlinecold-threshold to 225.	Manman Ren	2014-02-06	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	225 is the default value of inline-threshold. This change will make sure we have the same inlining behavior as prior to r200886. As Chandler points out, even though we don't have code in our testing suite that uses cold attribute, there are larger applications that do use cold attribute. r200886 + this commit intend to keep the same behavior as prior to r200886. We can later on tune the inlinecold-threshold. The main purpose of r200886 is to help performance of instrumentation based PGO before we actually hook up inliner with analysis passes such as BPI and BFI. For instrumentation based PGO, we try to increase inlining of hot functions and reduce inlining of cold functions by setting inlinecold-threshold. Another option suggested by Chandler is to use a boolean flag that controls if we should use OptSizeThreshold for cold functions. The default value of the boolean flag should not change the current behavior. But it gives us less freedom in controlling inlining of cold functions. llvm-svn: 200898
*	Disable most IR-level transform passes on functions marked 'optnone'.	Paul Robinson	2014-02-06	29	-0/+89
\| \| \| \| \| \| \| \| \|	Ideally only those transform passes that run at -O0 remain enabled, in reality we get as close as we reasonably can. Passes are responsible for disabling themselves, it's not the job of the pass manager to do it for them. llvm-svn: 200892
*	Inliner uses a smaller inline threshold for callees with cold attribute.	Manman Ren	2014-02-05	1	-0/+11
\| \| \| \| \| \| \| \|	Added command line option inlinecold-threshold to set threshold for inlining functions with cold attribute. Listen to the cold attribute when it would decrease the inline threshold. llvm-svn: 200886
*	SimplifyLibCalls: Push TLI through the exp2->ldexp transform.	Benjamin Kramer	2014-02-04	1	-29/+29
\| \| \| \| \| \|	For the odd case of platforms with exp2 available but not ldexp. llvm-svn: 200795
*	cleanup: scc_iterator consumers should use isAtEnd	Duncan P. N. Exon Smith	2014-02-04	2	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	No functional change. Updated loops from: for (I = scc_begin(), E = scc_end(); I != E; ++I) to: for (I = scc_begin(); !I.isAtEnd(); ++I) for teh win. llvm-svn: 200789
*	OS X: the correct function is __sincospif_stret, not __sincospi_stretf	Tim Northover	2014-02-04	1	-2/+2
\| \| \| \| \| \|	rdar://problem/13729466 llvm-svn: 200771
*	Add strchr(p, 0) -> p + strlen(p) to SimplifyLibCalls	Kai Nacke	2014-02-04	1	-3/+4
\| \| \| \| \| \| \| \|	Add the missing transformation strchr(p, 0) -> p + strlen(p) to SimplifyLibCalls and remove the ToDo comment. Reviewer: Duncan P.N. Exan Smith llvm-svn: 200736
*	Self-memcpy-elision and memcpy of constant byte to memset transforms don't ↵	Nick Lewycky	2014-02-04	1	-4/+7
\| \| \| \| \| \|	care how many bytes you were trying to transfer. Sink that safety test after those transforms. Noticed by inspection. llvm-svn: 200726
*	inalloca: Don't remove dead arguments in the presence of inalloca args	Reid Kleckner	2014-02-03	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \|	It disturbs the layout of the parameters in memory and registers, leading to problems in the backend. The plan for optimizing internal inalloca functions going forward is to essentially SROA the argument memory and demote any captured arguments (things that aren't trivially written by a load or store) to an indirect pointer to a static alloca. llvm-svn: 200717
*	Lower llvm.expect intrinsic correctly for i1	Duncan P. N. Exon Smith	2014-02-02	1	-5/+18
\| \| \| \| \| \| \| \| \| \|	LowerExpectIntrinsic previously only understood the idiom of an expect intrinsic followed by a comparison with zero. For llvm.expect.i1, the comparison would be stripped by the early-cse pass. Patch by Daniel Micay. llvm-svn: 200664
*	LoopVectorizer: Enable unrolling of conditional stores and the load/store	Arnold Schwaighofer	2014-02-02	1	-3/+3
\| \| \| \| \| \| \| \| \|	unrolling heuristic per default Benchmarking on x86_64 (thanks Chandler!) and ARM has shown those options speed up some benchmarks while not causing any interesting regressions. llvm-svn: 200621
*	[LPM] Apply a really big hammer to fix PR18688 by recursively reforming	Chandler Carruth	2014-02-01	1	-5/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	LCSSA when we promote to SSA registers inside of LICM. Currently, this is actually necessary. The promotion logic in LICM uses SSAUpdater which doesn't understand how to place LCSSA PHI nodes. Teaching it to do so would be a very significant undertaking. It may be worthwhile and I've left a FIXME about this in the code as well as starting a thread on llvmdev to try to figure out the right long-term solution. For now, the PR needs to be fixed. Short of using the promition SSAUpdater to place both the LCSSA PHI nodes and the promoted PHI nodes, I don't see a cleaner or cheaper way of achieving this. Fortunately, LCSSA is relatively lazy and sparse -- it should only update instructions which need it. We can also skip the recursive variant when we don't promote to SSA values. llvm-svn: 200612
*	Remove some unused #includes	Eli Bendersky	2014-02-01	1	-2/+0
\| \| \| \|	llvm-svn: 200611
*	Revert "[SLPV] Recognize vectorizable intrinsics during SLP vectorization ..."	Reid Kleckner	2014-02-01	1	-86/+3
\| \| \| \| \| \| \|	This reverts commit r200576. It broke 32-bit self-host builds by vectorizing two calls to @llvm.bswap.i64, which we then fail to expand. llvm-svn: 200602
*	[SLPV] Recognize vectorizable intrinsics during SLP vectorization and	Chandler Carruth	2014-01-31	1	-3/+86
\| \| \| \| \| \| \| \| \| \|	transform accordingly. Based on similar code from Loop vectorization. Subsequent commits will include vectorization of function calls to vector intrinsics and form function calls to vector library calls. Patch by Raul Silvera! (Much delayed due to my not running dcommit) llvm-svn: 200576
*	[vectorizer] Tweak the way we do small loop runtime unrolling in the	Chandler Carruth	2014-01-31	1	-15/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	loop vectorizer to not do so when runtime pointer checks are needed and share code with the new (not yet enabled) load/store saturation runtime unrolling. Also ensure that we only consider the runtime checks when the loop hasn't already been vectorized. If it has, the runtime check cost has already been paid. I've fleshed out a test case to cover the scalar unrolling as well as the vector unrolling and comment clearly why we are or aren't following the pattern. llvm-svn: 200530
*	Fix a bug in gcov instrumentation introduced by r195513. <rdar://15930350>	Bob Wilson	2014-01-31	1	-1/+8
\| \| \| \| \| \| \| \| \| \|	The entry block of a function starts with all the static allocas. The change in r195513 splits the block before those allocas, which has the effect of turning them into dynamic allocas. That breaks all sorts of things. Change to split after the initial allocas, and also add a comment explaining why the block is split. llvm-svn: 200515
*	[LPM] Fix PR18643, another scary place where loop transforms failed to	Chandler Carruth	2014-01-29	2	-49/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	preserve loop simplify of enclosing loops. The problem here starts with LoopRotation which ends up cloning code out of the latch into the new preheader it is buidling. This can create a new edge from the preheader into the exit block of the loop which breaks LoopSimplify form. The code tries to fix this by splitting the critical edge between the latch and the exit block to get a new exit block that only the latch dominates. This sadly isn't sufficient. The exit block may be an exit block for multiple nested loops. When we clone an edge from the latch of the inner loop to the new preheader being built in the outer loop, we create an exiting edge from the outer loop to this exit block. Despite breaking the LoopSimplify form for the inner loop, this is fine for the outer loop. However, when we split the edge from the inner loop to the exit block, we create a new block which is in neither the inner nor outer loop as the new exit block. This is a predecessor to the old exit block, and so the split itself takes the outer loop out of LoopSimplify form. We need to split every edge entering the exit block from inside a loop nested more deeply than the exit block in order to preserve all of the loop simplify constraints. Once we try to do that, a problem with splitting critical edges surfaces. Previously, we tried a very brute force to update LoopSimplify form by re-computing it for all exit blocks. We don't need to do this, and doing this much will sometimes but not always overlap with the LoopRotate bug fix. Instead, the code needs to specifically handle the cases which can start to violate LoopSimplify -- they aren't that common. We need to see if the destination of the split edge was a loop exit block in simplified form for the loop of the source of the edge. For this to be true, all the predecessors need to be in the exact same loop as the source of the edge being split. If the dest block was originally in this form, we have to split all of the deges back into this loop to recover it. The old mechanism of doing this was conservatively correct because at least one of the exiting blocks it rewrote was the DestBB and so the DestBB's predecessors were fixed. But this is a much more targeted way of doing it. Making it targeted is important, because ballooning the set of edges touched prevents LoopRotate from being able to split edges it needs to split to preserve loop simplify in a coherent way -- the critical edge splitting would sometimes find the other edges in need of splitting but not others. Many, many thanks for help from Nick reducing these test cases mightily. And helping lots with the analysis here as this one was quite tricky to track down. llvm-svn: 200393
*	[LPM] Fix PR18642, a pretty nasty bug in IndVars that "never mattered"	Chandler Carruth	2014-01-29	1	-7/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	because of the inside-out run of LoopSimplify in the LoopPassManager and the fact that LoopSimplify couldn't be "preserved" across two independent LoopPassManagers. Anyways, in that case, IndVars wasn't correctly preserving an LCSSA PHI node because it thought it was rewriting (via SCEV) the incoming value to a loop invariant value. While it may well be invariant for the current loop, it may be rewritten in terms of an enclosing loop's values. This in and of itself is fine, as the LCSSA PHI node in the enclosing loop for the inner loop value we're rewriting will have its own LCSSA PHI node if used outside of the enclosing loop. With me so far? Well, the current loop and the enclosing loop may share an exiting block and exit block, and when they do they also share LCSSA PHI nodes. In this case, its not valid to RAUW through the LCSSA PHI node. Expected crazy test included. llvm-svn: 200372
*	LoopVectorizer: Don't count the induction variable multiple times	Arnold Schwaighofer	2014-01-29	1	-0/+9
\| \| \| \| \| \| \| \|	When estimating register pressure, don't count the induction variable mulitple times. It is unlikely to be unrolled. This is currently disabled and hidden behind a flag ("enable-ind-var-reg-heur"). llvm-svn: 200371
*	Fix pr14893.	Rafael Espindola	2014-01-28	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \|	When simplifycfg moves an instruction, it must drop metadata it doesn't know is still valid with the preconditions changes. In particular, it must drop the range and tbaa metadata. The patch implements this with an utility function to drop all metadata not in a white list. llvm-svn: 200322
*	[vectorizer] Completely disable the block frequency guidance of the loop	Chandler Carruth	2014-01-28	1	-3/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	vectorizer, placing it behind an off-by-default flag. It turns out that block frequency isn't what we want at all, here or elsewhere. This has been I think a nagging feeling for several of us working with it, but Arnold has given some really nice simple examples where the results are so comprehensively wrong that they aren't useful. I'm planning to email the dev list with a summary of why its not really useful and a couple of ideas about how to better structure these types of heuristics. llvm-svn: 200294
*	Update optimization passes to handle inalloca arguments	Reid Kleckner	2014-01-28	8	-15/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: I searched Transforms/ and Analysis/ for 'ByVal' and updated those call sites to check for inalloca if appropriate. I added tests for any change that would allow an optimization to fire on inalloca. Reviewers: nlewycky Differential Revision: http://llvm-reviews.chandlerc.com/D2449 llvm-svn: 200281
*	[LPM] Fix PR18616 where the shifts to the loop pass manager to extract	Chandler Carruth	2014-01-28	2	-16/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	LCSSA from it caused a crasher with the LoopUnroll pass. This crasher is really nasty. We destroy LCSSA form in a suprising way. When unrolling a loop into an outer loop, we not only need to restore LCSSA form for the outer loop, but for all children of the outer loop. This is somewhat obvious in retrospect, but hey! While this seems pretty heavy-handed, it's not that bad. Fundamentally, we only do this when we unroll a loop, which is already a heavyweight operation. We're unrolling all of these hypothetical inner loops as well, so their size and complexity is already on the critical path. This is just adding another pass over them to re-canonicalize. I have a test case from PR18616 that is great for reproducing this, but pretty useless to check in as it relies on many 10s of nested empty loops that get unrolled and deleted in just the right order. =/ What's worse is that investigating this has exposed another source of failure that is likely to be even harder to test. I'll try to come up with test cases for these fixes, but I want to get the fixes into the tree first as they're causing crashes in the wild. llvm-svn: 200273
*	LoopVectorize: Support conditional stores by scalarizing	Arnold Schwaighofer	2014-01-28	1	-29/+194
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The vectorizer takes a loop like this and widens all instructions except for the store. The stores are scalarized/unrolled and hidden behind an "if" block. for (i = 0; i < 128; ++i) { if (a[i] < 10) a[i] += val; } for (i = 0; i < 128; i+=2) { v = a[i:i+1]; v0 = (extract v, 0) + 10; v1 = (extract v, 1) + 10; if (v0 < 10) a[i] = v0; if (v1 < 10) a[i] = v1; } The vectorizer relies on subsequent optimizations to sink instructions into the conditional block where they are anticipated. The flag "vectorize-num-stores-pred" controls whether and how many stores to handle this way. Vectorization of conditional stores is disabled per default for now. This patch also adds a change to the heuristic when the flag "enable-loadstore-runtime-unroll" is enabled (off by default). It unrolls small loops until load/store ports are saturated. This heuristic uses TTI's getMaxUnrollFactor as a measure for load/store ports. I also added a second flag -enable-cond-stores-vec. It will enable vectorization of conditional stores. But there is no cost model for vectorization of conditional stores in place yet so this will not do good at the moment. rdar://15892953 Results for x86-64 -O3 -mavx +/- -mllvm -enable-loadstore-runtime-unroll -vectorize-num-stores-pred=1 (before the BFI change): Performance Regressions: Benchmarks/Ptrdist/yacr2/yacr2 7.35% (maze3() is identical but 10% slower) Applications/siod/siod 2.18% Performance improvements: mesa -4.42% libquantum -4.15% With a patch that slightly changes the register heuristics (by subtracting the induction variable on both sides of the register pressure equation, as the induction variable is probably not really unrolled): Performance Regressions: Benchmarks/Ptrdist/yacr2/yacr2 7.73% Applications/siod/siod 1.97% Performance Improvements: libquantum -13.05% (we now also unroll quantum_toffoli) mesa -4.27% llvm-svn: 200270
*	PGO branch weight: keep halving the weights until they can fit into	Manman Ren	2014-01-27	1	-12/+13
\| \| \| \| \| \| \| \| \| \|	uint32. When folding branches to common destination, the updated branch weights can exceed uint32 by more than factor of 2. We should keep halving the weights until they can fit into uint32. llvm-svn: 200262
*	[vectorize] Initial version of respecting PGO in the vectorizer: treat	Chandler Carruth	2014-01-27	1	-0/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cold loops as-if they were being optimized for size. Nothing fancy here. Simply test case included. The nice thing is that we can now incrementally build on top of this to drive other heuristics. All of the infrastructure work is done to get the profile information into this layer. The remaining work necessary to make this a fully general purpose loop unroller for very hot loops is to make it a fully general purpose loop unroller. Things I know of but am not going to have time to benchmark and fix in the immediate future: 1) Don't disable the entire pass when the target is lacking vector registers. This really doesn't make any sense any more. 2) Teach the unroller at least and the vectorizer potentially to handle non-if-converted loops. This is trivial for the unroller but hard for the vectorizer. 3) Compute the relative hotness of the loop and thread that down to the various places that make cost tradeoffs (very likely only the unroller makes sense here, and then only when dealing with loops that are small enough for unrolling to not completely blow out the LSD). I'm still dubious how useful hotness information will be. So far, my experiments show that if we can get the correct logic for determining when unrolling actually helps performance, the code size impact is completely unimportant and we can unroll in all cases. But at least we'll no longer burn code size on cold code. One somewhat unrelated idea that I've had forever but not had time to implement: mark all functions which are only reachable via the global constructors rigging in the module as optsize. This would also decrease the impact of any more aggressive heuristics here on code size. llvm-svn: 200219
*	ConstantHoisting: We can't insert instructions directly in front of a PHI node.	Benjamin Kramer	2014-01-27	1	-3/+18
\| \| \| \| \| \|	Insert before the terminating instruction of the dominating block instead. llvm-svn: 200218
*	[vectorizer] Add an override for the target instruction cost and use it	Chandler Carruth	2014-01-27	1	-0/+11
\| \| \| \| \| \| \|	to stabilize a test that really is trying to test generic behavior and not a specific target's behavior. llvm-svn: 200215
*	[vectorizer] Simplify code to use existing helpers on the Function	Chandler Carruth	2014-01-27	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	object and fewer pointless variables. Also, add a clarifying comment and a FIXME because the code which disables all vectorization if we can't use implicit floating point instructions just makes no sense at all. llvm-svn: 200214
*	[vectorizer] Teach the loop vectorizer's unroller to only unroll by	Chandler Carruth	2014-01-27	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	powers of two. This is essentially always the correct thing given the impact on alignment, scaling factors that can be used in addressing modes, etc. Also, fix the management of the unroll vs. small loop cost to more accurately model things with this world. Enhance a test case to actually exercise more of the unroll machinery if using synthetic constants rather than a specific target model. Before this change, with the added flags this test will unroll 3 times instead of either 2 or 4 (the two sensible answers). While I don't expect this to make a huge difference, if there are lots of loops sitting right on the edge of hitting the 'small unroll' factor, they might change behavior. However, I've benchmarked moving the small loop cost up and down in many various ways and by a huge factor (2x) without seeing more than 0.2% code size growth. Small adjustments such as the series that led up here have led to about 1% improvement on some benchmarks, but it is very close to the noise floor so I mostly checked that nothing regressed. Let me know if you see bad behavior on other targets but I don't expect this to be a sufficiently dramatic change to trigger anything. llvm-svn: 200213
*	[vectorizer] Add some flags which are useful for conducting experiments	Chandler Carruth	2014-01-27	1	-2/+38
\| \| \| \| \| \| \| \| \| \| \| \|	with the unrolling behavior in the loop vectorizer. No functionality changed at this point. These are a bit hack-y, but talking with Hal, there doesn't seem to be a cleaner way to easily experiment with different thresholds here and he was also interested in them so I wanted to commit them. Suggestions for improvement are very welcome here. llvm-svn: 200212
*	[vectorizer] Fix a trivial oversight where we always requested the	Chandler Carruth	2014-01-27	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	number of vector registers rather than toggling between vector and scalar register number based on VF. I don't have a test case as I spotted this by inspection and on X86 it only makes a difference if your target is lacking SSE and thus has no vector registers. If someone wants to add a test case for this for ARM or somewhere else where this is more significant, that would be awesome. Also made the variable name a bit more sensible while I'm here. llvm-svn: 200211
*	[vectorizer] Clean up the handling of unvectorized loop unrolling in the	Chandler Carruth	2014-01-27	1	-13/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	LoopVectorize pass. The logic here doesn't make much sense. We only unrolled if the unvectorized loop was a reduction loop with a single basic block and small loop body. The reduction part in particular doesn't make much sense. Instead, if we just fall through to the vectorized unroll logic it makes more sense of unrolling if there is a vectorized reduction that could be hacked on by the SLP vectorizer or if the loop is small. This is mostly a cleanup and nothing in the test suite really exercises this, but I did run benchmarks across this change and saw no really significant changes. llvm-svn: 200198
*	[LPM] Conclude my immediate work by making the LoopVectorizer	Chandler Carruth	2014-01-25	1	-8/+37
\| \| \| \| \| \| \| \| \| \| \| \|	a FunctionPass. With this change the loop vectorizer no longer is a loop pass and can readily depend on function analyses. In particular, with this change we no longer have to form a loop pass manager to run the loop vectorizer which simplifies the entire pass management of LLVM. The next step here is to teach the loop vectorizer to leverage profile information through the profile information providing analysis passes. llvm-svn: 200074
*	[LPM] Make LCSSA a utility with a FunctionPass that applies it to all	Chandler Carruth	2014-01-25	2	-168/+262
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the loops in a function, and teach LICM to work in the presance of LCSSA. Previously, LCSSA was a loop pass. That made passes requiring it also be loop passes and unable to depend on function analysis passes easily. It also caused outer loops to have a different "canonical" form from inner loops during analysis. Instead, we go into LCSSA form and preserve it through the loop pass manager run. Note that this has the same problem as LoopSimplify that prevents enabling its verification -- loop passes which run at the end of the loop pass manager and don't preserve these are valid, but the subsequent loop pass runs of outer loops that do preserve this pass trigger too much verification and fail because the inner loop no longer verifies. The other problem this exposed is that LICM was completely unable to handle LCSSA form. It didn't preserve it and it actually would give up on moving instructions in many cases when they were used by an LCSSA phi node. I've taught LICM to support detecting LCSSA-form PHI nodes and to hoist and sink around them. This may actually let LICM fire significantly more because we put everything into LCSSA form to rotate the loop before running LICM. =/ Now LICM should handle that fine and preserve it correctly. The down side is that LICM has to require LCSSA in order to preserve it. This is just a fact of life for LCSSA. It's entirely possible we should completely remove LCSSA from the optimizer. The test updates are essentially accomodating LCSSA phi nodes in the output of LICM, and the fact that we now completely sink every instruction in ashr-crash below the loop bodies prior to unrolling. With this change, LCSSA is computed only three times in the pass pipeline. One of them could be removed (and potentially a SCEV run and a separate LoopPassManager entirely!) if we had a LoopPass variant of InstCombine that ran InstCombine on the loop body but refused to combine away LCSSA PHI nodes. Currently, this also prevents loop unrolling from being in the same loop pass manager is rotate, LICM, and unswitch. There is one thing that I really don't like -- preserving LCSSA in LICM is quite expensive. We end up having to re-run LCSSA twice for some loops after LICM runs because LICM can undo LCSSA both in the current loop and the parent loop. I don't really see good solutions to this other than to completely move away from LCSSA and using tools like SSAUpdater instead. llvm-svn: 200067