bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	Codegen: Make chains from trellis-shaped CFGs	Kyle Butt	2017-02-15	5	-7/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lay out trellis-shaped CFGs optimally. A trellis of the shape below: A B \|\ /\| \| \ / \| \| X \| \| / \ \| \|/ \\| C D would be laid out A; B->C ; D by the current layout algorithm. Now we identify trellises and lay them out either A->C; B->D or A->D; B->C. This scales with an increasing number of predecessors. A trellis is a a group of 2 or more predecessor blocks that all have the same successors. because of this we can tail duplicate to extend existing trellises. As an example consider the following CFG: B D F H / \ / \ / \ / \ A---C---E---G---Ret Where A,C,E,G are all small (Currently 2 instructions). The CFG preserving layout is then A,B,C,D,E,F,G,H,Ret. The current code will copy C into B, E into D and G into F and yield the layout A,C,B(C),E,D(E),F(G),G,H,ret define void @straight_test(i32 %tag) { entry: br label %test1 test1: ; A %tagbit1 = and i32 %tag, 1 %tagbit1eq0 = icmp eq i32 %tagbit1, 0 br i1 %tagbit1eq0, label %test2, label %optional1 optional1: ; B call void @a() br label %test2 test2: ; C %tagbit2 = and i32 %tag, 2 %tagbit2eq0 = icmp eq i32 %tagbit2, 0 br i1 %tagbit2eq0, label %test3, label %optional2 optional2: ; D call void @b() br label %test3 test3: ; E %tagbit3 = and i32 %tag, 4 %tagbit3eq0 = icmp eq i32 %tagbit3, 0 br i1 %tagbit3eq0, label %test4, label %optional3 optional3: ; F call void @c() br label %test4 test4: ; G %tagbit4 = and i32 %tag, 8 %tagbit4eq0 = icmp eq i32 %tagbit4, 0 br i1 %tagbit4eq0, label %exit, label %optional4 optional4: ; H call void @d() br label %exit exit: ret void } here is the layout after D27742: straight_test: # @straight_test ; ... Prologue elided ; BB#0: # %entry ; A (merged with test1) ; ... More prologue elided mr 30, 3 andi. 3, 30, 1 bc 12, 1, .LBB0_2 ; BB#1: # %test2 ; C rlwinm. 3, 30, 0, 30, 30 beq 0, .LBB0_3 b .LBB0_4 .LBB0_2: # %optional1 ; B (copy of C) bl a nop rlwinm. 3, 30, 0, 30, 30 bne 0, .LBB0_4 .LBB0_3: # %test3 ; E rlwinm. 3, 30, 0, 29, 29 beq 0, .LBB0_5 b .LBB0_6 .LBB0_4: # %optional2 ; D (copy of E) bl b nop rlwinm. 3, 30, 0, 29, 29 bne 0, .LBB0_6 .LBB0_5: # %test4 ; G rlwinm. 3, 30, 0, 28, 28 beq 0, .LBB0_8 b .LBB0_7 .LBB0_6: # %optional3 ; F (copy of G) bl c nop rlwinm. 3, 30, 0, 28, 28 beq 0, .LBB0_8 .LBB0_7: # %optional4 ; H bl d nop .LBB0_8: # %exit ; Ret ld 30, 96(1) # 8-byte Folded Reload addi 1, 1, 112 ld 0, 16(1) mtlr 0 blr The tail-duplication has produced some benefit, but it has also produced a trellis which is not laid out optimally. With this patch, we improve the layouts of such trellises, and decrease the cost calculation for tail-duplication accordingly. This patch produces the layout A,C,E,G,B,D,F,H,Ret. This layout does have back edges, which is a negative, but it has a bigger compensating positive, which is that it handles the case where there are long strings of skipped blocks much better than the original layout. Both layouts handle runs of executed blocks equally well. Branch prediction also improves if there is any correlation between subsequent optional blocks. Here is the resulting concrete layout: straight_test: # @straight_test ; BB#0: # %entry ; A (merged with test1) mr 30, 3 andi. 3, 30, 1 bc 12, 1, .LBB0_4 ; BB#1: # %test2 ; C rlwinm. 3, 30, 0, 30, 30 bne 0, .LBB0_5 .LBB0_2: # %test3 ; E rlwinm. 3, 30, 0, 29, 29 bne 0, .LBB0_6 .LBB0_3: # %test4 ; G rlwinm. 3, 30, 0, 28, 28 bne 0, .LBB0_7 b .LBB0_8 .LBB0_4: # %optional1 ; B (Copy of C) bl a nop rlwinm. 3, 30, 0, 30, 30 beq 0, .LBB0_2 .LBB0_5: # %optional2 ; D (Copy of E) bl b nop rlwinm. 3, 30, 0, 29, 29 beq 0, .LBB0_3 .LBB0_6: # %optional3 ; F (Copy of G) bl c nop rlwinm. 3, 30, 0, 28, 28 beq 0, .LBB0_8 .LBB0_7: # %optional4 ; H bl d nop .LBB0_8: # %exit Differential Revision: https://reviews.llvm.org/D28522 llvm-svn: 295223
*	[AMDGPU] Revert failed scheduling	Stanislav Mekhanoshin	2017-02-15	4	-33/+626
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch reverts region's scheduling to the original untouched state in case if we have have decreased occupancy. In addition it switches to use TargetRegisterInfo occupancy callback for pressure limits instead of gradually increasing limits which were just passed by. We are going to stay with the best schedule so we do not need to tolerate worsened scheduling anymore. Differential Revision: https://reviews.llvm.org/D29971 llvm-svn: 295206
*	[AMDGPU] Fix MaxWorkGroupsPerCU for large workgroups	Stanislav Mekhanoshin	2017-02-15	1	-2/+4
\| \| \| \| \| \| \| \| \| \|	This patch corrects the maximum workgroups per CU if we have big workgroups (more than 128). This calculation contributes to the occupancy calculation in respect to LDS size. Differential Revision: https://reviews.llvm.org/D29974 llvm-svn: 295134
*	Revert "[AMDGPU] Fix for SIMachineScheduler crash. SI Scheduler should track"	Alexander Timofeev	2017-02-14	1	-49/+0
\| \| \| \| \| \|	This reverts commit ce06d9cb99298eb844b66e117f5108a06747c907. llvm-svn: 295054
*	AMDGPU : Add trap handler support.	Wei Ding	2017-02-10	2	-6/+75
\| \| \| \| \| \|	Differential Revision: http://reviews.llvm.org/D26010 llvm-svn: 294692
*	[AMDGPU] Override PSet for M0	Stanislav Mekhanoshin	2017-02-10	1	-0/+57
\| \| \| \| \| \| \| \| \| \| \| \|	This change returns empty PSet list for M0 register. Otherwise its PSet as defined by tablegen is SReg_32. This results in incorrect register pressure calculation every time an instruction uses M0. Such uses count as SReg_32 PSet and inadequately increase pressure on SGPRs. Differential Revision: https://reviews.llvm.org/D29798 llvm-svn: 294691
*	AMDGPU: Add pass to expand memcpy/memmove/memset	Matt Arsenault	2017-02-09	1	-0/+117
\| \| \| \|	llvm-svn: 294635
*	[AMDGPU] Calculate number of min/max SGPRs/VGPRs for WavesPerEU instead of ↵	Konstantin Zhuravlyov	2017-02-09	1	-3/+3
\| \| \| \| \| \| \| \|	using switch statement Differential Revision: https://reviews.llvm.org/D29741 llvm-svn: 294627
*	[AMDGPU] Add target information that is required by tools to metadata	Konstantin Zhuravlyov	2017-02-08	4	-7/+11
\| \| \| \| \| \|	Differential Revision: https://reviews.llvm.org/D28760#fb670e28 llvm-svn: 294449
*	AMDGPU: Enable InferAddressSpaces	Matt Arsenault	2017-02-08	2	-22/+22
\| \| \| \|	llvm-svn: 294408
*	[AMDGPU] Fix for SIMachineScheduler crash. SI Scheduler should track	Alexander Timofeev	2017-02-07	1	-0/+49
\| \| \| \| \| \| \| \|	lane masks. Differential revision: https://reviews.llvm.org/D29442 llvm-svn: 294324
*	[AMDGPU] Lower null pointers in static variable initializer	Yaxun Liu	2017-02-07	1	-0/+113
\| \| \| \| \| \| \| \| \| \| \| \|	For amdgcn target Clang generates addrspacecast to represent null pointers in private and local address spaces. In LLVM codegen, the static variable initializer is lowered by virtual function AsmPrinter::lowerConstant which is target generic. Since addrspacecast is target specific, AsmPrinter::lowerConst This patch overrides AsmPrinter::lowerConstant with AMDGPUAsmPrinter::lowerConstant, which is able to lower the target-specific addrspacecast in the null pointer representation so that -1 is co Differential Revision: https://reviews.llvm.org/D29284 llvm-svn: 294265
*	[RegisterCoalescer] Do not call getInstructionIndex with DBG_VALUE	Brendon Cahoon	2017-02-04	1	-0/+76
\| \| \| \| \| \| \| \| \| \| \|	An assert occurs when calling SlotIndexes::getInstructionIndex with a DBG_VALUE instruction because the function expects an instruction with a slot index. However, there is no slot index for a DBG_VALUE instruction. Differential Revision: https://reviews.llvm.org/D29048 llvm-svn: 294070
*	AMDGPU: Cleanup scalar_to_vector test	Matt Arsenault	2017-02-03	1	-13/+13
\| \| \| \|	llvm-svn: 294038
*	AMDGPU: Set MCAsmInfo::PointerSize	Matt Arsenault	2017-02-03	1	-0/+26
\| \| \| \|	llvm-svn: 294031
*	AMDGPU: Fold fneg into fmin/fmax_legacy	Matt Arsenault	2017-02-03	2	-0/+79
\| \| \| \|	llvm-svn: 293972
*	AMDGPU: Fold fneg into fminnum/fmaxnum	Matt Arsenault	2017-02-03	1	-0/+264
\| \| \| \|	llvm-svn: 293968
*	llvm-readobj: fix next note entry calculation and print unknown note types	Konstantin Zhuravlyov	2017-02-02	1	-1/+7
\| \| \| \| \| \|	Differential Revision: https://reviews.llvm.org/D29131 llvm-svn: 293964
*	AMDGPU: Check if users of fneg can fold mods	Matt Arsenault	2017-02-02	5	-76/+502
\| \| \| \| \| \|	In multi-use cases this can save a few instructions. llvm-svn: 293962
*	Revert "In visitSTORE, always use FindBetterChain, rather than only when ↵	Nirav Dave	2017-02-02	5	-36/+46
\| \| \| \| \| \| \| \| \|	UseAA is enabled." This reverts commit r293893 which is miscompiling lua on ARM and bootstrapping for x86-windows. llvm-svn: 293915
*	In visitSTORE, always use FindBetterChain, rather than only when UseAA is ↵	Nirav Dave	2017-02-02	5	-46/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	enabled. Recommiting after fixing X86 inc/dec chain bug. * Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search and chain alias analysis which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. When merging stores search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and the output Codegen (save perhaps for some ARM cases where we correctly constructs wider loads, but then promotes them to float operations which appear but requires more expensive constant generation). Some minor peephole optimizations to deal with improved SubDAG shapes (listed below) Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seems sufficient to not cause regressions in tests. 5. Remove Chain dependencies of Memory operations on CopyfromReg nodes as these are captured by data dependence 6. Forward loads-store values through tokenfactors containing {CopyToReg,CopyFromReg} Values. 7. Peephole to convert buildvector of extract_vector_elt to extract_subvector if possible (see CodeGen/AArch64/store-merge.ll) 8. Store merging for the ARM target is restricted to 32-bit as some in some contexts invalid 64-bit operations are being generated. This can be removed once appropriate checks are added. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable, improving load-store forwarding. One test in particular is worth noting: CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store forwarding converts a load-store pair into a parallel store and a memory-realized bitcast of the same value. However, because we lose the sharing of the explicit and implicit store values we must create another local store. A similar transformation happens before SelectionDAG as well. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle llvm-svn: 293893
*	AMDGPU: Use source modifiers with f16->f32 conversions	Matt Arsenault	2017-02-02	7	-106/+375
\| \| \| \| \| \| \| \| \| \| \|	The operand types were defined to fit the fp16_to_fp node, which has the half as an integer type. v_cvt_f32_f16 does support source modifiers, so change this to have an FP type and modifiers. For targets without legal f16, this requires recognizing the bit operations and trying to produce them. llvm-svn: 293857
*	[AMDGPU] Account workgroup size in LDS occupancy limits	Stanislav Mekhanoshin	2017-02-01	2	-26/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Functions matching LDS use to occupancy return results for a workgroup of 64 workitems. The numbers has to be adjusted for bigger workgroups. For example a workgroup of size 256 already occupies 4 waves just by itself. Given that all numbers of LDS use in the compiler are per workgroup, occupancy shall be multiplied by 4 in this case. Each 64 workitems still limited by the same number, but 4 subrgoups 64 workitems each can afford 4 times more LDS to get the same occupancy. In addition change initializes LDS size in the subtarget to a real value for SI+ targets. This is required since LDS size is a variable in these calculations. Differential Revision: https://reviews.llvm.org/D29423 llvm-svn: 293837
*	AMDGPU: Improve nsw/nuw/exact when promoting uniform i16 ops	Matt Arsenault	2017-02-01	1	-44/+44
\| \| \| \| \| \| \| \| \| \| \| \|	These were simply preserving the flags of the original operation, which was too conservative in most cases and incorrect for mul. nsw/nuw may be needed for some combines to cleanup messes when intermediate sext_inregs are introduced later. Tested valid combinations with alive. llvm-svn: 293776
*	CodeGen: Allow small copyable blocks to "break" the CFG.	Kyle Butt	2017-01-31	2	-10/+28
\| \| \| \| \| \| \| \| \| \| \|	When choosing the best successor for a block, ordinarily we would have preferred a block that preserves the CFG unless there is a strong probability the other direction. For small blocks that can be duplicated we now skip that requirement as well, subject to some simple frequency calculations. Differential Revision: https://reviews.llvm.org/D28583 llvm-svn: 293716
*	AMDGPU: Use source mods with fcanonicalize	Matt Arsenault	2017-01-31	2	-0/+105
\| \| \| \|	llvm-svn: 293654
*	AMDGPU/SI: Fix inst-select-load-smrd.mir on some builds	Tom Stellard	2017-01-31	1	-12/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: For some reason instructions are being inserted in the wrong order with some builds. I'm not sure why this is happening. Reviewers: arsenm Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, tony-tye, tpr, llvm-commits Differential Revision: https://reviews.llvm.org/D29325 llvm-svn: 293639
*	[DAGCombine] require UnsafeFPMath for re-association of addition	Nicolai Haehnle	2017-01-31	2	-49/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: The affected transforms all implicitly use associativity of addition, for which we usually require unsafe math to be enabled. The "Aggressive" flag is only meant to convey information about the performance of the fused ops relative to a fmul+fadd sequence. Fixes Bug 31626. Reviewers: spatel, hfinkel, mehdi_amini, arsenm, tstellarAMD Subscribers: jholewinski, nemanjai, wdng, llvm-commits Differential Revision: https://reviews.llvm.org/D28675 llvm-svn: 293635
*	AMDGPU: Generalize matching of v_med3_f32	Matt Arsenault	2017-01-31	2	-6/+760
\| \| \| \| \| \| \| \| \| \|	I think this is safe as long as no inputs are known to ever be nans. Also add an intrinsic for fmed3 to be able to handle all safe math cases. llvm-svn: 293598
*	Re-commit AMDGPU/GlobalISel: Add support for simple shaders	Tom Stellard	2017-01-30	6	-0/+368
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix build when global-isel is disabled and fix a warning. Summary: We can select constant/global G_LOAD, global G_STORE, and G_GEP. Reviewers: qcolombet, MatzeB, t.p.northover, ab, arsenm Subscribers: mehdi_amini, vkalintiris, kzhuravl, wdng, nhaehnle, mgorny, yaxunl, tony-tye, modocache, llvm-commits, dberris Differential Revision: https://reviews.llvm.org/D26730 llvm-svn: 293551
*	[AMDGPU] Internalize non-kernel symbols	Stanislav Mekhanoshin	2017-01-30	1	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \|	Since we have no call support and late linking we can produce code only for used symbols. This saves compilation time, size of the final executable, and size of any intermediate dumps. Run Internalize pass early in the opt pipeline followed by global DCE pass. To enable it RT can pass -amdgpu-internalize-symbols option. Differential Revision: https://reviews.llvm.org/D29214 llvm-svn: 293549
*	AMDGPU: Undo sub x, c -> add x, -c canonicalization	Matt Arsenault	2017-01-30	3	-2/+197
\| \| \| \| \| \| \| \| \|	This is worse if the original constant is an inline immediate. This should also be done for 64-bit adds, but requires fixing operand folding bugs first. llvm-svn: 293540
*	AMDGPU: Make i32 uaddo/usubo legal	Matt Arsenault	2017-01-30	2	-58/+175
\| \| \| \|	llvm-svn: 293514
*	DAG: Fold fneg into compare with constant into the constant	Matt Arsenault	2017-01-30	1	-0/+257
\| \| \| \| \| \| \| \|	fcmp (fneg x), c, pred -> fcmp x, -c, (swap pred) InstCombine already does this. llvm-svn: 293512
*	Revert "AMDGPU/GlobalISel: Add support for simple shaders"	Tom Stellard	2017-01-30	6	-356/+0
\| \| \| \| \| \| \| \|	This reverts commit r293503. Revert while I investigate some of the buildbot failures. llvm-svn: 293509
*	AMDGPU/GlobalISel: Add support for simple shaders	Tom Stellard	2017-01-30	6	-0/+356
\| \| \| \| \| \| \| \| \| \| \| \|	Summary: We can select constant/global G_LOAD, global G_STORE, and G_GEP. Reviewers: qcolombet, MatzeB, t.p.northover, ab, arsenm Subscribers: mehdi_amini, vkalintiris, kzhuravl, wdng, nhaehnle, mgorny, yaxunl, tony-tye, modocache, llvm-commits, dberris Differential Revision: https://reviews.llvm.org/D26730 llvm-svn: 293503
*	DAG: Constant fold fp16_to_fp/fp16_to_fp	Matt Arsenault	2017-01-30	11	-158/+116
\| \| \| \| \| \| \|	This fixes emitting conversions of constants on targets without legal f16 that need to use these for legalization. llvm-svn: 293499
*	AMDGPU: Enable FeatureFlatForGlobal on Volcanic Islands	Matt Arsenault	2017-01-27	2	-26/+54
\| \| \| \| \| \| \| \| \| \| \|	Accomplishes what r292982 was supposed to, which ended up only really making the necessary test changes. This should be applied to the 4.0 branch. Patch by Vedran Miletić <vedran@miletic.net> llvm-svn: 293310
*	[AMDGPU] Turn AMDGPUUnifyMetadata back into module pass	Stanislav Mekhanoshin	2017-01-27	1	-4/+0
\| \| \| \| \| \| \| \| \|	With the adjustPassManager interface that is now possible to use custom early module passes. Differential Revision: https://reviews.llvm.org/D29189 llvm-svn: 293300
*	Revert "In visitSTORE, always use FindBetterChain, rather than only when ↵	Nirav Dave	2017-01-26	5	-36/+46
\| \| \| \| \| \| \| \|	UseAA is enabled." This reverts commit r293184 which is failing in LTO builds llvm-svn: 293188
*	In visitSTORE, always use FindBetterChain, rather than only when UseAA is ↵	Nirav Dave	2017-01-26	5	-46/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	enabled. * Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search and chain alias analysis which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. When merging stores search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and the output Codegen (save perhaps for some ARM cases where we correctly constructs wider loads, but then promotes them to float operations which appear but requires more expensive constant generation). Some minor peephole optimizations to deal with improved SubDAG shapes (listed below) Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seems sufficient to not cause regressions in tests. 5. Remove Chain dependencies of Memory operations on CopyfromReg nodes as these are captured by data dependence 6. Forward loads-store values through tokenfactors containing {CopyToReg,CopyFromReg} Values. 7. Peephole to convert buildvector of extract_vector_elt to extract_subvector if possible (see CodeGen/AArch64/store-merge.ll) 8. Store merging for the ARM target is restricted to 32-bit as some in some contexts invalid 64-bit operations are being generated. This can be removed once appropriate checks are added. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable, improving load-store forwarding. One test in particular is worth noting: CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store forwarding converts a load-store pair into a parallel store and a memory-realized bitcast of the same value. However, because we lose the sharing of the explicit and implicit store values we must create another local store. A similar transformation happens before SelectionDAG as well. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle llvm-svn: 293184
*	[AMDGPU] Fix typo in GCNSchedStrategy	Valery Pykhtin	2017-01-26	1	-8/+3
\| \| \| \| \| \|	Differential revision: https://reviews.llvm.org/D28980 llvm-svn: 293171
*	AMDGPU: Fold fneg into round instructions	Matt Arsenault	2017-01-26	3	-11/+99
\| \| \| \|	llvm-svn: 293127
*	AMDGPU: Set call_convention bit in kernel_code_t	Matt Arsenault	2017-01-25	1	-0/+2
\| \| \| \| \| \| \|	According to the documentation this is supposed to be -1 if indirect calls are not supported. llvm-svn: 293081
*	AMDGPU: Check nsz instead of unsafe math	Matt Arsenault	2017-01-25	2	-3/+3
\| \| \| \|	llvm-svn: 293028
*	DAG: Recognize no-signed-zeros-fp-math attribute	Matt Arsenault	2017-01-25	2	-0/+80
\| \| \| \| \| \| \| \|	clang already emits this with -cl-no-signed-zeros, but codegen doesn't do anything with it. Treat it like the other fast math attributes, and change one place to use it. llvm-svn: 293024
*	DAGCombiner: Allow negating ConstantFP after legalize	Matt Arsenault	2017-01-25	1	-2/+1
\| \| \| \|	llvm-svn: 293019
*	AMDGPU: Implement early ifcvt target hooks.	Matt Arsenault	2017-01-25	4	-2/+567
\| \| \| \| \| \| \| \| \| \| \| \|	Leave early ifcvt disabled for now since there are some shader-db regressions. This causes some immediate improvements, but could be better. The cost checking that the pass does is based on critical path length for out of order CPUs which we do not want so it skips out on many cases we want. llvm-svn: 293016
*	AMDGPU: Remove spurious out branches after a kill	Matt Arsenault	2017-01-24	1	-0/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The sequence like this: v_cmpx_le_f32_e32 vcc, 0, v0 s_branch BB0_30 s_cbranch_execnz BB0_30 ; BB#29: exp null off, off, off, off done vm s_endpgm BB0_30: ; %endif110 is likely wrong. The s_branch instruction will unconditionally jump to BB0_30 and the skip block (exp done + endpgm) inserted for performing the kill instruction will never be executed. This results in a GPU hang with Star Ruler 2. The s_branch instruction is added during the "Control Flow Optimizer" pass which seems to re-organize the basic blocks, and we assume that SI_KILL_TERMINATOR is always the last instruction inside a basic block. Thus, after inserting a skip block we just go to the next BB without looking at the subsequent instructions after the kill, and the s_branch op is never removed. Instead, we should remove the unconditional out branches and let skip the two instructions if the exec mask is non-zero. This patch fixes the GPU hang and doesn't introduce any regressions with "make check". Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99019 Patch by Samuel Pitoiset <samuel.pitoiset@gmail.com> llvm-svn: 292985
*	Enable FeatureFlatForGlobal on Volcanic Islands	Matt Arsenault	2017-01-24	273	-377/+374
\| \| \| \| \| \| \| \| \| \| \|	This switches to the workaround that HSA defaults to for the mesa path. This should be applied to the 4.0 branch. Patch by Vedran Miletić <vedran@miletic.net> llvm-svn: 292982