bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[ARM] Expand v1i64 and v2i64 ctpop.	Benjamin Kramer	2016-03-31	1	-0/+16
\| \| \| \| \| \| \|	The default is legal, which results in 'Cannot select' errors. This is triggered during selfhost due to a recent cost model change. llvm-svn: 265040
*	[X86] Merge adjacent stack adjustments in eliminateCallFramePseudoInstr ↵	Hans Wennborg	2016-03-31	8	-18/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(PR27140) For code such as: void f(int, int); void g() { f(1, 2); } compiled for 32-bit X86 Linux, Clang would previously generate: subl $12, %esp subl $8, %esp pushl $2 pushl $1 calll f addl $16, %esp addl $12, %esp retl This patch fixes that by merging adjacent stack adjustments in eliminateCallFramePseudoInstr(). Differential Revision: http://reviews.llvm.org/D18627 llvm-svn: 265039
*	[lanai] isBrImm should accept any non-constant immediate.	Jacques Pienaar	2016-03-31	1	-0/+13
\| \| \| \| \| \| \| \|	isBrImm should accept any non-constant immediate. Previously it was only accepting LanaiMCExpr ones which was wrong. Differential Revision: http://reviews.llvm.org/D18571 llvm-svn: 265032
*	[PPC] basic support for Power 9 direct move instructions	Ehsan Amiri	2016-03-31	2	-0/+22
\| \| \| \| \| \| \| \|	http://reviews.llvm.org/D18097 Initial support does not include any patterns to generate this instructions llvm-svn: 265031
*	[x86] use SSE/AVX ops for non-zero memsets (PR27100)	Sanjay Patel	2016-03-31	1	-51/+131
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the memset check down to the CPU-with-slow-SSE-unaligned-memops case: this allows fast targets to take advantage of SSE/AVX instructions and prevents slow targets from stepping into a codegen sinkhole while trying to splat a byte into an XMM reg. Follow-on bugs exposed by the current codegen are: https://llvm.org/bugs/show_bug.cgi?id=27141 https://llvm.org/bugs/show_bug.cgi?id=27143 Differential Revision: http://reviews.llvm.org/D18566 llvm-svn: 265029
*	[AMDGPU] enable few disassembler tests that were mistakenly marked as FIXME.	Valery Pykhtin	2016-03-31	1	-8/+8
\| \| \| \|	llvm-svn: 265028
*	More checks in win32-seh-nested-finally.ll after comment on r264966	Hans Wennborg	2016-03-31	1	-0/+3
\| \| \| \|	llvm-svn: 265027
*	[PowerPC] Attempt to fix fast-isel-i64offset.ll failure	Ulrich Weigand	2016-03-31	1	-4/+1
\| \| \| \| \| \| \|	The test case added in r265023 is failing on ninja-x64-msvc-RA-centos6. Update the test to make less specific assumptions on code generation. llvm-svn: 265026
*	[PowerPC] Correctly compute 64-bit offsets in fast isel	Ulrich Weigand	2016-03-31	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PPCSimplifyAddress contains this code: IntegerType OffsetTy = ((VT == MVT::i32) ? Type::getInt32Ty(Context) : Type::getInt64Ty(Context)); to determine the type to be used for an index register, if one needs to be created. However, the "VT" here is the type of the data being loaded or stored, not* the type of an address. This means that if a data element of type i32 is accessed using an index that does not not fit into 32 bits, a wrong address is computed here. Note that PPCFastISel is only ever used on 64-bit currently, so the type of an address is actually always MVT::i64. Other parts of the code, even in this same PPCSimplifyAddress routine, already rely on that fact. Thus, this patch changes the code to simply unconditionally use Type::getInt64Ty(*Context) as OffsetTy. llvm-svn: 265023
*	[PowerPC] Basic support for P9 atomic loads and stores	Nemanja Ivanovic	2016-03-31	4	-0/+42
\| \| \| \| \| \| \| \| \| \|	This patch corresponds to review: http://reviews.llvm.org/D18032 This patch provides asm implementation for the following instructions: lwat, ldat, stwat, stdat, ldmx, mcrxrx llvm-svn: 265022
*	[AArch64] Handle missing store pair opportunity	Jun Bum Lim	2016-03-31	1	-8/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This change will handle missing store pair opportunity where the first store instruction stores zero followed by the non-zero store. For example, this change will convert : str wzr, [x8] str w1, [x8, #4] into: stp wzr, w1, [x8] Reviewers: jmolloy, t.p.northover, mcrosier Subscribers: flyingforyou, aemerson, rengolin, mcrosier, llvm-commits Differential Revision: http://reviews.llvm.org/D18570 llvm-svn: 265021
*	[PowerPC] Remove incorrect use of COPY_TO_REGCLASS in fast isel	Ulrich Weigand	2016-03-31	1	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The fast isel pass currently emits a COPY_TO_REGCLASS node to convert from a F4RC to a F8RC register class during conversion of a floating-point number to integer. There is actually no support in the common code instruction printers to emit COPY_TO_REGCLASS nodes, so the PowerPC back-end has special code there to simply ignore COPY_TO_REGCLASS. This is correct if and only if the source and destination registers of COPY_TO_REGCLASS are the same (except for the different register class). But nothing guarantees this to be the case, and if the register allocator does end up allocating source and destination to different registers after all, the back-end simply generates incorrect code. I've included a test case that shows such incorrect code generation. However, it seems that COPY_TO_REGCLASS is actually not intended to be used at the MI layer at all. It is used during SelectionDAG, but always lowered to a plain COPY before emitting MI. Other back-end's fast isel passes never emit COPY_TO_REGCLASS at all. I suspect it is simply wrong for the PowerPC back-end to emit it here. This patch changes the PowerPC back-end to directly emit COPY instead of COPY_TO_REGCLASS and removes the special handling in the instruction printers. Differential Revision: http://reviews.llvm.org/D18605 llvm-svn: 265020
*	[mips] Range check simm16	Daniel Sanders	2016-03-31	5	-14/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: There are too many instructions to exhaustively test so addiu and lwc2 are used as representative examples. It should be noted that many memory instructions that should have simm16 range checking do not because it is also necessary to support the macro of the same name which accepts simm32. The range checks for these occur in the macro expansion. Reviewers: vkalintiris Subscribers: dsanders, llvm-commits Differential Revision: http://reviews.llvm.org/D18437 llvm-svn: 265019
*	[mips] Range check simm11 and mem_simm11.	Daniel Sanders	2016-03-31	4	-8/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: ldc2/sdc2 now emit slightly worse diagnostics for MIPS-I. The problem is that they don't trigger the custom parser because all the candidates are disabled by feature bits. On all other subtargets, the diagnostics are accurate but are subject to the usual issues of needing to report multiple ways to correct the code (e.g. smaller offset, enable a CPU feature) but only being able to report one error. Reviewers: vkalintiris Subscribers: dsanders, llvm-commits Differential Revision: http://reviews.llvm.org/D18436 llvm-svn: 265018
*	[AMDGPU] Disassembler: support for DPP	Sam Kolton	2016-03-31	1	-0/+89
\| \| \| \| \|	Review: http://reviews.llvm.org/D18642 llvm-svn: 265015
*	[mips] Split mem_msa into range checked mem_simm10 and mem_simm10_lsl[123]	Daniel Sanders	2016-03-31	2	-50/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Also, made test_mi10.s formatting consistent with the majority of the MC tests. Reviewers: vkalintiris Subscribers: dsanders, llvm-commits Differential Revision: http://reviews.llvm.org/D18435 llvm-svn: 265014
*	[mips] Range check simm9 and fix a bug this revealed.	Daniel Sanders	2016-03-31	12	-93/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: The bug was that microMIPS's [ls]w[lr]e instructions claimed to support a 12-bit offset when it is only 9-bit. Reviewers: vkalintiris Subscribers: llvm-commits, dsanders Differential Revision: http://reviews.llvm.org/D18434 llvm-svn: 265010
*	[TTI] Let the cost model estimate ctpop costs based on legality	Benjamin Kramer	2016-03-31	2	-8/+19
\| \| \| \| \| \| \| \| \|	PPC has a vector popcount, this lets the vectorizer use the correct cost for it. Tweak X86 test to use an intrinsic that's actually scalarized (we have a somewhat efficient lowering for vector popcount using SSE, the cost model finds that now). llvm-svn: 265005
*	[mips][microMIPS] Implement MFC, MFHC and DMFC* instructions	Zlatko Buljan	2016-03-31	6	-14/+44
\| \| \| \| \| \|	Differential Revision: http://reviews.llvm.org/D17334 llvm-svn: 265002
*	Silence warnings in OCaml bindings	Jeroen Ketema	2016-03-31	1	-4/+1
\| \| \| \| \| \| \| \| \|	* LLVMDisposeMessage lives in llvm-c/Core.h, include this file where necessary * LLVMAddTargetData has been removed, follow suit in the bindings Differential Revision: http://reviews.llvm.org/D18633 llvm-svn: 265001
*	[InstCombine] Fix incorrect rule from rL236202	Sanjoy Das	2016-03-31	1	-0/+18
\| \| \| \| \| \| \|	The rule for SMIN introduced in rL236202 doesn't work as advertised: the check for Pred == ICmpInst::ICMP_SGT was missing. llvm-svn: 264996
*	[DebugInfo] Subprograms should belong to a CU.	Davide Italiano	2016-03-31	10	-9/+10
\| \| \| \| \| \| \| \|	Start fixing tests accordingly. There are still about 35 failures before we can enable this check in the IR verifier. llvm-svn: 264990
*	[PowerPC] Load two floats directly instead of using one 64-bit integer load	Hal Finkel	2016-03-31	1	-0/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When dealing with complex<float>, and similar structures with two single-precision floating-point numbers, especially when such things are being passed around by value, we'll sometimes end up loading both float values by extracting them from one 64-bit integer load. It looks like this: t13: i64,ch = load<LD8[%ref.tmp]> t0, t6, undef:i64 t16: i64 = srl t13, Constant:i32<32> t17: i32 = truncate t16 t18: f32 = bitcast t17 t19: i32 = truncate t13 t20: f32 = bitcast t19 The problem, especially before the P8 where those bitcasts aren't legal (and get expanded via the stack), is that it would have been better to use two floating-point loads directly. Here we add a target-specific DAGCombine to do just that. In short, we turn: ld 3, 0(5) stw 3, -8(1) rldicl 3, 3, 32, 32 stw 3, -4(1) lfs 3, -4(1) lfs 0, -8(1) into: lfs 3, 4(5) lfs 0, 0(5) llvm-svn: 264988
*	Fix case confusion.	Sean Silva	2016-03-31	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	The test case was defining and using a function 'notExported()', but the FileCheck checks were checking for the name 'not_exported'. This changes the test to use 'notExported' across the board. Also, the test defined a function 'not_defined()', but doesn't have any checks related to it. For consistency, this name is changed to 'notDefined'. A later commit will add checks for 'notDefined'. Patch by Warren Ristow! llvm-svn: 264984
*	Introduce a @llvm.experimental.guard intrinsic	Sanjoy Das	2016-03-31	3	-0/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: As discussed on llvm-dev[1]. This change adds the basic boilerplate code around having this intrinsic in LLVM: - Changes in Intrinsics.td, and the IR Verifier - A lowering pass to lower @llvm.experimental.guard to normal control flow - Inliner support [1]: http://lists.llvm.org/pipermail/llvm-dev/2016-February/095523.html Reviewers: reames, atrick, chandlerc, rnk, JosephTremoulet, echristo Subscribers: mcrosier, llvm-commits Differential Revision: http://reviews.llvm.org/D18527 llvm-svn: 264976
*	Add some more triples after r264966	Hans Wennborg	2016-03-30	3	-3/+3
\| \| \| \|	llvm-svn: 264972
*	[X86] Enable call frame optimization ("mov to push") not only for optsize ↵	Hans Wennborg	2016-03-30	34	-195/+191
\| \| \| \| \| \| \| \| \| \|	(PR26325) The size savings are significant, and from what I can tell, both ICC and GCC do this. Differential Revision: http://reviews.llvm.org/D18573 llvm-svn: 264966
*	AMDGPU: Add frexp_exp intrinsic	Matt Arsenault	2016-03-30	2	-0/+226
\| \| \| \|	llvm-svn: 264944
*	AMDGPU: Constant folding for frexp_mant	Matt Arsenault	2016-03-30	1	-0/+155
\| \| \| \|	llvm-svn: 264943
*	Use existing PrintEscapedString in AssemblyWriter	Teresa Johnson	2016-03-30	1	-2/+2
\| \| \| \| \| \| \|	r264884 introduced a helper to escape the backslashes in the source file path, but I since discovered an existing mechanism to escape strings. llvm-svn: 264936
*	[IndVarSimplify] Don't insert after a catchswitch	David Majnemer	2016-03-30	1	-0/+38
\| \| \| \| \| \| \| \| \| \|	Widening a PHI requires us to insert a trunc. The logical place for this trunc is in the same BB as the PHI. This is not possible if the BB is terminated by a catchswitch. This fixes PR27133. llvm-svn: 264926
*	[X86][AVX] Ensure EltsFromConsecutiveLoads tests the entire vector for ↵	Simon Pilgrim	2016-03-30	1	-0/+39
\| \| \| \| \| \| \| \|	consecutive loads/zeros Fix for issue introduced D17297, where we were breaking early from the loop detecting consecutive loads which could leave us thinking a consecutive load with zeros was possible. llvm-svn: 264922
*	test: Remove a test for a transform that hasn't existed in 5 years.	Justin Bogner	2016-03-30	3	-33/+0
\| \| \| \| \| \| \| \| \| \|	The TailDup transform was removed in r138841 in 2011, along with most of the tests for it. This test, however, was missed. Probably because it had already been XFAIL'd for 3 years at that point (since r52243!) and continued to fail when the opt flag for -tailduplicate stopped being valid. llvm-svn: 264916
*	[LoopVectorize] Don't vectorize loops when everything will be scalarized	Hal Finkel	2016-03-30	2	-0/+101
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change prevents the loop vectorizer from vectorizing when all of the vector types it generates will be scalarized. I've run into this problem on the PPC's QPX vector ISA, which only holds floating-point vector types. The loop vectorizer will, however, happily vectorize loops with purely integer computation. Here's an example: LV: The Smallest and Widest types: 32 / 32 bits. LV: The Widest register is: 256 bits. LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ] LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25 LV: Found an estimated cost of 0 for VF 1 For instruction: %2 = trunc i64 %indvars.iv25 to i32 LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 %2, i32* %arrayidx, align 4 LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600 LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body LV: Scalar loop costs: 3. LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ] LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25 LV: Found an estimated cost of 0 for VF 2 For instruction: %2 = trunc i64 %indvars.iv25 to i32 LV: Found an estimated cost of 2 for VF 2 For instruction: store i32 %2, i32* %arrayidx, align 4 LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1 LV: Found an estimated cost of 1 for VF 2 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600 LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body LV: Vector loop of width 2 costs: 2. LV: Found an estimated cost of 0 for VF 4 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ] LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25 LV: Found an estimated cost of 0 for VF 4 For instruction: %2 = trunc i64 %indvars.iv25 to i32 LV: Found an estimated cost of 4 for VF 4 For instruction: store i32 %2, i32* %arrayidx, align 4 LV: Found an estimated cost of 1 for VF 4 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1 LV: Found an estimated cost of 1 for VF 4 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600 LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body LV: Vector loop of width 4 costs: 1. ... LV: Selecting VF: 8. LV: The target has 32 registers LV(REG): Calculating max register usage: LV(REG): At #0 Interval # 0 LV(REG): At #1 Interval # 1 LV(REG): At #2 Interval # 2 LV(REG): At #4 Interval # 1 LV(REG): At #5 Interval # 1 LV(REG): VF = 8 The problem is that the cost model here is not wrong, exactly. Since all of these operations are scalarized, their cost (aside from the uniform ones) are indeed VF*(scalar cost), just as the model suggests. In fact, the larger the VF picked, the lower the relative overhead from the loop itself (and the induction-variable update and check), and so in a sense, picking the largest VF here is the right thing to do. The problem is that vectorizing like this, where all of the vectors will be scalarized in the backend, isn't really vectorizing, but rather interleaving. By itself, this would be okay, but then the vectorizer itself also interleaves, and that's where the problem manifests itself. There's aren't actually enough scalar registers to support the normal interleave factor multiplied by a factor of VF (8 in this example). In other words, the problem with this is that our register-pressure heuristic does not account for scalarization. While we might want to improve our register-pressure heuristic, I don't think this is the right motivating case for that work. Here we have a more-basic problem: The job of the vectorizer is to vectorize things (interleaving aside), and if the IR it generates won't generate any actual vector code, then something is wrong. Thus, if every type looks like it will be scalarized (i.e. will be split into VF or more parts), then don't consider that VF. This is not a problem specific to PPC/QPX, however. The problem comes up under SSE on x86 too, and as such, this change fixes PR26837 too. I've added Sanjay's reduced test case from PR26837 to this commit. Differential Revision: http://reviews.llvm.org/D18537 llvm-svn: 264904
*	Restore "[ThinLTO] Serialize the Module SourceFileName to/from LLVM assembly"	Teresa Johnson	2016-03-30	2	-0/+16
\| \| \| \| \| \| \|	This restores commit 264869, with a fix for windows bots to properly escape '\' in the path when serializing out. Added test. llvm-svn: 264884
*	AMDGPU/SI: Improve MachineSchedModel definition	Tom Stellard	2016-03-30	9	-162/+168
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch contains a few improvements to the model, including: - Using a single resource with a defined buffers size for each memory unit. - Setting the IssueWidth correctly. - Fixing latency values for memory instructions. shader-db stats: 16429 shaders in 3231 tests Totals: SGPRS: 318232 -> 312328 (-1.86 %) VGPRS: 208996 -> 209346 (0.17 %) Code Size: 7147044 -> 7166440 (0.27 %) bytes LDS: 83 -> 83 (0.00 %) blocks Scratch: 1862656 -> 1459200 (-21.66 %) bytes per wave Max Waves: 49182 -> 49243 (0.12 %) Wait states: 0 -> 0 (0.00 %)A Differential Revision: http://reviews.llvm.org/D18453 llvm-svn: 264877
*	AMDGPU/SI: Enable lanemask tracking in misched	Tom Stellard	2016-03-30	30	-139/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This results in higher register usage, but should make it easier for the compiler to hide latency. This pass is a prerequisite for some more scheduler improvements, and I think the increase register usage with this patch is acceptable, because when combined with the scheduler improvements, the total register usage will decrease. shader-db stats: 2382 shaders in 478 tests Totals: SGPRS: 48672 -> 49088 (0.85 %) VGPRS: 34148 -> 34847 (2.05 %) Code Size: 1285816 -> 1289128 (0.26 %) bytes LDS: 28 -> 28 (0.00 %) blocks Scratch: 492544 -> 573440 (16.42 %) bytes per wave Max Waves: 6856 -> 6846 (-0.15 %) Wait states: 0 -> 0 (0.00 %) Depends on D18451 Reviewers: nhaehnle, arsenm Subscribers: arsenm, llvm-commits Differential Revision: http://reviews.llvm.org/D18452 llvm-svn: 264876
*	[SystemZ] Add nop and nopr InstAliases.	Jonas Paulsson	2016-03-30	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	For compatability with GAS, nop and nopr are recognized as alises for bc and bcr, respectively. A mask of 0 turns these instructions effectively into no-operations. Reviewed by Ulrich Weigand. llvm-svn: 264875
*	Revert "[ThinLTO] Serialize the Module SourceFileName to/from LLVM assembly"	Teresa Johnson	2016-03-30	1	-8/+0
\| \| \| \| \| \| \| \| \|	This reverts commit r264869. I am seeing Windows bot failures due to the "\" in the path being mishandled at some point (seems to be interpreted wrongly at some point and llvm-as \| llvm-dis is yielding some junk characters). Need to investigate. llvm-svn: 264871
*	[X86][XOP] BITREVERSE lowering using VPPERM	Simon Pilgrim	2016-03-30	1	-0/+186
\| \| \| \| \| \|	XOP's VPPERM has some great 'permute operations' that it can do as well as part of shuffling the bytes of a 128-bit vector - in this case we use it to perform BITREVERSE in a single instruction. llvm-svn: 264870
*	[ThinLTO] Serialize the Module SourceFileName to/from LLVM assembly	Teresa Johnson	2016-03-30	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This change serializes out and in the SourceFileName to LLVM assembly so that it is preserved through "llvm-dis \| llvm-as". This is necessary to ensure that the global identifiers created for local values in the module summary index are the same even if the bitcode is streamed out and read back from LLVM assembly. Serializing the summary itself to LLVM assembly is in progress. Reviewers: joker.eph Subscribers: llvm-commits, joker.eph Differential Revision: http://reviews.llvm.org/D18588 llvm-svn: 264869
*	[X86][SSE] Test the legalization of vector comparison results	Simon Pilgrim	2016-03-30	1	-0/+1971
\| \| \| \| \| \|	We are currently doing a REALLY bad job of packing results of vector comparisons into the legalized <X x i1> result equivalents - a mixture of PACKSS/PMOVMSKB would be much better here. llvm-svn: 264867
*	[X86][SSE] Added tests for clearing upper bits of vector elements	Simon Pilgrim	2016-03-30	1	-0/+688
\| \| \| \| \| \|	Patterns based on PR6455 llvm-svn: 264857
*	[VectorUtils] Don't try and truncate PHIs to a smaller bitwidth	James Molloy	2016-03-30	1	-0/+58
\| \| \| \| \| \| \| \| \| \|	We already try not to truncate PHIs in computeMinimalBitwidths. LoopVectorize can't handle it and we really don't need to, because both induction and reduction PHIs are truncated by other means. However, we weren't bailing out in all the places we should have, and we ended up by returning a PHI to be truncated, which has caused PR27018. This fixes PR17018. llvm-svn: 264852
*	[x86] Fix a horrible bug in our lowering of x86 floating point atomic	Chandler Carruth	2016-03-30	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	operations. Specifically, we had code that tried to badly approximate reconstructing all of the possible variations on addressing modes in two x86 instructions based on those in one pseudo instruction. This is not the first bug uncovered with doing this, so stop doing it altogether. Instead generically and pedantically copy every operand from the address over to both new instructions, and strip kill flags from any register operands. This fixes a subtle bug seen in the wild where we would mysteriously drop parts of the addressing mode, causing for example the index argument in the added test case to just be completely ignored. Hypothetically, this was an extremely bad miscompile because it actually caused a predictable and leveragable write of a 64bit quantity to an unintended offset (the first element of the array intead of whatever other element was intended). As a consequence, in theory this could even have introduced security vulnerabilities. However, this was only something that could happen with an atomic floating point add. No other operation could trigger this bug, so it seems extremely unlikely to have occured widely in the wild. But it did in fact occur, and frequently in scientific applications which were using relaxed atomic updates of a floating point value after adding a delta. Those would end up being quite badly miscompiled by LLVM, which is how we found this. Of course, this often looks like a race condition in the code, but it was actually a miscompile. I suspect that this whole RELEASE_FADD thing was a complete mistake. There is no such operation, and I worry that anything other than add will get remarkably worse codegeneration. But that's not for this change.... llvm-svn: 264845
*	[MemorySSA] Make the visitor more careful with calls.	George Burgess IV	2016-03-30	1	-0/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prior to this patch, the MemorySSA caching visitor would cache all calls that it visited. When paired with phi optimization, this can be problematic. Consider: define void @foo() { ; 1 = MemoryDef(liveOnEntry) call void @clobberFunction() br i1 undef, label %if.end, label %if.then if.then: ; MemoryUse(??) call void @readOnlyFunction() ; 2 = MemoryDef(1) call void @clobberFunction() br label %if.end if.end: ; 3 = MemoryPhi(...) ; MemoryUse(?) call void @readOnlyFunction() ret void } When optimizing MemoryUse(?), we visit defs 1 and 2, so we note to cache them later. We ultimately end up not being able to optimize passed the Phi, so we set MemoryUse(?) to point to the Phi. We then cache the clobbering call for def 1 to be the Phi. This commit changes this behavior so that we wipe out any calls added to VisistedCalls while visiting the defs of a phi we couldn't optimize. Aside: With this patch, we now can bootstrap clang/LLVM without a single MemorySSA verifier failure. Woohoo. :) llvm-svn: 264820
*	[PGO] Handle invoke inst in IR based icall instrumentation	Xinliang David Li	2016-03-30	1	-0/+42
\| \| \| \| \| \|	Differential Revision: http://reviews.llvm.org/D18580 llvm-svn: 264818
*	[MemorySSA] Change how the walker views/walks visited phis.	George Burgess IV	2016-03-30	2	-5/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch teaches the caching MemorySSA walker a few things: 1. Not to walk Phis we've walked before. It seems that we tried to do this before, but it didn't work so well in cases like: define void @foo() { %1 = alloca i8 %2 = alloca i8 br label %begin begin: ; 3 = MemoryPhi({%0,liveOnEntry},{%end,2}) ; 1 = MemoryDef(3) store i8 0, i8* %2 br label %end end: ; MemoryUse(?) load i8, i8* %1 ; 2 = MemoryDef(1) store i8 0, i8* %2 br label %begin } Because we wouldn't put Phis in Q.Visited until we tried to visit them. So, when trying to optimize MemoryUse(?): - We would visit 3 above - ...Which would make us put {%0,liveOnEntry} in Q.Visited - ...Which would make us visit {%0,liveOnEntry} - ...Which would make us put {%end,2} in Q.Visited - ...Which would make us visit {%end,2} - ...Which would make us visit 3 - ...Which would realize we've already visited everything in 3 - ...Which would make us conservatively return 3. In the added test-case, (@looped_visitedonlyonce) this behavior would cause us to give incorrect results. Specifically, we'd visit 4 twice in the same query, but on the second visit, we'd skip while.cond because it had been visited, visit if.then/if.then2, and cache "1" as the clobbering def on the way back. 2. If we try to walk the defs of a {Phi,MemLoc} and see it has been visited before, just hand back the Phi we're trying to optimize. I promise this isn't as terrible as it seems. :) We now insert {Phi,MemLoc} pairs just before walking the Phi's upward defs. So, we check the cache for the {Phi,MemLoc} pair before checking if we've already walked the Phi. The {Phi,MemLoc} pair is (almost?) always guaranteed to have a cache entry if we've already fully walked it, because we cache as we go. So, if the {Phi,MemLoc} pair isn't in cache, either: (a) we must be in the process of visiting it (in which case, we can't give a better answer in a cache-as-we-go DFS walker) (b) we visited it, but didn't cache it on the way back (...which seems to require `ModifyingAccess` to not dominate `StartingAccess`, so I'm 99% sure that would be an error. If it's not an error, I haven't been able to get it to happen locally, so I suspect it's rare.) - - - - - As a consequence of this change, we no longer skip upward defs of phis, so we can kill the `VisitedOnlyOne` check. This gives us better accuracy than we had before, at the cost of potentially doing a bit more work when we have a loop. llvm-svn: 264814
*	[LoopDataPrefetch] Centralize the tuning cl::opts under the pass	Adam Nemet	2016-03-29	1	-1/+1
\| \| \| \| \| \| \| \| \|	This is effectively NFC, minus the renaming of the options (-cyclone-prefetch-distance -> -prefetch-distance). The change was requested by Tim in D17943. llvm-svn: 264806
*	[tsan] Do not instrument reads/writes to instruction profile counters.	Anna Zaks	2016-03-29	1	-0/+33
\| \| \| \| \| \| \| \| \|	We have known races on profile counters, which can be reproduced by enabling -fsanitize=thread and -fprofile-instr-generate simultaneously on a multi-threaded program. This patch avoids reporting those races by not instrumenting the reads and writes coming from the instruction profiler. llvm-svn: 264805