bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[XCOFF][AIX] Differentiate usage of label symbol and csect symbol	Jason Liu	2019-11-08	1	-7/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: We are using symbols to represent label and csect interchangeably before, and that could be a problem. There are cases we would need to add storage mapping class to the symbol if that symbol is actually the name of a csect, but it's hard for us to figure out whether that symbol is a label or csect. This patch intend to do the following: 1. Construct a QualName (A name include the storage mapping class) MCSymbolXCOFF for every MCSectionXCOFF. 2. Keep a pointer to that QualName inside of MCSectionXCOFF. 3. Use that QualName whenever we need a symbol refers to that MCSectionXCOFF. 4. Adapt the snowball effect from the above changes in XCOFFObjectWriter.cpp. Reviewers: xingxue, DiggerLin, sfertile, daltenty, hubert.reinterpretcast Reviewed By: DiggerLin, daltenty Subscribers: wuzish, nemanjai, mgorny, hiraditya, kbarton, jsji, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D69633
*	[AMDGPU][MC] Corrected src0 for v_movrelsd_b32 and v_movrelsd_2_b32	Dmitry Preobrazhensky	2019-11-08	1	-6/+8
\| \| \| \| \| \| \| \|	See https://bugs.llvm.org/show_bug.cgi?id=40903 Reviewers: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D69888
*	Reland: [TII] Use optional destination and source pair as a return value; NFC	Djordje Todorovic	2019-11-08	10	-98/+65
\| \| \| \| \| \| \| \| \| \|	Refactor usage of isCopyInstrImpl, isCopyInstr and isAddImmediate methods to return optional machine operand pair of destination and source registers. Patch by Nikola Prica Differential Revision: https://reviews.llvm.org/D69622
*	[RAGreedy] Enable -consider-local-interval-cost for AArch64	Sanne Wouda	2019-11-08	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: The greedy register allocator occasionally decides to insert a large number of unnecessary copies, see below for an example. The -consider-local-interval-cost option (which X86 already enables by default) fixes this. We enable this option for AArch64 only after receiving feedback that this change is not beneficial for PowerPC. We evaluated the impact of this change on compile time, code size and performance benchmarks. This option has a small impact on compile time, measured on CTMark. A 0.1% geomean regression on -O1 and -O2, and 0.2% geomean for -O3, with at most 0.5% on individual benchmarks. The effect on both code size and performance on AArch64 for the LLVM test suite is nil on the geomean with individual outliers (ignoring short exec_times) between: best worst size..text -3.3% +0.0% exec_time -5.8% +2.3% On SPEC CPU® 2017 (compiled for AArch64) there is a minor reduction (-0.2% at most) in code size on some benchmarks, with a tiny movement (-0.01%) on the geomean. Neither intrate nor fprate show any change in performance. This patch makes the following changes. - For the AArch64 target, enableAdvancedRASplitCost() now returns true. - Ensures that -consider-local-interval-cost=false can disable the new behaviour if necessary. This matrix multiply example: $ cat test.c long A[8][8]; long B[8][8]; long C[8][8]; void run_test() { for (int k = 0; k < 8; k++) { for (int i = 0; i < 8; i++) { for (int j = 0; j < 8; j++) { C[i][j] += A[i][k] * B[k][j]; } } } } results in the following generated code on AArch64: $ clang --target=aarch64-arm-none-eabi -O3 -S test.c -o - [...] // %for.cond1.preheader // =>This Inner Loop Header: Depth=1 add x14, x11, x9 str q0, [sp, #16] // 16-byte Folded Spill ldr q0, [x14] mov v2.16b, v15.16b mov v15.16b, v14.16b mov v14.16b, v13.16b mov v13.16b, v12.16b mov v12.16b, v11.16b mov v11.16b, v10.16b mov v10.16b, v9.16b mov v9.16b, v8.16b mov v8.16b, v31.16b mov v31.16b, v30.16b mov v30.16b, v29.16b mov v29.16b, v28.16b mov v28.16b, v27.16b mov v27.16b, v26.16b mov v26.16b, v25.16b mov v25.16b, v24.16b mov v24.16b, v23.16b mov v23.16b, v22.16b mov v22.16b, v21.16b mov v21.16b, v20.16b mov v20.16b, v19.16b mov v19.16b, v18.16b mov v18.16b, v17.16b mov v17.16b, v16.16b mov v16.16b, v7.16b mov v7.16b, v6.16b mov v6.16b, v5.16b mov v5.16b, v4.16b mov v4.16b, v3.16b mov v3.16b, v1.16b mov x12, v0.d[1] fmov x15, d0 ldp q1, q0, [x14, #16] ldur x1, [x10, #-256] ldur x2, [x10, #-192] add x9, x9, #64 // =64 mov x13, v1.d[1] fmov x16, d1 ldr q1, [x14, #48] mul x3, x15, x1 mov x14, v0.d[1] fmov x17, d0 mov x18, v1.d[1] fmov x0, d1 mov v1.16b, v3.16b mov v3.16b, v4.16b mov v4.16b, v5.16b mov v5.16b, v6.16b mov v6.16b, v7.16b mov v7.16b, v16.16b mov v16.16b, v17.16b mov v17.16b, v18.16b mov v18.16b, v19.16b mov v19.16b, v20.16b mov v20.16b, v21.16b mov v21.16b, v22.16b mov v22.16b, v23.16b mov v23.16b, v24.16b mov v24.16b, v25.16b mov v25.16b, v26.16b mov v26.16b, v27.16b mov v27.16b, v28.16b mov v28.16b, v29.16b mov v29.16b, v30.16b mov v30.16b, v31.16b mov v31.16b, v8.16b mov v8.16b, v9.16b mov v9.16b, v10.16b mov v10.16b, v11.16b mov v11.16b, v12.16b mov v12.16b, v13.16b mov v13.16b, v14.16b mov v14.16b, v15.16b mov v15.16b, v2.16b ldr q2, [sp] // 16-byte Folded Reload fmov d0, x3 mul x3, x12, x1 [...] With -consider-local-interval-cost the same section of code results in the following: $ clang --target=aarch64-arm-none-eabi -mllvm -consider-local-interval-cost -O3 -S test.c -o - [...] .LBB0_1: // %for.cond1.preheader // =>This Inner Loop Header: Depth=1 add x14, x11, x9 ldp q0, q1, [x14] ldur x1, [x10, #-256] ldur x2, [x10, #-192] add x9, x9, #64 // =64 mov x12, v0.d[1] fmov x15, d0 mov x13, v1.d[1] fmov x16, d1 ldp q0, q1, [x14, #32] mul x3, x15, x1 cmp x9, #512 // =512 mov x14, v0.d[1] fmov x17, d0 fmov d0, x3 mul x3, x12, x1 [...] Reviewers: SjoerdMeijer, samparker, dmgreen, qcolombet Reviewed By: dmgreen Subscribers: ZhangKang, jsji, wuzish, ppc-slack, lkail, steven.zhang, MatzeB, qcolombet, kristof.beyls, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D69437
*	[RISCV] Fix evaluation of %pcrel_lo	Roger Ferrer Ibanez	2019-11-08	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following testcase function: .Lpcrel_label1: auipc a0, %pcrel_hi(other_function) addi a1, a0, %pcrel_lo(.Lpcrel_label1) .p2align 2 # Causes a new fragment to be emitted .type other_function,@function other_function: ret exposes an odd behaviour in which only the %pcrel_hi relocation is evaluated but not the %pcrel_lo. $ llvm-mc -triple riscv64 -filetype obj t.s \| llvm-objdump -d -r - <stdin>: file format ELF64-riscv Disassembly of section .text: 0000000000000000 function: 0: 17 05 00 00 auipc a0, 0 4: 93 05 05 00 mv a1, a0 0000000000000004: R_RISCV_PCREL_LO12_I other_function+4 0000000000000008 other_function: 8: 67 80 00 00 ret The reason seems to be that in RISCVAsmBackend::shouldForceRelocation we only consider the fragment but in RISCVMCExpr::evaluatePCRelLo we consider the section. This usually works but there are cases where the section may still be the same but the fragment may be another one. In that case we end forcing a %pcrel_lo relocation without any %pcrel_hi. This patch makes RISCVAsmBackend::shouldForceRelocation use the section, if any, to determine if the relocation must be forced or not. Differential Revision: https://reviews.llvm.org/D60657
*	[BPF] turn on -mattr=+alu32 for cpu version v3 and later	Yonghong Song	2019-11-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	-mattr=+alu32 has shown good performance vs. without this attribute. Based on discussion at https://lore.kernel.org/bpf/1ec37838-966f-ec0b-5223-ca9b6eb0860d@fb.com/T/#t cpu version v3 should support -mattr=+alu32. This patch enabled alu32 if cpu version is v3, either specified by user or probed by the llvm. Differential Revision: https://reviews.llvm.org/D69957
*	[PowerPC] Option for enabling absolute jumptables with command line	Nemanja Ivanovic	2019-11-07	1	-0/+5
\| \| \| \| \| \| \| \| \|	This option allows the user to specify the use of absolute jumptables instead of relative which is the default on most PPC subtargets. Patch by Kamauu Bridgeman Differential revision: https://reviews.llvm.org/D69108
*	X86FrameLowering - fix bool to unsigned cast static analyzer warnings. NFCI.	Simon Pilgrim	2019-11-07	1	-7/+7
\|
*	ManagedStringPool - pre-increment iterator. NFC.	Simon Pilgrim	2019-11-07	1	-1/+1
\|
*	X86CondBrFolding - remove non-existent fixBranchProb function. NFC.	Simon Pilgrim	2019-11-07	1	-2/+0
\|
*	comment shiftamountthreshold	joanlluch	2019-11-07	1	-0/+1
\|
*	[mips] Write `AFL_EXT_OCTEONP` flag to the `.MIPS.abiflags` section	Simon Atanasyan	2019-11-07	1	-1/+3
\| \| \| \|	Differential Revision: https://reviews.llvm.org/D69851
*	[mips] Support `octeon+` CPU in the `.set arch=` directive	Simon Atanasyan	2019-11-07	1	-2/+3
\| \| \| \|	Differential Revision: https://reviews.llvm.org/D69850
*	[mips] Implement Octeon+ `saa` and `saad` instructions	Simon Atanasyan	2019-11-07	10	-16/+130
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`saa` and `saad` are 32-bit and 64-bit store atomic add instructions. memory[base] = memory[base] + rt These instructions are available for "Octeon+" CPU. The patch adds support for both instructions to MIPS assembler and diassembler and introduces new CPU type - "octeon+". Next patches will implement `.set arch=octeon+` directive and `AFL_EXT_OCTEONP` ISA extension flag support. Differential Revision: https://reviews.llvm.org/D69849
*	[AMDGPU] Fix bug introduced in 47a5c36b37f0	dfukalov	2019-11-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: [AMDGPU] Fix bug introduced in 47a5c36b37f0 Reviewers: foad, arsenm Reviewed By: arsenm Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D69915
*	[X86] Remove unused variable. NFC	Craig Topper	2019-11-06	1	-1/+0
\|
*	[X86] Remove dead code from combineStore.	Craig Topper	2019-11-06	1	-44/+10
\| \| \| \| \| \|	Leftovers from before we switched to widening legalization. Fixes PR43919.
*	[AArch64][SVE] Add remaining patterns and intrinsics for add/sub/mad patterns	Danilo Carvalho Grael	2019-11-06	2	-23/+38
\| \| \| \| \| \| \| \| \| \| \|	Add pattern matching and intrinsics for the following instructions: predicated orr, eor, and, bic predicated mul, smulh, umulh, sdiv, udiv, sdivr, udivr predicated smax, umax, smin, umin, sabd, uabd mad, msb, mla, mls https://reviews.llvm.org/D69588
*	AMDGPU: Select global atomicrmw fadd	Matt Arsenault	2019-11-06	5	-13/+21
\| \| \| \|	This only works if there is no use of the return value.
*	[AMDGPU] Add handling of 160 bit registers in analyzeResourceUsage	Stanislav Mekhanoshin	2019-11-06	1	-0/+7
\| \| \| \| \| \|	This was omitted. Also SReg_96Reg missed IsSGPR assignment. Differential Revision: https://reviews.llvm.org/D69919
*	When lowering calls and tail calls in AArch64, the register mask and	Eric Christopher	2019-11-06	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	return value location depends on the calling convention of the callee. `F.getCallingConv()`, however, is the caller CC. Correct it to the callee CC from `CallLoweringInfo`. Fixes PR43449 Patch by Shu-Chun Weng!
*	[X86] Clamp large constant shift amounts for MMX shift intrinsics to 8-bits.	Craig Topper	2019-11-06	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The MMX intrinsics for shift by immediate take a 32-bit shift amount but the hardware for shifting by immediate only encodes 8-bits. For the intrinsic we don't require the shift amount to fit in 8-bits in the frontend because we don't check that its an immediate in the frontend. If its is not an immediate we move it to an MMX register and use the shift by register. But if it is an immediate we'll use the shift by immediate instruction. But we need to change the shift amount to 8-bits. We were previously doing this accidentally by masking it in the encoder. But this can make a large shift amount into a small in bounds shift amount. Instead we should clamp larger shift amounts to 255 so that the they don't become in bounds. Fixes PR43922
*	[AArch64] Re-add patterns for (s/u)mull2.	Eli Friedman	2019-11-06	1	-0/+19
\| \| \| \| \| \| \|	These patterns were added in D46009, but removed in D54276 due to missing test coverage. Differential Revision: https://reviews.llvm.org/D69831
*	[X86TargetTransformInfo] Fixed warning: Expression 'ISD == ISD::UREM' is ↵	Dávid Bolvanský	2019-11-06	1	-1/+1
\| \| \| \|	always true. NFCI.
*	[X86] Fix SLM v2i64 ADD/Sub/CMPEQ instruction schedules	Simon Pilgrim	2019-11-06	1	-0/+16
\| \| \| \| \| \|	Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2i64 ops. Numbers taken from Intel AOM (+ checked against Agner)
*	[X86] Fix SLM v2f64 ADD/MUL + FP BLEND/HADD instruction schedules	Simon Pilgrim	2019-11-06	1	-7/+7
\| \| \| \|	Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2f64/v2i64 ops.
*	[X86ISelLowering] Fixed typo in assert. NFCI.	Dávid Bolvanský	2019-11-06	1	-1/+1
\|
*	[CostModel][X86] Improve add vXi64 + fadd vXf64 reduction tests for SLM	Simon Pilgrim	2019-11-06	1	-0/+26
\| \| \| \|	As noted on D59710 we weren't handling the high costs of these operations on SLM.
*	[x86] avoid crashing when splitting AVX stores with non-simple type (PR43916)	Sanjay Patel	2019-11-06	1	-3/+5
\| \| \| \| \|	The store splitting transform was assuming a simple type (MVT), but that's not necessarily the case as shown in the test.
*	[X86] Fix uninitialized variable warnings. NFCI.	Simon Pilgrim	2019-11-06	7	-26/+26
\|
*	[X86] LowerAVXExtend - fix dodgy self-comparison assert.	Simon Pilgrim	2019-11-06	1	-1/+1
\| \| \| \|	PVS Studio noticed that we were asserting "VT.getVectorNumElements() == VT.getVectorNumElements()" instead of "VT.getVectorNumElements() == InVT.getVectorNumElements()".
*	[AArch64] Move the branch relaxation pass after BTI insertion	Momchil Velikov	2019-11-06	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Inserting BTI instructions can push branch destinations out of range. The branch relaxation pass itself cannot insert indirect branches since `TargetInstrInfo::insertIndirecrtBranch` is not implemented for AArch64 (guess +/-128 MB direct branch range is more than enough in practice). Testing this is a bit tricky. The original test case we have is 155kloc/6.1M. I've generated a test case using this program: ``` int main() { std::cout << R"src(int test(); void g0(), g1(), g2(), g3(), g4(), e(); void f(int v) { if ((test() & 2) == 0) { switch (v) { case 0: g0(); case 1: g1(); case 2: g2(); case 3: g3(); } )src"; const int N = 8176; for (int i = 0; i < N; ++i) std::cout << " void h" << i << "();\n"; for (int i = 0; i < N; ++i) std::cout << " h" << i << "();\n"; std::cout << R"src( } else { e(); } } )src"; } ``` which is still a bit too much to commit as a regression test, IMHO. Reviewers: t.p.northover, ostannard Reviewed By: ostannard Subscribers: kristof.beyls, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D69118 Change-Id: Ide5c922bcde08ff4cf635da5e52365525a997a0a
*	[AMDGPU] Improve code size cost model (part 2)	dfukalov	2019-11-06	1	-18/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Added estimations for ShuffleVector, some cast and arithmetic instructions Reviewers: rampitec Reviewed By: rampitec Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, zzheng, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D69629
*	[TTI][LV] preferPredicateOverEpilogue	Sjoerd Meijer	2019-11-06	2	-1/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have two ways to steer creating a predicated vector body over creating a scalar epilogue. To force this, we have 1) a command line option and 2) a pragma available. This adds a third: a target hook to TargetTransformInfo that can be queried whether predication is preferred or not, which allows the vectoriser to make the decision without forcing it. While this change behaves as a non-functional change for now, it shows the required TTI plumbing, usage of this new hook in the vectoriser, and the beginning of an ARM MVE implementation. I will follow up on this with: - a complete MVE implementation, see D69845. - a patch to disable this, i.e. we should respect "vector_predicate(disable)" and its corresponding loophint. Differential Revision: https://reviews.llvm.org/D69040
*	[ARM,MVE] Add intrinsics for gather/scatter load/stores.	Simon Tatham	2019-11-06	1	-39/+143
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds two new families of intrinsics, both of which are memory accesses taking a vector of locations to load from / store to. The vldrq_gather_base / vstrq_scatter_base intrinsics take a vector of base addresses, and an immediate offset to be added consistently to each one. vldrq_gather_offset / vstrq_scatter_offset take a scalar base address, and a vector of offsets to add to it. The 'shifted_offset' variants also multiply each offset by the element size type, so that the vector is effectively of array indices. At the IR level, these operations are represented by a single set of four IR intrinsics: {gather,scatter} × {base,offset}. The other details (signed/unsigned, shift, and memory element size as opposed to vector element size) are all specified by IR intrinsic polymorphism and immediate operands, because that made the selection job easier than making a huge family of similarly named intrinsics. I considered using the standard IR representations such as llvm.masked.gather, but they're not a good fit. In order to use llvm.masked.gather to represent a gather_offset load with element size smaller than a pointer, you'd have to expand the <8 x i16> vector of offsets into an <8 x i16*> vector of pointers, which would be split up during legalization, so you'd spend most of your time undoing the mess it had made. Also, ISel support for llvm.masked.gather would be easy enough in a trivial way (you can expand it into a gather-base load with a zero immediate offset), but instruction-selecting lots of fiddly idioms back into all the _other_ MVE load instructions would be much more work. So I think dedicated IR intrinsics are the more sensible approach, at least for the moment. On the clang tablegen side, I've added two new features to the Tablegen source accepted by MveEmitter: a 'CopyKind' type node for defining a type that varies with the parameter type (it lets you ask for an unsigned integer type of the same width as the parameter), and an 'unsignedflag' value node for passing an immediate IR operand which is 0 for a signed integer type or 1 for an unsigned one. That lets me write each kind of intrinsic just once and get all its subtypes and immediate arguments generated automatically. Also I've tweaked the handling of pointer-typed values in the code generation part of MveEmitter: they're generated as Address rather than Value (i.e. including an alignment) so that they can be given to the ordinary IR load and store operations, but I'd omitted the code to convert them back to Value when they're going to be used as an argument to an IR intrinsic. On the MC side, I've enhanced MVEVectorVTInfo so that it can tell you not only the full assembly-language suffix for a given vector type (like 's32' or 'u16') but also the numeric-only one used by store instructions (just '32' or '16'). Reviewers: dmgreen Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D69791
*	[PowerPC] Fix the incorrect 'RM' flag set on load/store instr	QingShan Zhang	2019-11-06	1	-1/+1
\| \| \| \| \| \|	The 'RM' flag model the "Rounding Mode" and it has nothing to do with the load/store instructions. Differential Revision: https://reviews.llvm.org/D69551
*	[AMDGPU] Add missing flags to DS_Real	Stanislav Mekhanoshin	2019-11-05	1	-0/+2
\| \| \| \|	Differential Revision: https://reviews.llvm.org/D69867
*	[mips] Fix `getRegForInlineAsmConstraint` to do not crash on empty Constraint	Simon Atanasyan	2019-11-06	1	-4/+6
\|
*	[Hexagon] getCompoundCandidateGroup - fix 'false' value is implicitly cast ↵	Simon Pilgrim	2019-11-05	1	-5/+5
\| \| \| \| \| \|	to unsigned warning. NFCI. Consistently return HexagonII::HCG_None.
*	[X86] Gate select->fmin/fmax transform on NoSignedZeros instead of UnsafeFPMath	Benjamin Kramer	2019-11-05	1	-8/+7
\|
*	[AMDGPU] Removed dead code from R600ISelLowering.cpp	Stanislav Mekhanoshin	2019-11-05	1	-6/+1
\| \| \| \| \| \| \| \| \|	This was added to inhibit a warning from gcc 7.3 according to the comment. However, it triggers warning from PVS. In addition I cannot reproduce it with gcc 7.4 and I also cannot reproduce it with gcc 7.3 using compiler explorer. Differential Revision: https://reviews.llvm.org/D69863
*	[X86/Atomics] (Semantically) revert G246098, switch back to the old atomic ↵	Philip Reames	2019-11-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	example When writing an email for a follow up proposal, I realized one of the diffs in the committed change was incorrect. Digging into it revealed that the fix is complicated enough to require some thought, so reverting in the meantime. The problem is visible in this diff (from the revert): ; X64-SSE-LABEL: store_fp128: ; X64-SSE: # %bb.0: -; X64-SSE-NEXT: movaps %xmm0, (%rdi) +; X64-SSE-NEXT: subq $24, %rsp +; X64-SSE-NEXT: .cfi_def_cfa_offset 32 +; X64-SSE-NEXT: movaps %xmm0, (%rsp) +; X64-SSE-NEXT: movq (%rsp), %rsi +; X64-SSE-NEXT: movq {{[0-9]+}}(%rsp), %rdx +; X64-SSE-NEXT: callq __sync_lock_test_and_set_16 +; X64-SSE-NEXT: addq $24, %rsp +; X64-SSE-NEXT: .cfi_def_cfa_offset 8 ; X64-SSE-NEXT: retq store atomic fp128 %v, fp128* %fptr unordered, align 16 ret void The problem here is three fold: 1) x86-64 doesn't guarantee atomicity of anything larger than 8 bytes. Some platforms observably break this guarantee, others don't, but the codegen isn't considering this, so it's wrong on at least some platforms. 2) When I started to track down the problem, I discovered that DAGCombiner had stripped the atomicity off the store entirely. This comes down to idiomatic usage of DAG.getStore passing all MMO components separately as opposed to just passing the MMO. 3) On x86 (not -64), there are cases where 8 byte atomiciy is supported, but only for floating point operations. This would seem to imply that operation typing matters for correctness, and DAGCombine happily folds away bitcasts. I'm not 100% sure there's a problem here, but I'm not entirely sure there isn't either. I plan on returning to each issue in turn; sorry for the churn here.
*	[AMDGPU] Removed dead code handling M0CopyReg	Stanislav Mekhanoshin	2019-11-05	1	-14/+0
\| \| \| \| \| \| \|	Static analyzer complains about always false condition. See https://bugs.llvm.org/show_bug.cgi?id=43886 Differential Revision: https://reviews.llvm.org/D69860
*	[X86] Specifically limit fmin/fmax commutativity to NoNaNs + NoSignedZeros	Benjamin Kramer	2019-11-05	1	-2/+3
\| \| \| \| \|	The backend UnsafeFPMath flag is not a superset of all the others, so limit it to the exact bits needed.
*	[globalisel] Rename G_GEP to G_PTR_ADD	Daniel Sanders	2019-11-05	20	-44/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: G_GEP is rather poorly named. It's a simple pointer+scalar addition and doesn't support any of the complexities of getelementptr. I therefore propose that we rename it. There's a G_PTR_MASK so let's follow that convention and go with G_PTR_ADD Reviewers: volkan, aditya_nandakumar, bogner, rovka, arsenm Subscribers: sdardis, jvesely, wdng, nhaehnle, hiraditya, jrtc27, atanasyan, arphaman, Petar.Avramovic, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D69734
*	[AMDGPU] return Fail instead of SolfFail from addOperand()	Stanislav Mekhanoshin	2019-11-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	addOperand() method of AMDGPU disassembler returns SoftFail on error. All instances which may lead to that place are an impossible encdoing, not something which is possible to encode, but semantically incorrect as described for SoftFail. Then tablegen generates a check of the following form: if (Decode...(..) == MCDisassembler::Fail) { return MCDisassembler::Fail; } Since we can only return Success and SoftFail that is dead code as detected by the static code analyzer. Solution: return Fail as it should be. See https://bugs.llvm.org/show_bug.cgi?id=43886 Differential Revision: https://reviews.llvm.org/D69819
*	[ARM] Always enable UseAA in the arm backend	David Green	2019-11-05	2	-16/+2
\| \| \| \| \| \| \| \| \| \|	This feature controls whether AA is used into the backend, and was previously turned on for certain subtargets to help create less constrained scheduling graphs. This patch turns it on for all subtargets, so that they can all make use of the extra information to produce better code. Differential Revision: https://reviews.llvm.org/D69796
*	[Scheduling][ARM] Consistently enable PostRA Machine scheduling	David Green	2019-11-05	4	-9/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the ARM backend, for historical reasons we have only some targets using Machine Scheduling. The rest use the old list scheduler as they are using itinaries and the list scheduler seems to produce better code (and not crash running out of register on v6m codes). So whether to use the MIScheduler or not is checked at runtime from the subtarget features. This is fine, except for post-ra scheduling. Whether to use the old post-ra list scheduler or the post-ra machine schedule is decided as the pass manager is set up, in arms case from a newly constructed subtarget. Under some situations, like LTO, this won't include the correct cpu so can pick the wrong option. This can have a surprising effect on performance. To fix that, this patch overrides targetSchedulesPostRAScheduling and addPreSched2 in the ARM backend, adding _both_ post-ra schedulers and picking at runtime which to execute. To pick between the two I've had to add a enablePostRAMachineScheduler() method that normally returns enableMachineScheduler() && enablePostRAScheduler(), which can be overridden to enable just one of PostRAMachineScheduler vs PostRAScheduler. Thanks to David Penry for the identifying this problem. Differential Revision: https://reviews.llvm.org/D69775
*	[RISCV] Add InstrInfo areMemAccessesTriviallyDisjoint hook	Luís Marques	2019-11-05	2	-0/+63
\| \| \| \| \| \| \| \| \| \| \|	Summary: Introduces the `InstrInfo::areMemAccessesTriviallyDisjoint` hook. The test could check for instruction reorderings, but to avoid being brittle it just checks instruction dependencies. Reviewers: asb, lenary Reviewed By: lenary Tags: #llvm Differential Revision: https://reviews.llvm.org/D67046
*	[X86] Lower the cost of avx512 horizontal bool and/or reductions to ↵	Craig Topper	2019-11-04	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	2*log2(bitwidth)+1 for legal types. This better represents the kshift+binop we'd get for each stage before the final extract. Its likely we'll do even better by doing a kmov and a cmp with a GPR, but this is a good start. The default handling was costing a worst case single source permute shuffle of the vector before the binop. This worst case assumes the shuffle might have to be emulated with extracts and inserts. But since we know we're doing a reduction we can assume we'll get kshift lowering. There's still some room for improvement here, but this is much better than it was.