bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	[AArch64] Ensure no tagged memory is left in the unallocated portion of the	Momchil Velikov	2019-10-09	1	-15/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stack This patch makes sure that if we tag some memory, we untag that memory before the function returns/throws via any exit, reachable from the tag operation. For that we place the untag operation either at: a) the lifetime end call for the alloca, if that call post-dominates the lifetime start call (where the tag operation is placed), or it (the lifetime end call) dominates all reachable exits, otherwise b) at the reachable exits Differential Revision: https://reviews.llvm.org/D68469 llvm-svn: 374182
*	[mips] Rename local variable. NFC	Simon Atanasyan	2019-10-09	1	-19/+19
\| \| \| \|	llvm-svn: 374165
*	[mips] Split expandLoadImmReal into multiple methods. NFC	Simon Atanasyan	2019-10-09	1	-154/+205
\| \| \| \| \| \| \| \| \| \| \|	The `expandLoadImmReal` handles four different and almost non-overlapping cases: loading a "single" float immediate into a GPR, loading a "single" float immediate into a FPR, and the same couple for a "double" float immediate. It's better to move each `else if` branch into separate methods. llvm-svn: 374164
*	[BPF] do compile-once run-everywhere relocation for bitfields	Yonghong Song	2019-10-08	7	-96/+371
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A bpf specific clang intrinsic is introduced: u32 __builtin_preserve_field_info(member_access, info_kind) Depending on info_kind, different information will be returned to the program. A relocation is also recorded for this builtin so that bpf loader can patch the instruction on the target host. This clang intrinsic is used to get certain information to facilitate struct/union member relocations. The offset relocation is extended by 4 bytes to include relocation kind. Currently supported relocation kinds are enum { FIELD_BYTE_OFFSET = 0, FIELD_BYTE_SIZE, FIELD_EXISTENCE, FIELD_SIGNEDNESS, FIELD_LSHIFT_U64, FIELD_RSHIFT_U64, }; for __builtin_preserve_field_info. The old access offset relocation is covered by FIELD_BYTE_OFFSET = 0. An example: struct s { int a; int b1:9; int b2:4; }; enum { FIELD_BYTE_OFFSET = 0, FIELD_BYTE_SIZE, FIELD_EXISTENCE, FIELD_SIGNEDNESS, FIELD_LSHIFT_U64, FIELD_RSHIFT_U64, }; void bpf_probe_read(void , unsigned, const void ); int field_read(struct s arg) { unsigned long long ull = 0; unsigned offset = __builtin_preserve_field_info(arg->b2, FIELD_BYTE_OFFSET); unsigned size = __builtin_preserve_field_info(arg->b2, FIELD_BYTE_SIZE); #ifdef USE_PROBE_READ bpf_probe_read(&ull, size, (const void )arg + offset); unsigned lshift = __builtin_preserve_field_info(arg->b2, FIELD_LSHIFT_U64); #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ lshift = lshift + (size << 3) - 64; #endif #else switch(size) { case 1: ull = (unsigned char )((void )arg + offset); break; case 2: ull = (unsigned short )((void )arg + offset); break; case 4: ull = (unsigned int )((void )arg + offset); break; case 8: ull = (unsigned long long )((void )arg + offset); break; } unsigned lshift = __builtin_preserve_field_info(arg->b2, FIELD_LSHIFT_U64); #endif ull <<= lshift; if (__builtin_preserve_field_info(arg->b2, FIELD_SIGNEDNESS)) return (long long)ull >> __builtin_preserve_field_info(arg->b2, FIELD_RSHIFT_U64); return ull >> __builtin_preserve_field_info(arg->b2, FIELD_RSHIFT_U64); } There is a minor overhead for bpf_probe_read() on big endian. The code and relocation generated for field_read where bpf_probe_read() is used to access argument data on little endian mode: r3 = r1 r1 = 0 r1 = 4 <=== relocation (FIELD_BYTE_OFFSET) r3 += r1 r1 = r10 r1 += -8 r2 = 4 <=== relocation (FIELD_BYTE_SIZE) call bpf_probe_read r2 = 51 <=== relocation (FIELD_LSHIFT_U64) r1 = (u64 )(r10 - 8) r1 <<= r2 r2 = 60 <=== relocation (FIELD_RSHIFT_U64) r0 = r1 r0 >>= r2 r3 = 1 <=== relocation (FIELD_SIGNEDNESS) if r3 == 0 goto LBB0_2 r1 s>>= r2 r0 = r1 LBB0_2: exit Compare to the above code between relocations FIELD_LSHIFT_U64 and FIELD_LSHIFT_U64, the code with big endian mode has four more instructions. r1 = 41 <=== relocation (FIELD_LSHIFT_U64) r6 += r1 r6 += -64 r6 <<= 32 r6 >>= 32 r1 = (u64 )(r10 - 8) r1 <<= r6 r2 = 60 <=== relocation (FIELD_RSHIFT_U64) The code and relocation generated when using direct load. r2 = 0 r3 = 4 r4 = 4 if r4 s> 3 goto LBB0_3 if r4 == 1 goto LBB0_5 if r4 == 2 goto LBB0_6 goto LBB0_9 LBB0_6: # %sw.bb1 r1 += r3 r2 = (u16 )(r1 + 0) goto LBB0_9 LBB0_3: # %entry if r4 == 4 goto LBB0_7 if r4 == 8 goto LBB0_8 goto LBB0_9 LBB0_8: # %sw.bb9 r1 += r3 r2 = (u64 )(r1 + 0) goto LBB0_9 LBB0_5: # %sw.bb r1 += r3 r2 = (u8 )(r1 + 0) goto LBB0_9 LBB0_7: # %sw.bb5 r1 += r3 r2 = (u32 )(r1 + 0) LBB0_9: # %sw.epilog r1 = 51 r2 <<= r1 r1 = 60 r0 = r2 r0 >>= r1 r3 = 1 if r3 == 0 goto LBB0_11 r2 s>>= r1 r0 = r2 LBB0_11: # %sw.epilog exit Considering verifier is able to do limited constant propogation following branches. The following is the code actually traversed. r2 = 0 r3 = 4 <=== relocation r4 = 4 <=== relocation if r4 s> 3 goto LBB0_3 LBB0_3: # %entry if r4 == 4 goto LBB0_7 LBB0_7: # %sw.bb5 r1 += r3 r2 = (u32 )(r1 + 0) LBB0_9: # %sw.epilog r1 = 51 <=== relocation r2 <<= r1 r1 = 60 <=== relocation r0 = r2 r0 >>= r1 r3 = 1 if r3 == 0 goto LBB0_11 r2 s>>= r1 r0 = r2 LBB0_11: # %sw.epilog exit For native load case, the load size is calculated to be the same as the size of load width LLVM otherwise used to load the value which is then used to extract the bitfield value. Differential Revision: https://reviews.llvm.org/D67980 llvm-svn: 374099
*	AMDGPU: Fix i16 arithmetic pattern redundancy	Matt Arsenault	2019-10-08	1	-78/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	There were 2 problems here. First, these patterns were duplicated to handle the inverted shift operands instead of using the commuted PatFrags. Second, the point of the zext folding patterns don't apply to the non-0ing high subtargets. They should be skipped instead of inserting the extension. The zeroing high code would be emitted when necessary anyway. This was also emitting unnecessary zexts in cases where the high bits were undefined. llvm-svn: 374092
*	Revert "[LoopVectorize][PowerPC] Estimate int and float register pressure ↵	Jinsong Ji	2019-10-08	11	-54/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	separately in loop-vectorize" Also Revert "[LoopVectorize] Fix non-debug builds after rL374017" This reverts commit 9f41deccc0e648a006c9f38e11919f181b6c7e0a. This reverts commit 18b6fe07bcf44294f200bd2b526cb737ed275c04. The patch is breaking PowerPC internal build, checked with author, reverting on behalf of him for now due to timezone. llvm-svn: 374091
*	AMDGPU: Add offsets to MMO when lowering buffer intrinsics	Tom Stellard	2019-10-08	2	-11/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Without offsets on the MachineMemOperands (MMOs), MachineInstr::mayAlias() will return true for all reads and writes to the same resource descriptor. This leads to O(N^2) complexity in the MachineScheduler when analyzing dependencies of buffer loads and stores. It also limits the SILoadStoreOptimizer from merging more instructions. This patch reduces the compile time of one pathological compute shader from 12 seconds to 1 second. Reviewers: arsenm, nhaehnle Reviewed By: arsenm Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, jfb, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D65097 llvm-svn: 374087
*	[AMDGPU] Disable unused gfx10 dpp instructions	Stanislav Mekhanoshin	2019-10-08	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \|	Inhibit generation of unused real dpp instructions on gfx10 just like it is done on other subtargets. This does not change anything because these are illegal anyway and not accepted, but it does reduce the number of instruction definitions generated. Differential Revision: https://reviews.llvm.org/D68607 llvm-svn: 374083
*	[WebAssembly] Fix a bug in 'try' placement	Heejin Ahn	2019-10-08	1	-13/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: When searching for local expression tree created by stackified registers, for 'block' placement, we start the search from the previous instruction of a BB's terminator. But in 'try''s case, we should start from the previous instruction of a call that can throw, or a EH_LABEL that precedes the call, because the return values of the call's previous instructions can be stackified and consumed by the throwing call. For example, ``` i32.call @foo call @bar ; may throw br $label0 ``` In this case, if we start the search from the previous instruction of the terminator (`br` here), we end up stopping at `call @bar` and place a 'try' between `i32.call @foo` and `call @bar`, because `call @bar` does not have a return value so it is not a local expression tree of `br`. But in this case, unlike when placing 'block's, we should start the search from `call @bar`, because the return value of `i32.call @foo` is stackified and used by `call @bar`. Reviewers: dschuff Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68619 llvm-svn: 374073
*	[DebugInfo][If-Converter] Update call site info during the optimization	Nikola Prica	2019-10-08	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	During the If-Converter optimization pay attention when copying or deleting call instructions in order to keep call site information in valid state. Reviewers: aprantl, vsk, efriedma Reviewed By: vsk, efriedma Differential Revision: https://reviews.llvm.org/D66955 llvm-svn: 374068
*	[Mips] Emit proper ABI for _mcount calls	Mirko Brkusanin	2019-10-08	2	-0/+49
\| \| \| \| \| \| \| \| \| \| \|	When -pg option is present than a call to _mcount is inserted into every function. However since the proper ABI was not followed then the generated gmon.out did not give proper results. By inserting needed instructions before every _mcount we can fix this. Differential Revision: https://reviews.llvm.org/D68390 llvm-svn: 374055
*	fix fmls fp16	Sebastian Pop	2019-10-08	1	-12/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tim Northover remarked that the added patterns for fmls fp16 produce wrong code in case the fsub instruction has a multiplication as its first operand, i.e., all the patterns FMLSv_OP1: > define <8 x half> @test_FMLSv8f16_OP1(<8 x half> %a, <8 x half> %b, <8 x half> %c) { > ; CHECK-LABEL: test_FMLSv8f16_OP1: > ; CHECK: fmls {{v[0-9]+}}.8h, {{v[0-9]+}}.8h, {{v[0-9]+}}.8h > entry: > > %mul = fmul fast <8 x half> %c, %b > %sub = fsub fast <8 x half> %mul, %a > ret <8 x half> %sub > } > > This doesn't look right to me. The exact instruction produced is "fmls > v0.8h, v2.8h, v1.8h", which I think calculates "v0 - v2v1", but the > IR is calculating "v2v1-v0". The equivalent <4 x float> code also > doesn't emit an fmls. This patch generates an fmla and negates the value of the operand2 of the fsub. Inspecting the pattern match, I found that there was another mistake in the opcode to be selected: matching FMULv416 should generate FMLSv416 and not FMLSv232. Tested on aarch64-linux with make check-all. Differential Revision: https://reviews.llvm.org/D67990 llvm-svn: 374044
*	[SVE][IR] Scalable Vector size queries and IR instruction support	Graham Hunter	2019-10-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Adds a TypeSize struct to represent the known minimum size of a type along with a flag to indicate that the runtime size is a integer multiple of that size * Converts existing size query functions from Type.h and DataLayout.h to return a TypeSize result * Adds convenience methods (including a transparent conversion operator to uint64_t) so that most existing code 'just works' as if the return values were still scalars. * Uses the new size queries along with ElementCount to ensure that all supported instructions used with scalable vectors can be constructed in IR. Reviewers: hfinkel, lattner, rkruppe, greened, rovka, rengolin, sdesmalen Reviewed By: rovka, sdesmalen Differential Revision: https://reviews.llvm.org/D53137 llvm-svn: 374042
*	AMDGPU: Propagate undef flag during pre-RA exec mask optimizations	Nicolai Haehnle	2019-10-08	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Issue: https://github.com/GPUOpen-Drivers/llpc/issues/204 Reviewers: arsenm, rampitec Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68184 llvm-svn: 374041
*	[ISEL][ARM][AARCH64] Tracking simple parameter forwarding registers	Nikola Prica	2019-10-08	3	-3/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Support for tracking registers that forward function parameters into the following function frame. For now we only support cases when parameter is forwarded through single register. Reviewers: aprantl, vsk, t.p.northover Reviewed By: vsk Differential Revision: https://reviews.llvm.org/D66953 llvm-svn: 374033
*	[ARM] Generate vcmp instead of vcmpe	Kristof Beyls	2019-10-08	5	-69/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on the discussion in http://lists.llvm.org/pipermail/llvm-dev/2019-October/135574.html, the conclusion was reached that the ARM backend should produce vcmp instead of vcmpe instructions by default, i.e. not be producing an Invalid Operation exception when either arguments in a floating point compare are quiet NaNs. In the future, after constrained floating point intrinsics for floating point compare have been introduced, vcmpe instructions probably should be produced for those intrinsics - depending on the exact semantics they'll be defined to have. This patch logically consists of the following parts: - Revert http://llvm.org/viewvc/llvm-project?rev=294945&view=rev and http://llvm.org/viewvc/llvm-project?rev=294968&view=rev, which implemented fine-tuning for when to produce vcmpe (i.e. not do it for equality comparisons). The complexity introduced by those patches isn't needed anymore if we just always produce vcmp instead. Maybe these patches need to be reintroduced again once support is needed to map potential LLVM-IR constrained floating point compare intrinsics to the ARM instruction set. - Simply select vcmp, instead of vcmpe, see simple changes in lib/Target/ARM/ARMInstrVFP.td - Adapt lots of tests that tested for vcmpe (instead of vcmp). For all of these test, the intent of what is tested for isn't related to whether the vcmp should produce an Invalid Operation exception or not. Fixes PR43374. Differential Revision: https://reviews.llvm.org/D68463 llvm-svn: 374025
*	[LoopVectorize][PowerPC] Estimate int and float register pressure separately ↵	Zi Xuan Wu	2019-10-08	11	-15/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in loop-vectorize In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not estimate different register pressure for different register class separately(especially for scalar type, float type should not be on the same position with int type), so it's not accurate. Specifically, it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance. So we need classify the register classes in IR level, and importantly these are abstract register classes, and are not the target register class of backend provided in td file. It's used to establish the mapping between the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types. For example, POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR), float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled, and 3 kinds of register class when VSX is NOT enabled. It runs on POWER target, it makes big(+~30%) performance improvement in one specific bmk(503.bwaves_r) of spec2017 and no other obvious degressions. Differential revision: https://reviews.llvm.org/D67148 llvm-svn: 374017
*	AMDGPU/GlobalISel: Clamp G_SITOFP/G_UITOFP sources	Matt Arsenault	2019-10-07	1	-3/+6
\| \| \| \|	llvm-svn: 373989
*	[X86] Shrink zero extends of gather indices from type less than i32 to types ↵	Craig Topper	2019-10-07	1	-44/+26
\| \| \| \| \| \| \| \| \| \|	larger than i32. Gather instructions can use i32 or i64 elements for indices. If the index is zero extended from a type smaller than i32 to i64, we can shrink the extend to just extend to i32. llvm-svn: 373982
*	[X86] Add new calling convention that guarantees tail call optimization	Reid Kleckner	2019-10-07	5	-12/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the target option GuaranteedTailCallOpt is specified, calls with the fastcc calling convention will be transformed into tail calls if they are in tail position. This diff adds a new calling convention, tailcc, currently supported only on X86, which behaves the same way as fastcc, except that the GuaranteedTailCallOpt flag does not need to enabled in order to enable tail call optimization. Patch by Dwight Guth <dwight.guth@runtimeverification.com>! Reviewed By: lebedev.ri, paquette, rnk Differential Revision: https://reviews.llvm.org/D67855 llvm-svn: 373976
*	[WebAssembly] Fix unwind mismatch stat computation	Heejin Ahn	2019-10-07	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: There was a bug when computing the number of unwind destination mismatches in CFGStackify. When there are many mismatched calls that share the same (original) destination BB, they have to be counted separately. This also fixes a typo and runs `fixUnwindMismatches` only when the wasm exception handling is enabled. This is to prevent unnecessary computations and does not change behavior. Reviewers: dschuff Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68552 llvm-svn: 373975
*	[WebAssembly] Add memory intrinsics handling to mayThrow()	Heejin Ahn	2019-10-07	1	-1/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Previously, `WebAssembly::mayThrow()` assumed all inputs are global addresses. But when intrinsics, such as `memcpy`, `memmove`, or `memset` are lowered to external symbols in instruction selection and later emitted as library calls. And these functions don't throw. This patch adds handling to those memory intrinsics to `mayThrow` function. But while most of libcalls don't throw, we can't guarantee all of them don't throw, so currently we conservatively return true for all other external symbols. I think a better way to solve this problem is to embed 'nounwind' info in `TargetLowering::CallLoweringInfo`, so that we can access the info from the backend. This will also enable transferring 'nounwind' properties of LLVM IR instructions. Currently we don't transfer that info and we can only access properties of callee functions, if the callees are within the module. Other targets don't need this info in the backend because they do all the processing before isel, but it will help us because that info will reduce code size increase in fixing unwind destination mismatches in CFGStackify. But for now we return false for these memory intrinsics and true for all other libcalls conservatively. Reviewers: dschuff Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68553 llvm-svn: 373967
*	AMDGPU/GlobalISel: Handle more G_INSERT cases	Matt Arsenault	2019-10-07	3	-57/+55
\| \| \| \| \| \| \| \| \|	Start manually writing a table to get the subreg index. TableGen should probably generate this, but I'm not sure what it looks like in the arbitrary case where subregisters are allowed to not fully cover the super-registers. llvm-svn: 373947
*	GlobalISel: Partially implement lower for G_INSERT	Matt Arsenault	2019-10-07	1	-7/+3
\| \| \| \|	llvm-svn: 373946
*	AMDGPU/GlobalISel: Fix selection of 16-bit shifts	Matt Arsenault	2019-10-07	1	-3/+6
\| \| \| \|	llvm-svn: 373945
*	AMDGPU/GlobalISel: Select VALU G_AMDGPU_FFBH_U32	Matt Arsenault	2019-10-07	1	-1/+1
\| \| \| \|	llvm-svn: 373944
*	AMDGPU/GlobalISel: Use S_MOV_B64 for inline constants	Matt Arsenault	2019-10-07	1	-20/+27
\| \| \| \| \| \| \|	This hides some defects in SIFoldOperands when the immediates are split. llvm-svn: 373943
*	AMDGPU/GlobalISel: Widen 16-bit G_MERGE_VALUEs sources	Matt Arsenault	2019-10-07	1	-18/+29
\| \| \| \| \| \|	Continue making a mess of merge/unmerge legality. llvm-svn: 373942
*	AMDGPU/GlobalISel: Select more G_INSERT cases	Matt Arsenault	2019-10-07	1	-20/+78
\| \| \| \| \| \| \| \| \| \|	At minimum handle the s64 insert type, which are emitted in real cases during legalization. We really need TableGen to emit something to emit something like the inverse of composeSubRegIndices do determine the subreg index to use. llvm-svn: 373938
*	GlobalISel: Add target pre-isel instructions	Matt Arsenault	2019-10-07	5	-2/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allows targets to introduce regbankselectable pseudo-instructions. Currently the closet feature to this is an intrinsic. However this requires creating a public intrinsic declaration. This litters the public intrinsic namespace with operations we don't necessarily want to expose to IR producers, and would rather leave as private to the backend. Use a new instruction bit. A previous attempt tried to keep using enum value ranges, but it turned into a mess. llvm-svn: 373937
*	Second attempt to add iterator_range::empty()	Jordan Rose	2019-10-07	4	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \|	Doing this makes MSVC complain that `empty(someRange)` could refer to either C++17's std::empty or LLVM's llvm::empty, which previously we avoided via SFINAE because std::empty is defined in terms of an empty member rather than begin and end. So, switch callers over to the new method as it is added. https://reviews.llvm.org/D68439 llvm-svn: 373935
*	[X86][SSE] getTargetShuffleInputs - move VT.isSimple/isVector checks inside. ↵	Simon Pilgrim	2019-10-07	1	-4/+11
\| \| \| \| \| \| \| \|	NFCI. Stop all the callers from having to check the value type before calling getTargetShuffleInputs. llvm-svn: 373915
*	[Mips] Always save RA when disabling frame pointer elimination	Simon Atanasyan	2019-10-07	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \|	This ensures that frame-based unwinding will continue to work when calling a noreturn function; there is not much use having the caller's frame pointer saved if you don't also have the caller's program counter. Patch by James Clarke. Differential Revision: https://reviews.llvm.org/D68542 llvm-svn: 373907
*	[Mips] Fix evaluating J-format branch targets	Simon Atanasyan	2019-10-07	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \|	J/JAL/JALX/JALS are absolute branches, but stay within the current 256 MB-aligned region, so we must include the high bits of the instruction address when calculating the branch target. Patch by James Clarke. Differential Revision: https://reviews.llvm.org/D68548 llvm-svn: 373906
*	Test commit	Mirko Brkusanin	2019-10-07	1	-1/+1
\| \| \| \| \| \|	Fix comment. llvm-svn: 373901
*	[X86] Support LEA64_32r in processInstrForSlow3OpLEA and use INC/DEC when ↵	Craig Topper	2019-10-07	1	-80/+110
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	possible. Move the erasing and iterator updating inside to match the other slow LEA function. I've adapted code from optTwoAddrLEA and basically rebuilt the implementation here. We do lose the kill flags now just like optTwoAddrLEA. This runs late enough in the pipeline that shouldn't really be a problem. llvm-svn: 373877
*	[X86][AVX] Access a scalar float/double as a free extract from a broadcast ↵	Simon Pilgrim	2019-10-06	1	-11/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	load (PR43217) If a fp scalar is loaded and then used as both a scalar and a vector broadcast, perform the load as a broadcast and then extract the scalar for 'free' from the 0th element. This involved switching the order of the X86ISD::BROADCAST combines so we only convert to X86ISD::BROADCAST_LOAD once all other canonicalizations have been attempted. Adds a DAGCombinerInfo::recursivelyDeleteUnusedNodes wrapper. Fixes PR43217 Differential Revision: https://reviews.llvm.org/D68544 llvm-svn: 373871
*	Fix signed/unsigned warning. NFCI	Simon Pilgrim	2019-10-06	1	-1/+1
\| \| \| \|	llvm-svn: 373870
*	[NFC][PowerPC] Reorganize CRNotPat multiclass patterns in PPCInstrInfo.td	Amy Kwan	2019-10-06	1	-84/+91
\| \| \| \| \| \| \| \| \| \| \| \|	This is patch aims to group together the `CRNotPat` multi class instantiations within the `PPCInstrInfo.td` file. Integer instantiations of the multi class are grouped together into a section, and the floating point patterns are separated into its own section. Differential Revision: https://reviews.llvm.org/D67975 llvm-svn: 373869
*	[X86][SSE] Remove resolveTargetShuffleInputs and use getTargetShuffleInputs ↵	Simon Pilgrim	2019-10-06	1	-42/+22
\| \| \| \| \| \| \| \|	directly. Move the resolveTargetShuffleInputsAndMask call to after the shuffle mask combine before the undef/zero constant fold instead. llvm-svn: 373868
*	[X86][SSE] Don't merge known undef/zero elements into target shuffle masks.	Simon Pilgrim	2019-10-06	1	-30/+50
\| \| \| \| \| \| \| \|	Replaces setTargetShuffleZeroElements with getTargetShuffleAndZeroables which reports the Zeroable elements but doesn't merge them into the decoded target shuffle mask (the merging has been moved up into getTargetShuffleInputs until we can get rid of it entirely). This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle mask isn't adjusted by its source inputs but instead we cache them in a parallel Zeroable mask. llvm-svn: 373867
*	[X86] Add custom type legalization for v16i64->v16i8 truncate and ↵	Craig Topper	2019-10-06	1	-3/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	v8i64->v8i8 truncate when v8i64 isn't legal Summary: The default legalization for v16i64->v16i8 tries to create a multiple stage truncate concatenating after each stage and truncating again. But avx512 implements truncates with multiple uops. So it should be better to truncate all the way to the desired element size and then concatenate the pieces using unpckl instructions. This minimizes the number of 2 uop truncates. The unpcks are all single uop instructions. I tried to handle this by just custom splitting the v16i64->v16i8 shuffle. And hoped that the DAG combiner would leave the two halves in the state needed to make D68374 do the job for each half. This worked for the first half, but the second half got messed up. So I've implemented custom handling for v8i64->v8i8 when v8i64 needs to be split to produce the VTRUNCs directly. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68428 llvm-svn: 373864
*	[X86][SSE] resolveTargetShuffleInputs - call getTargetShuffleInputs instead ↵	Simon Pilgrim	2019-10-06	1	-5/+4
\| \| \| \| \| \|	of using setTargetShuffleZeroElements directly. NFCI. llvm-svn: 373855
*	[NFC] Replace 'isDarwin' with 'IsDarwin'	Xiangling Liao	2019-10-06	7	-38/+38
\| \| \| \| \| \| \| \|	Summary: Replace 'isDarwin' with 'IsDarwin' based on LLVM naming convention. Differential Revision: https://reviews.llvm.org/D68336 llvm-svn: 373852
*	[X86][AVX] combineExtractSubvector - merge duplicate variables. NFCI.	Simon Pilgrim	2019-10-06	1	-18/+17
\| \| \| \|	llvm-svn: 373849
*	[X86][SSE] matchVectorShuffleAsBlend - use Zeroable element mask directly.	Simon Pilgrim	2019-10-06	1	-34/+13
\| \| \| \| \| \| \| \| \| \|	We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly. This allows us to remove createTargetShuffleMask. This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask. llvm-svn: 373846
*	[X86] Enable AVX512BW for memcmp()	David Zarzycki	2019-10-06	1	-2/+7
\| \| \| \|	llvm-svn: 373845
*	AMDGPU/GlobalISel: Fall back on weird G_EXTRACT offsets	Matt Arsenault	2019-10-06	1	-2/+5
\| \| \| \|	llvm-svn: 373842
*	AMDGPU/GlobalISel: RegBankSelect mul24 intrinsics	Matt Arsenault	2019-10-06	1	-0/+2
\| \| \| \|	llvm-svn: 373841
*	AMDGPU/GlobalISel: RegBankSelect DS GWS intrinsics	Matt Arsenault	2019-10-06	1	-0/+35
\| \| \| \|	llvm-svn: 373840