bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	Extract LaneBitmask into a separate type	Krzysztof Parzyszek	2016-12-15	6	-25/+26
\| \| \| \| \| \| \| \| \| \| \| \|	Specifically avoid implicit conversions from/to integral types to avoid potential errors when changing the underlying type. For example, a typical initialization of a "full" mask was "LaneMask = ~0u", which would result in a value of 0x00000000FFFFFFFF if the type was extended to uint64_t. Differential Revision: https://reviews.llvm.org/D27454 llvm-svn: 289820
*	[CostModel][X86] Updated reverse shuffle costs	Simon Pilgrim	2016-12-15	1	-5/+95
\| \| \| \|	llvm-svn: 289819
*	[Power9] Allow AnyExt immediates for XXSPLTIB	Nemanja Ivanovic	2016-12-15	2	-7/+7
\| \| \| \| \| \| \| \| \| \|	In some situations, the BUILD_VECTOR node that builds a v18i8 vector by a splat of an i8 constant will end up with signed 8-bit values and other situations, it'll end up with unsigned ones. Handle both situations. Fixes PR31340. llvm-svn: 289804
*	[AVR] Support floats in the instrumention pass	Dylan McKay	2016-12-15	1	-14/+14
\| \| \| \| \| \|	This also refactors some common code into the 'GetTypeName' method. llvm-svn: 289803
*	[Thumb] Teach ISel how to lower compares of AND bitmasks efficiently	Sjoerd Meijer	2016-12-15	2	-4/+141
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is essentially a recommit of r285893, but with a correctness fix. The problem of the original commit was that this: bic r5, r7, #31 cbz r5, .LBB2_10 got rewritten into: lsrs r5, r7, #5 beq .LBB2_10 The result in destination register r5 is not the same and this is incorrect when r5 is not dead. So this fix includes checking the uses of the AND destination register. And also, compared to the original commit, some regression tests didn't need changing anymore because of this extra check. For completeness, this was the original commit message: For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more efficient instruction selection if the bitmask is one consecutive sequence of set bits (32 - clz(bm) - ctz(bm) == popcount(bm)). 1) If the bitmask touches the LSB, then we can remove all the upper bits and set the flags by doing one LSLS. 2) If the bitmask touches the MSB, then we can remove all the lower bits and set the flags with one LSRS. 3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit into the sign bit with one LSLS and change the condition query from NE/EQ to MI/PL (we could also implement this by shifting into the carry bit and branching on BCC/BCS). 4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower zero bits of the mask. 1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two 16-bit instructions but can elide the CMP and doesn't require materializing a complex immediate, so is also a win. Differential Revision: https://reviews.llvm.org/D27761 llvm-svn: 289794
*	[AVR] Add argument indices to the instrumention hook functions	Dylan McKay	2016-12-15	1	-2/+4
\| \| \| \| \| \| \|	This allows the instrumention hook functions to do better pretty-printing. llvm-svn: 289793
*	Fix for build warning in execute-only support	Prakhar Bahuguna	2016-12-15	1	-2/+2
\| \| \| \|	llvm-svn: 289788
*	[ARM] Implement execute-only support in CodeGen	Prakhar Bahuguna	2016-12-15	13	-17/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements execute-only support for ARM code generation, which prevents the compiler from generating data accesses to code sections. The following changes are involved: * Add the CodeGen option "-arm-execute-only" to the ARM code generator. * Add the clang flag "-mexecute-only" as well as the GCC-compatible alias "-mpure-code" to enable this option. * When enabled, literal pools are replaced with MOVW/MOVT instructions, with VMOV used in addition for floating-point literals. As the MOVT instruction is required, execute-only support is only available in Thumb mode for targets supporting ARMv8-M baseline or Thumb2. * Jump tables are placed in data sections when in execute-only mode. * The execute-only text section is assigned section ID 0, and is marked as unreadable with the SHF_ARM_PURECODE flag with symbol 'y'. This also overrides selection of ELF sections for globals. llvm-svn: 289784
*	[NVPTX] Remove dead #defines from NVPTXUtilities.h.	Justin Lebar	2016-12-15	1	-3/+0
\| \| \| \|	llvm-svn: 289747
*	Use PIC relocation model as default for PowerPC64 ELF.	Joerg Sonnenberger	2016-12-15	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most of the PowerPC64 code generation for the ELF ABI is already PIC. There are four main exceptions: (1) Constant pointer arrays etc. should in writeable sections. (2) The TOC restoration NOP after a call is needed for all global symbols. While GNU ld has a workaround for questionable GCC self-calls, we trigger the checks for calls from COMDAT sections as they cross input sections and are therefore not considered self-calls. The current decision is questionable and suboptimal, but outside the scope of the change. (3) TLS access can not use the initial-exec model. (4) Jump tables should use relative addresses. Note that the current encoding doesn't work for the large code model, but it is more compact than the default for any non-trivial jump table. Improving this is again beyond the scope of this change. At least (1) and (3) are assumptions made in target-independent code and introducing additional hooks is a bit messy. Testing with clang shows that a -fPIC binary is 600KB smaller than the corresponding -fno-pic build. Separate testing from improved jump table encodings would explain only about 100KB or so. The rest is expected to be a result of more aggressive immediate forming for -fno-pic, where the -fPIC binary just uses TOC entries. This change brings the LLVM output in line with the GCC output, other PPC64 compilers like XLC on AIX are known to produce PIC by default as well. The relocation model can still be provided explicitly, i.e. when using MCJIT. One test case for case (1) is included, other test cases with relocation mode sensitive behavior are wired to static for now. They will be reviewed and adjusted separately. Differential Revision: https://reviews.llvm.org/D26566 llvm-svn: 289743
*	[NVPTX] Remove dead code.	Justin Lebar	2016-12-14	5	-130/+0
\| \| \| \| \| \| \| \| \| \| \|	I've chosen to remove NVPTXInstrInfo::CanTailMerge but not NVPTXInstrInfo::isLoadInstr and isStoreInstr (which are also dead) because while the latter two are reasonably useful utilities, the former cannot be used safely: It relies on successful address space inference to identify writes to shared memory, but addrspace inference is a best-effort thing. llvm-svn: 289740
*	[Hexagon] Fix some Clang-tidy modernize and Include What You Use warnings; ↵	Eugene Zelenko	2016-12-14	5	-284/+249
\| \| \| \| \| \|	other minor fixes (NFC). llvm-svn: 289736
*	[NVPTX] Support .maxnreg annotation.	Justin Lebar	2016-12-14	3	-0/+9
\| \| \| \| \| \| \| \| \| \|	Reviewers: tra Subscribers: llvm-commits, jholewinski Differential Revision: https://reviews.llvm.org/D27638 llvm-svn: 289729
*	[NVPTX] Remove string constants from NVPTXBaseInfo.h.	Justin Lebar	2016-12-14	3	-165/+88
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Previously they were defined as a 2D char array in a header file. This is kind of overkill -- we can let the linker lay out these strings however it pleases. While we're at it, we might as well just inline these constants where they're used, as each of them is used only once. Also move NVPTXUtilities.{h,cpp} into namespace llvm. Reviewers: tra Subscribers: jholewinski, mgorny, llvm-commits Differential Revision: https://reviews.llvm.org/D27636 llvm-svn: 289728
*	[ARM] Split 128-bit vectors in BUILD_VECTOR lowering	Eli Friedman	2016-12-14	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Given that INSERT_VECTOR_ELT operates on D registers anyway, combining 64-bit vectors into a 128-bit vector is basically free. Therefore, try to split BUILD_VECTOR nodes before giving up and lowering them to a series of INSERT_VECTOR_ELT instructions. Sometimes this allows dramatically better lowerings; see testcases for examples. Inspired by similar code in the x86 backend for AVX. Differential Revision: https://reviews.llvm.org/D27624 llvm-svn: 289706
*	fix gcc warning about a superfluous ;	Nico Weber	2016-12-14	1	-1/+1
\| \| \| \|	llvm-svn: 289705
*	[ARM] Add ARMISD::VLD1DUP to match vld1_dup more consistently.	Eli Friedman	2016-12-14	3	-19/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, there are substantial problems forming vld1_dup even if the VDUP survives legalization. The lack of an actual node leads to terrible results: not only can we not form post-increment vld1_dup instructions, but we form scalar pre-increment and post-increment loads which force the loaded value into a GPR. This patch fixes that by combining the vdup+load into an ARMISD node before DAGCombine messes it up. Also includes a crash fix for vld2_dup (see testcase @vld2dupi8_postinc_variable). Differential Revision: https://reviews.llvm.org/D27694 llvm-svn: 289703
*	Fix build failure due to r289674 on certain systems	Yaxun Liu	2016-12-14	1	-1/+0
\| \| \| \| \| \|	Removed a useless include which caused conflict. llvm-svn: 289700
*	AMDGPU: Emit runtime metadata version 2 as YAML	Yaxun Liu	2016-12-14	7	-403/+550
\| \| \| \| \| \|	Differential Revision: https://reviews.llvm.org/D25046 llvm-svn: 289674
*	AMDGPU: Make AllocationPriority of SGPRs higher than VGPRs	Matt Arsenault	2016-12-14	1	-11/+13
\| \| \| \| \| \| \| \|	Since SGPRs should spill to VGPRs, they should be allocated first. I don't think this is sufficient for SGPRs to always spill to VGPRs though. llvm-svn: 289671
*	Revert "In visitSTORE, always use FindBetterChain, rather than only when ↵	Nirav Dave	2016-12-14	1	-0/+10
\| \| \| \| \| \| \| \| \| \|	UseAA is enabled." Reverting due to ARM MCJIT and MIPS LLD error. This reverts commit r289659. llvm-svn: 289667
*	AMDGPU: Change vintrp printing	Matt Arsenault	2016-12-14	4	-6/+37
\| \| \| \|	llvm-svn: 289664
*	In visitSTORE, always use FindBetterChain, rather than only when UseAA is ↵	Nirav Dave	2016-12-14	1	-10/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	enabled. Retrying after fixing after removing load-store factoring through token factors in favor of improved token factor operand pruning Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. Whem merging stores, search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and generally the output CodeGen (with some exceptions). Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seemed sufficient to not cause regressions in tests. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable. Some tests relying on the order were changed to use volatile memory operations Noteworthy tests: CodeGen/AArch64/argument-blocks.ll - It's not entirely clear what the test_varargs_stackalign test is supposed to be asserting, but the new code looks right. CodeGen/AArch64/arm64-memset-inline.lli - CodeGen/AArch64/arm64-stur.ll - CodeGen/ARM/memset-inline.ll - The backend now generates worse code due to store merging succeeding, as we do do a 16-byte constant-zero store efficiently. CodeGen/AArch64/merge-store.ll - Improved, but there still seems to be an extraneous vector insert from an element to itself? CodeGen/PowerPC/ppc64-align-long-double.ll - Worse code emitted in this case, due to the improved store->load forwarding. CodeGen/X86/dag-merge-fast-accesses.ll - CodeGen/X86/MergeConsecutiveStores.ll - CodeGen/X86/stores-merging.ll - CodeGen/Mips/load-store-left-right.ll - Restored correct merging of non-aligned stores CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll - Improved. Correctly merges buffer_store_dword calls CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll - Improved. Sidesteps loading a stored value and merges two stores CodeGen/X86/pr18023.ll - This test has been removed, as it was asserting incorrect behavior. Non-volatile stores CAN be moved past volatile loads, and now are. CodeGen/X86/vector-idiv.ll - CodeGen/X86/vector-lzcnt-128.ll - It's basically impossible to tell what these tests are actually testing. But, looks like the code got better due to the memory operations being recognized as non-aliasing. CodeGen/X86/win32-eh.ll - Both loads of the securitycookie are now merged. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel Differential Revision: https://reviews.llvm.org/D14834 llvm-svn: 289659
*	Fix bug 30945- [AVX512] Failure to flip vector comparison to remove not mask ↵	Michael Zuckerman	2016-12-14	1	-3/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instruction adding new optimization opportunity by adding new X86ISelLowering pattern. The test case was shown in https://llvm.org/bugs/show_bug.cgi?id=30945. Test explanation: Select gets three arguments mask, op and op2. In this case, the Mask is a result of ICMP. The ICMP instruction compares (with equal operand) the zero initializer vector and the result of the first ICMP. In general, The result of "cmp eq, op1, zero initializers" is "not(op1)" where op1 is a mask. By rearranging of the two arguments inside the Select instruction, we can get the same result. Without the necessary of the middle phase ("cmp eq, op1, zero initializers"). Missed optimization opportunity: vpcmpled %zmm0, %zmm1, %k0 knotw %k0, %k1 can be combine to vpcmpgtd %zmm0, %zmm2, %k1 Reviewers: 1. delena 2. igorb Commited after check all Differential Revision: https://reviews.llvm.org/D27160 llvm-svn: 289653
*	Replace APFloatBase static fltSemantics data members with getter functions	Stephan Bergmann	2016-12-14	9	-34/+34
\| \| \| \| \| \| \| \| \| \| \| \| \|	At least the plugin used by the LibreOffice build (<https://wiki.documentfoundation.org/Development/Clang_plugins>) indirectly uses those members (through inline functions in LLVM/Clang include files in turn using them), but they are not exported by utils/extract_symbols.py on Windows, and accessing data across DLL/EXE boundaries on Windows is generally problematic. Differential Revision: https://reviews.llvm.org/D26671 llvm-svn: 289647
*	[AVR] Add a function instrumentation pass	Dylan McKay	2016-12-14	4	-0/+224
\| \| \| \| \| \|	This will be used for an on-chip test suite. llvm-svn: 289641
*	[PowerPC] Fix logic dealing with nop after calls (and tail-call eligibility)	Hal Finkel	2016-12-14	1	-40/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change aims to unify and correct our logic for when we need to allow for the possibility of the linker adding a TOC restoration instruction after a call. This comes up in two contexts: 1. When determining tail-call eligibility. If we make a tail call (i.e. directly branch to a function) then there is no place for the linker to add a TOC restoration. 2. When determining when we need to add a nop instruction after a call. Likewise, if there is a possibility that the linker might need to add a TOC restoration after a call, then we need to put a nop after the call (the bl instruction). First problem: We were using similar, but different, logic to decide (1) and (2). This is just wrong. Both the resideInSameModule function (used when determining tail-call eligibility) and the isLocalCall function (used when deciding if the post-call nop is needed) were supposed to be determining the same underlying fact (i.e. might a TOC restoration be needed after the call). The same logic should be used in both places. Second problem: The logic in both places was wrong. We only know that two functions will share the same TOC when both functions come from the same section of the same object. Otherwise the linker might cause the functions to use different TOC base addresses (unless the multi-TOC linker option is disabled, in which case only shared-library boundaries are relevant). There are a number of factors that can cause functions to be placed in different sections or come from different objects (-ffunction-sections, explicitly-specified section names, COMDAT, weak linkage, etc.). All of these need to be checked. The existing logic only checked properties of the callee, but the properties of the caller must also be checked (for example, calling from a function in a COMDAT section means calling between sections). There was a conceptual error in the resideInSameModule function in that it allowed tail calls to functions with weak linkage and protected/hidden visibility. While protected/hidden visibility does prevent the function implementation from being replaced at runtime (via interposition), it does not prevent the linker from using an alternate implementation at link time (i.e. using some strong definition to replace the provided weak one during linking). If this happens, then we're still potentially looking at a required TOC restoration upon return. Otherwise, in general, the post-call nop is needed wherever ELF interposition needs to be supported. We don't currently support ELF interposition at the IR level (see http://lists.llvm.org/pipermail/llvm-dev/2016-November/107625.html for more information), and I don't think we should try to make it appear to work in the backend in spite of that fact. This will yield subtle bugs if interposition is attempted. As a result, regardless of whether we're in PIC mode, we don't assume that we need to add the nop to support the possibility of ELF interposition. However, the necessary check is in place (i.e. calling GV->isInterposable and TM.shouldAssumeDSOLocal) so when we have functions for which interposition is allowed at the IR level, we'll add the nop as necessary. In the mean time, we'll generate more tail calls and fewer nops when compiling position-independent code. Differential Revision: https://reviews.llvm.org/D27231 llvm-svn: 289638
*	Add support for Samsung Exynos M3 (NFC)	Evandro Menezes	2016-12-13	2	-1/+8
\| \| \| \|	llvm-svn: 289613
*	[Hexagon] Fix some Clang-tidy modernize and Include What You Use warnings; ↵	Eugene Zelenko	2016-12-13	7	-402/+359
\| \| \| \| \| \|	other minor fixes (NFC). llvm-svn: 289604
*	Generalize strided store pattern in interleave access pass	Alina Sbirlea	2016-12-13	2	-13/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This patch aims to generalize matching of the strided store accesses to more general masks. The more general rule is to have consecutive accesses based on the stride: [x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...] All elements in the masks need not form a contiguous space, there may be gaps. As before, undefs are allowed and filled in with adjacent element loads. Reviewers: HaoLiu, mssimpso Subscribers: mkuper, delena, llvm-commits Differential Revision: https://reviews.llvm.org/D23646 llvm-svn: 289573
*	Revert "AArch64CollectLOH: Rewrite as block-local analysis."	Matthias Braun	2016-12-13	1	-279/+841
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is not always behaving as expected as it turns out block live-in lists are only correct most of the time. Still waiting for reviews on https://reviews.llvm.org/D27559 to have them correct all of the time. See also http://llvm.org/PR31361, rdar://25117107 This reverts commit r288567. This reverts commit r288561. llvm-svn: 289570
*	GlobalISel: fix GOT accesses on AArch64.	Tim Northover	2016-12-13	1	-2/+3
\| \| \| \| \| \| \| \|	We were using the correct pseudo-instruction, but because the operand's flags weren't set correctly we still ended up emitting incorrect relocations during MC lowering. llvm-svn: 289566
*	[mips] Fix comment to respect 80 chars per line; NFC	Simon Dardis	2016-12-13	1	-6/+6
\| \| \| \|	llvm-svn: 289530
*	[mips] Fix compact branch hazard detection	Simon Dardis	2016-12-13	1	-22/+42
\| \| \| \| \| \| \| \| \| \| \| \| \|	In certain cases it is possible that transient instructions such as %reg = IMPLICIT_DEF as a single instruction in a basic block to reach the MipsHazardSchedule pass. This patch teaches MipsHazardSchedule to properly look through such cases. Reviewers: vkalintiris, zoran.jovanovic Differential Revision: https://reviews.llvm.org/D27209 llvm-svn: 289529
*	[GlobalISel] Move extendRegister where it belongs. NFCI	Diana Picus	2016-12-13	1	-28/+0
\| \| \| \| \| \|	Apparently I missed this one when I moved ValueHandler back in r288658. Sorry! llvm-svn: 289528
*	[AVR] Add an 'relax memory operation' pass	Dylan McKay	2016-12-13	5	-2/+157
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This pass will be used to relax instructions which use out of bounds memory accesses to equivalent operations that can work with the addresses. The pass currently implements relaxation for the STDWPtrQRr instruction. Without this pass, an assertion error would be hit in the pseudo expansion pass. In the future, we will need to add more instructions to this pass. We can do that on a case-by-case basic. Reviewers: arsenm, kparzysz Subscribers: wdng, llvm-commits, mgorny Differential Revision: https://reviews.llvm.org/D27650 llvm-svn: 289517
*	[peephole] Enhance folding logic to work for STATEPOINTs	Philip Reames	2016-12-13	1	-18/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The general idea here is to get enough of the existing restrictions out of the way that the already existing folding logic in foldMemoryOperand can kick in for STATEPOINTs and fold references to immutable stack slots. The key changes are: Support for folding multiple operands at once which reference the same load Support for folding multiple loads into a single instruction Walk all the operands of the instruction for varidic instructions (this is a bug fix!) Once this lands, I'll post another patch which refactors the TII interface here. There's nothing actually x86 specific about the x86 code used here. Differential Revision: https://reviews.llvm.org/D24103 llvm-svn: 289510
*	[x86] fix formatting; NFC	Sanjay Patel	2016-12-12	1	-1/+1
\| \| \| \|	llvm-svn: 289476
*	[AMDGPU, PowerPC, TableGen] Fix some Clang-tidy modernize and Include What ↵	Eugene Zelenko	2016-12-12	10	-135/+192
\| \| \| \| \| \|	You Use warnings; other minor fixes (NFC). llvm-svn: 289475
*	[PPC] Prefer direct move on power8 if load 1 or 2 bytes to VSR	Guozhi Wei	2016-12-12	2	-1/+9
\| \| \| \| \| \| \| \| \|	Power8 has MTVSRWZ but no LXSIBZX/LXSIHZX, so move 1 or 2 bytes to VSR through MTVSRWZ is much faster than store the extended value into stack and load it with LXSIWZX. This patch fixes pr31144. Differential Revision: https://reviews.llvm.org/D27287 llvm-svn: 289473
*	[mips] For PIC code convert unconditional jump to unconditional branch	Simon Atanasyan	2016-12-12	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \|	Unconditional branch uses relative addressing which is the right choice in case of position independent code. This is a fix for the bug: https://dmz-portal.mips.com/bugz/show_bug.cgi?id=2445 Differential revision: https://reviews.llvm.org/D27483 llvm-svn: 289448
*	AMDGPU: llvm.amdgcn.interp.mov is a source of divergence	Nicolai Haehnle	2016-12-12	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: While the result is constant across a single primitive, each pixel shader wave can have pixels from multiple primitives. Reviewers: tstellarAMD, arsenm Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye Differential Revision: https://reviews.llvm.org/D27572 llvm-svn: 289447
*	Update inline argument comment. NFCI.	Simon Pilgrim	2016-12-12	1	-3/+3
\| \| \| \| \| \|	combineX86ShufflesRecursively 'HasPSHUFB' flag has been the more generic 'HasVariableMask' flag for some time. llvm-svn: 289430
*	[X86][SSE] Add support for combining SSE VSHLI/VSRLI uniform constant shifts.	Simon Pilgrim	2016-12-12	1	-0/+33
\| \| \| \| \| \|	Fixes some missed constant folding opportunities and allows us to combine shuffles that end with a logical bit shift. llvm-svn: 289429
*	[X86][SSE] Lower suitably sign-extended mul vXi64 using PMULDQ	Simon Pilgrim	2016-12-12	1	-4/+21
\| \| \| \| \| \| \| \|	PMULDQ returns the 64-bit result of the signed multiplication of the lower 32-bits of vXi64 vector inputs, we can lower with this if the sign bits stretch that far. Differential Revision: https://reviews.llvm.org/D27657 llvm-svn: 289426
*	[X86] Teach selectScalarSSELoad to accept full 128-bit vector loads and the ↵	Craig Topper	2016-12-12	1	-0/+22
\| \| \| \| \| \| \| \|	X86ISD::VZEXT_LOAD opcode. Disable peephole on some of the tests that no longer require it to properly fold scalar intrinsics. llvm-svn: 289424
*	[X86] Change CMPSS/CMPSD intrinsic instructions to use sse_load_f32/f64 as ↵	Craig Topper	2016-12-12	1	-12/+12
\| \| \| \| \| \| \| \| \| \|	its memory pattern instead of full vector load. These intrinsics only load a single element. We should use sse_loadf32/f64 to give more options of what loads it can match. Currently these instructions are often only getting their load folded thanks to the load folding in the peephole pass. I plan to add more types of loads to sse_load_f32/64 so we can match without the peephole. llvm-svn: 289423
*	[X86] Remove some intrinsic instructions from hasPartialRegUpdate	Craig Topper	2016-12-12	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: These intrinsic instructions are all selected from intrinsics that have well defined behavior for where the upper bits come from. It's not the same place as the lower bits. As you can see we were suppressing load folding for these instructions in some cases. In none of the cases was the separate load helping avoid a partial dependency on the destination register. So we should just go ahead and allow the load to be folded. Only foldMemoryOperand was suppressing folding for these. They all have patterns for folding sse_load_f32/f64 that aren't gated with OptForSize, but sse_load_f32/f64 doesn't allow 128-bit vector loads. It only allows scalar_to_vector and vzmovl of scalar loads to match. There's no reason we can't allow a 128-bit vector load to be narrowed so I would like to fix sse_load_f32/f64 to allow that. And if I do that it changes some of these same test cases to fold the load too. Reviewers: spatel, zvi, RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D27611 llvm-svn: 289419
*	[X86][SSE] Add support for combining target shuffles to SHUFPD.	Simon Pilgrim	2016-12-11	1	-17/+45
\| \| \| \|	llvm-svn: 289407
*	[X86][AVX512] Add missing patterns for broadcast fallback in case load node ↵	Ayman Musa	2016-12-11	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \|	has multiple uses (for v4i64 and v4f64). When the load node which the broadcast instruction broadcasts has multiple uses, it cannot be folded. A fallback pattern is added to catch these cases and provide another solution. Differential Revision: https://reviews.llvm.org/D27661 llvm-svn: 289404