summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target
Commit message (Collapse)AuthorAgeFilesLines
...
* [AMDGPU] Overload llvm.amdgcn.fmad.ftz to support f16Stanislav Mekhanoshin2018-06-281-5/+9
| | | | | | Differential Revision: https://reviews.llvm.org/D48677 llvm-svn: 335866
* [ARM] Parallel DSP PassSjoerd Meijer2018-06-285-1/+624
| | | | | | | | | | | | | | | | | Armv6 introduced instructions to perform 32-bit SIMD operations. The purpose of this pass is to do some straightforward IR pattern matching to create ACLE DSP intrinsics, which map on these 32-bit SIMD operations. Currently, only the SMLAD instruction gets recognised. This instruction performs two multiplications with 16-bit operands, and stores the result in an accumulator. We will follow this up with patches to recognise SMLAD in more cases, and also to generate other DSP instructions (like e.g. SADD16). Patch by: Sam Parker and Sjoerd Meijer Differential Revision: https://reviews.llvm.org/D48128 llvm-svn: 335850
* s/TablesChecked/TableChecked/ after r335823Hans Wennborg2018-06-282-2/+2
| | | | llvm-svn: 335831
* AMDGPU: Remove MFI::ABIArgOffsetMatt Arsenault2018-06-287-33/+22
| | | | | | | | | | | | | | We have too many mechanisms for tracking the various offsets used for kernel arguments, so remove one. There's still a lot of confusion with these because there are two different "implicit" argument areas located at the beginning and end of the kernarg segment. Additionally, the offset was determined based on the memory size of the split element types. This would break in a future commit where v3i32 is decomposed into separate i32 pieces. llvm-svn: 335830
* AMDGPU: Error on calls from graphics shadersMatt Arsenault2018-06-281-0/+7
| | | | | | | | In principle nothing should stop these from working, but work is necessary to create an ABI for dealing with the stack related registers. llvm-svn: 335829
* AMDGPU: Fix AMDGPUCodeGenPrepare using uninitialized AMDGPUAS structMatt Arsenault2018-06-281-1/+2
| | | | | | Not sure how this wasn't noticed before. llvm-svn: 335828
* AMDGPU: Fix assert on aggregate type kernel argumentsMatt Arsenault2018-06-281-2/+4
| | | | | | | | | | Just fix the crash for now by not doing the optimization since figuring out how to properly convert the bits for an arbitrary struct is a pain. Also fix a crash when there is only an empty struct argument. llvm-svn: 335827
* Unify sorted asserts to use the existing atomic patternBenjamin Kramer2018-06-283-9/+10
| | | | | | | These are all benign races and only visible in !NDEBUG. tsan complains about it, but a simple atomic bool is sufficient to make it happy. llvm-svn: 335823
* [X86] Use PatFrag with hardcoded numbers for FROUND_NO_EXC/FROUND_CURRENT ↵Craig Topper2018-06-281-4/+2
| | | | | | | | | | instead of ImmLeafs with predicates where one of the two numbers was hardcoded. This more efficient for the isel table generator since we can use CheckChildInteger instead of MoveChild, CheckPredicate, MoveParent. This reduced the table size by 1-2K. I wish there was a way to share the values with X86BaseInfo.h and still use a PatFrag like this. These numbers are fixed by the X86 intrinsic spec going back many years and we should never need to change them. So we shouldn't waste table bytes to support sharing. llvm-svn: 335806
* [X86] Change how we prefer shift by immediate over folding a load into a shift.Craig Topper2018-06-283-57/+68
| | | | | | | | | | | | BMI2 added new shift by register instructions that have the ability to fold a load. Normally without doing anything special isel would prefer folding a load over folding an immediate because the load folding pattern has higher "complexity". This would require an instruction to move the immediate into a register. We would rather fold the immediate instead and have a separate instruction for the load. We used to enforce this priority by artificially lowering the complexity of the load pattern. This patch changes this to instead reject the load fold in isProfitableToFoldLoad if there is an immediate. This is more consistent with other binops and feels less hacky. llvm-svn: 335804
* [X86] Make folding table checking threadsafeBenjamin Kramer2018-06-271-4/+3
| | | | | | | This is a benign race, but tsan likes to complain about it. Just make it happy. llvm-svn: 335788
* [X86] In X86DAGToDAGISel::PreprocessISelDAG, make sure we don't access N ↵Craig Topper2018-06-271-0/+1
| | | | | | | | | | after we delete it. If we turn X86ISD::AND into ISD::AND, we delete N. But we were continuing onto the next block of code even though N no longer existed. Just happened to notice it. I assume asan didn't notice it because we explicitly unpoison deleted nodes and give them a DELETE_NODE opcode. llvm-svn: 335787
* [RISCV] Add machine function pass to merge base + offsetSameer AbuAsal2018-06-275-208/+296
| | | | | | | | | | | | | | | | | | | | | | | | Summary: In r333455 we added a peephole to fix the corner cases that result from separating base + offset lowering of global address.The peephole didn't handle some of the cases because it only has a basic block view instead of a function level view. This patch replaces that logic with a machine function pass. In addition to handling the original cases it handles uses of the global address across blocks in function and folding an offset from LW\SW instruction. This pass won't run for OptNone compilation, so there will be a negative impact overall vs the old approach at O0. Reviewers: asb, apazos, mgrang Reviewed By: asb Subscribers: MartinMosbeck, brucehoult, the_o, rogfer01, mgorny, rbar, johnrusso, simoncook, niosHD, kito-cheng, shiva0217, zzheng, llvm-commits, edward-jones Differential Revision: https://reviews.llvm.org/D47857 llvm-svn: 335786
* [X86] Fix unmatched parenthesis in r335768Fangrui Song2018-06-271-1/+1
| | | | llvm-svn: 335769
* [X86] Teach the disassembler to use %eiz/%riz instead of NoRegister when the ↵Craig Topper2018-06-271-5/+20
| | | | | | | | | | SIB byte is present, but doesn't encode an index register and there was another shorter encoding that would achieve the same result. The %eiz/%riz are dummy registers that force the encoder to emit a SIB byte when it normally wouldn't. By emitting them in the disassembly output we ensure that assembling the disassembler output would also produce a SIB byte. This should match the behavior of objdump from binutils. llvm-svn: 335768
* [globalisel][legalizer] Add AtomicOrdering to LegalityQuery and use it in ↵Daniel Sanders2018-06-271-2/+6
| | | | | | | | | | | | | AArch64 Now that we have the ability to legalize based on MMO's. Add support for legalizing based on AtomicOrdering and use it to correct the legalization of the atomic instructions. Also extend all() to be a variadic template as this ruleset now requires 3 and 4 argument versions. llvm-svn: 335767
* [MachineOutliner] Don't outline sequences where x16/x17/nzcv are live acrossJessica Paquette2018-06-272-27/+41
| | | | | | | | | | | | | | It isn't safe to outline sequences of instructions where x16/x17/nzcv live across the sequence. This teaches the outliner to check whether or not a specific canidate has x16/x17/nzcv live across it and discard the candidate in the case that that is true. https://bugs.llvm.org/show_bug.cgi?id=37573 https://reviews.llvm.org/D47655 llvm-svn: 335758
* [X86] Use bts/btr/btc for single bit set/clear/complement of a variable bit ↵Craig Topper2018-06-271-0/+31
| | | | | | | | | | position If we are just modifying a single bit at a variable bit position we can use the BT* instructions to make the change instead of shifting a 1(or rotating a -1) and doing a binop. These instruction also ignore the upper bits of their index input so we can also remove an and if one is present on the index. Fixes PR37938. llvm-svn: 335754
* [X86] Rename the autoupgraded of packed fp compare and fpclass intrinsics ↵Craig Topper2018-06-271-14/+12
| | | | | | | | that don't take a mask as input to exclude '.mask.' from their name. I think the intrinsics named 'avx512.mask.' should refer to the previous behavior of taking a mask argument in the intrinsic instead of using a 'select' or 'and' instruction in IR to accomplish the masking. This is more consistent with the goal that eventually we will have no intrinsics that have masking builtin. When we reach that goal, we should have no intrinsics named "avx512.mask". llvm-svn: 335744
* [AMDGPU] Convert rcp to rcp_iflagStanislav Mekhanoshin2018-06-276-12/+45
| | | | | | | | | | | If a source of rcp instruction is a result of any conversion from an integer convert it into rcp_iflag instruction. No FP exception can ever happen except division by zero if a single precision rcp argument is a representation of an integral number. Differential Revision: https://reviews.llvm.org/D48569 llvm-svn: 335742
* [AArch64] Reverting FP16 vcvth_n_s64_f16 to fixLuke Geeson2018-06-271-2/+0
| | | | llvm-svn: 335737
* [AArch64] Add custom lowering for v4i8 trunc storeAdhemerval Zanella2018-06-273-8/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a custom trunc store lowering for v4i8 vector types. Since there is not v.4b register, the v4i8 is promoted to v4i16 (v.4h) and default action for v4i8 is to extract each element and issue 4 byte stores. A better strategy would be to extended the promoted v4i16 to v8i16 (with undef elements) and extract and store the word lane which represents the v4i8 subvectores. The construction: define void @foo(<4 x i16> %x, i8* nocapture %p) { %0 = trunc <4 x i16> %x to <4 x i8> %1 = bitcast i8* %p to <4 x i8>* store <4 x i8> %0, <4 x i8>* %1, align 4, !tbaa !2 ret void } Can be optimized from: umov w8, v0.h[3] umov w9, v0.h[2] umov w10, v0.h[1] umov w11, v0.h[0] strb w8, [x0, #3] strb w9, [x0, #2] strb w10, [x0, #1] strb w11, [x0] ret To: xtn v0.8b, v0.8h str s0, [x0] ret The patch also adjust the memory cost for autovectorization, so the C code: void foo (const int *src, int width, unsigned char *dst) { for (int i = 0; i < width; i++) *dst++ = *src++; } can be vectorized to: .LBB0_4: // %vector.body // =>This Inner Loop Header: Depth=1 ldr q0, [x0], #16 subs x12, x12, #4 // =4 xtn v0.4h, v0.4s xtn v0.8b, v0.8h st1 { v0.s }[0], [x2], #4 b.ne .LBB0_4 Instead of byte operations. llvm-svn: 335735
* [NEON] Support vldNq intrinsics in AArch32 (LLVM part)Ivan A. Kosarev2018-06-275-70/+262
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for the q versions of the dup (load-to-all-lanes) NEON intrinsics, such as vld2q_dup_f16() for example. Currently, non-q versions of the dup intrinsics are implemented in clang by generating IR that first loads the elements of the structure into the first lane with the lane (to-single-lane) intrinsics, and then propagating it other lanes. There are at least two problems with this approach. First, there are no double-spaced to-single-lane byte-element instructions. For example, there is no such instruction as 'vld2.8 { d0[0], d2[0] }, [r0]'. That means we cannot rely on the to-single-lane intrinsics and instructions to implement the q versions of the dup intrinsics. Note that to-all-lanes instructions do support all sizes of data items, including bytes. The second problem with the current approach is that we need a separate vdup instruction to propagate the structure to each lane. So for vld4q_dup_f16() we would need four vdup instructions in addition to the initial vld instruction. This patch introduces dup LLVM intrinsics and reworks handling of the currently supported (non-q) NEON dup intrinsics to expand them into those LLVM intrinsics, thus eliminating the need for using to-single-lane intrinsics and instructions. Additionally, this patch adds support for u64 and s64 dup NEON intrinsics. These are marked as Arch64-only in the ARM NEON Reference, but it seems there are no reasons to not support them in AArch32 mode. Please correct, if that is wrong. That's what we generate with this patch applied: vld2q_dup_f16: vld2.16 {d0[], d2[]}, [r0] vld2.16 {d1[], d3[]}, [r0] vld3q_dup_f16: vld3.16 {d0[], d2[], d4[]}, [r0] vld3.16 {d1[], d3[], d5[]}, [r0] vld4q_dup_f16: vld4.16 {d0[], d2[], d4[], d6[]}, [r0] vld4.16 {d1[], d3[], d5[], d7[]}, [r0] Differential Revision: https://reviews.llvm.org/D48439 llvm-svn: 335733
* [AArch64] Remove Duplicate FP16 Patterns with same encoding, match on ↵Luke Geeson2018-06-274-35/+54
| | | | | | existing patterns llvm-svn: 335715
* AMDGPU/NFC: Fix typo in commentKonstantin Zhuravlyov2018-06-271-1/+1
| | | | llvm-svn: 335707
* [X86] Don't store register and memory FMA3 opcodes in the same ↵Craig Topper2018-06-273-99/+30
| | | | | | | | | | X86InstrFMA3Group. Nothing was using this relationship. By splitting them we no longer need to worry about register or memory entries being empty in a group. The memory folding tables in X86InstrInfo.cpp can be used to access this relationship if needed. llvm-svn: 335694
* AMDGPU: Silence unused warnings in waitcnt insertion pass in release buildKonstantin Zhuravlyov2018-06-261-1/+5
| | | | | | Differential Revision: https://reviews.llvm.org/D48607 llvm-svn: 335669
* [X86][AsmParser] Recommit r335658Jessica Paquette2018-06-261-0/+8
| | | | | | | Recommit of r335658 so that it does not change the behaviour of any existing error output. llvm-svn: 335668
* Revert "[X86][AsmParser] Emit an error when RIP-relative instructions are ↵Jessica Paquette2018-06-261-7/+0
| | | | | | | | | | used in 32-bit mode" This reverts commit 4850a9aae8b38c7deadc103d634ec7397e6c323b. It caused MC/X86/x86_errors.s to fail. Will fix and recommit shortly. llvm-svn: 335660
* [X86][AsmParser] Emit an error when RIP-relative instructions are used in ↵Jessica Paquette2018-06-261-0/+7
| | | | | | | | | | | | | 32-bit mode Right now, when we use RIP-relative instructions in 32-bit mode, we'll just assert and crash. This adds an error message which tells the user that they can't do that in 32-bit mode, so that we don't crash (and also can see the issue outside of assert builds). llvm-svn: 335658
* [AMDGPU] Add llvm.amdgcn.fmad.ftz intrinsicStanislav Mekhanoshin2018-06-261-0/+3
| | | | | | | | This intrinsic selects v_mad_f32 regardless of fp32 denorm support. Differential Revision: https://reviews.llvm.org/D48573 llvm-svn: 335654
* AMDGPU: Add pass to lower kernel arguments to loadsMatt Arsenault2018-06-264-0/+283
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces most argument uses with loads, but for now not all. The code in SelectionDAG for calling convention lowering is actively harmful for amdgpu_kernel. It attempts to split the argument types into register legal types, which results in low quality code for arbitary types. Since all kernel arguments are passed in memory, we just want the raw types. I've tried a couple of methods of mitigating this in SelectionDAG, but it's easier to just bypass this problem alltogether. It's possible to hack around the problem in the initial lowering, but the real problem is the DAG then expects to be able to use CopyToReg/CopyFromReg for uses of the arguments outside the block. Exposing the argument loads in the IR also has the advantage that the LoadStoreVectorizer can merge them. I'm not sure the best approach to dealing with the IR argument list is. The patch as-is just leaves the IR arguments in place, so all the existing code will still compute the same kernarg size and pointlessly lowers the arguments. Arguably the frontend should emit kernels with an empty argument list in the first place. Alternatively a dummy array could be inserted as a single argument just to reserve space. This does have some disadvantages. Local pointer kernel arguments can no longer have AssertZext placed on them as the equivalent !range metadata is not valid on pointer typed loads. This is mostly bad for SI which needs to know about the known bits in order to use the DS instruction offset, so in this case this is not done. More importantly, this skips noalias arguments since this pass does not yet convert this to the equivalent !alias.scope and !noalias metadata. Producing this metadata correctly seems to be tricky, although this logically is the same as inlining into a function which doesn't exist. Additionally, exposing these loads to the vectorizer may result in degraded aliasing information if a pointer load is merged with another argument load. I'm also not entirely sure this is preserving the current clover ABI, although I would greatly prefer if it would stop widening arguments and match the HSA ABI. As-is I think it is extending < 4-byte arguments to 4-bytes but doesn't align them to 4-bytes. llvm-svn: 335650
* [Hexagon] Add a "generic" cpuBrendon Cahoon2018-06-264-1/+7
| | | | | | | | | | Add the generic processor for Hexagon so that it can be used with 3rd party programs that create a back-end with the "generic" CPU. This patch also enables the JIT for Hexagon. Differential Revision: https://reviews.llvm.org/D48571 llvm-svn: 335641
* [TargetLowering] isVectorClearMaskLegal - use ArrayRef<int> instead of const ↵Simon Pilgrim2018-06-262-8/+6
| | | | | | | | | | SmallVectorImpl<int>& This is more generic and matches isShuffleMaskLegal. Differential Revision: https://reviews.llvm.org/D48591 llvm-svn: 335605
* [X86,ARM] Retain split-stack prolog check for sibling callsThan McIntosh2018-06-262-4/+8
| | | | | | | | | | | | | | | | | | | Summary: If a routine with no stack frame makes a sibling call, we need to preserve the stack space check even if the local stack frame is empty, since the call target could be a "no-split" function (in which case the linker needs to be able to fix up the prolog sequence in order to switch to a larger stack). This fixes PR37807. Reviewers: cherry, javed.absar Subscribers: srhines, llvm-commits Differential Revision: https://reviews.llvm.org/D48444 llvm-svn: 335604
* ARM: correctly decode VFP instructions following unpredictable t2ITTim Northover2018-06-261-0/+2
| | | | | | | | | When the condition code for an IT instruction is "AL" we get strange "15" predicates on subsequent instructions. These are dealt with for most instructions by treating them as "ARMCC::AL", but VFP takes a different path which didn't have this code. llvm-svn: 335594
* ARM: diagnose unpredictable IT instructionsTim Northover2018-06-262-1/+19
| | | | | | | | | | | IT instructions are allowed to have the 'AL' predicate, but it must never result in an 'NV' predicated instruction. Essentially this means that all branches must be 't' rather than 'e' if the predicate is 'AL'. This patch adds a diagnostic for this during assembly (error because parsing hits an assertion if allowed to continue) and an annotation during disassembly. llvm-svn: 335593
* [X86] Just use ArrayRef instead of SmallVectorImpl in a few static method ↵Simon Pilgrim2018-06-261-4/+4
| | | | | | arguments. NFCI. llvm-svn: 335590
* [X86] Don't use getScalarShiftAmountTy to get the immediate type for target ↵Craig Topper2018-06-261-5/+2
| | | | | | | | specific VSHLDQ/VSRLDQ nodes. These opcodes have a fixed type of i8 for their immediate and shouldn't have anything to do with the scalar shift amount used by target independent shift nodes. llvm-svn: 335578
* [WebAssembly] Fix lowering of varargs functions with non-legal fixed arguments.Dan Gohman2018-06-261-2/+3
| | | | | | | | | | | CallLoweringInfo's NumFixedArgs field gives the number of fixed arguments before legalization. The ISD::OutputArg "Outs" array holds legalized arguments, so when indexing into it to find the non-fixed arguemn, we need to use the number of arguments after legalization. Fixes PR37934. llvm-svn: 335576
* [X86] Use XOR for SUB (C, X) during isel if will help fold an immediateCraig Topper2018-06-261-0/+38
| | | | | | | | | | | | | | | | | Summary: Same idea as D48529, but restricted to X86 and done very late to avoid any surprises where subtract might be better for DAG combining. This seems like the safest way to do this trick. And we consider doing it as a DAG combine later. Reviewers: spatel, RKSimon Reviewed By: spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D48557 llvm-svn: 335575
* [WebAssembly] Fix a typo in a comment.Dan Gohman2018-06-261-1/+1
| | | | llvm-svn: 335574
* [X86] Redefine avx512 packed fpclass intrinsics to return a vXi1 mask and ↵Craig Topper2018-06-261-19/+6
| | | | | | | | | | | | implement the mask input argument using an 'and' IR instruction. This recommits r335562 and 335563 as a single commit. The frontend will surround the intrinsic with the appropriate marshalling to/from a scalar type to match the sigature of the builtin that software expects. By exposing the vXi1 type directly in the llvm intrinsic we make it available to optimizers much earlier. This can enable the scalar marshalling code to be optimized away. llvm-svn: 335568
* Revert r335562 and 335563 "[X86] Redefine avx512 packed fpclass intrinsics ↵Craig Topper2018-06-261-6/+19
| | | | | | | | to return a vXi1 mask and implement the mask input argument using an 'and' IR instruction." These were supposed to have been squashed to a single commit. llvm-svn: 335566
* fooCraig Topper2018-06-261-19/+6
| | | | llvm-svn: 335562
* [X86] Simplify intrinsic table binary search to not require a temporary struct.Craig Topper2018-06-251-10/+9
| | | | | | std::lower_bound doesn't require the thing to search for to be the same type as the table entries. We just need to define an appropriate comparison function that can take an table entry and an intrinsic number. llvm-svn: 335518
* [X86] Add comment about the sorting of the memory folding tables added in ↵Craig Topper2018-06-251-0/+15
| | | | | | r335501. llvm-svn: 335517
* [PowerPC] Fix incorrectly encoded wait instructionLei Huang2018-06-251-1/+1
| | | | | | | | Encoding for the wait instruction was wrong. Fix according to ISA 3.0. Differential Revision: https://reviews.llvm.org/D48550 llvm-svn: 335514
* Re-land r335297 "[X86] Implement more of x86-64 large and medium PIC code ↵Reid Kleckner2018-06-256-30/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | models" The large code model allows code and data segments to exceed 2GB, which means that some symbol references may require a displacement that cannot be encoded as a displacement from RIP. The large PIC model even relaxes the assumption that the GOT itself is within 2GB of all code. Therefore, we need a special code sequence to materialize it: .LtmpN: leaq .LtmpN(%rip), %rbx movabsq $_GLOBAL_OFFSET_TABLE_-.LtmpN, %rax # Scratch addq %rax, %rbx # GOT base reg From that, non-local references go through the GOT base register instead of being PC-relative loads. Local references typically use GOTOFF symbols, like this: movq extern_gv@GOT(%rbx), %rax movq local_gv@GOTOFF(%rbx), %rax All calls end up being indirect: movabsq $local_fn@GOTOFF, %rax addq %rbx, %rax callq *%rax The medium code model retains the assumption that the code segment is less than 2GB, so calls are once again direct, and the RIP-relative loads can be used to access the GOT. Materializing the GOT is easy: leaq _GLOBAL_OFFSET_TABLE_(%rip), %rbx # GOT base reg DSO local data accesses will use it: movq local_gv@GOTOFF(%rbx), %rax Non-local data accesses will use RIP-relative addressing, which means we may not always need to materialize the GOT base: movq extern_gv@GOTPCREL(%rip), %rax Direct calls are basically the same as they are in the small code model: They use direct, PC-relative addressing, and the PLT is used for calls to non-local functions. This patch adds reasonably comprehensive testing of LEA, but there are lots of interesting folding opportunities that are unimplemented. I restricted the MCJIT/eh-lg-pic.ll test to Linux, since the large PIC code model is not implemented for MachO yet. Differential Revision: https://reviews.llvm.org/D47211 llvm-svn: 335508
* [X86] Sort the static memory folding tables by reg opcode. Remove the ↵Craig Topper2018-06-252-5419/+5328
| | | | | | | | | | | | reg->mem DenseMaps in favor of binary search. With the static tables sorted we can binary search them directly for reg->mem lookups. This removes 6 DenseMaps that had to be created when X86InstrInfo is constructed. We still have one Mem->Reg DenseMap for the reverse direction. This is created just as before by walking the reg->mem arrays to populate it. Differential Revision: https://reviews.llvm.org/D48527 llvm-svn: 335501
OpenPOWER on IntegriCloud