summaryrefslogtreecommitdiffstats
path: root/llvm/test/CodeGen/AMDGPU
Commit message (Collapse)AuthorAgeFilesLines
...
* [AMDGPU] prevent hitting Assertion `isReg() && "Wrong MachineOperand accessor"'Mark Searles2018-06-121-0/+23
| | | | | | | | | The use iterator, used within findMaskOperands(), can return anything which is not a def. isUse() requires a register, so check isReg() before calling isUse(). Differential Revision: https://reviews.llvm.org/D48047 llvm-svn: 334459
* [AMDGPU] Do not consider indirect acces through phi for wave limiterStanislav Mekhanoshin2018-06-111-0/+26
| | | | | | | | | | | Rational: if there is indirect access that is usually an issue because load is not ready by the use. However, if use is inside a loop and load is outside that is potentially an issue for a first iteration only. Differential Revision: https://reviews.llvm.org/D47740 llvm-svn: 334420
* [NFC][AMDGPU] Add tests for all the various IR patterns equivalent to ↵Roman Lebedev2018-06-111-0/+263
| | | | | | | | | | | | | | | | | | | | | | | extracting low bits. Summary: The idiom recognition seems rather poor. Only the `@bzhi32_d0` produces `v_bfe_u32`. But they all should. This needs to be fixed before D47980 can be re-landed. Reviewers: mareko, bogner, rampitec, arsenm, tstellar, nhaehnle Reviewed By: nhaehnle Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits Tags: #amdgpu Differential Revision: https://reviews.llvm.org/D48005 llvm-svn: 334398
* [AMDGPU] Inline asm - added i16, half and i128 types supportDaniil Fukalov2018-06-083-54/+33
| | | | | | | | | | AMDGPU inline assembler support i16, half and i128 typed variables in constraints, but they were reported as error. Needed to fix https://github.com/RadeonOpenCompute/ROCm/issues/341, e.g. to be able to load with global_load_dwordx4 to a 128bit integer variable Differential Revision: https://reviews.llvm.org/D44920 llvm-svn: 334301
* AMDGPU: Error on LDS global address in functionsMatt Arsenault2018-06-081-0/+9
| | | | | | | These won't work as expected now, so error on them to avoid wasting time debugging this in the future. llvm-svn: 334269
* [AMDGPU] Simplify memory legalizerTony Tye2018-06-076-139/+2216
| | | | | | | | | | - Make code easier to maintain. - Avoid generating waitcnts for VMEM if the address sppace does not involve VMEM. - Add support to generate waitcnts for LDS and GDS memory. Differential Revision: https://reviews.llvm.org/D47504 llvm-svn: 334241
* AMDGPU: Fix not including v2f64 in SReg_128Matt Arsenault2018-06-072-2/+45
| | | | | | Fixes assertion with calls returning v2f64. llvm-svn: 334189
* AMDGPU: Use scalar operations for f16 fabs/fneg patternsMatt Arsenault2018-06-074-64/+29
| | | | | | Fixes unnecessary differences between subtargets. llvm-svn: 334184
* AMDGPU: Try a lot harder to emit scalar loadsMatt Arsenault2018-06-0726-462/+714
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This has two main components. First, widen widen short constant loads in DAG when they have the correct alignment. This is already done a bit in AMDGPUCodeGenPrepare, since that has access to DivergenceAnalysis. This can't help kernarg loads created in the DAG. Start to use DAG divergence analysis to help this case. The second part is to avoid kernel argument lowering breaking the alignment of short vector elements because calling convention lowering wants to split everything into legal register types. When loading a split type, load the nearest 4-byte aligned segment and shift to get the desired bits. This extra load of the earlier argument piece ends up merging, and the bit extract hopefully folds out. There are a number of improvements and regressions with this, but I think as-is this is a better compromise between several of the worst parts of SelectionDAG. Particularly when i16 is legal, this produces worse code for i8 and i16 element vector kernel arguments. This is partially due to the very weak load merging the DAG does. It only looks for fairly specific combines between pairs of loads which no longer appear. In particular this causes v4i16 loads to be split into 2 components when previously the two halves were merged. Worse, because of the newly introduced shifts, there is a lot more unnecessary vector packing and unpacking code emitted. At least some of this is due to reporting false for isTypeDesirableForOp for i16 as a workaround for the lack of divergence information in the DAG. The cases where this happens it doesn't actually matter, but the relevant code in SimplifyDemandedBits doens't have the context to know to ignore this. The use of the scalar cache is probably more important than the mess of mostly scalar instructions doing this packing and unpacking. Future work can fix this, possibly by making better use of the new DAG divergence information for controlling promotion decisions, or adding another version of shift + trunc + shift combines that doesn't only know about the used types. llvm-svn: 334180
* [AMDGPU] Improve reciprocal handlingStanislav Mekhanoshin2018-06-061-0/+459
| | | | | | | | | | | | | | | | | | | | | | | When denormals are supported we are producing a full division for 1.0f / x. That still can be replaced by the faster version: bool c = fabs(x) > 0x1.0p+96f; float s = c ? 0x1.0p-32f : 1.0f; x *= s; return s * v_rcp_f32(x) in case if requested accuracy is 2.5ulp or less. The same version is used if denormals are not supported for non 1.0 numerators, where just v_rcp_f32 is then used for 1.0 numerator. The optimization of 1/x is extended to the case -1/x, which is the same except for the resulting sign bit. OpenCL conformance passed with both enabled and disabled denorms. Differential Revision: https://reviews.llvm.org/D47805 llvm-svn: 334142
* AMDGPU: Custom lower v2f16 fneg/fabs with illegal f16Matt Arsenault2018-06-063-38/+67
| | | | | | | | | | | | Fixes terrible code on targets without f16 support. The legalization creates a mess that is difficult to recover from. Also should avoid randomly breaking these tests multiple times in sequence in future commits. Some regressions in cases where it happens to be better to pull the source modifier after the conversion. llvm-svn: 334132
* AMDGPU: Preserve metadata when widening loadsMatt Arsenault2018-06-051-0/+76
| | | | | | | | Preserves the low bound of the !range. I don't think it's legal to do anything with the top half since it's theoretically reading garbage. llvm-svn: 334045
* AMDGPU: Use more custom insert/extract_vector_elt loweringMatt Arsenault2018-06-054-85/+133
| | | | | | Apply to i8 vectors. llvm-svn: 334044
* DAG: Stop dropping invariant/dereferencableMatt Arsenault2018-06-051-0/+13
| | | | | | | | | | | | | When legalizing illegal FP load results, this was for some reason dropping the invariant and dereferencable memory flags. There doesn't seem to be any reason for this, and the equivalent isn't done for integer loads. Fixes an issue in a future AMDGPU commit where some identical loads fail to merge because one of the loads ends up dropping the flags. llvm-svn: 334020
* [CodeGen] Always update divergence in SelectionDAG::UpdateNodeOperandsScott Linder2018-06-041-0/+30
| | | | | | | | Some overloads failed to update divergence. Differential Revision: https://reviews.llvm.org/D47148 llvm-svn: 333947
* [AMDGPU][Waitcnt] Fix handling of flat instrsMark Searles2018-06-041-11/+4
| | | | | | | | On GFX9 and earlier, flat memory ops may decrement VMCNT out-of-order as well as LGKMCNT out-of-order. Differential Revision: https://reviews.llvm.org/D46616 llvm-svn: 333926
* AMDGPU: Switch some half using-tests to use amdhsaMatt Arsenault2018-06-015-162/+157
| | | | | | | The default clover ABI weirdly promotes half to float, which should probably be fixed. llvm-svn: 333730
* [AMDGPU] Construct memory clauses before RAStanislav Mekhanoshin2018-05-312-0/+554
| | | | | | | | | | | | | | | | | | Memory clauses are formed into bundles in presence of xnack. Their source operands are marked as early-clobber. This allows to allocate distinct source and destination registers within a clause and prevent breaking the clause with s_nop in the hazard recognizer. Clauses are undone before post-RA scheduler to allow some rescheduling, which will not break the clause since artificial edges are created in the dag to keep memory operations together. Yet this allows a better ILP in some cases. Differential Revision: https://reviews.llvm.org/D47511 llvm-svn: 333691
* [AMDGPU] Fixed incorrect -mcpu=gfx800 in xnor.ll test. NFC.Stanislav Mekhanoshin2018-05-311-1/+1
| | | | llvm-svn: 333687
* AMDGPU/R600: Make sure functions are cacheline alignedJan Vesely2018-05-311-0/+13
| | | | | | | | | | | v2: use "ensureAlignment" make functions cache line aligned Fixes GPU hangs since r333219: "AMDGPU: Split R600 AsmPrinter code into its own class" Differential Revision: https://reviews.llvm.org/D47516 llvm-svn: 333622
* AMDGPU: Use better alignment for kernarg loweringMatt Arsenault2018-05-304-142/+94
| | | | | | | | | | This was just emitting loads with the ABI alignment for the raw type. The true alignment is often better, especially when an illegal vector type was scalarized. The better alignment allows using a scalar load more often. llvm-svn: 333558
* [AMDGPU][Waitcnt] Fix handling of loops with many bottom blocksMark Searles2018-05-301-0/+34
| | | | | | | | | | | | | In terms of waitcnt insertion/if necessary, the waitcnt pass forces convergence for a loop. Previously, that kicked if greater than 2 passes over a loop, which doesn't account for loop with many bottom blocks. So, increase the threshold to (n+1), where n is the number of bottom blocks. This gives the pass an opportunity to consider the contribution of each bottom block, to the overall loop, before the forced convergence potentially kicks in. Differential Revision: https://reviews.llvm.org/D47488 llvm-svn: 333556
* AMDGPU: Fix broken check linesMatt Arsenault2018-05-291-6/+6
| | | | llvm-svn: 333458
* AMDGPU: Round up kernel argument allocation sizeMatt Arsenault2018-05-292-3/+98
| | | | | | | | | | AFAIK the driver's allocation will actually have to round this up anyway. It is useful to track the rounded up size, so that the end of the kernel segment is known to be dereferencable so a wider s_load_dword can be used for a short argument at the end of the segment. llvm-svn: 333456
* AMDGPU: Always set COMPUTE_PGM_RSRC2.ENABLE_TRAP_HANDLER to zero for AMDHSA asKonstantin Zhuravlyov2018-05-291-2/+2
| | | | | | | | it is set by CP Differential Revision: https://reviews.llvm.org/D47392 llvm-svn: 333451
* [AMDGPU] Fixed WWM bug in block otherwise entirely in WQMTim Renouf2018-05-271-0/+31
| | | | | | | | | | | | | | | | Summary: For a block with WQM on entry and exit and containing no exact mode code, but containing some WWM code, the WQM pass forgot to process the block at all and so did not insert code to enter and leave WWM. This commit fixes that. Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D47027 Change-Id: I044792eead1293bed4203fb26ce75f47878afeb6 llvm-svn: 333362
* [AMDGPU][Waitcnt] Remove obsolete waitcnt optionMark Searles2018-05-251-2/+2
| | | | | | | | With the removal of the old waitcnt pass, the '-enable-si-insert-waitcnts' option is obsolete. Remove it. Differential Revision: https://reviews.llvm.org/D47378 llvm-svn: 333303
* [AMDGPU] Add perf hints to functionsStanislav Mekhanoshin2018-05-252-6/+89
| | | | | | | | | | | | | | | This is adoption of HSAIL perfhint pass. Two types of hints are produced: 1. Function is memory bound. 2. Kernel can use wave limiter. Currently these hints are used in the scheduler. If a function is suspected to be memory bound we allow occupancy to decrease to 4 waves in the course of scheduling. Differential Revision: https://reviews.llvm.org/D46992 llvm-svn: 333289
* [AMDGPU] Fixed incorrect break from loopTim Renouf2018-05-253-3/+101
| | | | | | | | | | | | | | | | | | | | | | | | Summary: Lower control flow did not correctly handle the case that a loop break in if/else was on a condition that was not guaranteed to be masked by exec. The first test kernel shows an example of this going wrong; after exiting the loop, exec is all ones, even if it was not before the loop. The fix is for lowering of if-break and else-break to insert an S_AND_B64 to mask the break condition with exec. This commit also includes the optimization of not inserting that S_AND_B64 if it is obviously not needed because the break condition is the result of a V_CMP in the same basic block. V2: Addressed some review comments. V3: Test fixes. Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D44046 Change-Id: I0fc56a01209a9e99d1d5c9b0ffd16f111caf200c llvm-svn: 333258
* StructurizeCFG: Adjust the loop depth for a subregion to order the nodes ↵Changpeng Fang2018-05-231-36/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | correctly Summary: StructurizeCFG::orderNodes basically uses a reverse post-order (RPO) traversal of the region list to get the order. The only problem with it is that sometimes backedges for outer loops will be visited before backedges for inner loops. To solve this problem, a loop depth based approach has been used to make sure all blocks in this loop has been visited before moving on to outer loop. However, we found a problem for a SubRegion which is a loop itself: --> BB1 --> BB2 --> BB3 --> In this case, BB2 is a SubRegion (loop), and thus its loopdepth is different than that of BB1 and BB3. This fact will lead BB2 to be placed in the wrong order. In this work, we treat the SubRegion as a special case and use its exit block to determine the loop and its depth to guard the sorting. Reviewers: arsenm, jlebar Differential Revision: https://reviews.llvm.org/D46912 llvm-svn: 333111
* AMDGPU: Fix missing test coverage for some 16-bit and packed opsMatt Arsenault2018-05-225-29/+370
| | | | llvm-svn: 333024
* AMDGPU: Fix v2f16 fneg/fabs patternMatt Arsenault2018-05-221-2/+20
| | | | | | | | | | The integer operation convertion for some reason only happens if the source is a bitcast from an integer, which happens to always be the situation when the result is loaded. Add an additional pattern for when the source operation is really an FP operation. llvm-svn: 333019
* AMDGPU: Make v2i16/v2f16 legal on VIMatt Arsenault2018-05-2229-485/+640
| | | | | | | | | | | | This usually results in better code. Fixes using inline asm with short2, and also fixes having a different ABI for function parameters between VI and gfx9. Partially cleans up the mess used for lowering of the d16 operations. Making v4f16 legal will help clean this up more, but this requires additional work. llvm-svn: 332953
* [DAG] fold FP binops with undef operands to NaNSanjay Patel2018-05-211-2/+5
| | | | | | | | | | | | | | | | | | This is the FP sibling of D43141 with the corresponding IR change in rL327212. We can't propagate undef here because if a variable operand is a NaN, these binops must propagate NaN. Neither global nor node-level fast-math makes a difference. If we have 'nnan', I think later folds can turn the NaN into undef. The tests in X86/fp-undef.ll are meant to be the definitive verification for these folds - everything reduces identically now. The other test changes are collateral damage. They may need to be altered to preserve their intent. Differential Revision: https://reviews.llvm.org/D47026 llvm-svn: 332920
* AMDGPU: Add pass to optimize reqd_work_group_sizeMatt Arsenault2018-05-181-0/+501
| | | | | | | | | | | Eliminate loads from the dispatch packet when they will have a known value. Also pattern match the code used by the library to handle partial workgroup dispatches, which isn't necessary if reqd_work_group_size is used. llvm-svn: 332771
* AMDGPU/SI: Don't promote alloca to vector for atomic load/storeChangpeng Fang2018-05-171-0/+65
| | | | | | | | | | | | | Summary: Don't promote alloca to vector for atomic load/store Reviewer: arsenm Differential Revision: https://reviews.llvm.org/D46085 llvm-svn: 332673
* AMDGPU/SI: Handle infinite loop for the structurizer to work with CFG with ↵Changpeng Fang2018-05-175-15/+170
| | | | | | | | | | | | | | | | | | | infinite loops. Summary: The current StructurizeCFG pass only works for CFG with one exit. AMDGPUUnifyDivergentExitNodes combines multiple "return" blocks and/or "unreachable" blocks to one exit block for the Structurizer to work. However, infinite loop is another kind of special "exit", and if we don't handle it, the case of multiple exits will prevent the structurizer from working. In this work, for each infinite loop, we add a dummy edge to the "return" block, and thus the AMDGPUUnifyDivergentExitNodes pass will work with infinite loops. This will make CFG with infinite loops be structurized. Reviewer: nhaehnle Differential Revision: https://reviews.llvm.org/D46340 llvm-svn: 332625
* [AMDGPU] Move lsr test. NFC.Stanislav Mekhanoshin2018-05-171-37/+0
| | | | llvm-svn: 332562
* AMDGPU : Recalculate SGPRs when trap handler is supportedKonstantin Zhuravlyov2018-05-161-0/+70
| | | | | | Differential Revision: https://reviews.llvm.org/D29911 llvm-svn: 332523
* [AMDGPU] Change llvm.debugtrap to be a debug breakpoint that can resume ↵Tony Tye2018-05-161-13/+23
| | | | | | | | | | execution. No longer require the queue pointer to be passed in in fixed SGPRs. Differential Revision: https://reviews.llvm.org/D46769 llvm-svn: 332485
* AMDGPU: Custom lower v4i16/v4f16 vector operationsMatt Arsenault2018-05-166-111/+257
| | | | | | | | | Avoids stack access. Also handle extract hi elt pattern from truncate + shift to avoid a couple test regressions. llvm-svn: 332453
* [AMDGPU] Fix handling of void types in isLegalAddressingModeStanislav Mekhanoshin2018-05-151-0/+37
| | | | | | | | | | | | | It is legal for the type passed to isLegalAddressingMode to be unsized or, more specifically, VoidTy. In this case, we must check the legality of load / stores for all legal types. Directly trying to call getTypeStoreSize is incorrect, and leads to breakage in e.g. Loop Strength Reduction. This change guards against that behaviour. Differential Revision: https://reviews.llvm.org/D40405 llvm-svn: 332409
* AMDGPU: Add a missing test for the 128-bit local addr space optionMarek Olsak2018-05-151-0/+20
| | | | | | | This should have been pushed with: "AMDGPU: enable 128-bit for local addr space under an option" llvm-svn: 332404
* AMDGPU/GlobalISel: Implement select() for G_FCONSTANTTom Stellard2018-05-152-6/+67
| | | | | | | | | | | | Summary: Also clean up G_CONSTANT selection. Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D46170 llvm-svn: 332379
* AMDGPU: Make undef legal for v2i16/v2f16Matt Arsenault2018-05-131-4/+1
| | | | | | | This is apparently necessary to stop undef from being turned into a build_vector of 0s. llvm-svn: 332195
* [AMDGPU] Fix amdgpu-waves-per-eu accounting in schedulerStanislav Mekhanoshin2018-05-121-0/+591
| | | | | | | | | | We cannot query this attribute from a subtarget given a machine function. At this point attribute itself is already unavailable and can only be obtained through MFI. Differential Revision: https://reviews.llvm.org/D46781 llvm-svn: 332166
* AMDGPU/GlobalISel: Implement select() for >32-bit G_STORETom Stellard2018-05-111-4/+19
| | | | | | | | | | Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, llvm-commits, t-tye Differential Revision: https://reviews.llvm.org/D46153 llvm-svn: 332154
* AMDGPU/SI: Don't promote alloca to vector for AddrSpaceCast instruction.Changpeng Fang2018-05-111-0/+27
| | | | | | | | | | | | | Summary: We have no logic to promote alloca to vector for an AddrSpaceCast instruction. Reviewer: arsenm Differential Revision: https://reviews.llvm.org/D45993 llvm-svn: 332147
* [AMDGPU] Fix compilation failure when IR contains comdatYaxun Liu2018-05-111-0/+19
| | | | | | | | | | | | | | | | | Remove a useless SwitchSection which also causes compilation failure when IR contains comdat. The SwitchSection is useless because the current section is already correct text section for the function therefore no need to switch. It causes compilation failure for comdat because functions with comdat has specific text section, not the default .text section. Since HIP uses comdat, this bug caused failures for HIP. Differential Revision: https://reviews.llvm.org/D46770 llvm-svn: 332137
* AMDGPU/GlobalISel: Implement select() for 32-bit G_FPTOUITom Stellard2018-05-111-0/+36
| | | | | | | | | | Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D45883 llvm-svn: 332082
OpenPOWER on IntegriCloud