summaryrefslogtreecommitdiffstats
path: root/llvm/test/CodeGen/AMDGPU
Commit message (Collapse)AuthorAgeFilesLines
...
* AMDGPU/GlobalISel: Select G_UADDE/G_USUBEMatt Arsenault2020-01-064-0/+318
|
* AMDGPU/GlobalISel: Replace handling of boolean valuesMatt Arsenault2020-01-0651-2036/+2185
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This solves selection failures with generated selection patterns, which would fail due to inferring the SGPR reg bank for virtual registers with a set register class instead of VCC bank. Use instruction selection would constrain the virtual register to a specific class, so when the def was selected later the bank no longer was set to VCC. Remove the SCC reg bank. SCC isn't directly addressable, so it requires copying from SCC to an allocatable 32-bit register during selection, so these might as well be treated as 32-bit SGPR values. Now any scalar boolean value that will produce an outupt in SCC should be widened during RegBankSelect to s32. Any s1 value should be a vector boolean during selection. This makes the vcc register bank unambiguous with a normal SGPR during selection. Summary of how this should now work: - G_TRUNC is always a no-op, and never should use a vcc bank result. - SALU boolean operations should be promoted to s32 in RegBankSelect apply mapping - An s1 value means vcc bank at selection. The exception is for legalization artifacts that use s1, which are never VCC. All other contexts should infer the VCC register classes for s1 typed registers. The LLT for the register is now needed to infer the correct register class. Extensions with vcc sources should be legalized to a select of constants during RegBankSelect. - Copy from non-vcc to vcc ensures high bits of the input value are cleared during selection. - SALU boolean inputs should ensure the inputs are 0/1. This includes select, conditional branches, and carry-ins. There are a few somewhat dirty details. One is that G_TRUNC/G_*EXT selection ignores the usual register-bank from register class functions, and can't handle truncates with VCC result banks. I think this is OK, since the artifacts are specially treated anyway. This does require some care to avoid producing cases with vcc. There will also be no 100% reliable way to verify this rule is followed in selection in case of register classes, and violations manifests themselves as invalid copy instructions much later. Standard phi handling also only considers the bank of the result register, and doesn't insert copies to make the source banks match. This doesn't work for vcc, so we have to manually correct phi inputs in this case. We should add a verifier check to make sure there are no phis with mixed vcc and non-vcc register bank inputs. There's also some duplication with the LegalizerHelper, and some code which should live in the helper. I don't see a good way to share special knowledge about what types to use for intermediate operations depending on the bank for example. Using the helper to replace extensions with selects also seems somewhat awkward to me. Another issue is there are some contexts calling getRegBankFromRegClass that apparently don't have the LLT type for the register, but I haven't yet run into a real issue from this. This also introduces new unnecessary instructions in most cases, since we don't yet try to optimize out the zext when the source is known to come from a compare.
* GlobalISel: Implement lower for G_INTRINSIC_ROUNDMatt Arsenault2020-01-062-54/+779
| | | | | Mostly copied from AMDGPU lowering implementation, except used G_SITOFP instead of directly creating a select on -1.0, 0.0.
* GlobalISel: Fix unsupported legalize actionMatt Arsenault2020-01-061-0/+78
| | | | | | | | This would complain about invalid legalizer rules otherwise. Mark some operations as unsupported for AMDGPU. This currently seems to produce the same legalize error as when no rules are defined, but eventually this should produce a proper user facing error.
* AMDGPU: Fix legalizing f16 fpowMatt Arsenault2020-01-061-0/+562
| | | | | | The existing test only covered one case for r600. The use of mul_legacy also looks suspicious to me, but leave it for now. The patterns are also not making use of source modifiers.
* AMDGPU/GlobalISel: Select scalar v2s16 G_BUILD_VECTORMatt Arsenault2020-01-061-0/+239
|
* AMDGPU/GlobalISel: Select more G_EXTRACTs correctlyMatt Arsenault2020-01-061-0/+20
| | | | | | | | This assumed a 32-bit extract size, which would produce invalid copies with 64-bit extracts. Handle the easy case. Ideally we would have a way to get the proper subreg index for any 32-bit offset, but there should probably be a tablegenerated way of getting the subreg index for any size and offset.
* GlobalISel: Scalarize all division operationsMatt Arsenault2020-01-044-0/+1732
| | | | | | This only handled G_SDIV, but they all are trivially scalarizable. Also define placeholder AMDGPU division legalizer rules.
* AMDGPU/GlobalISel: Refine SMRD selection rulesMatt Arsenault2020-01-042-32/+150
| | | | | Fix selecting these for volatile global loads, and ensure the loads are constant enough.
* AMDGPU/GlobalISel: Legalize more odd sized loadsMatt Arsenault2020-01-044-302/+34
| | | | | The attempts to widen sufficently aligned, odd sized loads wasn't consistently applied.
* AMDGPU/GlobalISel: Assume vcc phis for any vcc inputMatt Arsenault2020-01-042-84/+66
| | | | | | | This produces more intelligible looking results, more comparabble to the DAG output in the simplest cases. This is probably wrong in complex control flow, but RegBankSelect doesn't attempt analyzing if this is on a masked path for selecting the bank yet.
* [AMDGPU] need to insert wait between the scalar load and vector store to the ↵alex-t2020-01-041-0/+29
| | | | | | | | | | same address to avoid WAR conflict. Reviewers: rampitec, vpykhtin, nhaehnle Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D71934
* AMDGPU: Add gfx9 run lines to a testcaseMatt Arsenault2020-01-031-5/+13
|
* AMDGPU/GlobalISel: Fix off by one in operand indexMatt Arsenault2020-01-033-116/+78
| | | | This should be looking at the RHS of the add for a constant.
* AMDGPU/GlobalISel: Correct MMO sizes in some testsMatt Arsenault2020-01-024-932/+98
| | | | | There intended to test non-extloads, but the memory size did not match the result size.
* AMDGPU/GlobalISel: Regenerate check linesMatt Arsenault2020-01-027-13076/+13076
| | | | | This avoids diff noise in a future commit from the check name change from the G_GEP->G_PTR_ADD rename.
* DAG: Stop trying to fold FP -(x-y) -> y-x in getNode with nszMatt Arsenault2019-12-311-7/+4
| | | | | | | | | | | | | | This was increasing the number of instructions when fsub was legalized on AMDGPU with no signed zeros enabled. This fold should be guarded by hasOneUse, and I don't think getNode should be doing that. The same fold is already done as a regular combine through isNegatibleForFree. This does require duplicating, even though isNegatibleForFree does this combine already (and properly checks hasOneUse) to avoid one PPC regression. In the regression, the outer fneg has nsz but the fsub operand does not. isNegatibleForFree only sees the operand, and doesn't see it's used from a nsz context. A nsz parameter needs to be added and threaded through isNegatibleForFree to avoid this.
* AMDGPU: Precommit test showing extra instructions are introducedMatt Arsenault2019-12-311-0/+27
|
* [amdgpu] Fix scoreboard updating on `s_waitcnt_vscnt`.Michael Liao2019-12-311-0/+17
| | | | | | | | | | Summary: - Other counters are accidentally cleared. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71866
* AMDGPU/GlobalISel: Select mul24 intrinsicsMatt Arsenault2019-12-301-0/+65
|
* [MIPS GlobalISel] Select bitreverse. RecommitPetar Avramovic2019-12-301-6/+15
| | | | | | | | | | | | | | | G_BITREVERSE is generated from llvm.bitreverse.<type> intrinsics, clang genrates these intrinsics from __builtin_bitreverse32 and __builtin_bitreverse64. Add lower and narrowscalar for G_BITREVERSE. Lower G_BITREVERSE on MIPS32. Recommit notes: Introduce temporary variables in order to make sure instructions get inserted into MachineFunction in same order regardless of compiler used to build llvm. Differential Revision: https://reviews.llvm.org/D71363
* AMDGPU/GlobalISel: Select llvm.amdgcn.fmad.ftzMatt Arsenault2019-12-301-0/+233
|
* AMDGPU/GlobalISel: Add select test for fexp2Matt Arsenault2019-12-301-0/+42
|
* GlobalISel: moreElementsVector for FP min/maxMatt Arsenault2019-12-302-56/+28
|
* AMDGPU: Improve llvm.round.f64 lowering for CI+Matt Arsenault2019-12-301-496/+153
| | | | | The path already used for f16/f32 works a lot better when v_trunc_f64 is available.
* AMDGPU: Generate check linesMatt Arsenault2019-12-301-27/+1078
|
* AMDGPU/GlobalISel: Account for G_PHI result bankMatt Arsenault2019-12-302-12/+98
| | | | | | | | | Sometimes the result bank of the phi is already assigned to something, and should not be ignored. This is in preparation for additional boolean phi handling changes. Also refine the logic to fix some cases that were incorrectly deciding to use SGPRs.
* Revert "[MIPS GlobalISel] Select bitreverse"Dmitri Gribenko2019-12-301-15/+6
| | | | | | This reverts commit dbc136e0fe7e14c64dcb78e72321bb41af60afa4. It broke buildbots: http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/21066
* [MIPS GlobalISel] Select bitreversePetar Avramovic2019-12-301-6/+15
| | | | | | | | | | G_BITREVERSE is generated from llvm.bitreverse.<type> intrinsics, clang genrates these intrinsics from __builtin_bitreverse32 and __builtin_bitreverse64. Add lower and narrowscalar for G_BITREVERSE. Lower G_BITREVERSE on MIPS32. Differential Revision: https://reviews.llvm.org/D71363
* AMDGPU: Adjust test so it will work with GlobalISelMatt Arsenault2019-12-271-8/+8
| | | | This is mostly a workaround for not handling the mubuf store path yet.
* AMDGPU/GlobalISel: Use SReg_32 for readfirstlane constrainingMatt Arsenault2019-12-278-15/+15
| | | | | This matches the DAG behavior where we don't use SReg_32_XM0 everywhere anymore, and fixes not coalescing the copies into m0.
* TailDuplication: Clear NoPHIs propertyMatt Arsenault2019-12-271-0/+41
| | | | | | The early tail duplicator pass introduces new ones, so a MIR test that infers no phis since there were none on the input would fail the verifier after running.
* AMDGPU/GlobalISel: Fix extra result register in fdiv64 loweringMatt Arsenault2019-12-271-36/+36
| | | | | | There ended up being two result registers, which would fail on select. It was really defing a new temp register in the correct def position, instead of the correct result register.
* AMDGPU/GlobalISel: Select some 128-bit load/storesMatt Arsenault2019-12-274-48/+52
|
* Migrate function attribute "no-frame-pointer-elim" to "frame-pointer"="all" ↵Fangrui Song2019-12-241-1/+1
| | | | as cleanups after D56351
* AMDGPU/GlobalISel: Legalize some 16-bit round instructionsMatt Arsenault2019-12-246-16/+345
|
* GlobalISel: Define equivalent node for G_INTRINSIC_TRUNCMatt Arsenault2019-12-241-0/+82
|
* AMDGPU/GlobalISel: Lower llvm.amdgcn.elseMatt Arsenault2019-12-241-0/+31
|
* [AMDGPU] Don't create MachinePointerInfos with an UndefValue pointerJay Foad2019-12-232-80/+80
| | | | | | | | | | | | | | | Summary: The only useful information the UndefValue conveys is the address space, which MachinePointerInfo can represent directly without referring to an IR value. Reviewers: arsenm, rampitec Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, Petar.Avramovic, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71838
* [DAGCombiner] Check term use before applying aggressive FSUB optimisationsCarl Ritson2019-12-231-2/+116
| | | | | | | | | | | | | | | | Summary: Without this check unnecessary FMA instructions are generated when the FSUB terms are reused. This also has the side-effect that the same value is computed to different levels of precision, which can create undesirable effects if the results are used together in subsequent computation. Reviewers: arsenm, nhaehnle, foad, tpr, dstuttard, spatel Reviewed By: arsenm Subscribers: jvesely, wdng, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71656
* AMDGPU/GlobalISel: Fix misuse of div_scale intrinsicsMatt Arsenault2019-12-211-152/+152
| | | | | | | | | | | Confusingly, the intrinsic operands do not match the instruction/custom node. The order is shuffled, and the 3rd operand is an immediate to select operands. I'm not 100% sure I did this right, but fdiv still doesn't select end to end and it will be easier to tell when it does. This at least avoids an assertion in RegBankSelect and allows hitting the fallback on selection.
* AMDGPU/GlobalISel: Fix missing scc imp-def on scalar and/or/xorMatt Arsenault2019-12-213-56/+56
|
* [AMDGPU] Fix typo in SIInstrInfo::memOpsHaveSameBasePtrJay Foad2019-12-1722-3596/+3615
| | | | | | | | | | | | | | | Summary: The typo has been present since memOpsHaveSameBasePtr was introduced in r313208. It caused SIInstrInfo::shouldClusterMemOps to cluster more mem ops than it was supposed to. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71616
* [AMDGPU] Update autogenerated checksJay Foad2019-12-171-3/+5
|
* PostRA Machine Sink should take care of COPY defining register that is a ↵alex-t2019-12-171-0/+31
| | | | | | sub-register by another COPY source operand Differential Revision: https://reviews.llvm.org/D71132
* Fix for AMDGPU MUL_I24 known bits calculationJay Foad2019-12-161-0/+34
| | | | | | | | | | | | | | | | | | | Summary: At present, the code calculating known bits of AMDGPU MUL_I24 confuses the concepts of "non-negative number" and "positive number". In some situations, it results in incorrect code. I have a case where the optimizer replaces the result of calculating MUL_I24(-5, 0) with -8. Reviewers: foad, arsenm Reviewed By: arsenm Subscribers: foad, arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Patch by Eugene Kuznetsov. Differential Revision: https://reviews.llvm.org/D70367
* [DAG] Add SimplifyDemandedBits support for BSWAPSanjay Patel2019-12-151-1/+0
| | | | | | | | | This exposes a shortcoming for AArch64, and that is tracked by PR40881: https://bugs.llvm.org/show_bug.cgi?id=40881 Patch by: @RKSimon (Simon Pilgrim) Differential Revision: https://reviews.llvm.org/D58017
* [Legalizer] Making artifact combining order-independentRoman Tereshin2019-12-132-65/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Legalization algorithm is complicated by two facts: 1) While regular instructions should be possible to legalize in an isolated, per-instruction, context-free manner, legalization artifacts can only be eliminated in pairs, which could be deeply, and ultimately arbitrary nested: { [ () ] }, where which paranthesis kind depicts an artifact kind, like extend, unmerge, etc. Such structure can only be fully eliminated by simple local combines if they are attempted in a particular order (inside out), or alternatively by repeated scans each eliminating only one innermost pair, resulting in O(n^2) complexity. 2) Some artifacts might in fact be regular instructions that could (and sometimes should) be legalized by the target-specific rules. Which means failure to eliminate all artifacts on the first iteration is not a failure, they need to be tried as instructions, which may produce more artifacts, including the ones that are in fact regular instructions, resulting in a non-constant number of iterations required to finish the process. I trust the recently introduced termination condition (no new artifacts were created during as-a-regular-instruction-retrial of artifacts not eliminated on the previous iteration) to be efficient in providing termination, but only performing the legalization in full if and only if at each step such chains of artifacts are successfully eliminated in full as well. Which is currently not guaranteed, as the artifact combines are applied only once and in an arbitrary order that has to do with the order of creation or insertion of artifacts into their worklist, which is a no particular order. In this patch I make a small change to the artifact combiner, making it to re-insert into the worklist immediate (modulo a look-through copies) artifact users of each vreg that changes its definition due to an artifact combine. Here the first scan through the artifacts worklist, while not being done in any guaranteed order, only needs to find the innermost pair(s) of artifacts that could be immediately combined out. After that the process follows def-use chains, making them shorter at each step, thus combining everything that can be combined in O(n) time. Reviewers: volkan, aditya_nandakumar, qcolombet, paquette, aemerson, dsanders Reviewed By: aditya_nandakumar, paquette Tags: #llvm Differential Revision: https://reviews.llvm.org/D71448
* Revert "AMDGPU: Try to commute sub of boolean ext"Tim Renouf2019-12-132-10/+49
| | | | | | | | | | | | | | | | | | | | This reverts commit 69fcfb7d3597e0cdb5554b4e672e9032b411b167. As shown in the test I attached to this commit, the change I reverted causes a problem with "zext(cc1) - zext(cc2)". It commuted the operands to the sub and used different logic to select the addc/subc instruction: sub zext (setcc), x => addcarry 0, x, setcc sub sext (setcc), x => subcarry 0, x, setcc ... but that is bogus. I believe it is not possible to fold those commuted patterns into any form of addcarry or subcarry. It may have worked as intended before "AMDGPU: Change boolean content type to 0 or 1" because the setcc was considered to be -1 rather than 1. Differential Revision: https://reviews.llvm.org/D70978 Change-Id: If2139421aa6c935cbd1d925af58fe4a4aa9e8f43
* [MBP] Avoid tail duplication if it can't bring benefitGuozhi Wei2019-12-062-5/+6
| | | | | | | | | | | | | Current tail duplication integrated in bb layout is designed to increase the fallthrough from a BB's predecessor to its successor, but we have observed cases that duplication doesn't increase fallthrough, or it brings too much size overhead. To overcome these two issues in function canTailDuplicateUnplacedPreds I add two checks: make sure there is at least one duplication in current work set. the number of duplication should not exceed the number of successors. The modification in hasBetterLayoutPredecessor fixes a bug that potential predecessor must be at the bottom of a chain. Differential Revision: https://reviews.llvm.org/D64376
OpenPOWER on IntegriCloud