| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This solves selection failures with generated selection patterns,
which would fail due to inferring the SGPR reg bank for virtual
registers with a set register class instead of VCC bank. Use
instruction selection would constrain the virtual register to a
specific class, so when the def was selected later the bank no longer
was set to VCC.
Remove the SCC reg bank. SCC isn't directly addressable, so it
requires copying from SCC to an allocatable 32-bit register during
selection, so these might as well be treated as 32-bit SGPR values.
Now any scalar boolean value that will produce an outupt in SCC should
be widened during RegBankSelect to s32. Any s1 value should be a
vector boolean during selection. This makes the vcc register bank
unambiguous with a normal SGPR during selection.
Summary of how this should now work:
- G_TRUNC is always a no-op, and never should use a vcc bank result.
- SALU boolean operations should be promoted to s32 in RegBankSelect
apply mapping
- An s1 value means vcc bank at selection. The exception is for
legalization artifacts that use s1, which are never VCC. All other
contexts should infer the VCC register classes for s1 typed
registers. The LLT for the register is now needed to infer the
correct register class. Extensions with vcc sources should be
legalized to a select of constants during RegBankSelect.
- Copy from non-vcc to vcc ensures high bits of the input value are
cleared during selection.
- SALU boolean inputs should ensure the inputs are 0/1. This includes
select, conditional branches, and carry-ins.
There are a few somewhat dirty details. One is that G_TRUNC/G_*EXT
selection ignores the usual register-bank from register class
functions, and can't handle truncates with VCC result banks. I think
this is OK, since the artifacts are specially treated anyway. This
does require some care to avoid producing cases with vcc. There will
also be no 100% reliable way to verify this rule is followed in
selection in case of register classes, and violations manifests
themselves as invalid copy instructions much later.
Standard phi handling also only considers the bank of the result
register, and doesn't insert copies to make the source banks
match. This doesn't work for vcc, so we have to manually correct phi
inputs in this case. We should add a verifier check to make sure there
are no phis with mixed vcc and non-vcc register bank inputs.
There's also some duplication with the LegalizerHelper, and some code
which should live in the helper. I don't see a good way to share
special knowledge about what types to use for intermediate operations
depending on the bank for example. Using the helper to replace
extensions with selects also seems somewhat awkward to me.
Another issue is there are some contexts calling
getRegBankFromRegClass that apparently don't have the LLT type for the
register, but I haven't yet run into a real issue from this.
This also introduces new unnecessary instructions in most cases, since
we don't yet try to optimize out the zext when the source is known to
come from a compare.
|
|
|
|
|
| |
Mostly copied from AMDGPU lowering implementation, except used
G_SITOFP instead of directly creating a select on -1.0, 0.0.
|
|
|
|
|
|
|
|
| |
This would complain about invalid legalizer rules otherwise.
Mark some operations as unsupported for AMDGPU. This currently seems
to produce the same legalize error as when no rules are defined, but
eventually this should produce a proper user facing error.
|
|
|
|
|
|
| |
The existing test only covered one case for r600. The use of
mul_legacy also looks suspicious to me, but leave it for now. The
patterns are also not making use of source modifiers.
|
| |
|
|
|
|
|
|
|
|
| |
This assumed a 32-bit extract size, which would produce invalid copies
with 64-bit extracts. Handle the easy case. Ideally we would have a
way to get the proper subreg index for any 32-bit offset, but there
should probably be a tablegenerated way of getting the subreg index
for any size and offset.
|
|
|
|
|
|
| |
This only handled G_SDIV, but they all are trivially scalarizable.
Also define placeholder AMDGPU division legalizer rules.
|
|
|
|
|
| |
Fix selecting these for volatile global loads, and ensure the loads
are constant enough.
|
|
|
|
|
| |
The attempts to widen sufficently aligned, odd sized loads wasn't
consistently applied.
|
|
|
|
|
|
|
| |
This produces more intelligible looking results, more comparabble to
the DAG output in the simplest cases. This is probably wrong in
complex control flow, but RegBankSelect doesn't attempt analyzing if
this is on a masked path for selecting the bank yet.
|
|
|
|
|
|
|
|
|
|
| |
same address to avoid WAR conflict.
Reviewers: rampitec, vpykhtin, nhaehnle
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D71934
|
| |
|
|
|
|
| |
This should be looking at the RHS of the add for a constant.
|
|
|
|
|
| |
There intended to test non-extloads, but the memory size did not match
the result size.
|
|
|
|
|
| |
This avoids diff noise in a future commit from the check name change
from the G_GEP->G_PTR_ADD rename.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was increasing the number of instructions when fsub was legalized
on AMDGPU with no signed zeros enabled. This fold should be guarded by
hasOneUse, and I don't think getNode should be doing that. The same
fold is already done as a regular combine through isNegatibleForFree.
This does require duplicating, even though isNegatibleForFree does
this combine already (and properly checks hasOneUse) to avoid one PPC
regression. In the regression, the outer fneg has nsz but the fsub
operand does not. isNegatibleForFree only sees the operand, and
doesn't see it's used from a nsz context. A nsz parameter needs to be
added and threaded through isNegatibleForFree to avoid this.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Summary: - Other counters are accidentally cleared.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71866
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
G_BITREVERSE is generated from llvm.bitreverse.<type> intrinsics,
clang genrates these intrinsics from __builtin_bitreverse32 and
__builtin_bitreverse64.
Add lower and narrowscalar for G_BITREVERSE.
Lower G_BITREVERSE on MIPS32.
Recommit notes:
Introduce temporary variables in order to make sure
instructions get inserted into MachineFunction in same order
regardless of compiler used to build llvm.
Differential Revision: https://reviews.llvm.org/D71363
|
| |
|
| |
|
| |
|
|
|
|
|
| |
The path already used for f16/f32 works a lot better when v_trunc_f64
is available.
|
| |
|
|
|
|
|
|
|
|
|
| |
Sometimes the result bank of the phi is already assigned to something,
and should not be ignored. This is in preparation for additional
boolean phi handling changes.
Also refine the logic to fix some cases that were incorrectly deciding
to use SGPRs.
|
|
|
|
|
|
| |
This reverts commit dbc136e0fe7e14c64dcb78e72321bb41af60afa4.
It broke buildbots:
http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/21066
|
|
|
|
|
|
|
|
|
|
| |
G_BITREVERSE is generated from llvm.bitreverse.<type> intrinsics,
clang genrates these intrinsics from __builtin_bitreverse32 and
__builtin_bitreverse64.
Add lower and narrowscalar for G_BITREVERSE.
Lower G_BITREVERSE on MIPS32.
Differential Revision: https://reviews.llvm.org/D71363
|
|
|
|
| |
This is mostly a workaround for not handling the mubuf store path yet.
|
|
|
|
|
| |
This matches the DAG behavior where we don't use SReg_32_XM0
everywhere anymore, and fixes not coalescing the copies into m0.
|
|
|
|
|
|
| |
The early tail duplicator pass introduces new ones, so a MIR test that
infers no phis since there were none on the input would fail the
verifier after running.
|
|
|
|
|
|
| |
There ended up being two result registers, which would fail on
select. It was really defing a new temp register in the correct def
position, instead of the correct result register.
|
| |
|
|
|
|
| |
as cleanups after D56351
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The only useful information the UndefValue conveys is the address space,
which MachinePointerInfo can represent directly without referring to an
IR value.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71838
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Without this check unnecessary FMA instructions are generated when the FSUB terms are reused.
This also has the side-effect that the same value is computed to different levels of precision, which can create undesirable effects if the results are used together in subsequent computation.
Reviewers: arsenm, nhaehnle, foad, tpr, dstuttard, spatel
Reviewed By: arsenm
Subscribers: jvesely, wdng, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71656
|
|
|
|
|
|
|
|
|
|
|
| |
Confusingly, the intrinsic operands do not match the
instruction/custom node. The order is shuffled, and the 3rd operand is
an immediate to select operands.
I'm not 100% sure I did this right, but fdiv still doesn't select end
to end and it will be easier to tell when it does. This at least
avoids an assertion in RegBankSelect and allows hitting the fallback
on selection.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The typo has been present since memOpsHaveSameBasePtr was introduced in
r313208.
It caused SIInstrInfo::shouldClusterMemOps to cluster more mem ops than
it was supposed to.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71616
|
| |
|
|
|
|
|
|
| |
sub-register by another COPY source operand
Differential Revision: https://reviews.llvm.org/D71132
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
At present, the code calculating known bits of AMDGPU MUL_I24 confuses the concepts of "non-negative number" and "positive number".
In some situations, it results in incorrect code. I have a case where the optimizer replaces the result of calculating MUL_I24(-5, 0) with -8.
Reviewers: foad, arsenm
Reviewed By: arsenm
Subscribers: foad, arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Patch by Eugene Kuznetsov.
Differential Revision: https://reviews.llvm.org/D70367
|
|
|
|
|
|
|
|
|
| |
This exposes a shortcoming for AArch64, and that is tracked by PR40881:
https://bugs.llvm.org/show_bug.cgi?id=40881
Patch by: @RKSimon (Simon Pilgrim)
Differential Revision: https://reviews.llvm.org/D58017
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Legalization algorithm is complicated by two facts:
1) While regular instructions should be possible to legalize in
an isolated, per-instruction, context-free manner, legalization
artifacts can only be eliminated in pairs, which could be deeply, and
ultimately arbitrary nested: { [ () ] }, where which paranthesis kind
depicts an artifact kind, like extend, unmerge, etc. Such structure
can only be fully eliminated by simple local combines if they are
attempted in a particular order (inside out), or alternatively by
repeated scans each eliminating only one innermost pair, resulting in
O(n^2) complexity.
2) Some artifacts might in fact be regular instructions that could (and
sometimes should) be legalized by the target-specific rules. Which
means failure to eliminate all artifacts on the first iteration is
not a failure, they need to be tried as instructions, which may
produce more artifacts, including the ones that are in fact regular
instructions, resulting in a non-constant number of iterations
required to finish the process.
I trust the recently introduced termination condition (no new artifacts
were created during as-a-regular-instruction-retrial of artifacts not
eliminated on the previous iteration) to be efficient in providing
termination, but only performing the legalization in full if and only if
at each step such chains of artifacts are successfully eliminated in
full as well.
Which is currently not guaranteed, as the artifact combines are applied
only once and in an arbitrary order that has to do with the order of
creation or insertion of artifacts into their worklist, which is a no
particular order.
In this patch I make a small change to the artifact combiner, making it
to re-insert into the worklist immediate (modulo a look-through copies)
artifact users of each vreg that changes its definition due to an
artifact combine.
Here the first scan through the artifacts worklist, while not
being done in any guaranteed order, only needs to find the innermost
pair(s) of artifacts that could be immediately combined out. After that
the process follows def-use chains, making them shorter at each step, thus
combining everything that can be combined in O(n) time.
Reviewers: volkan, aditya_nandakumar, qcolombet, paquette, aemerson, dsanders
Reviewed By: aditya_nandakumar, paquette
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71448
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 69fcfb7d3597e0cdb5554b4e672e9032b411b167.
As shown in the test I attached to this commit, the change I reverted
causes a problem with "zext(cc1) - zext(cc2)". It commuted
the operands to the sub and used different logic to select the addc/subc
instruction:
sub zext (setcc), x => addcarry 0, x, setcc
sub sext (setcc), x => subcarry 0, x, setcc
... but that is bogus. I believe it is not possible to fold those commuted
patterns into any form of addcarry or subcarry. It may have worked as
intended before "AMDGPU: Change boolean content type to 0 or 1" because
the setcc was considered to be -1 rather than 1.
Differential Revision: https://reviews.llvm.org/D70978
Change-Id: If2139421aa6c935cbd1d925af58fe4a4aa9e8f43
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current tail duplication integrated in bb layout is designed to increase the fallthrough from a BB's predecessor to its successor, but we have observed cases that duplication doesn't increase fallthrough, or it brings too much size overhead.
To overcome these two issues in function canTailDuplicateUnplacedPreds I add two checks:
make sure there is at least one duplication in current work set.
the number of duplication should not exceed the number of successors.
The modification in hasBetterLayoutPredecessor fixes a bug that potential predecessor must be at the bottom of a chain.
Differential Revision: https://reviews.llvm.org/D64376
|