| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
We are using symbols to represent label and csect interchangeably before, and that could be a problem.
There are cases we would need to add storage mapping class to the symbol if that symbol is actually the name of a csect, but it's hard for us to figure out whether that symbol is a label or csect.
This patch intend to do the following:
1. Construct a QualName (A name include the storage mapping class)
MCSymbolXCOFF for every MCSectionXCOFF.
2. Keep a pointer to that QualName inside of MCSectionXCOFF.
3. Use that QualName whenever we need a symbol refers to that
MCSectionXCOFF.
4. Adapt the snowball effect from the above changes in
XCOFFObjectWriter.cpp.
Reviewers: xingxue, DiggerLin, sfertile, daltenty, hubert.reinterpretcast
Reviewed By: DiggerLin, daltenty
Subscribers: wuzish, nemanjai, mgorny, hiraditya, kbarton, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69633
|
|
|
|
|
|
|
|
| |
See https://bugs.llvm.org/show_bug.cgi?id=40903
Reviewers: arsenm, rampitec
Differential Revision: https://reviews.llvm.org/D69888
|
|
|
|
|
|
|
|
|
|
| |
Refactor usage of isCopyInstrImpl, isCopyInstr and isAddImmediate methods
to return optional machine operand pair of destination and source
registers.
Patch by Nikola Prica
Differential Revision: https://reviews.llvm.org/D69622
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The greedy register allocator occasionally decides to insert a large number of
unnecessary copies, see below for an example. The -consider-local-interval-cost
option (which X86 already enables by default) fixes this. We enable this option
for AArch64 only after receiving feedback that this change is not beneficial for
PowerPC.
We evaluated the impact of this change on compile time, code size and
performance benchmarks.
This option has a small impact on compile time, measured on CTMark. A 0.1%
geomean regression on -O1 and -O2, and 0.2% geomean for -O3, with at most 0.5%
on individual benchmarks.
The effect on both code size and performance on AArch64 for the LLVM test suite
is nil on the geomean with individual outliers (ignoring short exec_times)
between:
best worst
size..text -3.3% +0.0%
exec_time -5.8% +2.3%
On SPEC CPU® 2017 (compiled for AArch64) there is a minor reduction (-0.2% at
most) in code size on some benchmarks, with a tiny movement (-0.01%) on the
geomean. Neither intrate nor fprate show any change in performance.
This patch makes the following changes.
- For the AArch64 target, enableAdvancedRASplitCost() now returns true.
- Ensures that -consider-local-interval-cost=false can disable the new
behaviour if necessary.
This matrix multiply example:
$ cat test.c
long A[8][8];
long B[8][8];
long C[8][8];
void run_test() {
for (int k = 0; k < 8; k++) {
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
results in the following generated code on AArch64:
$ clang --target=aarch64-arm-none-eabi -O3 -S test.c -o -
[...]
// %for.cond1.preheader
// =>This Inner Loop Header: Depth=1
add x14, x11, x9
str q0, [sp, #16] // 16-byte Folded Spill
ldr q0, [x14]
mov v2.16b, v15.16b
mov v15.16b, v14.16b
mov v14.16b, v13.16b
mov v13.16b, v12.16b
mov v12.16b, v11.16b
mov v11.16b, v10.16b
mov v10.16b, v9.16b
mov v9.16b, v8.16b
mov v8.16b, v31.16b
mov v31.16b, v30.16b
mov v30.16b, v29.16b
mov v29.16b, v28.16b
mov v28.16b, v27.16b
mov v27.16b, v26.16b
mov v26.16b, v25.16b
mov v25.16b, v24.16b
mov v24.16b, v23.16b
mov v23.16b, v22.16b
mov v22.16b, v21.16b
mov v21.16b, v20.16b
mov v20.16b, v19.16b
mov v19.16b, v18.16b
mov v18.16b, v17.16b
mov v17.16b, v16.16b
mov v16.16b, v7.16b
mov v7.16b, v6.16b
mov v6.16b, v5.16b
mov v5.16b, v4.16b
mov v4.16b, v3.16b
mov v3.16b, v1.16b
mov x12, v0.d[1]
fmov x15, d0
ldp q1, q0, [x14, #16]
ldur x1, [x10, #-256]
ldur x2, [x10, #-192]
add x9, x9, #64 // =64
mov x13, v1.d[1]
fmov x16, d1
ldr q1, [x14, #48]
mul x3, x15, x1
mov x14, v0.d[1]
fmov x17, d0
mov x18, v1.d[1]
fmov x0, d1
mov v1.16b, v3.16b
mov v3.16b, v4.16b
mov v4.16b, v5.16b
mov v5.16b, v6.16b
mov v6.16b, v7.16b
mov v7.16b, v16.16b
mov v16.16b, v17.16b
mov v17.16b, v18.16b
mov v18.16b, v19.16b
mov v19.16b, v20.16b
mov v20.16b, v21.16b
mov v21.16b, v22.16b
mov v22.16b, v23.16b
mov v23.16b, v24.16b
mov v24.16b, v25.16b
mov v25.16b, v26.16b
mov v26.16b, v27.16b
mov v27.16b, v28.16b
mov v28.16b, v29.16b
mov v29.16b, v30.16b
mov v30.16b, v31.16b
mov v31.16b, v8.16b
mov v8.16b, v9.16b
mov v9.16b, v10.16b
mov v10.16b, v11.16b
mov v11.16b, v12.16b
mov v12.16b, v13.16b
mov v13.16b, v14.16b
mov v14.16b, v15.16b
mov v15.16b, v2.16b
ldr q2, [sp] // 16-byte Folded Reload
fmov d0, x3
mul x3, x12, x1
[...]
With -consider-local-interval-cost the same section of code results in the
following:
$ clang --target=aarch64-arm-none-eabi -mllvm -consider-local-interval-cost -O3 -S test.c -o -
[...]
.LBB0_1: // %for.cond1.preheader
// =>This Inner Loop Header: Depth=1
add x14, x11, x9
ldp q0, q1, [x14]
ldur x1, [x10, #-256]
ldur x2, [x10, #-192]
add x9, x9, #64 // =64
mov x12, v0.d[1]
fmov x15, d0
mov x13, v1.d[1]
fmov x16, d1
ldp q0, q1, [x14, #32]
mul x3, x15, x1
cmp x9, #512 // =512
mov x14, v0.d[1]
fmov x17, d0
fmov d0, x3
mul x3, x12, x1
[...]
Reviewers: SjoerdMeijer, samparker, dmgreen, qcolombet
Reviewed By: dmgreen
Subscribers: ZhangKang, jsji, wuzish, ppc-slack, lkail, steven.zhang, MatzeB, qcolombet, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69437
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following testcase
function:
.Lpcrel_label1:
auipc a0, %pcrel_hi(other_function)
addi a1, a0, %pcrel_lo(.Lpcrel_label1)
.p2align 2 # Causes a new fragment to be emitted
.type other_function,@function
other_function:
ret
exposes an odd behaviour in which only the %pcrel_hi relocation is
evaluated but not the %pcrel_lo.
$ llvm-mc -triple riscv64 -filetype obj t.s | llvm-objdump -d -r -
<stdin>: file format ELF64-riscv
Disassembly of section .text:
0000000000000000 function:
0: 17 05 00 00 auipc a0, 0
4: 93 05 05 00 mv a1, a0
0000000000000004: R_RISCV_PCREL_LO12_I other_function+4
0000000000000008 other_function:
8: 67 80 00 00 ret
The reason seems to be that in RISCVAsmBackend::shouldForceRelocation we
only consider the fragment but in RISCVMCExpr::evaluatePCRelLo we
consider the section. This usually works but there are cases where the
section may still be the same but the fragment may be another one. In
that case we end forcing a %pcrel_lo relocation without any %pcrel_hi.
This patch makes RISCVAsmBackend::shouldForceRelocation use the section,
if any, to determine if the relocation must be forced or not.
Differential Revision: https://reviews.llvm.org/D60657
|
|
|
|
|
|
|
|
|
|
|
| |
-mattr=+alu32 has shown good performance vs. without this attribute.
Based on discussion at
https://lore.kernel.org/bpf/1ec37838-966f-ec0b-5223-ca9b6eb0860d@fb.com/T/#t
cpu version v3 should support -mattr=+alu32.
This patch enabled alu32 if cpu version is v3, either specified by user
or probed by the llvm.
Differential Revision: https://reviews.llvm.org/D69957
|
|
|
|
|
|
|
|
|
| |
This option allows the user to specify the use of absolute jumptables instead
of relative which is the default on most PPC subtargets.
Patch by Kamauu Bridgeman
Differential revision: https://reviews.llvm.org/D69108
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D69851
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D69850
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`saa` and `saad` are 32-bit and 64-bit store atomic add instructions.
memory[base] = memory[base] + rt
These instructions are available for "Octeon+" CPU. The patch adds support
for both instructions to MIPS assembler and diassembler and introduces new
CPU type - "octeon+".
Next patches will implement `.set arch=octeon+` directive and `AFL_EXT_OCTEONP`
ISA extension flag support.
Differential Revision: https://reviews.llvm.org/D69849
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: [AMDGPU] Fix bug introduced in 47a5c36b37f0
Reviewers: foad, arsenm
Reviewed By: arsenm
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69915
|
| |
|
|
|
|
|
|
| |
Leftovers from before we switched to widening legalization.
Fixes PR43919.
|
|
|
|
|
|
|
|
|
|
|
| |
Add pattern matching and intrinsics for the following instructions:
predicated orr, eor, and, bic
predicated mul, smulh, umulh, sdiv, udiv, sdivr, udivr
predicated smax, umax, smin, umin, sabd, uabd
mad, msb, mla, mls
https://reviews.llvm.org/D69588
|
|
|
|
| |
This only works if there is no use of the return value.
|
|
|
|
|
|
| |
This was omitted. Also SReg_96Reg missed IsSGPR assignment.
Differential Revision: https://reviews.llvm.org/D69919
|
|
|
|
|
|
|
|
|
|
| |
return value location depends on the calling convention of the callee.
`F.getCallingConv()`, however, is the caller CC. Correct it to the
callee CC from `CallLoweringInfo`.
Fixes PR43449
Patch by Shu-Chun Weng!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The MMX intrinsics for shift by immediate take a 32-bit shift
amount but the hardware for shifting by immediate only encodes
8-bits. For the intrinsic we don't require the shift amount to
fit in 8-bits in the frontend because we don't check that its an
immediate in the frontend. If its is not an immediate we move it
to an MMX register and use the shift by register.
But if it is an immediate we'll use the shift by immediate
instruction. But we need to change the shift amount to 8-bits.
We were previously doing this accidentally by masking it in the
encoder. But this can make a large shift amount into a small
in bounds shift amount. Instead we should clamp larger shift
amounts to 255 so that the they don't become in bounds.
Fixes PR43922
|
|
|
|
|
|
|
| |
These patterns were added in D46009, but removed in D54276 due to
missing test coverage.
Differential Revision: https://reviews.llvm.org/D69831
|
|
|
|
| |
always true. NFCI.
|
|
|
|
|
|
| |
Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2i64 ops.
Numbers taken from Intel AOM (+ checked against Agner)
|
|
|
|
| |
Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2f64/v2i64 ops.
|
| |
|
|
|
|
| |
As noted on D59710 we weren't handling the high costs of these operations on SLM.
|
|
|
|
|
| |
The store splitting transform was assuming a simple type (MVT),
but that's not necessarily the case as shown in the test.
|
| |
|
|
|
|
| |
PVS Studio noticed that we were asserting "VT.getVectorNumElements() == VT.getVectorNumElements()" instead of "VT.getVectorNumElements() == InVT.getVectorNumElements()".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Inserting BTI instructions can push branch destinations out of range.
The branch relaxation pass itself cannot insert indirect branches since `TargetInstrInfo::insertIndirecrtBranch` is not implemented for AArch64 (guess +/-128 MB direct branch range is more than enough in practice).
Testing this is a bit tricky.
The original test case we have is 155kloc/6.1M. I've generated a test case using this program:
```
int main() {
std::cout << R"src(int test();
void g0(), g1(), g2(), g3(), g4(), e();
void f(int v) {
if ((test() & 2) == 0) {
switch (v) {
case 0:
g0();
case 1:
g1();
case 2:
g2();
case 3:
g3();
}
)src";
const int N = 8176;
for (int i = 0; i < N; ++i)
std::cout << " void h" << i << "();\n";
for (int i = 0; i < N; ++i)
std::cout << " h" << i << "();\n";
std::cout << R"src(
} else {
e();
}
}
)src";
}
```
which is still a bit too much to commit as a regression test, IMHO.
Reviewers: t.p.northover, ostannard
Reviewed By: ostannard
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69118
Change-Id: Ide5c922bcde08ff4cf635da5e52365525a997a0a
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Added estimations for ShuffleVector, some cast and arithmetic instructions
Reviewers: rampitec
Reviewed By: rampitec
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, zzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69629
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have two ways to steer creating a predicated vector body over creating a
scalar epilogue. To force this, we have 1) a command line option and 2) a
pragma available. This adds a third: a target hook to TargetTransformInfo that
can be queried whether predication is preferred or not, which allows the
vectoriser to make the decision without forcing it.
While this change behaves as a non-functional change for now, it shows the
required TTI plumbing, usage of this new hook in the vectoriser, and the
beginning of an ARM MVE implementation. I will follow up on this with:
- a complete MVE implementation, see D69845.
- a patch to disable this, i.e. we should respect "vector_predicate(disable)"
and its corresponding loophint.
Differential Revision: https://reviews.llvm.org/D69040
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds two new families of intrinsics, both of which are
memory accesses taking a vector of locations to load from / store to.
The vldrq_gather_base / vstrq_scatter_base intrinsics take a vector of
base addresses, and an immediate offset to be added consistently to
each one. vldrq_gather_offset / vstrq_scatter_offset take a scalar
base address, and a vector of offsets to add to it. The
'shifted_offset' variants also multiply each offset by the element
size type, so that the vector is effectively of array indices.
At the IR level, these operations are represented by a single set of
four IR intrinsics: {gather,scatter} × {base,offset}. The other
details (signed/unsigned, shift, and memory element size as opposed to
vector element size) are all specified by IR intrinsic polymorphism
and immediate operands, because that made the selection job easier
than making a huge family of similarly named intrinsics.
I considered using the standard IR representations such as
llvm.masked.gather, but they're not a good fit. In order to use
llvm.masked.gather to represent a gather_offset load with element size
smaller than a pointer, you'd have to expand the <8 x i16> vector of
offsets into an <8 x i16*> vector of pointers, which would be split up
during legalization, so you'd spend most of your time undoing the mess
it had made. Also, ISel support for llvm.masked.gather would be easy
enough in a trivial way (you can expand it into a gather-base load
with a zero immediate offset), but instruction-selecting lots of
fiddly idioms back into all the _other_ MVE load instructions would be
much more work. So I think dedicated IR intrinsics are the more
sensible approach, at least for the moment.
On the clang tablegen side, I've added two new features to the
Tablegen source accepted by MveEmitter: a 'CopyKind' type node for
defining a type that varies with the parameter type (it lets you ask
for an unsigned integer type of the same width as the parameter), and
an 'unsignedflag' value node for passing an immediate IR operand which
is 0 for a signed integer type or 1 for an unsigned one. That lets me
write each kind of intrinsic just once and get all its subtypes and
immediate arguments generated automatically.
Also I've tweaked the handling of pointer-typed values in the code
generation part of MveEmitter: they're generated as Address rather
than Value (i.e. including an alignment) so that they can be given to
the ordinary IR load and store operations, but I'd omitted the code to
convert them back to Value when they're going to be used as an
argument to an IR intrinsic.
On the MC side, I've enhanced MVEVectorVTInfo so that it can tell you
not only the full assembly-language suffix for a given vector type
(like 's32' or 'u16') but also the numeric-only one used by store
instructions (just '32' or '16').
Reviewers: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D69791
|
|
|
|
|
|
| |
The 'RM' flag model the "Rounding Mode" and it has nothing to do with the load/store instructions.
Differential Revision: https://reviews.llvm.org/D69551
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D69867
|
| |
|
|
|
|
|
|
| |
to unsigned warning. NFCI.
Consistently return HexagonII::HCG_None.
|
| |
|
|
|
|
|
|
|
|
|
| |
This was added to inhibit a warning from gcc 7.3 according to
the comment. However, it triggers warning from PVS. In addition
I cannot reproduce it with gcc 7.4 and I also cannot reproduce
it with gcc 7.3 using compiler explorer.
Differential Revision: https://reviews.llvm.org/D69863
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
example
When writing an email for a follow up proposal, I realized one of the diffs in the committed change was incorrect. Digging into it revealed that the fix is complicated enough to require some thought, so reverting in the meantime.
The problem is visible in this diff (from the revert):
; X64-SSE-LABEL: store_fp128:
; X64-SSE: # %bb.0:
-; X64-SSE-NEXT: movaps %xmm0, (%rdi)
+; X64-SSE-NEXT: subq $24, %rsp
+; X64-SSE-NEXT: .cfi_def_cfa_offset 32
+; X64-SSE-NEXT: movaps %xmm0, (%rsp)
+; X64-SSE-NEXT: movq (%rsp), %rsi
+; X64-SSE-NEXT: movq {{[0-9]+}}(%rsp), %rdx
+; X64-SSE-NEXT: callq __sync_lock_test_and_set_16
+; X64-SSE-NEXT: addq $24, %rsp
+; X64-SSE-NEXT: .cfi_def_cfa_offset 8
; X64-SSE-NEXT: retq
store atomic fp128 %v, fp128* %fptr unordered, align 16
ret void
The problem here is three fold:
1) x86-64 doesn't guarantee atomicity of anything larger than 8 bytes. Some platforms observably break this guarantee, others don't, but the codegen isn't considering this, so it's wrong on at least some platforms.
2) When I started to track down the problem, I discovered that DAGCombiner had stripped the atomicity off the store entirely. This comes down to idiomatic usage of DAG.getStore passing all MMO components separately as opposed to just passing the MMO.
3) On x86 (not -64), there are cases where 8 byte atomiciy is supported, but only for floating point operations. This would seem to imply that operation typing matters for correctness, and DAGCombine happily folds away bitcasts. I'm not 100% sure there's a problem here, but I'm not entirely sure there isn't either.
I plan on returning to each issue in turn; sorry for the churn here.
|
|
|
|
|
|
|
| |
Static analyzer complains about always false condition.
See https://bugs.llvm.org/show_bug.cgi?id=43886
Differential Revision: https://reviews.llvm.org/D69860
|
|
|
|
|
| |
The backend UnsafeFPMath flag is not a superset of all the others, so
limit it to the exact bits needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
G_GEP is rather poorly named. It's a simple pointer+scalar addition and
doesn't support any of the complexities of getelementptr. I therefore
propose that we rename it. There's a G_PTR_MASK so let's follow that
convention and go with G_PTR_ADD
Reviewers: volkan, aditya_nandakumar, bogner, rovka, arsenm
Subscribers: sdardis, jvesely, wdng, nhaehnle, hiraditya, jrtc27, atanasyan, arphaman, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69734
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
addOperand() method of AMDGPU disassembler returns SoftFail
on error. All instances which may lead to that place are
an impossible encdoing, not something which is possible to
encode, but semantically incorrect as described for SoftFail.
Then tablegen generates a check of the following form:
if (Decode...(..) == MCDisassembler::Fail) { return MCDisassembler::Fail; }
Since we can only return Success and SoftFail that is dead
code as detected by the static code analyzer.
Solution: return Fail as it should be.
See https://bugs.llvm.org/show_bug.cgi?id=43886
Differential Revision: https://reviews.llvm.org/D69819
|
|
|
|
|
|
|
|
|
|
| |
This feature controls whether AA is used into the backend, and was
previously turned on for certain subtargets to help create less
constrained scheduling graphs. This patch turns it on for all
subtargets, so that they can all make use of the extra information to
produce better code.
Differential Revision: https://reviews.llvm.org/D69796
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the ARM backend, for historical reasons we have only some targets
using Machine Scheduling. The rest use the old list scheduler as they
are using itinaries and the list scheduler seems to produce better code
(and not crash running out of register on v6m codes). So whether to use
the MIScheduler or not is checked at runtime from the subtarget
features.
This is fine, except for post-ra scheduling. Whether to use the old
post-ra list scheduler or the post-ra machine schedule is decided as the
pass manager is set up, in arms case from a newly constructed subtarget.
Under some situations, like LTO, this won't include the correct cpu so
can pick the wrong option. This can have a surprising effect on
performance.
To fix that, this patch overrides targetSchedulesPostRAScheduling and
addPreSched2 in the ARM backend, adding _both_ post-ra schedulers and
picking at runtime which to execute. To pick between the two I've had to
add a enablePostRAMachineScheduler() method that normally returns
enableMachineScheduler() && enablePostRAScheduler(), which can be
overridden to enable just one of PostRAMachineScheduler vs
PostRAScheduler.
Thanks to David Penry for the identifying this problem.
Differential Revision: https://reviews.llvm.org/D69775
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Introduces the `InstrInfo::areMemAccessesTriviallyDisjoint`
hook. The test could check for instruction reorderings, but to avoid
being brittle it just checks instruction dependencies.
Reviewers: asb, lenary
Reviewed By: lenary
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67046
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2*log2(bitwidth)+1 for legal types.
This better represents the kshift+binop we'd get for each stage
before the final extract. Its likely we'll do even better by
doing a kmov and a cmp with a GPR, but this is a good start.
The default handling was costing a worst case single source
permute shuffle of the vector before the binop. This worst
case assumes the shuffle might have to be emulated with
extracts and inserts. But since we know we're doing a reduction
we can assume we'll get kshift lowering.
There's still some room for improvement here, but this is
much better than it was.
|