| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
| |
Enable full support for the debug info. Recommit to fix the emission of
the not required closing brace.
Differential revision: https://reviews.llvm.org/D46189
llvm-svn: 351972
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit r351846.
This patch may generate illegal assembly code, see
```
$ ./bin/clang -cc1 -triple nvptx64-nvidia-cuda -aux-triple x86_64-grtev4-linux-gnu -S -disable-free -disable-llvm-verifier -discard-value-names -main-file-name new.cc -mrelocation-model pic -pic-level 2 -mthread-model posix -fmerge-all-constants -mdisable-fp-elim -relaxed-aliasing -no-integrated-as -mpie-copy-relocations -munwind-tables -fcuda-is-device -target-feature +ptx60 -target-cpu sm_35 -dwarf-column-info -debug-info-kind=line-directives-only -dwarf-version=2 -debugger-tuning=gdb -o empty.s -x cuda empty.cc
$ cat empty.s
//
// Generated by LLVM NVPTX Back-End
//
.version 6.0
.target sm_35
.address_size 64
}
```
llvm-svn: 351966
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
delays from the integer to the floating point unit.
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.
Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.
When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.
Differential Revision: https://reviews.llvm.org/D57056
llvm-svn: 351965
|
|
|
|
| |
llvm-svn: 351958
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch replaces the existing LLVMVectorSameWidth matcher with LLVMScalarOrSameVectorWidth.
The matching args must be either scalars or vectors with the same number of elements, but in either case the scalar/element type can differ, specified by LLVMScalarOrSameVectorWidth.
I've updated the _overflow intrinsics to demonstrate this - allowing it to return a i1 or <N x i1> overflow result, matching the scalar/vectorwidth of the other (add/sub/mul) result type.
The masked load/store/gather/scatter intrinsics have also been updated to use this, although as we specify the reference type to be llvm_anyvector_ty we guarantee the mask will be <N x i1> so no change in behaviour
Differential Revision: https://reviews.llvm.org/D57090
llvm-svn: 351957
|
|
|
|
| |
llvm-svn: 351956
|
|
|
|
|
|
|
|
|
| |
CFIInst is not zero-terminated. This is one of more annoying functional
differences between StringRef and ArrayRef.
Found by asan.
llvm-svn: 351955
|
|
|
|
|
|
| |
They were in the floating point group.
llvm-svn: 351953
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
With XNACK, an smem load whose result is coalesced with an operand (thus
it overwrites its own operand) cannot appear in a clause, because some
other instruction might XNACK and restart the whole clause.
The clause breaker already realized that an smem that overwrites an
operand cannot appear in a clause, and broke the clause. The problem
that this commit fixes is that the SIFormMemoryClauses optimization
formed a bundle with early clobber, which caused the earlier code that
set up the coalesced operand to be removed as dead.
Differential Revision: https://reviews.llvm.org/D57008
Change-Id: I703c4d5b0bf7d6060222bec491f45c18bb3c0016
llvm-svn: 351950
|
|
|
|
| |
llvm-svn: 351945
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently in Arm code, we allocate LR first, under the assumption that
it needs to be saved anyway. Unfortunately this has the disadvantage
that it will require any instructions using it to be the longer thumb2
instructions, not the shorter thumb1 ones.
This switches the order when we are optimising for minsize, returning to
the default order so that more lower registers can be used. It can end
up requiring more pushed registers, but on average produces smaller code.
Differential Revision: https://reviews.llvm.org/D56008
llvm-svn: 351938
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the last stage of type promotion, we replace any zext that uses a
new trunc with the operand of the trunc. This is okay when we only
allowed one type to be optimised, but now its the case that the trunc
maybe needed to produce a more narrow type than the one we were
optimising for. So we need to check this before doing the replacement.
Differential Revision: https://reviews.llvm.org/D57041
llvm-svn: 351935
|
|
|
|
|
|
|
|
|
|
|
| |
The current check in CombineToPreIndexedLoadStore is too
conversative, preventing a pre-indexed store when the base pointer
is a predecessor of the value being stored. Instead, we should check
the pointer operand of the store.
Differential Revision: https://reviews.llvm.org/D56719
llvm-svn: 351933
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of speculation hardening, the stack pointer gets masked with the
taint register (X16) before a function call or before a function return.
Since there are no instructions that can directly mask writing to the
stack pointer, the stack pointer must first be transferred to another
register, where it can be masked, before that value is transferred back
to the stack pointer.
Before, that temporary register was always picked to be x17, since the
ABI allows clobbering x17 on any function call, resulting in the
following instruction pattern being inserted before function calls and
returns/tail calls:
mov x17, sp
and x17, x17, x16
mov sp, x17
However, x17 can be live in those locations, for example when the call
is an indirect call, using x17 as the target address (blr x17).
To fix this, this patch looks for an available register just before the
call or terminator instruction and uses that.
In the rare case when no register turns out to be available (this
situation is only encountered twice across the whole test-suite), just
insert a full speculation barrier at the start of the basic block where
this occurs.
Differential Revision: https://reviews.llvm.org/D56717
llvm-svn: 351930
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two backend optimizations failed to handle cases when compiled with -g, due
to failing to consider DBG_VALUE instructions. This was in
SystemZTargetLowering::emitSelect() and
SystemZElimCompare::getRegReferences().
This patch makes sure that DBG_VALUEs are recognized so that they do not
affect these optimizations.
Tests for branch-on-count, load-and-trap and consecutive selects.
Review: Ulrich Weigand
https://reviews.llvm.org/D57048
llvm-svn: 351928
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch relaxes restrictions on types of latch condition and range check.
In current implementation, they should match. This patch allows to handle
wide range checks against narrow condition. The motivating example is the
following:
int N = ...
for (long i = 0; (int) i < N; i++) {
if (i >= length) deopt;
}
In this patch, the option that enables this support is turned off by
default. We'll wait until it is switched to true.
Differential Revision: https://reviews.llvm.org/D56837
Reviewed By: reames
llvm-svn: 351926
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
#pragma clang loop pipeline(disable)
Disable SWP optimization for the next loop.
“disable” is the only possible value.
#pragma clang loop pipeline_initiation_interval(number)
Set value of initiation interval for SWP
optimization to specified number value for
the next loop. Number is the positive value
greater than 0.
These pragmas could be used for debugging or reducing
compile time purposes. It is possible to disable SWP for
concrete loops to save compilation time or to find bugs
by not doing SWP to certain loops. It is possible to set
value of initiation interval to concrete number to save
compilation time by not doing extra pipeliner passes or
to check created schedule for specific initiation interval.
That is llvm part of the fix
Clang part of fix: https://reviews.llvm.org/D55710
Patch by Alexey Lapshin!
Differential Revision: https://reviews.llvm.org/D56403
llvm-svn: 351923
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Each hwasan check requires emitting a small piece of code like this:
https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html#memory-accesses
The problem with this is that these code blocks typically bloat code
size significantly.
An obvious solution is to outline these blocks of code. In fact, this
has already been implemented under the -hwasan-instrument-with-calls
flag. However, as currently implemented this has a number of problems:
- The functions use the same calling convention as regular C functions.
This means that the backend must spill all temporary registers as
required by the platform's C calling convention, even though the
check only needs two registers on the hot path.
- The functions take the address to be checked in a fixed register,
which increases register pressure.
Both of these factors can diminish the code size effect and increase
the performance hit of -hwasan-instrument-with-calls.
The solution that this patch implements is to involve the aarch64
backend in outlining the checks. An intrinsic and pseudo-instruction
are created to represent a hwasan check. The pseudo-instruction
is register allocated like any other instruction, and we allow the
register allocator to select almost any register for the address to
check. A particular combination of (register selection, type of check)
triggers the creation in the backend of a function to handle the check
for specifically that pair. The resulting functions are deduplicated by
the linker. The pseudo-instruction (really the function) is specified
to preserve all registers except for the registers that the AAPCS
specifies may be clobbered by a call.
To measure the code size and performance effect of this change, I
took a number of measurements using Chromium for Android on aarch64,
comparing a browser with inlined checks (the baseline) against a
browser with outlined checks.
Code size: Size of .text decreases from 243897420 to 171619972 bytes,
or a 30% decrease.
Performance: Using Chromium's blink_perf.layout microbenchmarks I
measured a median performance regression of 6.24%.
The fact that a perf/size tradeoff is evident here suggests that
we might want to make the new behaviour conditional on -Os/-Oz.
But for now I've enabled it unconditionally, my reasoning being that
hwasan users typically expect a relatively large perf hit, and ~6%
isn't really adding much. We may want to revisit this decision in
the future, though.
I also tried experimenting with varying the number of registers
selectable by the hwasan check pseudo-instruction (which would result
in fewer variants being created), on the hypothesis that creating
fewer variants of the function would expose another perf/size tradeoff
by reducing icache pressure from the check functions at the cost of
register pressure. Although I did observe a code size increase with
fewer registers, I did not observe a strong correlation between the
number of registers and the performance of the resulting browser on the
microbenchmarks, so I conclude that we might as well use ~all registers
to get the maximum code size improvement. My results are below:
Regs | .text size | Perf hit
-----+------------+---------
~all | 171619972 | 6.24%
16 | 171765192 | 7.03%
8 | 172917788 | 5.82%
4 | 177054016 | 6.89%
Differential Revision: https://reviews.llvm.org/D56954
llvm-svn: 351920
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
size.
Previously, MemoryBlock automatically extends a requested buffer size to a
multiple of page size because (I believe) doing it was thought to be harmless
and with that you could get more memory (on average 2KiB on 4KiB-page systems)
"for free".
That programming interface turned out to be error-prone. If you request N
bytes, you usually expect that a resulting object returns N for `size()`.
That's not the case for MemoryBlock.
Looks like there is only one place where we take the advantage of
allocating more memory than the requested size. So, with this patch, I
simply removed the automatic size expansion feature from MemoryBlock
and do it on the caller side when needed. MemoryBlock now always
returns a buffer whose size is equal to the requested size.
Differential Revision: https://reviews.llvm.org/D56941
llvm-svn: 351916
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
`CodeViewDebug::lowerTypeMemberFunction` used to default to a `Void`
return type if the function's type array was empty. After D54667, it
started blindly indexing the 0th item for the return type, which fails
in `getOperand` for empty arrays if assertions are enabled.
This patch restores the `Void` return type for empty type arrays, and
adds a test generated by Rust in line-only debuginfo mode.
Reviewers: zturner, rnk
Reviewed By: rnk
Subscribers: hiraditya, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D57070
llvm-svn: 351910
|
|
|
|
| |
llvm-svn: 351895
|
|
|
|
|
|
|
|
|
|
| |
The splitting pass does not need BFI unless the Module actually has a profile
summary. Do not calcualte BFI unless the summary is present.
For the sqlite3 amalgamation, this reduces time spent in the splitting pass
from 0.4% of the total to under 0.1%.
llvm-svn: 351894
|
|
|
|
|
|
|
|
|
|
| |
The splitting pass does not need (post)domtrees until after it's found a
cold block. Defer domtree calculation until a cold block is found.
For the sqlite3 amalgamation, this reduces time spent in the splitting
pass from 0.8% of the total to 0.4%.
llvm-svn: 351892
|
|
|
|
|
|
|
|
|
|
| |
PromoteFloatResult.
Also add debug prints in the default case of the switches in these routines.
Most if not all of the type legalization handlers already do this so this makes promoting floats consistent
llvm-svn: 351890
|
|
|
|
|
|
|
|
| |
It might be a bit nicer to use the fancy .legalIf and co. predicates,
but this was requiring more boilerplate and disables the coverage
assertions.
llvm-svn: 351886
|
|
|
|
| |
llvm-svn: 351884
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the underlying filesystem does not support mmap system call,
FileOutputBuffer may fail when it attempts to mmap an output temporary
file. This patch handles such situation.
Unfortunately, it looks like it is very hard to test this functionality
without a filesystem that doesn't support mmap using llvm-lit. I tested
this locally by passing an invalid parameter to mmap so that it fails and
falls back to the in-memory buffer. Maybe that's all what we can do.
I believe it is reasonable to submit this without a test.
Differential Revision: https://reviews.llvm.org/D56949
llvm-svn: 351883
|
|
|
|
|
|
|
|
|
| |
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.
X86 uses i8, but seemed to be hacking around this before.
llvm-svn: 351882
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The old diagnostic form of the trace produced by -v and -vv looks
like:
```
check1:1:8: remark: CHECK: expected string found in input
CHECK: abc
^
<stdin>:1:3: note: found here
; abc def
^~~
```
When dumping annotated input is requested (via -dump-input), I find
that this old trace is not useful and is sometimes harmful:
1. The old trace is mostly redundant because the same basic
information also appears in the input dump's annotations.
2. The old trace buries any error diagnostic between it and the input
dump, but I find it useful to see any error diagnostic up front.
3. FILECHECK_OPTS=-dump-input=fail requests annotated input dumps only
for failed FileCheck calls. However, I have to also add -v or -vv
to get a full set of annotations, and that can produce massive
output from all FileCheck calls in all tests. That's a real
problem when I run this in the IDE I use, which grinds to a halt as
it tries to capture all that output.
When -dump-input=fail|always, this patch suppresses the old trace from
-v or -vv. Error diagnostics still print as usual. If you want the
old trace, perhaps to see variable expansions, you can set
-dump-input=none (the default).
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D55825
llvm-svn: 351881
|
|
|
|
|
|
|
| |
Produce a splat build_vector similar to how
SelectionDAG::getConstant does.
llvm-svn: 351880
|
|
|
|
| |
llvm-svn: 351871
|
|
|
|
| |
llvm-svn: 351866
|
|
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D57035
llvm-svn: 351861
|
|
|
|
| |
llvm-svn: 351859
|
|
|
|
| |
llvm-svn: 351856
|
|
|
|
|
|
| |
those of C_RegisterClass. NFCI.
llvm-svn: 351854
|
|
|
|
| |
llvm-svn: 351853
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I was honestly a bit surprised that we didn't do this before. This
patch is to handle "-" as the stdout so that if you pass `-o -` to
lld, for example, it writes an output to stdout instead of file `-`.
I thought that we might want to handle this at a higher level than
FileOutputBuffer, because if we land this patch, we can no longer
create a file whose name is `-` (there's a workaround though; you can
pass `./-` instead of `-`). However, because raw_fd_ostream already
handles `-` as a special file name, I think it's okay and actually
consistent to handle `-` as a special name in FileOutputBuffer.
Differential Revision: https://reviews.llvm.org/D56940
llvm-svn: 351852
|
|
|
|
| |
llvm-svn: 351851
|
|
|
|
|
|
|
| |
This reapplies commits r351778 and r351782 with
RISCV test fixes.
llvm-svn: 351850
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Enable full support for the debug info.
Reviewers: echristo
Subscribers: jholewinski, aprantl, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D46189
llvm-svn: 351846
|
|
|
|
|
|
| |
This is still causing compilation crashes in some targets. Will follow up shortly with a repro.
llvm-svn: 351845
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Initial function labels must follow the debug location for the correct relocation info generation.
Reviewers: tra, jlebar, echristo
Subscribers: jholewinski, llvm-commits
Differential Revision: https://reviews.llvm.org/D45784
llvm-svn: 351843
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
vecbo (insertsubv undef, X, Z), (insertsubv undef, Y, Z) --> insertsubv VecC, (vecbo X, Y), Z
This is another step in generic vector narrowing. It's also a step towards more horizontal op
formation specifically for x86 (although we still failed to match those in the affected tests).
The scalarization cases are also not optimal (we should be scalarizing those), but it's still
an improvement to use a narrower vector op when we know part of the result must be constant
because both inputs are undef in some vector lanes.
I think a similar match but checking for a constant operand might help some of the cases in
D51553.
Differential Revision: https://reviews.llvm.org/D56875
llvm-svn: 351825
|
|
|
|
|
|
|
|
|
|
| |
Previously we had names like 'Call' or 'Tail'. This potentially clashes with
the naming scheme used elsewhere in RISCVInstrInfo.td. Many other backends
would use names like AArch64call or PPCtail. I prefer the SystemZ approach,
which uses prefixed all-lowercase names. This matches the naming scheme used
for target-independent SelectionDAG nodes.
llvm-svn: 351823
|
|
|
|
|
|
|
|
|
|
| |
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.
This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.
Differential Revision: https://reviews.llvm.org/D55935
llvm-svn: 351819
|
|
|
|
|
|
|
|
| |
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57026
llvm-svn: 351817
|
|
|
|
| |
llvm-svn: 351816
|
|
|
|
|
|
|
|
| |
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57022
llvm-svn: 351815
|
|
|
|
|
|
|
|
| |
First step towards PR40376, this patch adds support for getCmpSelInstrCost to use the (optional) Instruction CmpInst predicate to indicate the type of integer comparison we're performing and alter the costs accordingly.
Differential Revision: https://reviews.llvm.org/D57013
llvm-svn: 351810
|