| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
| |
integer.
We're trying to have the vXi1 types in IR as much as possible. This prevents the need for bitcasts when the producer of the mask was already a vXi1 value like an icmp. The bitcasts can be subject to code motion and interfere with basic block at a time isel in bad ways.
llvm-svn: 351275
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
We have seen performance regression when v_add3 is generated. The major reason is that the v_mad pattern
is broken when v_add3 is generated. We also see the register pressure increased. While we could not properly
estimate register pressure during instruction selection, we can give mad a higher priority.
In this work, we raise the priority for mad24 in selection and resolve the performance regression.
Reviewers:
rampitec
Differential Revision:
https://reviews.llvm.org/D56745
llvm-svn: 351273
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Previously in D54095 i have added support for extraction of `lshr` from `X` if we are to produce `BEXTR`.
That was good, but the fix was partial, there was still [[ https://bugs.llvm.org/show_bug.cgi?id=36419 | PR36419 ]].
That pattern can also appear, roughly, when you have a large (64-bit) storage, and the consume bits from it.
It will not be unexpected if you will be doing further computations in 32-bit width.
And then the current code breaks, as the tests show.
The basic idea/pattern here is following:
1. We have `i64` input
2. We perform `i64` right-shift on it.
3. We `trunc`ate that shifted value
4. We do all further work (masking) in `i32`
Since we see `trunc`ation and not `lshr`, we give up, and stop trying to extract that right-shift.
BUT. The mask is `i32`, therefore we can extend both of the operands of the masking (`and`) to `i64`
and truncate the result after masking: https://rise4fun.com/Alive/K4B
```
Name: @bextr64_32_b1 -> @bextr64_32_b0
%shiftedval = lshr i64 %val, %numskipbits
%truncshiftedval = trunc i64 %shiftedval to i32
%widenumlowbits1 = zext i8 %numlowbits to i32
%notmask1 = shl nsw i32 -1, %widenumlowbits1
%mask1 = xor i32 %notmask1, -1
%res = and i32 %truncshiftedval, %mask1
=>
%shiftedval = lshr i64 %val, %numskipbits
%widenumlowbits = zext i8 %numlowbits to i64
%notmask = shl nsw i64 -1, %widenumlowbits
%mask = xor i64 %notmask, -1
%wideres = and i64 %shiftedval, %mask
%res = trunc i64 %wideres to i32
```
Thus, we are again able to extract that `lshr` into `BEXTR`'s control.
Now, the perf (via `llvm-exegesis`) of the snippet suggests that it is not a good idea:
```
$ cat /tmp/old.s
# bextr64_32_b1
# LLVM-EXEGESIS-LIVEIN RSI
# LLVM-EXEGESIS-LIVEIN EDX
# LLVM-EXEGESIS-LIVEIN RDI
movq %rsi, %rcx
shrq %cl, %rdi
shll $8, %edx
bextrl %edx, %edi, %eax
$ cat /tmp/old.s | ./bin/llvm-exegesis -mode=latency -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-1e0082.o
---
mode: latency
key:
instructions:
- 'MOV64rr RCX RSI'
- 'SHR64rCL RDI RDI'
- 'SHL32ri EDX EDX i_0x8'
- 'BEXTR32rr EAX EDI EDX'
config: ''
register_initial_values: []
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: latency, value: 0.6638, per_snippet_value: 2.6552 }
error: ''
info: ''
assembled_snippet: 4889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C7C3
...
$ cat /tmp/old.s | ./bin/llvm-exegesis -mode=uops -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-43e346.o
---
mode: uops
key:
instructions:
- 'MOV64rr RCX RSI'
- 'SHR64rCL RDI RDI'
- 'SHL32ri EDX EDX i_0x8'
- 'BEXTR32rr EAX EDI EDX'
config: ''
register_initial_values: []
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: PdFPU0, value: 0, per_snippet_value: 0 }
- { key: PdFPU1, value: 0, per_snippet_value: 0 }
- { key: PdFPU2, value: 0, per_snippet_value: 0 }
- { key: PdFPU3, value: 0, per_snippet_value: 0 }
- { key: NumMicroOps, value: 1.2571, per_snippet_value: 5.0284 }
error: ''
info: ''
assembled_snippet: 4889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C74889F148D3EFC1E208C4E268F7C7C3
...
```
vs
```
$ cat /tmp/new.s
# bextr64_32_b1
# LLVM-EXEGESIS-LIVEIN RDX
# LLVM-EXEGESIS-LIVEIN SIL
# LLVM-EXEGESIS-LIVEIN RDI
shlq $8, %rdx
movzbl %sil, %eax
orq %rdx, %rax
bextrq %rax, %rdi, %rax
$ cat /tmp/new.s | ./bin/llvm-exegesis -mode=latency -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-8944f1.o
---
mode: latency
key:
instructions:
- 'SHL64ri RDX RDX i_0x8'
- 'MOVZX32rr8 EAX SIL'
- 'OR64rr RAX RAX RDX'
- 'BEXTR64rr RAX RDI RAX'
config: ''
register_initial_values: []
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: latency, value: 0.7454, per_snippet_value: 2.9816 }
error: ''
info: ''
assembled_snippet: 48C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C7C3
...
$ cat /tmp/new.s | ./bin/llvm-exegesis -mode=uops -snippets-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-da403c.o
---
mode: uops
key:
instructions:
- 'SHL64ri RDX RDX i_0x8'
- 'MOVZX32rr8 EAX SIL'
- 'OR64rr RAX RAX RDX'
- 'BEXTR64rr RAX RDI RAX'
config: ''
register_initial_values: []
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: PdFPU0, value: 0, per_snippet_value: 0 }
- { key: PdFPU1, value: 0, per_snippet_value: 0 }
- { key: PdFPU2, value: 0, per_snippet_value: 0 }
- { key: PdFPU3, value: 0, per_snippet_value: 0 }
- { key: NumMicroOps, value: 1.2571, per_snippet_value: 5.0284 }
error: ''
info: ''
assembled_snippet: 48C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C748C1E208400FB6C64809D0C4E2F8F7C7C3
...
```
^ latency increased (worse).
Except //maybe// not really.
Like with all synthetic benchmarks, they //may// be misleading.
Let's take a look on some actual real-world hotpath.
In this case it's 'my' [[ https://github.com/darktable-org/rawspeed | RawSpeed ]]'s `BitStream<>::peekBitsNoFill()`, in [[ https://github.com/darktable-org/rawspeed/blob/e3316dc85127c2c29baa40f998f198a7b278bf36/src/librawspeed/decompressors/VC5Decompressor.cpp#L814 | GoPro VC5 decompressor ]]:
```
raw.pixls.us-unique/GoPro/HERO6 Black$ /usr/src/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-clangs1-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR
RUNNING: /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmplwbKEM
2018-12-22 21:23:03
Running /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench
Run on (8 X 4012.81 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 3.41, 2.41, 2.03
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations CPUTime,s CPUTime/WallTime Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_mean 40 ms 40 ms 128 0.322244 7.96974 12M 37.4457M 298.534M 3.12047 24.8778 0.040465
GOPR9172.GPR/threads:8/real_time_median 39 ms 39 ms 128 0.312606 7.99155 12M 38.387M 306.788M 3.19891 25.5656 0.039115
GOPR9172.GPR/threads:8/real_time_stddev 4 ms 3 ms 128 0.0271557 0.130575 0 2.4941M 21.3909M 0.207842 1.78257 3.81081m
RUNNING: /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 GOPR9172.GPR --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpWAkan9
2018-12-22 21:23:08
Running /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench
Run on (8 X 4013.1 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 3.78, 2.50, 2.06
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations CPUTime,s CPUTime/WallTime Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_mean 39 ms 39 ms 128 0.311533 7.97323 12M 38.6828M 308.471M 3.22356 25.706 0.0390928
GOPR9172.GPR/threads:8/real_time_median 38 ms 38 ms 128 0.304231 7.99005 12M 39.4437M 315.527M 3.28698 26.294 0.0380316
GOPR9172.GPR/threads:8/real_time_stddev 3 ms 3 ms 128 0.0229149 0.133814 0 2.26225M 19.1421M 0.188521 1.59517 3.13671m
Comparing /home/lebedevri/rawspeed/build-clangs1-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-clangs1-new/src/utilities/rsbench/rsbench
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------------------------------
GOPR9172.GPR/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 128 vs 128
GOPR9172.GPR/threads:8/real_time_mean -0.0339 -0.0316 40 39 40 39
GOPR9172.GPR/threads:8/real_time_median -0.0277 -0.0274 39 38 39 38
GOPR9172.GPR/threads:8/real_time_stddev -0.1769 -0.1267 4 3 3 3
```
I.e. this results in //roughly// -3% improvements in perf.
While this will help [[ https://bugs.llvm.org/show_bug.cgi?id=36419 | PR36419 ]], it won't address it fully.
Reviewers: RKSimon, craig.topper, andreadb, spatel
Reviewed By: craig.topper
Subscribers: courbet, llvm-commits
Differential Revision: https://reviews.llvm.org/D56052
llvm-svn: 351253
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
vXi1 vector instead of a scalar
In keeping with our general direction of having the vXi1 type present in IR, this patch converts the mask argument for avx512 gather to vXi1. This can avoid k-register to GPR to k-register transitions late in codegen.
I left the existing intrinsics behind because they have many out of tree users such as ISPC. They generate their own code and don't go through the autoupgrade path which only works for bitcode and ll parsing. Ideally we will get them to migrate to target independent intrinsics, but it might be easier for them to migrate to these new intrinsics.
I'll work on scatter and gatherpf/scatterpf next.
Differential Revision: https://reviews.llvm.org/D56527
llvm-svn: 351234
|
| |
|
|
|
|
|
| |
msp430-as supports multiple assembly statements on the same line
separated by a '{' character.
llvm-svn: 351233
|
| |
|
|
|
|
|
|
| |
As mentioned here http://lists.llvm.org/pipermail/llvm-dev/2019-January/129121.html This backend is incomplete and has not been maintained in several months.
Differential Revision: https://reviews.llvm.org/D56691
llvm-svn: 351231
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Related to https://bugs.llvm.org/show_bug.cgi?id=40123.
Rather than scalarizing, expand a vector USUBSAT into UMAX+SUB,
which produces much better code for X86.
Reapplying with updated SLPVectorizer tests.
Differential Revision: https://reviews.llvm.org/D56636
llvm-svn: 351219
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
As described in PR40209, there can be issues in DBG_VALUEs handling when multiple defs present in a BB. This patch
adds logic for detection of related to def DBG_VALUEs and localizes register update and movement to found DBG_VALUEs.
Reviewers: aheejin
Subscribers: mgorny, dschuff, sbc100, jgravelle-google, sunfish, llvm-commits
Differential Revision: https://reviews.llvm.org/D56401
llvm-svn: 351216
|
| |
|
|
|
|
|
|
| |
Modify getRegForInlineAsmConstraint to return special singleton
register class when a constraint references ST(7) not RFP80 for which
ST(7) is not a member.
llvm-svn: 351206
|
| |
|
|
|
|
|
|
|
|
| |
(PR40306)
If we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.
We were already doing this for SSSE3 targets as we have PSHUFB, but its worth doing for all targets.
llvm-svn: 351203
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
V8 currently implements SIMD shifts as taking an immediate operation,
which disagrees with the spec proposal and the toolchain
implementation. As a stopgap measure to get things working, unroll all
vector shifts. Since this is a temporary measure, there are no tests.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, sunfish, dmgreen, llvm-commits
Differential Revision: https://reviews.llvm.org/D56520
llvm-svn: 351151
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This allows moving the condition from the intrinsic to the standard ICmp
opcode, so that LLVM can do simplifications on it. The icmp.i1 intrinsic
is an identity for retrieving the SGPR mask.
And we can also get the mask from and i1, or i1, xor i1.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D52060
llvm-svn: 351150
|
| |
|
|
|
|
| |
Enable the fusion of arithmetic and logic instructions for Exynos M4.
llvm-svn: 351149
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
In r345197 ESP and RSP were added to GR32_TC/GR64_TC, allowing them to
be used for tail calls, but this also caused `findDeadCallerSavedReg` to
think they were acceptable targets for clobbering. Filter them out.
Fixes PR40289.
Patch by Geoffry Song!
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D56617
llvm-svn: 351146
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise, with D56544, the intrinsic will be expanded to an integer
csel, which is probably not what the user expected. This matches the
general convention of using "v1" types to represent scalar integer
operations in vector registers.
While I'm here, also add some error checking so we don't generate
illegal ABS nodes.
Differential Revision: https://reviews.llvm.org/D56616
llvm-svn: 351141
|
| |
|
|
|
|
|
|
|
| |
This feature enables the fusion of some arithmetic and logic instructions
together.
Differential revision: https://reviews.llvm.org/D56572
llvm-svn: 351139
|
| |
|
|
| |
llvm-svn: 351136
|
| |
|
|
|
|
|
|
|
| |
This reverts commit r351125.
I missed test changes in an SLPVectorizer test, due to the cost model
changes. Reverting for now.
llvm-svn: 351129
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Fixes https://bugs.llvm.org/show_bug.cgi?id=40172. See
test/CodeGen/WebAssembly/PR40172.ll for an explanation.
Reviewers: dschuff, aheejin
Subscribers: nikic, llvm-commits, sunfish, jgravelle-google, sbc100
Differential Revision: https://reviews.llvm.org/D56457
llvm-svn: 351127
|
| |
|
|
|
|
|
|
|
|
|
| |
Related to https://bugs.llvm.org/show_bug.cgi?id=40123.
Rather than scalarizing, expand a vector USUBSAT into UMAX+SUB,
which produces much better code for X86.
Differential Revision: https://reviews.llvm.org/D56636
llvm-svn: 351125
|
| |
|
|
| |
llvm-svn: 351111
|
| |
|
|
|
|
|
|
| |
shuffle-with-zero (PR40306)
If we have PSHUFB and we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.
llvm-svn: 351103
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
add (extractelt (X, 0), extractelt (X, 1)) --> extractelt (hadd X, X), 0
This is the integer sibling to D56011.
There's an additional restriction to only to do this transform in the
case where we don't have extra extracts from the source vector. Without
that, we can fail to match larger horizontal patterns that are more
beneficial than this minimal case. An improvement to the more general
h-op lowering may allow us to remove the restriction here in a follow-up.
llvm-svn: 351093
|
| |
|
|
|
|
|
|
|
| |
This removes the old grow_memory and mem.grow-style intrinsics, leaving just
the memory.grow-style intrinsics.
Differential Revision: https://reviews.llvm.org/D56645
llvm-svn: 351084
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
With this patch, shifts are lowered to optimal number of instructions
necessary to shift types larger than the general purpose register size.
This resolves PR/32293.
Thanks to Kyle Butt for reporting the issue!
Differential Revision: https://reviews.llvm.org/D56320
llvm-svn: 351059
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make it possible for TableGen to produce code for selecting MOVi32imm.
This allows reasonably recent ARM targets to select a lot more constants
than before.
We achieve this by adding GISelPredicateCode to arm_i32imm. It's
impossible to use the exact same code for both DAGISel and GlobalISel,
since one uses "Subtarget->" and the other "STI." to refer to the
subtarget. Moreover, in GlobalISel we don't have ready access to the
MachineFunction, so we need to add a bit of code for obtaining it from
the instruction that we're selecting. This is also the reason why it
needs to remain a PatLeaf instead of the more specific IntImmLeaf.
llvm-svn: 351056
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Part of the effort to refactoring frame pointer code generation. We used
to use two function attributes "no-frame-pointer-elim" and
"no-frame-pointer-elim-non-leaf" to represent three kinds of frame
pointer usage: (all) frames use frame pointer, (non-leaf) frames use
frame pointer, (none) frame use frame pointer. This CL makes the idea
explicit by using only one enum function attribute "frame-pointer"
Option "-frame-pointer=" replaces "-disable-fp-elim" for tools such as
llc.
"no-frame-pointer-elim" and "no-frame-pointer-elim-non-leaf" are still
supported for easy migration to "frame-pointer".
tests are mostly updated with
// replace command line args ‘-disable-fp-elim=false’ with ‘-frame-pointer=none’
grep -iIrnl '\-disable-fp-elim=false' * | xargs sed -i '' -e "s/-disable-fp-elim=false/-frame-pointer=none/g"
// replace command line args ‘-disable-fp-elim’ with ‘-frame-pointer=all’
grep -iIrnl '\-disable-fp-elim' * | xargs sed -i '' -e "s/-disable-fp-elim/-frame-pointer=all/g"
Patch by Yuanfang Chen (tabloid.adroit)!
Differential Revision: https://reviews.llvm.org/D56351
llvm-svn: 351049
|
| |
|
|
|
|
|
|
|
|
| |
Introduce GlobalISel pre legalizer pass for MIPS.
It will be used to cope with instructions that require
combining before legalization.
Differential Revision: https://reviews.llvm.org/D56269
llvm-svn: 351046
|
| |
|
|
|
|
|
|
| |
in IR instead.
Fixes PR40259
llvm-svn: 351035
|
| |
|
|
|
|
|
|
| |
not just any int.
Removes some type checks from X86GenDAGISel.inc
llvm-svn: 351033
|
| |
|
|
| |
llvm-svn: 351032
|
| |
|
|
| |
llvm-svn: 351031
|
| |
|
|
|
|
|
|
|
|
| |
vXi1 vector.
The input mask can be represented with an AND in IR.
Fixes PR40258
llvm-svn: 351028
|
| |
|
|
|
|
|
|
|
|
| |
VCVT(T)PD2DQZ128/VCVT(T)PD2UDQZ128 which only produce 2 result elements and zeroes the upper elements.
We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877
llvm-svn: 351018
|
| |
|
|
|
|
|
|
|
|
| |
only produces 2 result elements and zeroes the upper elements.
We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877.
llvm-svn: 351017
|
| |
|
|
|
|
| |
Use demanded extract index to set most of the shuffle mask to undef, making it easier to widen and peek through.
llvm-svn: 351013
|
| |
|
|
|
|
|
|
| |
Make use of vblendvpd to select on the signbit
Differential Revision: https://reviews.llvm.org/D56544
llvm-svn: 350999
|
| |
|
|
|
|
|
|
| |
This patch takes some of the code from D49837 to allow us to enable ISD::ABS support for all SSE vector types.
Differential Revision: https://reviews.llvm.org/D56544
llvm-svn: 350998
|
| |
|
|
|
|
| |
lowering.
llvm-svn: 350995
|
| |
|
|
|
|
|
|
|
|
| |
The 128-bit input produces 64-bits of output and fills the upper 64-bits with 0. The mask only applies to the lower elements. But we can't represent this with a vselect like we normally do.
This also avoids the need to have a special X86ISD::SELECT when avx512bw isn't enabled since vselect v8i16 isn't legal there.
Fixes another instruction for PR34877.
llvm-svn: 350994
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed on llvm-dev
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128497.html>, we have
to be careful when trying to select the *w RV64M instructions. i32 is not a
legal type for RV64 in the RISC-V backend, so operations have been promoted by
the time they reach instruction selection. Information about whether the
operation was originally a 32-bit operations has been lost, and it's easy to
write incorrect patterns.
Similarly to the variable 32-bit shifts, a DAG combine on ANY_EXTEND will
produce a SIGN_EXTEND if this is likely to result in sdiv/udiv/urem being
selected (and so save instructions to sext/zext the input operands).
Differential Revision: https://reviews.llvm.org/D53230
llvm-svn: 350993
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This restores support for selecting the SLLW/SRLW/SRAW instructions, which was
removed in rL348067 as the previous patterns made some unsafe assumptions.
Also see the related llvm-dev discussion
<http://lists.llvm.org/pipermail/llvm-dev/2018-December/128497.html>
Ultimately I didn't introduce a custom SelectionDAG node, but instead added a
DAG combine that inserts an AssertZext i5 on the shift amount for an i32
variable-length shift and also added an ANY_EXTEND DAG-combine which will
instead produce a SIGN_EXTEND for an i32 variable-length shift, increasing the
opportunity to safely select SLLW/SRLW/SRAW.
There are obviously different ways of addressing this (a number discussed in
the llvm-dev thread), so I'd welcome further feedback and comments.
Note that there are now some cases in
test/CodeGen/RISCV/rv64i-exhaustive-w-insts.ll where sraw/srlw/sllw is
selected even though sra/srl/sll could be used without any extra instructions.
Given both are semantically equivalent, there doesn't seem a good reason to
prefer one vs the other. Given that would require more logic to still select
sra/srl/sll in those cases, I've left it preferring the *w variants.
Differential Revision: https://reviews.llvm.org/D56264
llvm-svn: 350992
|
| |
|
|
|
|
| |
We no longer need to extend mask scalars before bitcasting them to vXi1. This was only needed for the truncate intrinsics. And was really a bug in our lowering of them.
llvm-svn: 350991
|
| |
|
|
|
|
|
|
|
|
| |
avx512dq, use v16i1 as the intermediate mask type instead of v8i1.
We still use i8 for the load/store type. So we need to convert to/from i16 to around the mask type.
By doing this we get an i8->i16 extload which we can then pattern match to a KMOVW if the access is aligned.
llvm-svn: 350989
|
| |
|
|
|
|
|
|
| |
MOVZX32rm8 and extract the subregister.
This should be a shorter encoding and is consistent with what we do for zext i8->i16
llvm-svn: 350988
|
| |
|
|
|
|
| |
Fix typo in r350952.
llvm-svn: 350986
|
| |
|
|
|
|
|
|
|
|
|
|
| |
the output has more elements than the input due to needing to be 128 bits.
We can't properly represent this with a vselect since the upper elements of the result are supposed to be zeroed regardless of the mask.
This also reuses the new nodes even when the result type fits in 128 bits if the input is q/d and the result is w/b since vselect w/b using k-register condition isn't legal without avx512bw. Currently we're doing this even when avx512bw is enabled, but I might change that.
This fixes some of PR34877
llvm-svn: 350985
|
| |
|
|
|
|
|
| |
Expand the predicate using shifted arithmetic and logic instructions to also
consider the respective not shifted instructions.
llvm-svn: 350976
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Teach x86 assembly operand parsing to distinguish between assembler
variable assigned to named registers and those assigned to immediate
values.
Reviewers: rnk, nickdesaulniers, void
Subscribers: hiraditya, jyknight, llvm-commits
Differential Revision: https://reviews.llvm.org/D56287
llvm-svn: 350966
|