| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
llvm-svn: 307288
|
|
|
|
| |
llvm-svn: 307268
|
|
|
|
|
|
| |
mask types
llvm-svn: 307257
|
|
|
|
| |
llvm-svn: 307255
|
|
|
|
| |
llvm-svn: 307254
|
|
|
|
|
|
| |
First step toward supporting shuffle combining to EXTRQ/INSERTQ.
llvm-svn: 307250
|
|
|
|
|
|
| |
With the improved shuffle decoding we can now combine EXTRQI/INSERTQI shuffles from non-v16i8 vector types
llvm-svn: 307099
|
|
|
|
|
|
| |
The existing decodes only worked for v16i8 vectors, this adds support for any 128-bit vector
llvm-svn: 307095
|
|
|
|
| |
llvm-svn: 307048
|
|
|
|
|
|
| |
Zero extend to v16i32/v8i64, use VPOPCNTDQ instructions and truncate back.
llvm-svn: 306990
|
|
|
|
|
|
|
|
| |
before bit shifts
We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
llvm-svn: 306978
|
|
|
|
|
|
|
|
|
|
| |
before bit shifts
We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
The 32-bit shuffles are a bit tricky and will be dealt with in a later patch
llvm-svn: 306977
|
|
|
|
|
|
| |
NFCI.
llvm-svn: 306847
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions).
AVX512 compare instructions return v*i1 types.
In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type.
Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes.
The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class.
When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction.
Differential Revision: https://reviews.llvm.org/D33188
llvm-svn: 306402
|
|
|
|
| |
llvm-svn: 306369
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
|
|
|
|
|
|
|
|
|
|
|
| |
I'm not sure yet why this wouldn't fail in the simple case,
but clearly I used the wrong value type with:
https://reviews.llvm.org/rL306040
...and the bug manifests with:
https://bugs.llvm.org/show_bug.cgi?id=33560
llvm-svn: 306139
|
|
|
|
| |
llvm-svn: 306121
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is very similar to the transform in:
https://reviews.llvm.org/rL306040
...but in this case, we use cmp X, 1 to set the carry bit as needed.
Again, we can show that all of these are logically equivalent (although
InstCombine currently canonicalizes to a form not seen here), and if
we believe IACA, then this is the smallest/fastest code. Eg, with SNB:
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | 1.0 | | | | | | | cmp edi, 0x1
| 2 | | 1.0 | | | | 1.0 | CP | sbb eax, eax
The larger motivation is to clean up all select-of-constants combining/lowering
because we're missing some common cases.
llvm-svn: 306072
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Our handling of select-of-constants is lumpy in IR (https://reviews.llvm.org/D24480),
lumpy in DAGCombiner, and lumpy in X86ISelLowering. That's why we only had the 'sbb'
codegen in 1 out of the 4 tests. This is a step towards smoothing that out.
First, show that all of these IR forms are equivalent:
http://rise4fun.com/Alive/mx
Second, show that the 'sbb' version is faster/smaller. IACA output for SandyBridge
(later Intel and AMD chips are similar based on Agner's tables):
This is the "obvious" x86 codegen (what gcc appears to produce currently):
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1* | | | | | | | | xor eax, eax
| 1 | 1.0 | | | | | | CP | test edi, edi
| 1 | | | | | | 1.0 | CP | setnz al
| 1 | | 1.0 | | | | | CP | neg eax
This is the adc version:
| 1* | | | | | | | | xor eax, eax
| 1 | 1.0 | | | | | | CP | cmp edi, 0x1
| 2 | | 1.0 | | | | 1.0 | CP | adc eax, 0xffffffff
And this is sbb:
| 1 | 1.0 | | | | | | | neg edi
| 2 | | 1.0 | | | | 1.0 | CP | sbb eax, eax
If IACA is trustworthy, then sbb became a single uop in Broadwell, so this will be
clearly better than the alternatives going forward.
llvm-svn: 306040
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds prologue code emission for stack probe function
calls.
Reviewed By: majnemer
Differential Revision: https://reviews.llvm.org/D34387
llvm-svn: 306010
|
|
|
|
|
|
|
|
|
|
|
|
| |
Masked gather for vector length 2 is lowered incorrectly for element type i32.
The type <2 x i32> was automatically extended to <2 x i64> and we generated VPGATHERQQ instead of VPGATHERQD.
The type <2 x float> is extended to <4 x float>, so there is no bug for this type, but the sequence may be more optimal.
In this patch I'm fixing <2 x i32>bug and optimizing <2 x float> sequence for GATHERs only. The same fix should be done for Scatters as well.
Differential revision: https://reviews.llvm.org/D34343
llvm-svn: 305987
|
|
|
|
| |
llvm-svn: 305914
|
|
|
|
|
|
|
|
|
| |
There are a couple of potential improvements as seen in the IR and asm:
1. We're unnecessarily extending to a larger type to compare values.
2. The codegen for (select cond, 1, -1) could avoid a cmov.
(or we could change the order of the compares, so we have a select with 0 operand)
llvm-svn: 305802
|
|
|
|
|
|
|
|
| |
>=16bit elements
Shuffle lowering/combining now does a good job for 256/512-bit vectors - we don't need to prevent this
llvm-svn: 305801
|
|
|
|
|
|
| |
Target shuffle combining now supports the matching of INSERT_VECTOR_ELT/PINSRW/PINSRB for merging multiple insertions into shuffles/bitmasks.
llvm-svn: 305788
|
|
|
|
| |
llvm-svn: 305476
|
|
|
|
|
|
|
|
|
|
| |
assuming v8i16 vectors
We can use this with v16i16/v32i16 as well.
Found during fuzz testing.
llvm-svn: 305472
|
|
|
|
|
|
|
|
| |
(remove redundant shift left+right instructions).
This is causing windows buildbot failures
llvm-svn: 305470
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
redundant shift left+right instructions).
AVX512 compare instructions return v*i1 types.
In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type.
Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes.
The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class.
When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction.
Differential Revision: https://reviews.llvm.org/D33188
llvm-svn: 305465
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up to https://reviews.llvm.org/D34174 / https://reviews.llvm.org/rL305398.
We mentioned replacing the multiplies with shifts, but the real win seems to be in
bypassing the extra ops in the common case when the RootRatio and OpRatio are one.
This gives us another 1-2% overall win for the test in PR32037:
https://bugs.llvm.org/show_bug.cgi?id=32037
llvm-svn: 305414
|
|
|
|
|
|
|
|
|
|
|
|
| |
We know that shuffle masks are power-of-2 sizes, but there's no way (?) for LLVM to know that,
so hack combineX86ShufflesRecursively() to be much faster by replacing div/rem with shift/mask.
This makes the motivating compile-time bug in PR32037 ( https://bugs.llvm.org/show_bug.cgi?id=32037 )
about 9% faster overall.
Differential Revision: https://reviews.llvm.org/D34174
llvm-svn: 305398
|
|
|
|
|
|
| |
Seems my recent move to VS2017 has resulted in a few text editor issues.....
llvm-svn: 305285
|
|
|
|
|
|
|
|
|
|
| |
(PR32037)
Much of PR32037's compile time regression is due to getTargetConstantBitsFromNode always creating large (>64bit) APInts during the bitcasting from the source data to the destination bitwidth.
This commit avoids this bitcast stage if the data is already the correct bitwidth.
llvm-svn: 305284
|
|
|
|
|
|
|
|
|
|
| |
This step is just intended to reduce code duplication rather than change any functionality.
A follow-up would be to replace PPCTargetLowering::spliceIntoChain() usage with this new helper.
Differential Revision: https://reviews.llvm.org/D33649
llvm-svn: 305192
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
32-byte load
I was looking closer at the x86 test diffs in D33866, and the first change seems like it
shouldn't happen in the first place. So this patch will resolve that.
Using Agner's tables and AMD docs, vperm2f128 and vinsertf128 have identical timing for
any given CPU model, so we should be able to interchange those without affecting perf.
But as we can see in some of the diffs here, using vperm2f128 allows load folding, so
we should take that opportunity to reduce code size and register pressure.
A secondary advantage is making AVX1 and AVX2 codegen more similar. Given that vperm2f128
was introduced with AVX1, we should be selecting it in all of the same situations that we
would with AVX2. If there's some reason that an AVX1 CPU would not want to use this
instruction, that should be fixed up in a later pass.
Differential Revision: https://reviews.llvm.org/D33938
llvm-svn: 305171
|
|
|
|
|
|
| |
If the inputs won't saturate during packing then we can treat the PACKSS as a truncation shuffle
llvm-svn: 305091
|
|
|
|
|
|
|
|
| |
constants.
The initial patch was rejected: I fixed the issue and re-apply it.
llvm-svn: 304972
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we know that both operands of an unsigned integer vector comparison are non-negative,
then it's safe to directly use a signed-compare-greater-than instruction (the only non-equality
integer vector compare predicate provided by SSE/AVX).
We're intentionally not changing the condition code to signed in order to preserve the
existing transforms that use min/max/psubus below here.
This should solve PR33276:
https://bugs.llvm.org/show_bug.cgi?id=33276
Differential Revision: https://reviews.llvm.org/D33862
llvm-svn: 304909
|
|
|
|
|
|
| |
We were checking that the index was in range of the destination vector type, not the (larger) source vector type
llvm-svn: 304894
|
|
|
|
|
|
|
|
| |
(PR32744)
Differential Revision: https://reviews.llvm.org/D33728
llvm-svn: 304718
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
coalescing/combine opportunities
We currently generate BUILD_VECTOR as a tree of UNPCKL shuffles of the same type:
e.g. for v4f32:
Step 1: unpcklps 0, 2 ==> X: <?, ?, 2, 0>
: unpcklps 1, 3 ==> Y: <?, ?, 3, 1>
Step 2: unpcklps X, Y ==> <3, 2, 1, 0>
The issue is because we are not placing sequential vector elements together early enough, we fail to recognise many combinable patterns - consecutive scalar loads, extractions etc.
Instead, this patch unpacks progressively larger sequential vector elements together:
e.g. for v4f32:
Step 1: unpcklps 0, 2 ==> X: <?, ?, 1, 0>
: unpcklps 1, 3 ==> Y: <?, ?, 3, 2>
Step 2: unpcklpd X, Y ==> <3, 2, 1, 0>
This does mean that we are creating UNPCKL shuffle of different value types, but the relevant combines that benefit from this are quite capable of handling the additional BITCASTs that are now included in the shuffle tree.
Differential Revision: https://reviews.llvm.org/D33864
llvm-svn: 304688
|
|
|
|
|
|
| |
Generalized existing SCALAR_TO_VECTOR(EXTRACT_VECTOR_ELT) code to support AssertZext + PEXTRW/PEXTRB cases as well.
llvm-svn: 304659
|
|
|
|
|
|
| |
Organizing by transform is smaller and easier to read than a squashed switch with fall-throughs.
llvm-svn: 304611
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since r288804, we try to lower build_vectors on AVX using broadcasts of
float/double. However, when we broadcast integer values that happen to
have a NaN float bitpattern, we lose the NaN payload, thereby changing
the integer value being broadcast.
This is caused by ConstantFP::get, to which we pass the splat i32 as
a float (by bitcasting it using bitsToFloat). ConstantFP::get takes
a double parameter, so we end up lossily converting a single-precision
NaN to double-precision.
Instead, avoid any kinds of conversions by directly building an APFloat
from the splatted APInt.
Note that this also fixes another piece of code (broadcast of
subvectors), that currently isn't susceptible to the same problem.
Also note that we could really just use APInt and ConstantInt
throughout: the constant pool type doesn't matter much. Still, for
consistency, use the appropriate type.
llvm-svn: 304590
|
|
|
|
| |
llvm-svn: 304576
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This might give a few better opportunities to optimize these to memcpy
rather than loops - also a few minor cleanups (StringRef-izing,
templating (to avoid std::function indirection), etc).
The SmallVector::assign(iter, iter) could be improved with the use of
SFINAE, but the (iter, iter) ctor and append(iter, iter) need it to and
don't have it - so, workaround it for now rather than bothering with the
added complexity.
(also, as noted in the added FIXME, these assign ops could potentially
be optimized better at least for non-trivially-copyable types)
llvm-svn: 304566
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the CARRY ops instead.
Summary:
As per title. This cleanup some technical debt.
Depends on D33374
Reviewers: jyknight, nemanjai, mkuper, spatel, RKSimon, zvi, bkramer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D33390
llvm-svn: 304435
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Add an early combine to match patterns such as:
(i16 bitcast (v16i1 x))
->
(i16 movmsk (v16i8 sext (v16i1 x)))
This combine needs to happen early enough before
type-legalization scalarizes the result of the setcc.
Reviewers: igorb, craig.topper, RKSimon
Subscribers: delena, llvm-commits
Differential Revision: https://reviews.llvm.org/D33311
llvm-svn: 304406
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
usubo/setcccarry.
Summary:
This is a continuation of the work started in D29872 . Passing the carry down as a value rather than as a glue allows for further optimizations. Introducing setcccarry makes the use of addc/subc unecessary and we can start the removal process.
This patch only introduce the optimization strictly required to get the same level of optimization as was available before nothing more.
Reviewers: jyknight, nemanjai, mkuper, spatel, RKSimon, zvi, bkramer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D33374
llvm-svn: 304404
|