| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
| |
llvm-svn: 347316
|
| |
|
|
|
|
|
|
| |
visitINSERT_SUBVECTOR (PR37989)
This uncovered an off-by-one typo in SimplifyDemandedVectorElts's INSERT_SUBVECTOR handling as its bounds check was bailing on safe indices.
llvm-svn: 347313
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When doing some instruction scheduling work, we noticed some missing itineraries.
Before we switch to machine scheduler, those missing itineraries might not have impact to actually scheduling,
because we can still get same latency due to default values.
With machine scheduler, however, itineraries will have impact to scheduling.
eg: NumMicroOps will default to be 0 if there is NO itineraries for specific instruction class.
And most of the instruction class with itineraries will have NumMicroOps default to 1.
This will has impact on the count of RetiredMOps, affects the Pending/Available Queue,
then causing different scheduling or suboptimal scheduling further.
This patch is for STWU/STWUX (IIC_LdStStoreUpd ) for P8.
Since there are already multiple IIC for store update, this patch also merge
IIC_LdStSTDU/IIC_LdStStoreUpd to IIC_LdStSTU
IIC_LdStSTDUX to IIC_LdStSTUX
and we add a new testcase in https://reviews.llvm.org/D54699 to show the difference.
Differential Revision: https://reviews.llvm.org/D54700
llvm-svn: 347311
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch add a STWU testcase for scheduling check.
Currently P7/P8 which use itineraries are missing IIC_LdStStoreUpd,
We use CHECK-ITIN prefix to check P7/P8, then use default for P9 (and future).
We will fix the missing itineraries of IIC_LdStStoreUpd in following patch,
and update this testcase to show the scheduling difference only there.
Differential Revision: https://reviews.llvm.org/D54699
llvm-svn: 347310
|
| |
|
|
|
|
|
|
| |
instructions.
Pull out getPackDemandedElts demanded elts remapping helper from computeKnownBitsForTargetNode and use in computeKnownBits/ComputeNumSignBits.
llvm-svn: 347303
|
| |
|
|
|
|
|
|
|
|
| |
For bitcast nodes from larger element types, add the ability for SimplifyDemandedVectorElts to call SimplifyDemandedBits by merging the elts mask to a bits mask.
I've raised https://bugs.llvm.org/show_bug.cgi?id=39689 to deal with the few places where SimplifyDemandedBits's lack of vector handling is a problem.
Differential Revision: https://reviews.llvm.org/D54679
llvm-svn: 347301
|
| |
|
|
|
|
| |
As discussed on rL347240, this avoids some regressions on D54679 and also helps some combines to kick in a bit earlier.
llvm-svn: 347300
|
| |
|
|
|
|
|
|
| |
instructions.
As discussed on rL347240.
llvm-svn: 347299
|
| |
|
|
|
|
|
|
|
|
|
|
| |
where all the even or odd elements are undef.
Previously if V2 was unused we ended up using V1 for both inputs as part of the code that follows the new code. By using lowerVectorShuffleWithUNPCK we keep the undef nature of V2 in the output.
As near as I can tell this makes v16i8 behavior consistent with every other VT now.
This does mean that we give the register allocator freedom to fill in random registers now and create false dependencies. But like I said we're already doing that for other types.
llvm-svn: 347296
|
| |
|
|
|
|
|
|
| |
getZeroVector produces a specifically canonicalized zero vector, but we can just let DAG legalization take care of it.
The test changes are because MULH lowering happens later than it should and this change gave us the opportunity to constant fold away a multiply during a DAG combine before the build_vector got legalized with a bitcast.
llvm-svn: 347290
|
| |
|
|
|
|
|
|
|
|
| |
Turns out that there was no check for a store that truncates down
to a single byte when combining a (store (bswap...)) into a byte-swapping
store. This patch just adds that check.
Fixes https://bugs.llvm.org/show_bug.cgi?id=39478.
llvm-svn: 347288
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
registers. Use it to add AssertZExt/AssertSExt in the live in basic blocks
Summary:
We already support this for scalars, but it was explicitly disabled for vectors. In the updated test cases this allows us to see the upper bits are zero to use less multiply instructions to emulate a 64 bit multiply.
This should help with this ispc issue that a coworker pointed me to https://github.com/ispc/ispc/issues/1362
Reviewers: spatel, efriedma, RKSimon, arsenm
Reviewed By: spatel
Subscribers: wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D54725
llvm-svn: 347287
|
| |
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D54728
llvm-svn: 347274
|
| |
|
|
|
|
|
|
|
|
| |
implement a multiply when a zext lives in another basic block.
This can occur when one of the inputs to the multiply is loop invariant. Though my test cases just use two basic blocks with an unconditional jump which we won't merge until after isel in the codegen pipeline.
For scalars, I believe SelectionDAGBuilder can add an AssertZExt to pass knowledge across basic blocks but its explicitly disabled for vectors.
llvm-svn: 347266
|
| |
|
|
|
|
|
|
| |
GFX9 should select opsel version.
Differential Revision: https://reviews.llvm.org/D54545
llvm-svn: 347265
|
| |
|
|
|
|
|
|
|
| |
This works if DAG combiner is enabled, but without combining
we cannot select scalar_to_vector of <2 x half> and <2 x i16>.
Differential Revision: https://reviews.llvm.org/D54718
llvm-svn: 347259
|
| |
|
|
|
|
|
|
|
|
| |
As discussed on D53794, for float types with ranges smaller than the destination integer type, then we should be able to just use a regular FP_TO_SINT opcode.
I thought we'd need to provide MSA test cases for very small integer types as well (fp16 -> i8 etc.), but it turns out that promotion will kick in so they're unnecessary.
Differential Revision: https://reviews.llvm.org/D54703
llvm-svn: 347251
|
| |
|
|
|
|
|
|
| |
SSE PSHUFB vector ctlz lowering works at the i4 nibble level. As detailed in PR39703, we were masking the lower nibble off but we only actually use it in the case where the upper nibble is known to be zero, making it safe to remove the mask and save an instruction.
Differential Revision: https://reviews.llvm.org/D54707
llvm-svn: 347242
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
non-avx2 algorithm to each 128-bit lane.
Previously we split the vectors in half to allow the two halves to be any extended then concatenated the results back together.
This patch instead instead extends the v16i8 sse algorithm to extend half of each 128-bit lane using punpcklbw/punpckhbw. Multiplies all the low half lanes and high half lanes together in separate operations. Then merges the half lane results back together using packuswb.
Unfortunately, some of the cases in vector-reduce-mul.ll regress because we aren't narrowing the vector width of the multiplies as we reduce. The splitting was somewhat making up for that before by causing halves to be discarded after the split.
Differential Revision: https://reviews.llvm.org/D54668
llvm-svn: 347240
|
| |
|
|
| |
llvm-svn: 347238
|
| |
|
|
|
|
|
|
|
| |
This allows to avoid scratch use or indirect VGPR addressing for
small vectors.
Differential Revision: https://reviews.llvm.org/D54606
llvm-svn: 347231
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This makes it easier/cleaner to generate a single signature from
this directive. Also:
- Adds the symbol name, such that we don't depend on the location
of this directive anymore.
- Actually constructs the signature in the assembler, and make the
assembler own it.
- Refactor the use of MVT vs ValType in the streamer and assembler
to require less conversions overall.
- Changed 700 or so tests to use it.
Reviewers: sbc100, dschuff
Subscribers: jgravelle-google, eraman, aheejin, sunfish, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D54652
llvm-svn: 347228
|
| |
|
|
| |
llvm-svn: 347227
|
| |
|
|
| |
llvm-svn: 347218
|
| |
|
|
| |
llvm-svn: 347217
|
| |
|
|
| |
llvm-svn: 347212
|
| |
|
|
| |
llvm-svn: 347211
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch defines an interleaved-load-combine pass. The pass searches
for ShuffleVector instructions that represent interleaved loads. Matches are
converted such that they will be captured by the InterleavedAccessPass.
The pass extends LLVMs capabilities to use target specific instruction
selection of interleaved load patterns (e.g.: ld4 on Aarch64
architectures).
Differential Revision: https://reviews.llvm.org/D52653
llvm-svn: 347208
|
| |
|
|
| |
llvm-svn: 347195
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Truncs are treated as sources if their produce a value of the same
type as the one we currently trying to promote. Truncs used to be
considered as a sink if their operand was the same value type.
We now allow smaller types in the search, so we should search through
truncs that produce a smaller value. These truncs can then be
converted to an AND mask.
This leaves sinks as being:
- points where the value in the register is being observed, such as
an icmp, switch or store.
- points where value types have to match, such as calls and returns.
- zext are included to ease the transformation and are generally
removed later on.
During this change, it also became apart from truncating sinks was
broken: if a sink used a source, its type information had already
been lost by the time the truncation happens. So I've changed the
method of caching the type information.
Differential Revision: https://reviews.llvm.org/D54515
llvm-svn: 347191
|
| |
|
|
|
|
|
|
|
|
|
| |
There is no variable-length shifts on MSP430. Therefore
"eat" 8 bits of shift via bswap & ext.
Path by Kristina Bessonova!
Differential Revision: https://reviews.llvm.org/D54623
llvm-svn: 347187
|
| |
|
|
|
|
|
|
| |
sign bit in v4i32 MULH lowering.
The shift requires a copy to avoid clobbering a register. Comparing with 0 uses an xor to produce 0 that will be overwritten with the compare results. So still requires 2 instructions, but should be one byte shorter since it doesn't need to encode an immediate.
llvm-svn: 347185
|
| |
|
|
|
|
|
|
| |
extending to v2i64 pre-sse4.1
Previously we used an arithmetic shift right by 31, but that requires a copy to preserve the input. So we might as well materialize a zero and compare to it since the comparison will overwrite the register that contains the zeros. This should be one byte shorter.
llvm-svn: 347181
|
| |
|
|
|
|
|
|
| |
-x86-experimental-vector-widening-legalization.
Leave just the v4i8->v4i64 and v8i8->v8i64, but only enable them on pre-sse4.1 targets when 64-bit mode is enabled. In those cases we end up creating sext loads that get scalarized to code that looks better than what we get from loading into a vector register and doing a multiple step sign extend using unpacks and shifts.
llvm-svn: 347180
|
| |
|
|
|
|
| |
conversions.
llvm-svn: 347177
|
| |
|
|
|
|
|
|
| |
Pre-SSE4.1 sext_invec for v2i64 is complicated because we don't have a v2i64 sra instruction. So instead we sign extend to i32 using unpack and sra, then copy the elements and do a v4i32 sra to fill with sign bits, then interleave the i32 sign extend and the sign bits. So really we're doing to two sign extends but only using half of the v4i32 intermediate result.
When the result is more than 128 bits, default type legalization would prefer to split the destination type all the way down to v2i64 with shuffles followed by v16i8/v8i16->v2i64 sext_inreg operations. This results in more instructions than necessary because we are only utilizing the lower 2 elements of the v4i32 intermediate result. Instead we can custom split a v4i8/v4i16->v4i64 sign_extend. Then we can sign extend v4i8/v4i16->v4i32 invec producing a full v4i32 result. Create the sign bit vector as a v4i32 then split and interleave with the sign bits using an punpackldq and punpackhdq.
llvm-svn: 347176
|
| |
|
|
|
|
|
|
| |
vector-sext.ll to show some of the scalarized load sequences without 64-bit scalar support.
Some of these sequeces look pretty bad since we have to copy the sign bit from a 32 bit register to a 64 bit register to finish a sign extend.
llvm-svn: 347175
|
| |
|
|
|
|
| |
SSE vector shifts only use the bottom 64-bits of the shift amount vector.
llvm-svn: 347173
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
-x86-experimental-vector-widening-legalization. Add custom type legalization for extends.
If we widen illegal types instead of promoting, we should be able to rely on the type legalizer to create the vector_inreg operations for us with some caveats.
This patch disables combineToExtendVectorInReg when we are using widening.
I've enabled custom legalization for v8i8->v8i64 extends under avx512f since the type legalizer would want to create a vector_inreg with a v64i8 input type which isn't legal without avx512bw. So we go to v16i8 with custom code using the relaxation of rules we get from D54346.
I've also enable custom legalization of v8i64 and v16i32 operations with with AVX. When the input type is 128 bits, the default splitting legalization would extend first 128->256, then do the a split to two 128 pieces. Extend each half to 256 and then concat the result. The custom legalization I've added instead uses a 128->256 bit vector_inreg extend that only reads the lower 64-bits for the low half of the split. Then shuffles the high 64-bits to the low 64-bits and does another vector_inreg extend.
llvm-svn: 347172
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
extract_subvector, and a packuswb instruction.
Summary: This is an improvement over the two pshufbs and punpcklqdq we'd get otherwise.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54671
llvm-svn: 347171
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sadly, this duplicates (twice) the logic from InstSimplify. There
might be some way to at least share the DAG versions of the code,
but copying the folds seems to be the standard method to ensure
that we don't miss these folds.
Unlike in IR, we don't run DAGCombiner to fixpoint, so there's no
way to ensure that we do these kinds of simplifications unless the
code is repeated at node creation time and during combines.
There were other tests that would become worthless with this
improvement that I changed as pre-commits:
rL347161
rL347164
rL347165
rL347166
rL347167
I'm not sure how to salvage the remaining tests (diffs in this patch).
So the x86 tests verify that the new code is working as intended.
The AMDGPU test is actually similar to my motivating case: we have
some undef value that has survived to machine IR in an x86 test, and
then it gets folded in some weird way, or we crash if we don't transfer
the undef flag. But we would have been better off never getting to that
point by doing these simplifications.
This will lead back to PR32023 someday...
https://bugs.llvm.org/show_bug.cgi?id=32023
llvm-svn: 347170
|
| |
|
|
| |
llvm-svn: 347167
|
| |
|
|
| |
llvm-svn: 347166
|
| |
|
|
| |
llvm-svn: 347165
|
| |
|
|
| |
llvm-svn: 347164
|
| |
|
|
|
|
| |
Means we don't use the per-lane-shifts as much when we can cheaply use the older splat-variable-shifts.
llvm-svn: 347162
|
| |
|
|
| |
llvm-svn: 347161
|
| |
|
|
| |
llvm-svn: 347159
|
| |
|
|
|
|
|
|
| |
SimplifyDemandedVectorEltsForTargetNode (PR39549)
We were using the 'normalized' shuffle mask from resolveTargetShuffleInputs, which replaces zero/undef inputs with sentinel values. For SimplifyDemandedVectorElts we need the raw mask so we can correctly demand those 'zero' inputs that got normalized away, this requires an extra bit of logic to locally normalize undef inputs.
llvm-svn: 347158
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Summary: Now `llc -filetype=null` works.
Reviewers: eush
Subscribers: dschuff, jgravelle-google, sbc100, sunfish, llvm-commits
Differential Revision: https://reviews.llvm.org/D54660
llvm-svn: 347155
|