| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
| |
llvm-svn: 343392
|
| |
|
|
| |
llvm-svn: 343391
|
| |
|
|
|
|
|
|
| |
shuffles before simplifying inputs
By removing demanded target shuffles that simplify to zero/undef/identity before simplifying its inputs we improve chances of further simplification, as only the immediate parent user of the combined is added back to the work list - this still doesn't help us if its passed through other ops though (bitcasts....).
llvm-svn: 343390
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
added to clang
This adds tests for:
_mm_loadu_si16
_mm_loadu_si32
_mm_loadu_si16
_mm_storeu_si64
_mm_storeu_si32
_mm_storeu_si16
llvm-svn: 343389
|
| |
|
|
|
|
|
|
| |
bits via shuffles
Exposed an issue that recursive calls to getTargetConstantBitsFromNode don't handle changes to EltSizeInBits yet.
llvm-svn: 343384
|
| |
|
|
| |
llvm-svn: 343376
|
| |
|
|
|
|
| |
ISD::EXTRACT_SUBVECTOR
llvm-svn: 343375
|
| |
|
|
|
|
| |
The shift amount might have peeked through a extract_subvector, altering the number of vector elements in the 'Amt' variable - so we were incorrectly calculating the ratio when peeking through bitcasts, resulting in incorrectly detecting splats.
llvm-svn: 343373
|
| |
|
|
|
|
|
|
| |
instructions when a truncate is beteen the CMP and the AND and the sign flag is used.
The code in X86ISelDAGToDAG only looks through truncates if the sign flag isn't used, but that is overly restrictive. A future patch will improve this.
llvm-svn: 343355
|
| |
|
|
|
|
|
|
|
|
|
| |
https://reviews.llvm.org/D51147
Asserting if any extend of vectors should be up to the target's
legalizer/target specific code not in CallLowering.
reviewed by : dsanders.
llvm-svn: 343325
|
| |
|
|
|
|
|
|
|
|
|
|
| |
On GreenDragon `CodeGen/X86/cpus.ll` is timing out on the bot with Asan
and UBSan enabled. With the same configuration on my machine, the test
passes but takes more than 3 minutes to do so. I could increase the
timeout, but I believe it makes more sense to split up the test because
it allows for more parallelism.
Differential revision: https://reviews.llvm.org/D52603
llvm-svn: 343313
|
| |
|
|
|
|
|
|
| |
Double throughput to account for 2 pipes + fix BSF's latency/uop counts
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343311
|
| |
|
|
|
|
| |
PHMINPOS can run on either JFPU pipe
llvm-svn: 343299
|
| |
|
|
|
|
| |
The assembly for this test should be optimal now after changes to the ScalarizeMaskedMemIntrin patch.
llvm-svn: 343281
|
| |
|
|
|
|
|
|
| |
passthru vector and insert the new load results into it.
Previously we started with undef and did a final merge with the passthru at the end.
llvm-svn: 343273
|
| |
|
|
|
|
|
|
| |
passthru value and insert each conditional load result over their element.
Previously we started with undef and did one final merge at the end with a select.
llvm-svn: 343271
|
| |
|
|
|
|
|
|
| |
values. That's just %x so use that directly.
Had we emitted this IR earlier, InstCombine would have removed icmp so I'm going to assume using the i1 directly would be considered canonical.
llvm-svn: 343244
|
| |
|
|
|
|
| |
We now generate cttz with the zero_undef flag set to false. This allows -O0 to avoid the zero check.
llvm-svn: 343127
|
| |
|
|
|
|
| |
The previously reduced version used urem <9 x i32> zeroinitializer, %tmp which D52504 will simplify.
llvm-svn: 343097
|
| |
|
|
|
|
|
|
|
|
| |
Similar to the existing ISD::SRL constant vector shifts from D49562, this patch adds ISD::SRA support with ISD::MULHS.
As we're dealing with signed values, we have to handle shift by zero and shift by one special cases, so XOP+AVX2/AVX512 splitting/extension is still a better solution - really we should still use ISD::MULHS if one of the special cases are used but for now I've just left a TODO and filtered by isKnownNeverZero.
Differential Revision: https://reviews.llvm.org/D52171
llvm-svn: 343093
|
| |
|
|
|
|
|
|
| |
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D51495
llvm-svn: 343090
|
| |
|
|
|
|
|
|
|
|
| |
input types.
This removes an int->fp bitcast between the surrounding code and the movmsk. I had already added a hack to combineMOVMSK to try to look through this bitcast to improve the SimplifyDemandedBits there.
But I found an additional issue where the bitcast was preventing combineMOVMSK from being called again after earlier nodes in the DAG are optimized. The bitcast gets revisted, but not the user of the bitcast. By using integer types throughout, the bitcast doesn't get in the way.
llvm-svn: 343046
|
| |
|
|
|
|
|
|
| |
These IR patterns represent the exact behavior of a movmsk instruction using (zext (bitcast (icmp slt X, 0))).
For the v4i32/v8i32/v2i64/v4i64 we currently emit a PCMPGT for the icmp slt which is unnecessary since we only care about the sign bit of the result. This is because of the int->fp bitcast we put on the input to the movmsk nodes for these cases. I'll be fixing this in a future patch.
llvm-svn: 343045
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the final (I hope!) problem pattern mentioned in PR37749:
https://bugs.llvm.org/show_bug.cgi?id=37749
We are trying to avoid an AVX1 sinkhole caused by having 256-bit bitwise logic ops but no other 256-bit integer ops.
We've already solved the simple logic ops, but 'andn' is an x86 special. I looked at alternative solutions like
extending the generic DAG combine or trying to wait until the ANDNP node is created, but those are bigger patches
that can over-reach. Ie, splitting to 128-bit does not look like a win in most cases with >1 256-bit op.
The pattern matching is cluttered with bitcasts because of our i64 element canonicalization. For the affected test,
we have this vector-type-legalized sequence:
t29: v8i32 = concat_vectors t27, t28
t30: v4i64 = bitcast t29
t18: v8i32 = BUILD_VECTOR Constant:i32<-1>, Constant:i32<-1>, ...
t31: v4i64 = bitcast t18
t32: v4i64 = xor t30, t31
t9: v8i32 = BUILD_VECTOR Constant:i32<255>, Constant:i32<255>, ...
t34: v4i64 = bitcast t9
t35: v4i64 = and t32, t34
t36: v8i32 = bitcast t35
t37: v4i32 = extract_subvector t36, Constant:i64<0>
t38: v4i32 = extract_subvector t36, Constant:i64<4>
Differential Revision: https://reviews.llvm.org/D52318
llvm-svn: 343008
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: spatel, RKSimon
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52424
llvm-svn: 342989
|
| |
|
|
|
|
|
|
| |
As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead.
The uops are slightly different to the register variant, so requires a +1uop tweak
llvm-svn: 342969
|
| |
|
|
|
|
|
|
| |
The included test case previously asserted because the type legalizer tried to soften the FILD ISD node.
Fixes PR38819.
llvm-svn: 342934
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: javed.absar, trentxintong, courbet
Reviewed By: trentxintong
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52433
llvm-svn: 342919
|
| |
|
|
|
|
| |
The uops are slightly different to the register variant, so requires a +1uop tweak
llvm-svn: 342916
|
| |
|
|
| |
llvm-svn: 342908
|
| |
|
|
|
|
|
|
| |
Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases.
This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions.
llvm-svn: 342892
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a preliminary step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613
If we have an 'add' instruction that sets flags, we can use that to eliminate an
explicit compare instruction or some other instruction (cmn) that sets flags for
use in the later select.
As shown in the unchanged tests that use 'icmp ugt %x, %a', we're effectively
reversing an IR icmp canonicalization that replaces a variable operand with a
constant:
https://rise4fun.com/Alive/V1Q
But we're not using 'uaddo' in those cases via DAG transforms. This happens in
CGP after D8889 without checking target lowering to see if the op is supported.
So AArch already shows 'uaddo' codegen for the i8/i16/i32/i64 test variants with
"using_cmp_sum" in the title. That's the pattern that CGP matches as an unsigned
saturated add and converts to uaddo without checking target capabilities.
This patch is gated by isOperationLegalOrCustom(ISD::UADDO, VT), so we see only
see AArch diffs for i32/i64 in the tests with "using_cmp_notval" in the title
(unlike x86 which sees improvements for all sizes because all sizes are 'custom').
But the AArch code (like x86) looks better when translated to 'uaddo' in all cases.
So someone that is involved with AArch may want to set i8/i16 to 'custom' for UADDO,
so this patch will fire on those tests.
Another possibility given the existing behavior: we could remove the legal-or-custom
check altogether because we're assuming that a UADDO sequence is canonical/optimal
before we ever reach here. But that seems like a bug to me. If the target doesn't
have an add-with-flags op, then it's not likely that we'll get optimal DAG combining
using a UADDO node. This is similar justification for why we don't canonicalize IR to
the overflow math intrinsic sibling (llvm.uadd.with.overflow) for UADDO in the first
place.
Differential Revision: https://reviews.llvm.org/D51929
llvm-svn: 342886
|
| |
|
|
|
|
|
|
|
|
| |
It would be best to introduce ISD::BitFieldExtract,
because clearly more than one backend faces the same problem.
But for now let's solve this in the x86-specific DAG combine.
https://bugs.llvm.org/show_bug.cgi?id=38938
llvm-svn: 342880
|
| |
|
|
| |
llvm-svn: 342860
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
negation
This is an alternative to https://reviews.llvm.org/D37896. We can't decompose
multiplies generically without a target hook to tell us when it's profitable.
ARM and AArch64 may be able to remove some existing code that overlaps with
this transform.
This extends D52195 and may resolve PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474
(still an open question about transforming legal vector multiplies, but we
could open another bug report for those)
llvm-svn: 342844
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults.
I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well.
This is necessary to allow WriteRotate to be updated to remove other rotate overrides.
It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions.
llvm-svn: 342842
|
| |
|
|
| |
llvm-svn: 342838
|
| |
|
|
|
|
|
|
| |
Our lowering that tries to avoid this sign extend can be defeated by the DAG combine folding it with a truncate.
The pattern needs to extend to an v8i32 then truncate back down to v8i16.
llvm-svn: 342830
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Similar to D51893 which was for memcpy
Reviewers: efriedma
Reviewed By: efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52063
llvm-svn: 342796
|
| |
|
|
|
|
|
|
| |
vXi8 vectors.
We don't have a vXi8 shift left so we need to bitcast to a vXi16 vector to perform the shift. If we let lowering legalize the vXi8 shift we get an extra and that we don't need and fail to remove.
llvm-svn: 342795
|
| |
|
|
|
|
|
|
| |
into a 64-bit register.
Previously we used SUBREG_TO_REG+MOV32ri. But regular isel was changed recently to use the MOV32ri64 pseudo. Fast isel now does the same.
llvm-svn: 342788
|
| |
|
|
| |
llvm-svn: 342775
|
| |
|
|
|
|
| |
To investigate broadcast instruction codegen for D51553
llvm-svn: 342773
|
| |
|
|
| |
llvm-svn: 342756
|
| |
|
|
| |
llvm-svn: 342755
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
On SNB, renamer-based zeroing does not work for:
- 16 and 8-bit GPRs[1].
- MMX [2].
- ANDN variants [3]
[1] echo 'sub %ax, %ax' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[2] echo 'pxor %mm0, %mm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[3] echo 'andnps %xmm0, %xmm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
Reviewers: RKSimon, andreadb
Subscribers: gbedwell, craig.topper, llvm-commits
Differential Revision: https://reviews.llvm.org/D52358
llvm-svn: 342736
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr
and VANDNP(S|D)Yrr.
This is a follow-up of r342555.
On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at
register-renaming stage. The other opcode has to be executed to set the upper
half of the destination YMM.
Same for VANDNP(S|D)Yrr.
Differential Revision: https://reviews.llvm.org/D52347
llvm-svn: 342728
|
| |
|
|
| |
llvm-svn: 342726
|
| |
|
|
|
|
|
|
|
|
| |
tryLocalSplit only handles a single use block, but an interval may
have multiple use blocks. So don't crash in that case. This fixes
PR38795.
Differential revision: https://reviews.llvm.org/D52277
llvm-svn: 342682
|
| |
|
|
| |
llvm-svn: 342639
|