| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
| |
instructions. Add asserts to verify operand count
It appears the FIXME here was handled at some point. r159728 from 2012 seems to be at least aportion of fixing it.
Differential Revision: https://reviews.llvm.org/D66570
llvm-svn: 369665
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Single operand MUL instructions that implicitly set EAX have the following
latency/throughput profile (see below):
imul %cl # latency: 3cy - uOPs: 1 - 1 JMul
imul %cx # latency: 3cy - uOPs: 3 - 3 JMul
imul %ecx # latency: 3cy - uOPs: 2 - 2 JMul
imul %rcx # latency: 6cy - uOPs: 2 - 4 JMul
mul %cl # latency: 3cy - uOPs: 1 - 1 JMul
mul %cx # latency: 3cy - uOPs: 3 - 3 JMul
mul %ecx # latency: 3cy - uOPs: 2 - 2 JMul
mul %rcx # latency: 6cy - uOPs: 2 - 4 JMul
Excluding the 64bit variant, which has a latency of 6cy, every other instruction
has a latency of 3cy. However, the number of decoded macro-opcodes (as well as
the resource cyles) depend on the MUL size.
The two operand MULs have a more predictable profile (see below):
imul %dx, %dx # latency: 3cy - uOPs: 1 - 1 JMul
imul %edx, %edx # latency: 3cy - uOPs: 1 - 1 JMul
imul %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul
imul $3, %dx, %dx # latency: 4cy - uOPs: 2 - 2 JMul
imul $3, %ecx, %ecx # latency: 3cy - uOPs: 1 - 1 JMul
imul $3, %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul
This patch updates the values in the Jaguar scheduling model and regenerates
llvm-mca tests.
Differential Revision: https://reviews.llvm.org/D66547
llvm-svn: 369661
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum
throughput for XCHG is 1 IPC. The byte exchange has worse latency and decodes to
1 extra uOP; maximum observed throughput is 0.5 IPC.
```
xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU
xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 2 ALU
xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 2 ALU
xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 2 ALU
```
The reg-mem forms of XCHG are atomic operations with an observed latency of
16cy. The resource usage is similar to the XCHGrr variants. The biggest
difference is obviously the bus-locking, which prevents the LS to issue other
memory uOPs in parallel until the unlocking store uOP is executed.
```
xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
```
The exchanged in/out register operand becomes available after 11cy from the
start of execution. Added test xchg.s to verify that we correctly see that
register write committed in 11cy (and not 16cy).
Reg-reg XADD instructions have the same latency/throughput than the byte
exchange (register-register variant).
```
xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 3 ALU
xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 3 ALU
xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 3 ALU
xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 3 ALU
```
The non-atomic RM variants have a latency of 11cy, and decode to 4
macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register
operand becomes available in 3cy (it matches the 'load-to-use latency').
```
xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
```
The atomic XADD variants execute in 16cy. The in/out register operand is
available after 11cy from the start of execution.
```
lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
```
Added test xadd.s to verify those latencies as well as read-advance values.
Differential Revision: https://reviews.llvm.org/D66535
llvm-svn: 369642
|
|
|
|
|
|
| |
Allows for some cleanup in a lot of SSE/AVX vector splitting code
llvm-svn: 369640
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
legalization.
I don't really understand the costs we're using for fp_to_sint,
but prior to widening legalization we used 20 as the cost for this
via the v2i64->v2f64 entry. That number seems better than the 40
we got with widening legalization. So now we need either a
v2i32->v2f64 entry or a v4i32->v2f64 entry depending on whether
AVX is enabled or not since we skip the first SSE2 table look up
under AVX.
llvm-svn: 369628
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: wxiao3, LuoYuanke, andrew.w.kaylor, craig.topper, annita.zhang, liutianle, pengfei, xiangzhangllvm, RKSimon, spatel, andreadb
Reviewed By: RKSimon
Subscribers: andreadb, hiraditya, llvm-commits
Tags: #llvm
Patch by Gen Pei (gpei)
Differential Revision: https://reviews.llvm.org/D65933
llvm-svn: 369612
|
|
|
|
|
|
|
|
|
|
|
|
| |
instructions.
We had an odd combination of WriteJump applied to some memory
instructions and WriteJumpLd applied to register and immediate
instructions.
Thsi should hopefully assign them all correctly.
llvm-svn: 369599
|
|
|
|
|
|
| |
readability. NFC
llvm-svn: 369598
|
|
|
|
|
|
|
|
| |
We do this by merging the source with the high bits set to 0.
Differential Revision: https://reviews.llvm.org/D66181
llvm-svn: 369480
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(concat_vectors X, Y), undef)) -> (concat_vectors X, Y, undef, undef)
I also had to add a new combine to X86's combineExtractSubvector to prevent a regression.
This helps our vXi1 code see the full concat operation and allow it optimize undef to a zero if there is already a zero in the concat. This helped us use a movzx instead of an AND in some of the tests. In those tests, one concat comes from SelectionDAGBuilder and the second comes from type legalization of v4i1->i4 bitcasts which uses an additional concat. Though these changes weren't my original motivation.
I'm looking at making X86ISelLowering's narrowShuffle emit a concat_vectors instead of an insert_subvector since concat_vectors is more canonical during early DAG combine. This patch helps prevent a regression from my experiments with that.
Differential Revision: https://reviews.llvm.org/D66456
llvm-svn: 369459
|
|
|
|
|
|
|
|
| |
This reverts r367088 (git commit 9ad565f70ec5fd3531056d7c939302d4ea970c83)
And the follow up fix r368631 / e9865b9b31bb2e6bc742dc6fca8f9f9517c3c43e
llvm-svn: 369457
|
|
|
|
|
|
|
|
|
|
| |
(v16i1 X), 0)))) -> (i8 (trunc (i16 (bitcast (v16i1 X))))) on KNL target
Without AVX512DQ we don't have KMOVB so we can't really copy 8-bits of a k-register to a GPR. We have to copy 16 bits instead. We do this even if the DAG copy is from v8i1->v16i1. If we detect the (i8 (bitcast (v8i1 (extract_subvector (v16i1 X), 0)))) we should rewrite the types to match the copy we do support. By doing this, we can help known bits to propagate without losing the upper 8 bits of the input to the extract_subvector. This allows some zero extends to be removed since we have an isel pattern to use kmovw for (zero_extend (i16 (bitcast (v16i1 X))).
Differential Revision: https://reviews.llvm.org/D66489
llvm-svn: 369434
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
KMOVW and a SUBREG_TO_REG. Similar for i8 and anyextend.
We already had patterns for extending to i32 to take advantage of
the impliciting zeroing of the upper bits of a 32-bit GPR that is
done by KMOVW/KMOVB. But the extend might be all the way to i64,
in which case the existing patterns would fail and we'd get a
KMOVW/B followed by a MOVZX. By adding patterns for i64 we can
use the fact that KMOVW/B zero the upper bits of the 32-bit GPR
and the normal property that 32-bit GPR writes implicitly zero the
upper 32-bits of the full 64-bit GPR.
The anyextend patterns are slightly different since we don't care
about the upper zeros. For the i8->i64 I think this avoids selecting
the anyextend as a MOVZX to prevent a partial register issue that
doesn't exist. For i16->i64 I think we would have just emitted an
insert_subreg on top of the extract_subreg that the vXi16->i16
bitcast pattern emits. The register coalescer or peephole pass
should combine those, but this saves that work and makes i8/16
consistent.
llvm-svn: 369431
|
|
|
|
|
|
|
|
| |
This avoids spurious relocation types for windows/elf targets.
Differential Revision: https://reviews.llvm.org/D66401
llvm-svn: 369426
|
|
|
|
|
|
| |
This is a follow-up of r369365.
llvm-svn: 369412
|
|
|
|
| |
llvm-svn: 369410
|
|
|
|
|
|
|
|
|
| |
Latency and throughput of LOCK INC/DEC/NEG/NOT is always 19cy.
Number of uOPs is still 1.
Differential Revision: https://reviews.llvm.org/D66469
llvm-svn: 369388
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC.
Throughput is superiorly limited to 0.33 because of the implicit in/out
dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the
same memory location, store-to-load forwarding occurs and values for sequent
loads are quickly forwarded from the store buffer.
Interestingly, the functionality in LLVM that computes the reciprocal throughput
doesn't seem to know about RMW instructions. That functionality only looks at
the "consumed resource cycles" for the throughput computation. It should be
fixed/improved by a future patch. In particular, for RMW instructions, that
logic should also take into account for the write latency of in/out register
operands.
An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to
~17cy/inst due to cache locking, which prevents other memory uOPs to start
executing before the "lock releasing" store uOP.
CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less
macro opcode. Their latency tend to be the same as the other RR/RM variants. RR
variants are relatively fast 3cy (but still microcoded - 5 macro opcodes).
CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load
forwarding. That means, throughput is clearly limited by the in/out dependency
on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs
for the Integer pipes). I have reused the same mix of consumed resource from the
other CMPXCHG instructions for CMPXCHG8B too.
LOCK CMPXCHG8B is instead 18cycles.
CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to
the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38)
cycles dependeing on whether the LOCK prefix is specified or not.
I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x
microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to
2x the resource cycles used for CMPXCHG8B.
The two new hasLockPrefix() functions are used by the btver2 scheduling model
check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have
been encoded in predicates of variant scheduling classes that describe lat/thr
of CMPXCHG.
Differential Revision: https://reviews.llvm.org/D66424
llvm-svn: 369365
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
line flag and all associated code, but leave it enabled by default
Google is reporting performance issues with the new default behavior
and have asked for a way to switch back to the old behavior while we
investigate and make fixes.
I've restored all of the code that had since been removed and added
additional checks of the command flag onto code paths that are
not otherwise guarded by a check of getTypeAction.
I've also modified the cost model tables to hopefully get us back
to the previous costs.
Hopefully we won't need to support this for very long since we
have no test coverage of the old behavior so we can very easily
break it.
llvm-svn: 369332
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
than one undef element. Prioritize shifts over broadcast in lowerV8I16Shuffle.
The motivating case are the changes in vector-reduce-add.ll where
we were doing extra work in the scalar domain instead of shuffling.
There may be some one use check that needs to be looked into there,
but this patch sidesteps the issue by avoiding broadcasts that
aren't really broadcasting.
Differential Revision: https://reviews.llvm.org/D66071
llvm-svn: 369287
|
|
|
|
|
|
|
|
|
|
| |
on types that don't natively support KSHIFT.
We can support these by widening to a supported type,
then shifting all the way to the left and then
back to the right to ensure that we shift in zeroes.
llvm-svn: 369232
|
|
|
|
|
|
|
|
|
|
| |
widened vector to the KSHIFT node.
Not sure how to test this as we have tests that exercise this code,
but nothing failed for the types not matching. Since all the k-registers
use equivalent register classes everything just ends up working.
llvm-svn: 369228
|
|
|
|
|
|
|
|
|
|
|
| |
only relies on undef.
This allows us to widen the type when the KSHIFTR instruction
doesn't exist for the type. If we need to shift in zeroes into
the upper elements we would need more work to guarantee zeroes
when widening.
llvm-svn: 369227
|
|
|
|
|
|
|
|
|
| |
with V2 as the source and V1 as the zero vector.
Shuffle canonicalization can swap the sources so the zero vector
might be V1 and the subvector that's being padded can be V2.
llvm-svn: 369226
|
|
|
|
|
|
|
|
|
|
| |
zero vectors followed by one non-zero vector followed by undef vectors.
For such a case we should only need a KSHIFTL, but we were
previously generating a KSHIFTL followed by a KSHIFTR because
we mistakenly believed we need to zero the undef elements.
llvm-svn: 369224
|
|
|
|
|
|
| |
vXi1 vectors don't need special handling.
llvm-svn: 369222
|
|
|
|
|
|
|
| |
We can insert the value into a larger legal type and shift that
by the desired amount.
llvm-svn: 369215
|
|
|
|
| |
llvm-svn: 369213
|
|
|
|
|
|
|
|
|
|
| |
Add similar functionality to isShuffleEquivalent - if the mask elements don't match, try matching the BUILD_VECTOR scalars instead.
As target shuffles need to handle SM_Sentinel values, this can get a bit tricky, so commit just adds actual mask element index handling - full SM_SentinelZero support will be added when the need arises.
Also, enables support in matchVectorShuffleWithPACK
llvm-svn: 369212
|
|
|
|
|
|
| |
Simplifies shuffle mask comparisons by just bailing out if the shuffle mask has any out of range values - will make an upcoming patch much simpler.
llvm-svn: 369211
|
|
|
|
|
|
|
|
|
| |
v16i16->v16i8 truncate+store by extending to v16i32 and then emitting a v16i32->v16i8 truncstore.
This prevent us from emitting a separate truncate and a truncating
store instruction.
llvm-svn: 369200
|
|
|
|
|
|
|
|
|
|
| |
shuffle using DemandedElts mask (reapplied)
This reverts r368662 (git commit 1a8d790cf5f89c1df718844f13e934e39bef6ef5)
The compile-time regression repro is in https://bugs.llvm.org/show_bug.cgi?id=43024
llvm-svn: 369167
|
|
|
|
|
|
|
|
|
|
| |
This was a quick pass through some obvious places. I haven't tried the clang-tidy check.
I also replaced the zeroes in getX86SubSuperRegister with X86::NoRegister which is the real sentinel name.
Differential Revision: https://reviews.llvm.org/D66363
llvm-svn: 369151
|
|
|
|
|
|
| |
Nothing calls this yet, everything still goes through the non (all) DemandedElts wrapper.
llvm-svn: 369136
|
|
|
|
|
|
| |
Eventually we need to generalize combineExtractWithShuffle to handle all faux shuffles and handle truncate (and X86ISD::VTRUNC etc.) there, but we're not ready yet (still creates nodes on the fly, incomplete DemandedElts support, bad use of recursive Depth limit).
llvm-svn: 369134
|
|
|
|
| |
llvm-svn: 369126
|
|
|
|
|
|
| |
We don't use anything from TargetOptions.h directly and its included via TargetLowering.h anyhow.
llvm-svn: 369110
|
|
|
|
|
|
|
|
|
|
| |
X86DAGToDAGISel::matchBitExtract so we can call insertDAGNode on the target constant.
This is needed to maintain the topological sort order.
Fixes PR42992.
llvm-svn: 369084
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This clang-tidy check is looking for unsigned integer variables whose initializer
starts with an implicit cast from llvm::Register and changes the type of the
variable to llvm::Register (dropping the llvm:: where possible).
Partial reverts in:
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
X86FixupLEAs.cpp - Some functions return unsigned and arguably should be MCRegister
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
HexagonBitSimplify.cpp - Function takes BitTracker::RegisterRef which appears to be unsigned&
MachineVerifier.cpp - Ambiguous operator==() given MCRegister and const Register
PPCFastISel.cpp - No Register::operator-=()
PeepholeOptimizer.cpp - TargetInstrInfo::optimizeLoadInstr() takes an unsigned&
MachineTraceMetrics.cpp - MachineTraceMetrics lacks a suitable constructor
Manual fixups in:
ARMFastISel.cpp - ARMEmitLoad() now takes a Register& instead of unsigned&
HexagonSplitDouble.cpp - Ternary operator was ambiguous between unsigned/Register
HexagonConstExtenders.cpp - Has a local class named Register, used llvm::Register instead of Register.
PPCFastISel.cpp - PPCEmitLoad() now takes a Register& instead of unsigned&
Depends on D65919
Reviewers: arsenm, bogner, craig.topper, RKSimon
Reviewed By: arsenm
Subscribers: RKSimon, craig.topper, lenary, aemerson, wuzish, jholewinski, MatzeB, qcolombet, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, wdng, nhaehnle, sbc100, jgravelle-google, kristof.beyls, hiraditya, aheejin, kbarton, fedor.sergeev, javed.absar, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, tpr, PkmX, jocewei, jsji, Petar.Avramovic, asbirlea, Jim, s.egerton, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65962
llvm-svn: 369041
|
|
|
|
|
|
| |
use movq2dq instead of going through memory.
llvm-svn: 369031
|
|
|
|
|
|
|
|
| |
Now that we're using widening legalization. We need to improve our extract_subvector cost model for these types. This patch begins by modeling these as a subvector extract followed by a permute. I've left FIXMEs in the code for future improvements.
Differential Revision: https://reviews.llvm.org/D65892
llvm-svn: 369022
|
|
|
|
|
|
|
|
| |
Now that we've moved to C++14, we no longer need the llvm::make_unique
implementation from STLExtras.h. This patch is a mechanical replacement
of (hopefully) all the llvm::make_unique instances across the monorepo.
llvm-svn: 369013
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the last step in an FP add reduction allows reassociation and doesn't care
about -0.0, then we are free to recognize that computation as a reduction
that may reorder the intermediate steps.
This is requested directly by PR42705:
https://bugs.llvm.org/show_bug.cgi?id=42705
and solves PR42947 (if horizontal math instructions are actually faster than
the alternative):
https://bugs.llvm.org/show_bug.cgi?id=42947
Differential Revision: https://reviews.llvm.org/D66236
llvm-svn: 368995
|
|
|
|
|
|
|
|
|
| |
bitcasted from x86mmx to MOVQ2DQ.
We already had the pattern for just the scalar to vector and bitcast,
but not the case where we wanted zeroes in the high half of the xmm.
llvm-svn: 368972
|
|
|
|
|
|
|
|
|
| |
pattern.
This pattern will narrow the load so we should make sure its not
volatile.
llvm-svn: 368971
|
|
|
|
|
|
|
|
|
|
| |
conversion to MMX.
fp_to_sint is turned into X86cvttp2si during isel preprocessing.
The other redundant isel patterns were removed previously, but I
missed this one because its in the MMX td file.
llvm-svn: 368968
|
|
|
|
|
|
| |
The default legalization can take care of this.
llvm-svn: 368967
|
|
|
|
|
|
|
| |
The generic legalization handles this in the same way so just use
that.
llvm-svn: 368966
|
|
|
|
| |
llvm-svn: 368965
|
|
|
|
|
|
|
|
| |
If the width is 256 bits, then we must have AVX so the else here
was unnecessary. Once that's removed then the >= 256 bit code is
identical to the 128 bit code with a different VT so combine them.
llvm-svn: 368956
|