| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we hardcode instructions with ReadAfterLd if the register operands don't need to be available until the folded load has completed. This doesn't take into account the different load latencies of different memory operands (PR36957).
This patch adds a ReadAfterFold def into X86FoldableSchedWrite to replace ReadAfterLd, allowing us to specify the load latency at a scheduler class level.
I've added ReadAfterVec*Ld classes that match the XMM/Scl, XMM and YMM/ZMM WriteVecLoad classes that we currently use, we can tweak these values in future patches once this infrastructure is in place.
Differential Revision: https://reviews.llvm.org/D52886
llvm-svn: 343868
|
|
|
|
| |
llvm-svn: 343795
|
|
|
|
|
|
| |
alphabetical order
llvm-svn: 343783
|
|
|
|
| |
llvm-svn: 343773
|
|
|
|
|
|
| |
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343701
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch teaches class RegisterFile how to analyze register writes from
instructions that are move elimination candidates.
In particular, it teaches it how to check if a move can be effectively eliminated
by the underlying PRF, and (if necessary) how to perform move elimination.
The long term goal is to allow processor models to describe instructions that
are valid move elimination candidates.
The idea is to let register file definitions in tablegen declare if/when moves
can be eliminated.
This patch is a non functional change.
The logic that performs move elimination is currently disabled. A future patch
will add support for move elimination in the processor models, and enable this
new code path.
llvm-svn: 343691
|
|
|
|
|
|
|
|
| |
Remove uop on WriteRMW and move it into the few instructions that need it.
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343671
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I was expecting this to be a nfc but Silvermont seems to be setup a little differently:
// A folded store needs a cycle on MEC_RSV for the store data, but it does not need an extra port cycle to recompute the address.
def : WriteRes<WriteRMW, [SLM_MEC_RSV]>;
So moving from WriteStore to WriteRMW reduces predicted port pressure, confirmed by @craig.topper that this is correct.
Differential Revision: https://reviews.llvm.org/D52740
llvm-svn: 343670
|
|
|
|
|
|
| |
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343597
|
|
|
|
|
|
| |
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343494
|
|
|
|
|
|
| |
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343484
|
|
|
|
|
|
|
|
| |
JFPU01 resource usage should match JFPX
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343468
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds another variant class to identify zero-idiom VPERM2F128rr
instructions.
On Jaguar, a VPERM wih bit 3 and 7 of the mask set, is a zero-idiom.
Differential Revision: https://reviews.llvm.org/D52663
llvm-svn: 343452
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
While looking at PR35606, I found out that the scheduling info is incorrect.
One can check that it's really a P5+P6 and not a 2*P56 with:
echo -e 'vzeroall\nvandps %xmm1, %xmm2, %xmm3' | ./bin/llvm-exegesis -mode=uops -snippets-file=-
(vandps executes on P5 only)
Reviewers: craig.topper, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52541
llvm-svn: 343447
|
|
|
|
|
|
| |
This wasn't treated as a folded load instruction
llvm-svn: 343424
|
|
|
|
| |
llvm-svn: 343421
|
|
|
|
|
|
| |
These are going to be necessary to check I don't mess up when I start cleaning up all the remaining vector integer overrides
llvm-svn: 343414
|
|
|
|
|
|
|
|
| |
Missing JFPU0 pipe and double JFPU1 pipe (to match JVALU1) resources
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343413
|
|
|
|
|
|
|
|
|
|
|
| |
We don't correctly model the latency and resource usage information for
zero-idiom VPERM2F128rr on Jaguar.
This is demonstrated by the incorrect numbers in the resource pressure view, and
the timeline view.
A follow up patch will fix this problem.
llvm-svn: 343346
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If any prefixes have been specified on the RUN lines that do not end up
ever actually getting printed, raise an Error. This is either an
indication that the run lines just need cleaning up, or that something
is more fundamentally wrong with the test.
Also raise an Error if there are any blocks which cannot be checked
because they are not uniquely covered by a prefix.
Fixed up a couple of tests where the extra checking flagged up issues.
Differential Revision: https://reviews.llvm.org/D48276
llvm-svn: 343332
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
update_mca_test_checks.py
Insert empty blocks to cause the positions of matching blocks to match
across lists where possible so that later stages of the algorithm can
actually identify them as being identical.
Regenerated all tests with this change.
Differential Revision: https://reviews.llvm.org/D52560
llvm-svn: 343331
|
|
|
|
|
|
| |
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB
llvm-svn: 343321
|
|
|
|
|
|
| |
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB
llvm-svn: 343319
|
|
|
|
|
|
|
|
| |
We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe)
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343314
|
|
|
|
|
|
|
|
| |
Double throughput to account for 2 pipes + fix BSF's latency/uop counts
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343311
|
|
|
|
|
|
| |
PHMINPOS can run on either JFPU pipe
llvm-svn: 343299
|
|
|
|
| |
llvm-svn: 343238
|
|
|
|
| |
llvm-svn: 343234
|
|
|
|
| |
llvm-svn: 343227
|
|
|
|
| |
llvm-svn: 343200
|
|
|
|
|
|
|
|
| |
As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead.
The uops are slightly different to the register variant, so requires a +1uop tweak
llvm-svn: 342969
|
|
|
|
|
|
| |
The uops are slightly different to the register variant, so requires a +1uop tweak
llvm-svn: 342916
|
|
|
|
|
|
|
|
| |
Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases.
This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions.
llvm-svn: 342892
|
|
|
|
|
|
|
|
| |
Confirmed with Craig Topper - fix a typo that was missing a Port4 uop for ROR*mCL instructions on some Intel models.
Yet another step on the scheduler model cleanup marathon......
llvm-svn: 342846
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults.
I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well.
This is necessary to allow WriteRotate to be updated to remove other rotate overrides.
It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions.
llvm-svn: 342842
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
On SNB, renamer-based zeroing does not work for:
- 16 and 8-bit GPRs[1].
- MMX [2].
- ANDN variants [3]
[1] echo 'sub %ax, %ax' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[2] echo 'pxor %mm0, %mm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[3] echo 'andnps %xmm0, %xmm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
Reviewers: RKSimon, andreadb
Subscribers: gbedwell, craig.topper, llvm-commits
Differential Revision: https://reviews.llvm.org/D52358
llvm-svn: 342736
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr
and VANDNP(S|D)Yrr.
This is a follow-up of r342555.
On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at
register-renaming stage. The other opcode has to be executed to set the upper
half of the destination YMM.
Same for VANDNP(S|D)Yrr.
Differential Revision: https://reviews.llvm.org/D52347
llvm-svn: 342728
|
|
|
|
|
|
|
| |
Two test cases should have tested 256-bit variants of VANDN zero-idioms instead
of the 128-bit variants.
llvm-svn: 342655
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
describe dependency breaking instructions.
This patch adds the ability for processor models to describe dependency breaking
instructions.
Different processors may specify a different set of dependency-breaking
instructions.
That means, we cannot assume that all processors of the same target would use
the same rules to classify dependency breaking instructions.
The main goal of this patch is to provide the means to describe dependency
breaking instructions directly via tablegen, and have the following
TargetSubtargetInfo hooks redefined in overrides by tabegen'd
XXXGenSubtargetInfo classes (here, XXX is a Target name).
```
virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const {
return false;
}
virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const {
return isZeroIdiom(MI);
}
```
An instruction MI is a dependency-breaking instruction if a call to method
isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to
true. Similarly, an instruction MI is a special case of zero-idiom dependency
breaking instruction if a call to STI.isZeroIdiom(MI) returns true.
The extra APInt is used for those targets that may want to select which machine
operands have their dependency broken (see comments in code).
Note that by default, subtargets don't know about the existence of
dependency-breaking. In the absence of external information, those method calls
would always return false.
A new tablegen class named STIPredicate has been added by this patch to let
processor models classify instructions that have properties in common. The idea
is that, a MCInstrPredicate definition can be used to "generate" an instruction
equivalence class, with the idea that instructions of a same class all have a
property in common.
STIPredicate definitions are essentially a collection of instruction equivalence
classes.
Also, different processor models can specify a different variant of the same
STIPredicate with different rules (i.e. predicates) to classify instructions.
Tablegen backends (in this particular case, the SubtargetEmitter) will be able
to process STIPredicate definitions, and automatically generate functions in
XXXGenSubtargetInfo.
This patch introduces two special kind of STIPredicate classes named
IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a
definition for those in the BtVer2 scheduling model only.
This patch supersedes the one committed at r338372 (phabricator review: D49310).
The main advantages are:
- We can describe subtarget predicates via tablegen using STIPredicates.
- We can describe zero-idioms / dep-breaking instructions directly via
tablegen in the scheduling models.
In future, the STIPredicates framework can be used for solving other problems.
Examples of future developments are:
- Teach how to identify optimizable register-register moves
- Teach how to identify slow LEA instructions (each subtarget defining its own
concept of "slow" LEA).
- Teach how to identify instructions that have undocumented false dependencies
on the output registers on some processors only.
It is also (in my opinion) an elegant way to expose knowledge to both external
tools like llvm-mca, and codegen passes.
For example, machine schedulers in LLVM could reuse that information when
internally constructing the data dependency graph for a code region.
This new design feature is also an "opt-in" feature. Processor models don't have
to use the new STIPredicates. It has all been designed to be as unintrusive as
possible.
Differential Revision: https://reviews.llvm.org/D52174
llvm-svn: 342555
|
|
|
|
|
|
| |
These have the same behaviour as tzcnt on btver2 - confirmed with AMD 16h SOG, Agner and instlatx64.
llvm-svn: 342235
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A ReadAdvance was incorrectly added to the SchedReadWrite list associated with
the following SSE instructions:
sqrtss
sqrtsd
rsqrtss
rcpss
As a consequence, a wrong operand latency was computed for the register operand
used as the base address of the folded load operand.
This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases.
There is still a problem with correctly modeling partial register writes on XMM
registers This other problem is currently tracked here:
https://bugs.llvm.org/show_bug.cgi?id=38813
Differential Revision: https://reviews.llvm.org/D51542
llvm-svn: 341326
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
instructions.
The presence of a ReadAdvance for input operand #0 is problematic
because it changes the input latency of the register used as the base address
for the folded load.
A broadcast cannot start executing if the load address hasn't been computed yet.
In the llvm-mca example, the VBROADCASTSS is dependent on the address generated
by the LEAQ. That means, it cannot start until LEAQ reaches the write-back
stage. If we apply ReadAdvance, then we wrongly assume that the load can start 3
cycles in advance.
Differential Revision: https://reviews.llvm.org/D51534
llvm-svn: 341222
|
|
|
|
|
|
|
|
|
|
|
| |
for SSE sqrtss/sd and rcpss.
According to the timeline view, sqrtss/sd/rcpss start executing before the load
address for the memory operand is available.
This problem is caused by the presence of a ReadAfterLd (a ReadAdvance). Those
unary operations should not specify a ReadAdvance at all.
llvm-svn: 341213
|
|
|
|
|
|
| |
broadcastss on ymm registers is incorrectly set.
llvm-svn: 341197
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes the number of micro opcodes, and processor resource cycles for
the following AVX instructions:
vinsertf128rr/rm
vperm2f128rr/rm
vbroadcastf128
Tests have been regenerated using the usual scripts in the llvm/utils directory.
Differential Revision: https://reviews.llvm.org/D51492
llvm-svn: 341185
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DispatchStatistics view.
This patch introduces the following changes to the DispatchStatistics view:
* DispatchStatistics now reports the number of dispatched opcodes instead of
the number of dispatched instructions.
* The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of
stall cycles against the total simulated cycles.
This change allows users to easily compare dispatch group sizes with the
processor DispatchWidth.
Before this change, it was difficult to correlate the two numbers, since
DispatchStatistics view reported numbers of instructions (instead of opcodes).
DispatchWidth defines the maximum size of a dispatch group in terms of number of
micro opcodes.
The other change introduced by this patch is related to how DispatchStage
generates "instruction dispatch" events.
In particular:
* There can be multiple dispatch events associated with a same instruction
* Each dispatch event now encapsulates the number of dispatched micro opcodes.
The number of micro opcodes declared by an instruction may exceed the processor
DispatchWidth. Therefore, we cannot assume that instructions are always fully
dispatched in a single cycle.
DispatchStage knows already how to handle instructions declaring a number of
opcodes bigger that DispatchWidth. However, DispatchStage always emitted a
single instruction dispatch event (during the first simulated dispatch cycle)
for instructions dispatched.
With this patch, DispatchStage now correctly notifies multiple dispatch events
for instructions that cannot be dispatched in a single cycle.
A few views had to be modified. Views can no longer assume that there can only
be one dispatch event per instruction.
Tests (and docs) have been updated.
Differential Revision: https://reviews.llvm.org/D51430
llvm-svn: 341055
|
|
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D50070
llvm-svn: 341024
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
generated by the SummaryView.
This patch adds two new fields to the perf report generated by the SummaryView.
Fields are now logically organized into two small groups; only the second group
contains throughput indicators.
Example:
```
Iterations: 100
Instructions: 300
Total Cycles: 414
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 1.69
IPC: 0.72
Block RThroughput: 4.0
```
This patch also updates the docs for llvm-mca.
Due to the nature of this change, several tests in the tools/llvm-mca directory
were affected, and had to be updated using script `update_mca_test_checks.py`.
llvm-svn: 340946
|
|
|
|
| |
llvm-svn: 340945
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in the 'wait-times' table.
This patch also uses colors to highlight problematic wait-time entries.
A problematic entry is an entry with an high wait time that tends to match (or
exceed) the size of the scheduler's buffer.
Color RED is used if an instruction had to wait an average number of cycles
which is bigger than (or equal to) the size of the underlying scheduler's
buffer.
Color YELLOW is used if the time (in cycles) spend waiting for the
operands or pipeline resources is bigger than half the size of the underlying
scheduler's buffer.
Color MAGENTA is used if an instruction does not consume buffer resources
according to the scheduling model.
llvm-svn: 340825
|