| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
| |
llvm-svn: 343421
|
|
|
|
|
|
| |
These are going to be necessary to check I don't mess up when I start cleaning up all the remaining vector integer overrides
llvm-svn: 343414
|
|
|
|
|
|
|
|
| |
Missing JFPU0 pipe and double JFPU1 pipe (to match JVALU1) resources
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343413
|
|
|
|
|
|
|
|
|
|
|
| |
We don't correctly model the latency and resource usage information for
zero-idiom VPERM2F128rr on Jaguar.
This is demonstrated by the incorrect numbers in the resource pressure view, and
the timeline view.
A follow up patch will fix this problem.
llvm-svn: 343346
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If any prefixes have been specified on the RUN lines that do not end up
ever actually getting printed, raise an Error. This is either an
indication that the run lines just need cleaning up, or that something
is more fundamentally wrong with the test.
Also raise an Error if there are any blocks which cannot be checked
because they are not uniquely covered by a prefix.
Fixed up a couple of tests where the extra checking flagged up issues.
Differential Revision: https://reviews.llvm.org/D48276
llvm-svn: 343332
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
update_mca_test_checks.py
Insert empty blocks to cause the positions of matching blocks to match
across lists where possible so that later stages of the algorithm can
actually identify them as being identical.
Regenerated all tests with this change.
Differential Revision: https://reviews.llvm.org/D52560
llvm-svn: 343331
|
|
|
|
|
|
| |
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB
llvm-svn: 343321
|
|
|
|
|
|
| |
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB
llvm-svn: 343319
|
|
|
|
|
|
|
|
| |
We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe)
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343314
|
|
|
|
|
|
|
|
| |
Double throughput to account for 2 pipes + fix BSF's latency/uop counts
Match AMD Fam16h SOG + llvm-exegesis tests
llvm-svn: 343311
|
|
|
|
|
|
| |
PHMINPOS can run on either JFPU pipe
llvm-svn: 343299
|
|
|
|
| |
llvm-svn: 343238
|
|
|
|
| |
llvm-svn: 343234
|
|
|
|
| |
llvm-svn: 343227
|
|
|
|
| |
llvm-svn: 343200
|
|
|
|
|
|
|
|
| |
As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead.
The uops are slightly different to the register variant, so requires a +1uop tweak
llvm-svn: 342969
|
|
|
|
|
|
| |
The uops are slightly different to the register variant, so requires a +1uop tweak
llvm-svn: 342916
|
|
|
|
|
|
|
|
| |
Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases.
This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions.
llvm-svn: 342892
|
|
|
|
|
|
|
|
| |
Confirmed with Craig Topper - fix a typo that was missing a Port4 uop for ROR*mCL instructions on some Intel models.
Yet another step on the scheduler model cleanup marathon......
llvm-svn: 342846
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults.
I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well.
This is necessary to allow WriteRotate to be updated to remove other rotate overrides.
It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions.
llvm-svn: 342842
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
On SNB, renamer-based zeroing does not work for:
- 16 and 8-bit GPRs[1].
- MMX [2].
- ANDN variants [3]
[1] echo 'sub %ax, %ax' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[2] echo 'pxor %mm0, %mm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[3] echo 'andnps %xmm0, %xmm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
Reviewers: RKSimon, andreadb
Subscribers: gbedwell, craig.topper, llvm-commits
Differential Revision: https://reviews.llvm.org/D52358
llvm-svn: 342736
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr
and VANDNP(S|D)Yrr.
This is a follow-up of r342555.
On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at
register-renaming stage. The other opcode has to be executed to set the upper
half of the destination YMM.
Same for VANDNP(S|D)Yrr.
Differential Revision: https://reviews.llvm.org/D52347
llvm-svn: 342728
|
|
|
|
|
|
|
| |
Two test cases should have tested 256-bit variants of VANDN zero-idioms instead
of the 128-bit variants.
llvm-svn: 342655
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
describe dependency breaking instructions.
This patch adds the ability for processor models to describe dependency breaking
instructions.
Different processors may specify a different set of dependency-breaking
instructions.
That means, we cannot assume that all processors of the same target would use
the same rules to classify dependency breaking instructions.
The main goal of this patch is to provide the means to describe dependency
breaking instructions directly via tablegen, and have the following
TargetSubtargetInfo hooks redefined in overrides by tabegen'd
XXXGenSubtargetInfo classes (here, XXX is a Target name).
```
virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const {
return false;
}
virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const {
return isZeroIdiom(MI);
}
```
An instruction MI is a dependency-breaking instruction if a call to method
isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to
true. Similarly, an instruction MI is a special case of zero-idiom dependency
breaking instruction if a call to STI.isZeroIdiom(MI) returns true.
The extra APInt is used for those targets that may want to select which machine
operands have their dependency broken (see comments in code).
Note that by default, subtargets don't know about the existence of
dependency-breaking. In the absence of external information, those method calls
would always return false.
A new tablegen class named STIPredicate has been added by this patch to let
processor models classify instructions that have properties in common. The idea
is that, a MCInstrPredicate definition can be used to "generate" an instruction
equivalence class, with the idea that instructions of a same class all have a
property in common.
STIPredicate definitions are essentially a collection of instruction equivalence
classes.
Also, different processor models can specify a different variant of the same
STIPredicate with different rules (i.e. predicates) to classify instructions.
Tablegen backends (in this particular case, the SubtargetEmitter) will be able
to process STIPredicate definitions, and automatically generate functions in
XXXGenSubtargetInfo.
This patch introduces two special kind of STIPredicate classes named
IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a
definition for those in the BtVer2 scheduling model only.
This patch supersedes the one committed at r338372 (phabricator review: D49310).
The main advantages are:
- We can describe subtarget predicates via tablegen using STIPredicates.
- We can describe zero-idioms / dep-breaking instructions directly via
tablegen in the scheduling models.
In future, the STIPredicates framework can be used for solving other problems.
Examples of future developments are:
- Teach how to identify optimizable register-register moves
- Teach how to identify slow LEA instructions (each subtarget defining its own
concept of "slow" LEA).
- Teach how to identify instructions that have undocumented false dependencies
on the output registers on some processors only.
It is also (in my opinion) an elegant way to expose knowledge to both external
tools like llvm-mca, and codegen passes.
For example, machine schedulers in LLVM could reuse that information when
internally constructing the data dependency graph for a code region.
This new design feature is also an "opt-in" feature. Processor models don't have
to use the new STIPredicates. It has all been designed to be as unintrusive as
possible.
Differential Revision: https://reviews.llvm.org/D52174
llvm-svn: 342555
|
|
|
|
|
|
| |
These have the same behaviour as tzcnt on btver2 - confirmed with AMD 16h SOG, Agner and instlatx64.
llvm-svn: 342235
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A ReadAdvance was incorrectly added to the SchedReadWrite list associated with
the following SSE instructions:
sqrtss
sqrtsd
rsqrtss
rcpss
As a consequence, a wrong operand latency was computed for the register operand
used as the base address of the folded load operand.
This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases.
There is still a problem with correctly modeling partial register writes on XMM
registers This other problem is currently tracked here:
https://bugs.llvm.org/show_bug.cgi?id=38813
Differential Revision: https://reviews.llvm.org/D51542
llvm-svn: 341326
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
instructions.
The presence of a ReadAdvance for input operand #0 is problematic
because it changes the input latency of the register used as the base address
for the folded load.
A broadcast cannot start executing if the load address hasn't been computed yet.
In the llvm-mca example, the VBROADCASTSS is dependent on the address generated
by the LEAQ. That means, it cannot start until LEAQ reaches the write-back
stage. If we apply ReadAdvance, then we wrongly assume that the load can start 3
cycles in advance.
Differential Revision: https://reviews.llvm.org/D51534
llvm-svn: 341222
|
|
|
|
|
|
|
|
|
|
|
| |
for SSE sqrtss/sd and rcpss.
According to the timeline view, sqrtss/sd/rcpss start executing before the load
address for the memory operand is available.
This problem is caused by the presence of a ReadAfterLd (a ReadAdvance). Those
unary operations should not specify a ReadAdvance at all.
llvm-svn: 341213
|
|
|
|
|
|
| |
broadcastss on ymm registers is incorrectly set.
llvm-svn: 341197
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes the number of micro opcodes, and processor resource cycles for
the following AVX instructions:
vinsertf128rr/rm
vperm2f128rr/rm
vbroadcastf128
Tests have been regenerated using the usual scripts in the llvm/utils directory.
Differential Revision: https://reviews.llvm.org/D51492
llvm-svn: 341185
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DispatchStatistics view.
This patch introduces the following changes to the DispatchStatistics view:
* DispatchStatistics now reports the number of dispatched opcodes instead of
the number of dispatched instructions.
* The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of
stall cycles against the total simulated cycles.
This change allows users to easily compare dispatch group sizes with the
processor DispatchWidth.
Before this change, it was difficult to correlate the two numbers, since
DispatchStatistics view reported numbers of instructions (instead of opcodes).
DispatchWidth defines the maximum size of a dispatch group in terms of number of
micro opcodes.
The other change introduced by this patch is related to how DispatchStage
generates "instruction dispatch" events.
In particular:
* There can be multiple dispatch events associated with a same instruction
* Each dispatch event now encapsulates the number of dispatched micro opcodes.
The number of micro opcodes declared by an instruction may exceed the processor
DispatchWidth. Therefore, we cannot assume that instructions are always fully
dispatched in a single cycle.
DispatchStage knows already how to handle instructions declaring a number of
opcodes bigger that DispatchWidth. However, DispatchStage always emitted a
single instruction dispatch event (during the first simulated dispatch cycle)
for instructions dispatched.
With this patch, DispatchStage now correctly notifies multiple dispatch events
for instructions that cannot be dispatched in a single cycle.
A few views had to be modified. Views can no longer assume that there can only
be one dispatch event per instruction.
Tests (and docs) have been updated.
Differential Revision: https://reviews.llvm.org/D51430
llvm-svn: 341055
|
|
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D50070
llvm-svn: 341024
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
generated by the SummaryView.
This patch adds two new fields to the perf report generated by the SummaryView.
Fields are now logically organized into two small groups; only the second group
contains throughput indicators.
Example:
```
Iterations: 100
Instructions: 300
Total Cycles: 414
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 1.69
IPC: 0.72
Block RThroughput: 4.0
```
This patch also updates the docs for llvm-mca.
Due to the nature of this change, several tests in the tools/llvm-mca directory
were affected, and had to be updated using script `update_mca_test_checks.py`.
llvm-svn: 340946
|
|
|
|
| |
llvm-svn: 340945
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in the 'wait-times' table.
This patch also uses colors to highlight problematic wait-time entries.
A problematic entry is an entry with an high wait time that tends to match (or
exceed) the size of the scheduler's buffer.
Color RED is used if an instruction had to wait an average number of cycles
which is bigger than (or equal to) the size of the underlying scheduler's
buffer.
Color YELLOW is used if the time (in cycles) spend waiting for the
operands or pipeline resources is bigger than half the size of the underlying
scheduler's buffer.
Color MAGENTA is used if an instruction does not consume buffer resources
according to the scheduling model.
llvm-svn: 340825
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this patch, the SchedulerStatistics only printed the maximum number of
buffer entries consumed in each scheduler's queue at a given point of the
simulation.
This patch restructures the reported table, and adds an extra field named
"Average number of used buffer entries" to it.
This patch also uses different colors to help identifying bottlenecks caused by
high scheduler's buffer pressure.
llvm-svn: 340746
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
resource mask (an uint64_t value) to unsigned.
This patch fixes a regression introduced at revision 338702.
A processor resource mask was incorrectly implicitly truncated to an unsigned
quantity. Later on, the truncated mask was used to initialize an element of a
vector of processor resource descriptors.
On targets with more than 32 processor resources, some elements of the vector
are left uninitialized. As a consequence, this bug might have eventually caused
a crash due to null dereference in the Scheduler.
This patch fixes PR38575, and adds a test for it.
llvm-svn: 339768
|
|
|
|
|
|
| |
Differential Revision: https://reviews.llvm.org/D49912
llvm-svn: 339145
|
|
|
|
|
|
| |
I've put CMPXCHG8B/CMPXCHG16B in the same file, even though technically they are under separate CPUID bits all targets seem to support both (or neither).
llvm-svn: 338595
|
|
|
|
|
|
| |
These aren't just available via 3DNow! so test for them separately as well.
llvm-svn: 338584
|
|
|
|
|
|
| |
Renamed the btver2 file that already contained them - the other targets were only testing the AVX versions
llvm-svn: 338583
|
|
|
|
|
|
| |
Found by inspection.
llvm-svn: 338579
|
|
|
|
| |
llvm-svn: 338576
|
|
|
|
|
|
| |
We already added these to btver2, now add them to other targets, even though none of their models treat them specially (yet).
llvm-svn: 338565
|
|
|
|
|
|
| |
CPUID, IN/OUT, INS/OUTS, INT, PAUSE, SCAS, UD2, XLAT
llvm-svn: 338563
|
|
|
|
| |
llvm-svn: 338550
|
|
|
|
| |
llvm-svn: 338532
|
|
|
|
| |
llvm-svn: 338514
|
|
|
|
|
|
| |
These aren't exhaustive, but cover some instructions that are only available in 32-bit mode (where would we be without good BCD math performance?).
llvm-svn: 338404
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch teaches llvm-mca how to identify dependency breaking instructions on
btver2.
An example of dependency breaking instructions is the zero-idiom XOR (example:
`XOR %eax, %eax`), which always generates zero regardless of the actual value of
the input register operands.
Dependency breaking instructions don't have to wait on their input register
operands before executing. This is because the computation is not dependent on
the inputs.
Not all dependency breaking idioms are also zero-latency instructions. For
example, `CMPEQ %xmm1, %xmm1` is independent on
the value of XMM1, and it generates a vector of all-ones.
That instruction is not eliminated at register renaming stage, and its opcode is
issued to a pipeline for execution. So, the latency is not zero.
This patch adds a new method named isDependencyBreaking() to the MCInstrAnalysis
interface. That method takes as input an instruction (i.e. MCInst) and a
MCSubtargetInfo.
The default implementation of isDependencyBreaking() conservatively returns
false for all instructions. Targets may override the default behavior for
specific CPUs, and return a value which better matches the subtarget behavior.
In future, we should teach to Tablegen how to automatically generate the body of
isDependencyBreaking from scheduling predicate definitions. This would allow us
to expose the knowledge about dependency breaking instructions to the machine
schedulers (and, potentially, other codegen passes).
Differential Revision: https://reviews.llvm.org/D49310
llvm-svn: 338372
|