| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
registers.
The goal of this patch is to improve the throughput analysis in llvm-mca for the
case where instructions perform partial register writes.
On x86, partial register writes are quite difficult to model, mainly because
different processors tend to implement different register merging schemes in
hardware.
When the code contains partial register writes, the IPC (instructions per
cycles) estimated by llvm-mca tends to diverge quite significantly from the
observed IPC (using perf).
Modern AMD processors (at least, from Bulldozer onwards) don't rename partial
registers. Quoting Agner Fog's microarchitecture.pdf:
" The processor always keeps the different parts of an integer register together.
For example, AL and AH are not treated as independent by the out-of-order
execution mechanism. An instruction that writes to part of a register will
therefore have a false dependence on any previous write to the same register or
any part of it."
This patch is a first important step towards improving the analysis of partial
register updates. It changes the semantic of RegisterFile descriptors in
tablegen, and teaches llvm-mca how to identify false dependences in the presence
of partial register writes (for more details: see the new code comments in
include/Target/TargetSchedule.h - class RegisterFile).
This patch doesn't address the case where a write to a part of a register is
followed by a read from the whole register. On Intel chips, high8 registers
(AH/BH/CH/DH)) can be stored in separate physical registers. However, a later
(dirty) read of the full register (example: AX/EAX) triggers a merge uOp, which
adds extra latency (and potentially affects the pipe usage).
This is a very interesting article on the subject with a very informative answer
from Peter Cordes:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to
In future, the definition of RegisterFile can be extended with extra information
that may be used to identify delays caused by merge opcodes triggered by a dirty
read of a partial write.
Differential Revision: https://reviews.llvm.org/D49196
llvm-svn: 337123
|
|
|
|
| |
llvm-svn: 337024
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before revision 336728, the "mayLoad" flag for instruction (V)MOVLPSrm was
inferred directly from the "default" pattern associated with the instruction
definition.
r336728 removed special node X86Movlps, and all the patterns associated to it.
Now instruction (V)MOVLPSrm doesn't have a pattern associated to it, and the
'mayLoad/hasSideEffects' flags are left unset.
When the instruction info is emitted by tablegen, method
CodeGenDAGPatterns::InferInstructionFlags() sees that (V)MOVLPSrm doesn't have a
pattern, and flags are undefined. So, it conservatively sets the
"hasSideEffects" flag for it.
As a consequence, we were losing the 'mayLoad' flag, and we were gaining a
'hasSideEffect' flag in its place.
This patch fixes the issue (originally reported by Michael Holmen).
The mca tests show the differences in the instruction info flags. Instructions
that were affected by this problem were: MOVLPSrm/VMOVLPSrm/VMOVLPSZ128rm.
Differential Revision: https://reviews.llvm.org/D49182
llvm-svn: 336818
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in the Instruction Info View. NFC
This makes easier to identify changes in the instruction info flags. It also
helps spotting potential regressions similar to the one recently introduced at
r336728.
Using the same character to mark MayLoad/MayStore/HasSideEffects is problematic
for llvm-lit. When pattern matching substrings, llvm-lit consumes tabs and
spaces. A change in position of the flag marker may not trigger a test failure.
This patch only changes the character used for flag `hasSideEffects`. The reason
why I didn't touch other flags is because I want to avoid spamming the mailing
because of the massive diff due to the numerous tests affected by this change.
In future, each instruction flag should be associated with a different character
in the Instruction Info View.
llvm-svn: 336797
|
|
|
|
|
|
|
|
|
|
| |
llvm-mca doesn't know that on modern AMD processors, portions of a general
purpose register are not treated independently. So, a partial register write has
a false dependency on the super-register.
The issue with partial register writes will be addressed by a follow-up patch.
llvm-svn: 336778
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
instruction.
This is a short-term fix for PR38093.
For now, we llvm::report_fatal_error if the instruction builder finds an
unsupported instruction in the instruction stream.
We need to revisit this fix once we start addressing PR38101.
Essentially, we need a better framework for error handling.
llvm-svn: 336543
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: RKSimon, andreadb, courbet
Reviewed By: RKSimon
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D48997
llvm-svn: 336510
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch modifies the Scheduler heuristic used to select the next instruction
to issue to the pipelines.
The motivating example is test X86/BtVer2/add-sequence.s, for which llvm-mca
wrongly reported an estimated IPC of 1.50. According to perf, the actual IPC for
that test should have been ~2.00.
It turns out that an IPC of 2.00 for test add-sequence.s cannot possibly be
predicted by a Scheduler that only prioritizes instructions based on their
"age". A similar issue also affected test X86/BtVer2/dependent-pmuld-paddd.s,
for which llvm-mca wrongly estimated an IPC of 0.84 instead of an IPC of 1.00.
Instructions in the ReadyQueue are now ranked based on two factors:
- The "age" of an instruction.
- The number of unique users of writes associated with an instruction.
The new logic still prioritizes older instructions over younger instructions to
minimize the pressure on the reorder buffer. However, the number of users of an
instruction now also affects the overall rank. This potentially increases the
ability of the Scheduler to extract instruction level parallelism. This patch
fixes the problem with the wrong IPC reported for test add-sequence.s and test
dependent-pmuld-paddd.s.
llvm-svn: 336420
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: As per `Agner's Microarchitecture doc
(21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions)`,
these, like zero-idioms, are dependency-breaking,
although they produce ones and still consume resources.
FIXME: as discussed in D48877, llvm-mca handling is broken for these.
Reviewers: andreadb
Reviewed By: andreadb
Subscribers: gbedwell, RKSimon, llvm-commits
Differential Revision: https://reviews.llvm.org/D48876
llvm-svn: 336292
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the upper portion of a super-register.
This patch teaches llvm-mca how to identify register writes that implicitly zero
the upper portion of a super-register.
On X86-64, a general purpose register is implemented in hardware as a 64-bit
register. Quoting the Intel 64 Software Developer's Manual: "an update to the
lower 32 bits of a 64 bit integer register is architecturally defined to zero
extend the upper 32 bits". Also, a write to an XMM register performed by an AVX
instruction implicitly zeroes the upper 128 bits of the aliasing YMM register.
This patch adds a new method named clearsSuperRegisters to the MCInstrAnalysis
interface to help identify instructions that implicitly clear the upper portion
of a super-register. The rest of the patch teaches llvm-mca how to use that new
method to obtain the information, and update the register dependencies
accordingly.
I compared the kernels from tests clear-super-register-1.s and
clear-super-register-2.s against the output from perf on btver2. Previously
there was a large discrepancy between the estimated IPC and the measured IPC.
Now the differences are mostly in the noise.
Differential Revision: https://reviews.llvm.org/D48225
llvm-svn: 335113
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
I ran llvm-exegesis on SKX, SKL, BDW, HSW, SNB.
Atom is from Agner and SLM is a guess.
I've left AMD processors alone.
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D48079
llvm-svn: 335097
|
|
|
|
|
|
|
| |
Histogram entries are now ordered by key. This should improves their
readability when statistics are printed.
llvm-svn: 334961
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
portion of a super-register.
On x86-64, a write to register EAX implicitly clears the upper half or RAX.
128-bit AVX instructions clear the upper 128-bit of the YMM register that
aliases the XMM definition register.
llvm-mca doesn't know about register writes that implicitly clear the upper
portion of an aliasing super-register. This issue will be fixed in a future patch.
llvm-svn: 334742
|
|
|
|
|
|
|
|
|
| |
This test checks that a physical register is correctly allocated for the partial
write to register BX.
The ADD instruction has to wait for the write to RBX (and BX) before being
executed.
llvm-svn: 334730
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ready to execute.
Fixes PR37790.
In some (very rare) cases, the LSUnit (Load/Store unit) was wrongly marking a
load (or store) as "ready to execute" effectively bypassing older memory barrier
instructions.
To reproduce this bug, the memory barrier must be the first instruction in the
input assembly sequence, and it doesn't have to perform any register writes.
llvm-svn: 334633
|
|
|
|
|
|
|
|
| |
should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), these instructions are dependency breaking and fast-path zero the destination register (and appropriate EFLAGS bits).
llvm-svn: 334303
|
|
|
|
|
|
| |
These aren't true zero-idiom instructions (just dependency breaking).
llvm-svn: 334297
|
|
|
|
|
|
|
|
| |
the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions).
llvm-svn: 334296
|
|
|
|
|
|
| |
Reduces output size and we're only wanting to check that the instructions are fast-path'd (just Dispatch+Retire) anyhow
llvm-svn: 334292
|
|
|
|
|
|
|
|
| |
dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), all these instructions are dependency breaking and zero the destination register.
llvm-svn: 334119
|
|
|
|
|
|
|
|
| |
in reg-reg cases
I noticed while working on zero-idiom + dependency-breaking support (PR36671) that most of our binary instruction tests were reusing the same src registers, which would cause the tests to fail once we enable scalar zero-idiom support on btver2. Fixed in all targets to keep them in sync.
llvm-svn: 334110
|
|
|
|
|
|
|
|
|
| |
dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), all these instructions are dependency breaking and zero the destination register.
TODO: Scalar instructions still need to be tested (need to check EFLAGS handling).
llvm-svn: 334104
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a fix for the problem arising in D47374 (PR37678):
https://bugs.llvm.org/show_bug.cgi?id=37678
We may not have throughput info because it's not specified in the model
or it's not available with variant scheduling, so assume that those
instructions can execute/complete at max-issue-width.
Differential Revision: https://reviews.llvm.org/D47723
llvm-svn: 334055
|
|
|
|
|
|
|
|
|
|
|
| |
presence of partial register updates.
This patch fixe the logic in ReadState::cycleEvent(). That method was not
correctly updating field `TotalCycles`.
Added extra code comments in class ReadState to better describe each field.
llvm-svn: 334028
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch is the last of a sequence of three patches related to LLVM-dev RFC
"MC support for variant scheduling classes".
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123181.html
This fixes PR36672.
The main goal of this patch is to teach llvm-mca how to solve variant scheduling
classes. This patch does that, plus it adds new variant scheduling classes to
the BtVer2 scheduling model to identify so-called zero-idioms (i.e. so-called
dependency breaking instructions that are known to generate zero, and that are
optimized out in hardware at register renaming stage).
Without the BtVer2 change, this patch would not have had any meaningful tests.
This patch is effectively the union of two changes:
1) a change that teaches llvm-mca how to resolve variant scheduling classes.
2) a change to the BtVer2 scheduling model that allows us to special-case
packed XOR zero-idioms (this partially fixes PR36671).
Differential Revision: https://reviews.llvm.org/D47374
llvm-svn: 333909
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
It's super irritating.
[properly configured] git client then complains about that double-newline,
and you have to use `--force` to ignore the warning, since even if you
fix it manually, it will be reintroduced the very next runtime :/
Reviewers: RKSimon, andreadb, courbet, craig.topper, javed.absar, gbedwell
Reviewed By: gbedwell
Subscribers: javed.absar, tschuett, gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D47697
llvm-svn: 333887
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
{FLDL2E, FLDL2T, FLDLG2, FLDLN2, FLDPI} were using WriteMicrocoded.
- I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
- For ZnVer1 and Atom, values were transferred form InstRWs.
- For SLM and BtVer2, I've guessed some values :(
Reviewers: RKSimon, craig.topper, andreadb
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D47585
llvm-svn: 333656
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
- I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
- For ZnVer1 and Atom, values were transferred form `InstRW`s.
- For SLM and BtVer2, values are from Agner.
This is split off from https://reviews.llvm.org/D47377
Reviewers: RKSimon, andreadb
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D47523
llvm-svn: 333642
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
After SNB, Intel CPUs can rename CF independently of other EFLAGS,
so the renamer can zero it for free. Note that STC still consumes resources.
To reproduce: `$ llvm-exegesis -mode=uops -opcode-name=CLC`
On SNB:
```
---
key:
opcode_name: CLC
mode: uops
config: ''
cpu_name: sandybridge
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: '3', value: 0.0014, debug_string: SBPort0 }
- { key: '4', value: 0.0013, debug_string: SBPort1 }
- { key: '5', value: 0.0003, debug_string: SBPort4 }
- { key: '6', value: 0.0029, debug_string: SBPort5 }
- { key: '10', value: 0.0003, debug_string: SBPort23 }
error: ''
info: 'instruction is serial, repeating a random one.
Snippet:
CLC
'
...
```
On HSW:
```
---
key:
opcode_name: CLC
mode: uops
config: ''
cpu_name: haswell
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: '3', value: 0.001, debug_string: HWPort0 }
- { key: '4', value: 0.0009, debug_string: HWPort1 }
- { key: '5', value: 0.0004, debug_string: HWPort2 }
- { key: '6', value: 0.0006, debug_string: HWPort3 }
- { key: '7', value: 0.0002, debug_string: HWPort4 }
- { key: '8', value: 0.0012, debug_string: HWPort5 }
- { key: '9', value: 0.0022, debug_string: HWPort6 }
- { key: '10', value: 0.0001, debug_string: HWPort7 }
error: ''
info: 'instruction is serial, repeating a random one.
Snippet:
CLC
'
...
```
Reviewers: craig.topper, RKSimon
Subscribers: gchatelet, llvm-commits
Differential Revision: https://reviews.llvm.org/D47362
llvm-svn: 333392
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously update_mca_test_checks worked entirely at "block" level where
a block is some sequence of lines delimited by at least one empty line.
This generally worked well, but could sometimes lead to excessive
repetition of check lines for various prefixes if some block was almost
identical between prefixes, but not quite (for example, due to a
different dispatch width in the otherwise identical summary views).
This new analyis attempts to split blocks further in the case where the
following conditions are met:
a) There is some prefix common to every RUN line (typically 'ALL').
b) The first line of the block is common to the output with every prefix.
c) The block has the same number of lines for the output with every prefix.
Also, regenerated all llvm-mca test files with the following command:
update_mca_test_checks.py "../test/tools/llvm-mca/*/*.s" "../test/tools/llvm-mca/*/*/*.s"
The new analysis showed a "multiple lines not disambiguated by prefixes" warning
for test "AArch64/Exynos/scheduler-queue-usage.s" so I've also added some
explicit prefixes to each of the RUN lines in that test.
Differential Revision: https://reviews.llvm.org/D47321
llvm-svn: 333204
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements the "block reciprocal throughput" computation in the
SummaryView.
The block reciprocal throughput is computed as the MAX of:
- NumMicroOps / DispatchWidth
- Resource Cycles / #Units (for every resource consumed).
The block throughput is bounded from above by the hardware dispatch throughput.
That is because the DispatchWidth is an upper bound on how many opcodes can be part
of a single dispatch group.
The block throughput is also limited by the amount of hardware parallelism. The
number of available resource units affects how the resource pressure is
distributed, and also how many blocks can be delivered every cycle.
llvm-svn: 333095
|
|
|
|
|
|
| |
Also, regenerate all tests.
llvm-svn: 332853
|
|
|
|
|
|
|
| |
This is to keep the Jaguar model's naming convention. Processor resources all
have a 'J' prefix in the BtVer2 scheduling model.
llvm-svn: 332851
|
|
|
|
|
|
|
|
|
|
| |
BtVer2 - fix NumMicroOp and account for the Lat+6cy GPR->XMM and Lat+1cy XMm->GPR delays (see rL332737)
The high number of MOVD/MOVQ equivalent instructions meant that there were a number of missed patterns in SNB/Znver1:
SNB - add missing GPR<->MMX costs (taken from Agner / Intel AOM)
Znver1 - add missing GPR<->XMM MOVQ costs (taken from Agner)
llvm-svn: 332745
|
|
|
|
|
|
| |
Include the 6cy delay transferring from the GPR to FPU.
llvm-svn: 332737
|
|
|
|
| |
llvm-svn: 332722
|
|
|
|
|
|
|
|
|
|
| |
WriteVecLoad/WriteVecStore scheduler classes
Retag some instructions that were missed when we split off vector load/store/moves - MOVQ/MOVD etc.
Fixes BtVer2/SLM which have different behaviours for GPR stores.
llvm-svn: 332718
|
|
|
|
|
|
|
|
|
|
| |
classes
Retag some instructions that were missed when we split off vector load/store/moves - MOVSS/MOVSD/MOVHPD/MOVHPD/MOVLPD/MOVLPS etc.
Fixes BtVer2/SLM which have different behaviours for GPR stores.
llvm-svn: 332714
|
|
|
|
| |
llvm-svn: 332622
|
|
|
|
| |
llvm-svn: 332616
|
|
|
|
| |
llvm-svn: 332447
|
|
|
|
|
|
|
|
| |
BtVer2 - Fixes schedules for (V)CVTPS2PD instructions
A lot of the Intel models still have too many InstRW overrides for these new classes - this needs cleaning up but I wanted to get the classes in first
llvm-svn: 332376
|
|
|
|
|
|
|
|
|
| |
Btver2 - VCVTPH2PSYrm needs to double pump the AGU
Broadwell - missing VCVTPS2PH*mr stores extra latency
Allows us to remove the WriteCvtF2FSt conversion store class
llvm-svn: 332357
|
|
|
|
|
|
| |
MMX was missing and YMM was tagged as a fp nt store
llvm-svn: 332269
|
|
|
|
| |
llvm-svn: 332262
|
|
|
|
| |
llvm-svn: 332257
|
|
|
|
|
|
| |
We still need to handle mmx/xmm moves as 'decode-only' no-pipe instructions
llvm-svn: 332109
|
|
|
|
|
|
| |
Fixes an issue on SLM/Btver2 where we had instructions were being treated as scalar loads/stores
llvm-svn: 332104
|
|
|
|
|
|
|
| |
This fixes a couple of BtVer2 missing instructions that weren't been handled in the override.
NOTE: There are still a lot of overrides that still need cleaning up!
llvm-svn: 331770
|
|
|
|
| |
llvm-svn: 331765
|