summaryrefslogtreecommitdiffstats
path: root/llvm/test/tools/llvm-mca/X86
Commit message (Collapse)AuthorAgeFilesLines
...
* [X86] Move ReadAfterLd functionality into X86FoldableSchedWrite (PR36957)Simon Pilgrim2018-10-052-181/+181
| | | | | | | | | | | | Currently we hardcode instructions with ReadAfterLd if the register operands don't need to be available until the folded load has completed. This doesn't take into account the different load latencies of different memory operands (PR36957). This patch adds a ReadAfterFold def into X86FoldableSchedWrite to replace ReadAfterLd, allowing us to specify the load latency at a scheduler class level. I've added ReadAfterVec*Ld classes that match the XMM/Scl, XMM and YMM/ZMM WriteVecLoad classes that we currently use, we can tweak these values in future patches once this infrastructure is in place. Differential Revision: https://reviews.llvm.org/D52886 llvm-svn: 343868
* [llvm-mca][x86] Add PR36951 ReadAfterLd test caseSimon Pilgrim2018-10-041-0/+50
| | | | llvm-svn: 343795
* [utils] Ensure that update_mca_test_checks.py writes prefixes in ↵Greg Bedwell2018-10-0411-253/+253
| | | | | | alphabetical order llvm-svn: 343783
* [llvm-mca][x86] Add tests demonstrating ReadAfterLd delaySimon Pilgrim2018-10-042-0/+368
| | | | llvm-svn: 343773
* [X86][Btver2] Fix MMX PSHUFB scheduleSimon Pilgrim2018-10-031-5/+5
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343701
* [llvm-mca] Add support for move elimination in class RegisterFile.Andrea Di Biagio2018-10-031-0/+96
| | | | | | | | | | | | | | | | | | | This patch teaches class RegisterFile how to analyze register writes from instructions that are move elimination candidates. In particular, it teaches it how to check if a move can be effectively eliminated by the underlying PRF, and (if necessary) how to perform move elimination. The long term goal is to allow processor models to describe instructions that are valid move elimination candidates. The idea is to let register file definitions in tablegen declare if/when moves can be eliminated. This patch is a non functional change. The logic that performs move elimination is currently disabled. A future patch will add support for move elimination in the processor models, and enable this new code path. llvm-svn: 343691
* [X86][Btver2] Most RMW instructions don't require an additional uopSimon Pilgrim2018-10-031-185/+185
| | | | | | | | Remove uop on WriteRMW and move it into the few instructions that need it. Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343671
* [X86] ALU/ADC RMW instructions should use the WriteRMW sequence classSimon Pilgrim2018-10-031-94/+94
| | | | | | | | | | | | | I was expecting this to be a nfc but Silvermont seems to be setup a little differently: // A folded store needs a cycle on MEC_RSV for the store data, but it does not need an extra port cycle to recompute the address. def : WriteRes<WriteRMW, [SLM_MEC_RSV]>; So moving from WriteStore to WriteRMW reduces predicted port pressure, confirmed by @craig.topper that this is correct. Differential Revision: https://reviews.llvm.org/D52740 llvm-svn: 343670
* [X86][Btver2] Fix BLENDV and AESDEC schedulesSimon Pilgrim2018-10-023-35/+35
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343597
* [X86][Btver2] Fix BT(C|R|S)mr & BT(C|R|S)mi schedule latency + uop countsSimon Pilgrim2018-10-011-18/+18
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343494
* [X86][Btver2] Fix BTmr schedule uop countsSimon Pilgrim2018-10-011-3/+3
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343484
* [X86][Btver2] Fix masked load scheduleSimon Pilgrim2018-10-011-5/+5
| | | | | | | | JFPU01 resource usage should match JFPX Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343468
* [X86][BtVer2] Teach how to identify zero-idiom VPERM2F128rr instructions.Andrea Di Biagio2018-10-011-15/+15
| | | | | | | | | | | This patch adds another variant class to identify zero-idiom VPERM2F128rr instructions. On Jaguar, a VPERM wih bit 3 and 7 of the mask set, is a zero-idiom. Differential Revision: https://reviews.llvm.org/D52663 llvm-svn: 343452
* [X86][Sched] Update scheduling information for VZEROALL on HWS, BDW, SKX, SNB.Clement Courbet2018-10-015-15/+15
| | | | | | | | | | | | | | | | | Summary: While looking at PR35606, I found out that the scheduling info is incorrect. One can check that it's really a P5+P6 and not a 2*P56 with: echo -e 'vzeroall\nvandps %xmm1, %xmm2, %xmm3' | ./bin/llvm-exegesis -mode=uops -snippets-file=- (vandps executes on P5 only) Reviewers: craig.topper, RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D52541 llvm-svn: 343447
* [X86] Fix scheduler class for BTmi instructionsSimon Pilgrim2018-09-303-21/+21
| | | | | | This wasn't treated as a folded load instruction llvm-svn: 343424
* [LLVM-MCA][X86] Add missing VCMPESTR/VCMPESTR testsSimon Pilgrim2018-09-308-7/+231
| | | | llvm-svn: 343421
* [LLVM-MCA][X86] Add some AVX512 testsSimon Pilgrim2018-09-303-1/+899
| | | | | | These are going to be necessary to check I don't mess up when I start cleaning up all the remaining vector integer overrides llvm-svn: 343414
* [X86][Btver2] Fix PCmpIStrI/PCmpIStrM schedulesSimon Pilgrim2018-09-301-5/+5
| | | | | | | | Missing JFPU0 pipe and double JFPU1 pipe (to match JVALU1) resources Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343413
* [llvm-mca] Add a test for zero-idiom VPERM2F128rr. NFCAndrea Di Biagio2018-09-281-0/+75
| | | | | | | | | | | We don't correctly model the latency and resource usage information for zero-idiom VPERM2F128rr on Jaguar. This is demonstrated by the incorrect numbers in the resource pressure view, and the timeline view. A follow up patch will fix this problem. llvm-svn: 343346
* [utils] Stricter checking from update_mca_test_checks.pyGreg Bedwell2018-09-282-28/+28
| | | | | | | | | | | | | | | | If any prefixes have been specified on the RUN lines that do not end up ever actually getting printed, raise an Error. This is either an indication that the run lines just need cleaning up, or that something is more fundamentally wrong with the test. Also raise an Error if there are any blocks which cannot be checked because they are not uniquely covered by a prefix. Fixed up a couple of tests where the extra checking flagged up issues. Differential Revision: https://reviews.llvm.org/D48276 llvm-svn: 343332
* [utils] Allow better identification of matching blocks in ↵Greg Bedwell2018-09-284-151/+127
| | | | | | | | | | | | | | update_mca_test_checks.py Insert empty blocks to cause the positions of matching blocks to match across lists where possible so that later stages of the algorithm can actually identify them as being identical. Regenerated all tests with this change. Differential Revision: https://reviews.llvm.org/D52560 llvm-svn: 343331
* [X86][Btver2] PSUBS/PSUBUS instructions are zero-idiomsSimon Pilgrim2018-09-281-148/+148
| | | | | | Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB llvm-svn: 343321
* [X86][Btver2] Add zero-idiom tests for PSUBS/PSUBUS instructionsSimon Pilgrim2018-09-281-88/+172
| | | | | | Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB llvm-svn: 343319
* [X86][Btver2] CVTSS2I/CVTSD2I - add missing JFPU0 pipeSimon Pilgrim2018-09-283-35/+35
| | | | | | | | We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe) Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343314
* [X86][Btver2] Fix BSF/BSR scheduleSimon Pilgrim2018-09-282-43/+43
| | | | | | | | Double throughput to account for 2 pipes + fix BSF's latency/uop counts Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343311
* [X86][BtVer2] Fix PHMINPOS schedule resources typoSimon Pilgrim2018-09-282-8/+8
| | | | | | PHMINPOS can run on either JFPU pipe llvm-svn: 343299
* [X86][Btver2] (V)MPSADBW instructions take 3uops not 1Simon Pilgrim2018-09-272-4/+4
| | | | llvm-svn: 343238
* [X86][Btver2] BTC/BTR/BTS instructions take 2uops not 1Simon Pilgrim2018-09-271-18/+18
| | | | llvm-svn: 343234
* [X86][Btver2] BLSI/BLSMSK/BLSR instructions take 2uops not 1 (same as TZCNT)Simon Pilgrim2018-09-271-12/+12
| | | | llvm-svn: 343227
* [X86][Btver2] TZCNT instructions take 2uops not 1Simon Pilgrim2018-09-271-4/+4
| | | | llvm-svn: 343200
* Revert rL342916: [X86] Remove shift/rotate by CL memory (RMW) overridesSimon Pilgrim2018-09-251-8/+8
| | | | | | | | As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead. The uops are slightly different to the register variant, so requires a +1uop tweak llvm-svn: 342969
* [X86] Remove shift/rotate by CL memory (RMW) overridesSimon Pilgrim2018-09-241-8/+8
| | | | | | The uops are slightly different to the register variant, so requires a +1uop tweak llvm-svn: 342916
* [X86] Split WriteIMul into 8/16/32/64 implementations (PR36931)Simon Pilgrim2018-09-245-25/+25
| | | | | | | | Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases. This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions. llvm-svn: 342892
* [X86] ROR*mCL instruction models should match ROL*mCL etc.Simon Pilgrim2018-09-234-36/+36
| | | | | | | | Confirmed with Craig Topper - fix a typo that was missing a Port4 uop for ROR*mCL instructions on some Intel models. Yet another step on the scheduler model cleanup marathon...... llvm-svn: 342846
* [X86] Added missing RCL/RCR schedule overrides to the generic SNB modelSimon Pilgrim2018-09-232-194/+194
| | | | | | | | | | | | The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults. I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well. This is necessary to allow WriteRotate to be updated to remove other rotate overrides. It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions. llvm-svn: 342842
* [X86][Sched] Add zero idiom sched data to the SNB model.Clement Courbet2018-09-211-0/+348
| | | | | | | | | | | | | | | | | | | | Summary: On SNB, renamer-based zeroing does not work for: - 16 and 8-bit GPRs[1]. - MMX [2]. - ANDN variants [3] [1] echo 'sub %ax, %ax' | /tmp/llvm-exegesis -mode=uops -snippets-file=- [2] echo 'pxor %mm0, %mm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=- [3] echo 'andnps %xmm0, %xmm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=- Reviewers: RKSimon, andreadb Subscribers: gbedwell, craig.topper, llvm-commits Differential Revision: https://reviews.llvm.org/D52358 llvm-svn: 342736
* [X86][BtVer2] Fix latency and resource cycles of AVX 256-bit zero-idioms.Andrea Di Biagio2018-09-211-42/+42
| | | | | | | | | | | | | | | | This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr and VANDNP(S|D)Yrr. This is a follow-up of r342555. On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at register-renaming stage. The other opcode has to be executed to set the upper half of the destination YMM. Same for VANDNP(S|D)Yrr. Differential Revision: https://reviews.llvm.org/D52347 llvm-svn: 342728
* [llvm-mca][BtVer2] Modify ANDN tests in zero-idioms-avx-256.s. NFCAndrea Di Biagio2018-09-201-42/+42
| | | | | | | Two test cases should have tested 256-bit variants of VANDN zero-idioms instead of the 128-bit variants. llvm-svn: 342655
* [TableGen][SubtargetEmitter] Add the ability for processor models to ↵Andrea Di Biagio2018-09-191-0/+322
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | describe dependency breaking instructions. This patch adds the ability for processor models to describe dependency breaking instructions. Different processors may specify a different set of dependency-breaking instructions. That means, we cannot assume that all processors of the same target would use the same rules to classify dependency breaking instructions. The main goal of this patch is to provide the means to describe dependency breaking instructions directly via tablegen, and have the following TargetSubtargetInfo hooks redefined in overrides by tabegen'd XXXGenSubtargetInfo classes (here, XXX is a Target name). ``` virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const { return false; } virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const { return isZeroIdiom(MI); } ``` An instruction MI is a dependency-breaking instruction if a call to method isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to true. Similarly, an instruction MI is a special case of zero-idiom dependency breaking instruction if a call to STI.isZeroIdiom(MI) returns true. The extra APInt is used for those targets that may want to select which machine operands have their dependency broken (see comments in code). Note that by default, subtargets don't know about the existence of dependency-breaking. In the absence of external information, those method calls would always return false. A new tablegen class named STIPredicate has been added by this patch to let processor models classify instructions that have properties in common. The idea is that, a MCInstrPredicate definition can be used to "generate" an instruction equivalence class, with the idea that instructions of a same class all have a property in common. STIPredicate definitions are essentially a collection of instruction equivalence classes. Also, different processor models can specify a different variant of the same STIPredicate with different rules (i.e. predicates) to classify instructions. Tablegen backends (in this particular case, the SubtargetEmitter) will be able to process STIPredicate definitions, and automatically generate functions in XXXGenSubtargetInfo. This patch introduces two special kind of STIPredicate classes named IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a definition for those in the BtVer2 scheduling model only. This patch supersedes the one committed at r338372 (phabricator review: D49310). The main advantages are: - We can describe subtarget predicates via tablegen using STIPredicates. - We can describe zero-idioms / dep-breaking instructions directly via tablegen in the scheduling models. In future, the STIPredicates framework can be used for solving other problems. Examples of future developments are: - Teach how to identify optimizable register-register moves - Teach how to identify slow LEA instructions (each subtarget defining its own concept of "slow" LEA). - Teach how to identify instructions that have undocumented false dependencies on the output registers on some processors only. It is also (in my opinion) an elegant way to expose knowledge to both external tools like llvm-mca, and codegen passes. For example, machine schedulers in LLVM could reuse that information when internally constructing the data dependency graph for a code region. This new design feature is also an "opt-in" feature. Processor models don't have to use the new STIPredicates. It has all been designed to be as unintrusive as possible. Differential Revision: https://reviews.llvm.org/D52174 llvm-svn: 342555
* [X86][BMI1] Fix BLSI/BLSMSK/BLSR BMI1 scheduling on btver2Simon Pilgrim2018-09-141-25/+25
| | | | | | These have the same behaviour as tzcnt on btver2 - confirmed with AMD 16h SOG, Agner and instlatx64. llvm-svn: 342235
* [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s.Andrea Di Biagio2018-09-031-87/+89
| | | | | | | | | | | | | | | | | | | | | | A ReadAdvance was incorrectly added to the SchedReadWrite list associated with the following SSE instructions: sqrtss sqrtsd rsqrtss rcpss As a consequence, a wrong operand latency was computed for the register operand used as the base address of the folded load operand. This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases. There is still a problem with correctly modeling partial register writes on XMM registers This other problem is currently tracked here: https://bugs.llvm.org/show_bug.cgi?id=38813 Differential Revision: https://reviews.llvm.org/D51542 llvm-svn: 341326
* [X86][BtVer2] Remove wrong ReadAdvance from AVX vbroadcast(ss|sd|f128) ↵Andrea Di Biagio2018-08-311-10/+10
| | | | | | | | | | | | | | | | | | | instructions. The presence of a ReadAdvance for input operand #0 is problematic because it changes the input latency of the register used as the base address for the folded load. A broadcast cannot start executing if the load address hasn't been computed yet. In the llvm-mca example, the VBROADCASTSS is dependent on the address generated by the LEAQ. That means, it cannot start until LEAQ reaches the write-back stage. If we apply ReadAdvance, then we wrongly assume that the load can start 3 cycles in advance. Differential Revision: https://reviews.llvm.org/D51534 llvm-svn: 341222
* [X86] Add llvm-mca tests that show how operand latency is wrongly computed ↵Andrea Di Biagio2018-08-311-0/+206
| | | | | | | | | | | for SSE sqrtss/sd and rcpss. According to the timeline view, sqrtss/sd/rcpss start executing before the load address for the memory operand is available. This problem is caused by the presence of a ReadAfterLd (a ReadAdvance). Those unary operations should not specify a ReadAdvance at all. llvm-svn: 341213
* [X86][BtVer2] Add an llvm-mca test that shows how the read latency of AVX ↵Andrea Di Biagio2018-08-311-0/+73
| | | | | | broadcastss on ymm registers is incorrectly set. llvm-svn: 341197
* [X86][BtVer2] Fix WriteFShuffle256 schedule write info.Andrea Di Biagio2018-08-311-11/+11
| | | | | | | | | | | | | | | This patch fixes the number of micro opcodes, and processor resource cycles for the following AVX instructions: vinsertf128rr/rm vperm2f128rr/rm vbroadcastf128 Tests have been regenerated using the usual scripts in the llvm/utils directory. Differential Revision: https://reviews.llvm.org/D51492 llvm-svn: 341185
* [llvm-mca] Report the number of dispatched micro opcodes in the ↵Andrea Di Biagio2018-08-3010-20/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DispatchStatistics view. This patch introduces the following changes to the DispatchStatistics view: * DispatchStatistics now reports the number of dispatched opcodes instead of the number of dispatched instructions. * The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of stall cycles against the total simulated cycles. This change allows users to easily compare dispatch group sizes with the processor DispatchWidth. Before this change, it was difficult to correlate the two numbers, since DispatchStatistics view reported numbers of instructions (instead of opcodes). DispatchWidth defines the maximum size of a dispatch group in terms of number of micro opcodes. The other change introduced by this patch is related to how DispatchStage generates "instruction dispatch" events. In particular: * There can be multiple dispatch events associated with a same instruction * Each dispatch event now encapsulates the number of dispatched micro opcodes. The number of micro opcodes declared by an instruction may exceed the processor DispatchWidth. Therefore, we cannot assume that instructions are always fully dispatched in a single cycle. DispatchStage knows already how to handle instructions declaring a number of opcodes bigger that DispatchWidth. However, DispatchStage always emitted a single instruction dispatch event (during the first simulated dispatch cycle) for instructions dispatched. With this patch, DispatchStage now correctly notifies multiple dispatch events for instructions that cannot be dispatched in a single cycle. A few views had to be modified. Views can no longer assume that there can only be one dispatch event per instruction. Tests (and docs) have been updated. Differential Revision: https://reviews.llvm.org/D51430 llvm-svn: 341055
* [X86] Improved sched model for X86 CMPXCHG* instructions.Andrew V. Tischenko2018-08-302-18/+20
| | | | | | Differential Revision: https://reviews.llvm.org/D50070 llvm-svn: 341024
* [llvm-mca] Add fields "Total uOps" and "uOps Per Cycle" to the report ↵Andrea Di Biagio2018-08-2970-164/+482
| | | | | | | | | | | | | | | | | | | | | | | | | | | generated by the SummaryView. This patch adds two new fields to the perf report generated by the SummaryView. Fields are now logically organized into two small groups; only the second group contains throughput indicators. Example: ``` Iterations: 100 Instructions: 300 Total Cycles: 414 Total uOps: 700 Dispatch Width: 4 uOps Per Cycle: 1.69 IPC: 0.72 Block RThroughput: 4.0 ``` This patch also updates the docs for llvm-mca. Due to the nature of this change, several tests in the tools/llvm-mca directory were affected, and had to be updated using script `update_mca_test_checks.py`. llvm-svn: 340946
* [llvm-mca] Don't disable the SummaryView if flag `-all-stats` is false.Andrea Di Biagio2018-08-292-6/+78
| | | | llvm-svn: 340945
* [llvm-mca][TimelineView] Force the same number of executions for every entry ↵Andrea Di Biagio2018-08-283-19/+19
| | | | | | | | | | | | | | | | | | | in the 'wait-times' table. This patch also uses colors to highlight problematic wait-time entries. A problematic entry is an entry with an high wait time that tends to match (or exceed) the size of the scheduler's buffer. Color RED is used if an instruction had to wait an average number of cycles which is bigger than (or equal to) the size of the underlying scheduler's buffer. Color YELLOW is used if the time (in cycles) spend waiting for the operands or pipeline resources is bigger than half the size of the underlying scheduler's buffer. Color MAGENTA is used if an instruction does not consume buffer resources according to the scheduling model. llvm-svn: 340825
OpenPOWER on IntegriCloud