summaryrefslogtreecommitdiffstats
path: root/llvm/test/tools/llvm-mca/X86/BtVer2
Commit message (Collapse)AuthorAgeFilesLines
...
* [X86][Btver2] Fix MMX PSHUFB scheduleSimon Pilgrim2018-10-031-5/+5
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343701
* [llvm-mca] Add support for move elimination in class RegisterFile.Andrea Di Biagio2018-10-031-0/+96
| | | | | | | | | | | | | | | | | | | This patch teaches class RegisterFile how to analyze register writes from instructions that are move elimination candidates. In particular, it teaches it how to check if a move can be effectively eliminated by the underlying PRF, and (if necessary) how to perform move elimination. The long term goal is to allow processor models to describe instructions that are valid move elimination candidates. The idea is to let register file definitions in tablegen declare if/when moves can be eliminated. This patch is a non functional change. The logic that performs move elimination is currently disabled. A future patch will add support for move elimination in the processor models, and enable this new code path. llvm-svn: 343691
* [X86][Btver2] Most RMW instructions don't require an additional uopSimon Pilgrim2018-10-031-185/+185
| | | | | | | | Remove uop on WriteRMW and move it into the few instructions that need it. Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343671
* [X86][Btver2] Fix BLENDV and AESDEC schedulesSimon Pilgrim2018-10-023-35/+35
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343597
* [X86][Btver2] Fix BT(C|R|S)mr & BT(C|R|S)mi schedule latency + uop countsSimon Pilgrim2018-10-011-18/+18
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343494
* [X86][Btver2] Fix BTmr schedule uop countsSimon Pilgrim2018-10-011-3/+3
| | | | | | Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343484
* [X86][Btver2] Fix masked load scheduleSimon Pilgrim2018-10-011-5/+5
| | | | | | | | JFPU01 resource usage should match JFPX Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343468
* [X86][BtVer2] Teach how to identify zero-idiom VPERM2F128rr instructions.Andrea Di Biagio2018-10-011-15/+15
| | | | | | | | | | | This patch adds another variant class to identify zero-idiom VPERM2F128rr instructions. On Jaguar, a VPERM wih bit 3 and 7 of the mask set, is a zero-idiom. Differential Revision: https://reviews.llvm.org/D52663 llvm-svn: 343452
* [X86] Fix scheduler class for BTmi instructionsSimon Pilgrim2018-09-301-7/+7
| | | | | | This wasn't treated as a folded load instruction llvm-svn: 343424
* [LLVM-MCA][X86] Add missing VCMPESTR/VCMPESTR testsSimon Pilgrim2018-09-301-1/+29
| | | | llvm-svn: 343421
* [X86][Btver2] Fix PCmpIStrI/PCmpIStrM schedulesSimon Pilgrim2018-09-301-5/+5
| | | | | | | | Missing JFPU0 pipe and double JFPU1 pipe (to match JVALU1) resources Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343413
* [llvm-mca] Add a test for zero-idiom VPERM2F128rr. NFCAndrea Di Biagio2018-09-281-0/+75
| | | | | | | | | | | We don't correctly model the latency and resource usage information for zero-idiom VPERM2F128rr on Jaguar. This is demonstrated by the incorrect numbers in the resource pressure view, and the timeline view. A follow up patch will fix this problem. llvm-svn: 343346
* [X86][Btver2] PSUBS/PSUBUS instructions are zero-idiomsSimon Pilgrim2018-09-281-148/+148
| | | | | | Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB llvm-svn: 343321
* [X86][Btver2] Add zero-idiom tests for PSUBS/PSUBUS instructionsSimon Pilgrim2018-09-281-88/+172
| | | | | | Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB llvm-svn: 343319
* [X86][Btver2] CVTSS2I/CVTSD2I - add missing JFPU0 pipeSimon Pilgrim2018-09-283-35/+35
| | | | | | | | We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe) Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343314
* [X86][Btver2] Fix BSF/BSR scheduleSimon Pilgrim2018-09-282-43/+43
| | | | | | | | Double throughput to account for 2 pipes + fix BSF's latency/uop counts Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343311
* [X86][BtVer2] Fix PHMINPOS schedule resources typoSimon Pilgrim2018-09-282-8/+8
| | | | | | PHMINPOS can run on either JFPU pipe llvm-svn: 343299
* [X86][Btver2] (V)MPSADBW instructions take 3uops not 1Simon Pilgrim2018-09-272-4/+4
| | | | llvm-svn: 343238
* [X86][Btver2] BTC/BTR/BTS instructions take 2uops not 1Simon Pilgrim2018-09-271-18/+18
| | | | llvm-svn: 343234
* [X86][Btver2] BLSI/BLSMSK/BLSR instructions take 2uops not 1 (same as TZCNT)Simon Pilgrim2018-09-271-12/+12
| | | | llvm-svn: 343227
* [X86][Btver2] TZCNT instructions take 2uops not 1Simon Pilgrim2018-09-271-4/+4
| | | | llvm-svn: 343200
* [X86][BtVer2] Fix latency and resource cycles of AVX 256-bit zero-idioms.Andrea Di Biagio2018-09-211-42/+42
| | | | | | | | | | | | | | | | This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr and VANDNP(S|D)Yrr. This is a follow-up of r342555. On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at register-renaming stage. The other opcode has to be executed to set the upper half of the destination YMM. Same for VANDNP(S|D)Yrr. Differential Revision: https://reviews.llvm.org/D52347 llvm-svn: 342728
* [llvm-mca][BtVer2] Modify ANDN tests in zero-idioms-avx-256.s. NFCAndrea Di Biagio2018-09-201-42/+42
| | | | | | | Two test cases should have tested 256-bit variants of VANDN zero-idioms instead of the 128-bit variants. llvm-svn: 342655
* [TableGen][SubtargetEmitter] Add the ability for processor models to ↵Andrea Di Biagio2018-09-191-0/+322
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | describe dependency breaking instructions. This patch adds the ability for processor models to describe dependency breaking instructions. Different processors may specify a different set of dependency-breaking instructions. That means, we cannot assume that all processors of the same target would use the same rules to classify dependency breaking instructions. The main goal of this patch is to provide the means to describe dependency breaking instructions directly via tablegen, and have the following TargetSubtargetInfo hooks redefined in overrides by tabegen'd XXXGenSubtargetInfo classes (here, XXX is a Target name). ``` virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const { return false; } virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const { return isZeroIdiom(MI); } ``` An instruction MI is a dependency-breaking instruction if a call to method isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to true. Similarly, an instruction MI is a special case of zero-idiom dependency breaking instruction if a call to STI.isZeroIdiom(MI) returns true. The extra APInt is used for those targets that may want to select which machine operands have their dependency broken (see comments in code). Note that by default, subtargets don't know about the existence of dependency-breaking. In the absence of external information, those method calls would always return false. A new tablegen class named STIPredicate has been added by this patch to let processor models classify instructions that have properties in common. The idea is that, a MCInstrPredicate definition can be used to "generate" an instruction equivalence class, with the idea that instructions of a same class all have a property in common. STIPredicate definitions are essentially a collection of instruction equivalence classes. Also, different processor models can specify a different variant of the same STIPredicate with different rules (i.e. predicates) to classify instructions. Tablegen backends (in this particular case, the SubtargetEmitter) will be able to process STIPredicate definitions, and automatically generate functions in XXXGenSubtargetInfo. This patch introduces two special kind of STIPredicate classes named IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a definition for those in the BtVer2 scheduling model only. This patch supersedes the one committed at r338372 (phabricator review: D49310). The main advantages are: - We can describe subtarget predicates via tablegen using STIPredicates. - We can describe zero-idioms / dep-breaking instructions directly via tablegen in the scheduling models. In future, the STIPredicates framework can be used for solving other problems. Examples of future developments are: - Teach how to identify optimizable register-register moves - Teach how to identify slow LEA instructions (each subtarget defining its own concept of "slow" LEA). - Teach how to identify instructions that have undocumented false dependencies on the output registers on some processors only. It is also (in my opinion) an elegant way to expose knowledge to both external tools like llvm-mca, and codegen passes. For example, machine schedulers in LLVM could reuse that information when internally constructing the data dependency graph for a code region. This new design feature is also an "opt-in" feature. Processor models don't have to use the new STIPredicates. It has all been designed to be as unintrusive as possible. Differential Revision: https://reviews.llvm.org/D52174 llvm-svn: 342555
* [X86][BMI1] Fix BLSI/BLSMSK/BLSR BMI1 scheduling on btver2Simon Pilgrim2018-09-141-25/+25
| | | | | | These have the same behaviour as tzcnt on btver2 - confirmed with AMD 16h SOG, Agner and instlatx64. llvm-svn: 342235
* [X86][BtVer2] Remove wrong ReadAdvance from AVX vbroadcast(ss|sd|f128) ↵Andrea Di Biagio2018-08-311-10/+10
| | | | | | | | | | | | | | | | | | | instructions. The presence of a ReadAdvance for input operand #0 is problematic because it changes the input latency of the register used as the base address for the folded load. A broadcast cannot start executing if the load address hasn't been computed yet. In the llvm-mca example, the VBROADCASTSS is dependent on the address generated by the LEAQ. That means, it cannot start until LEAQ reaches the write-back stage. If we apply ReadAdvance, then we wrongly assume that the load can start 3 cycles in advance. Differential Revision: https://reviews.llvm.org/D51534 llvm-svn: 341222
* [X86][BtVer2] Add an llvm-mca test that shows how the read latency of AVX ↵Andrea Di Biagio2018-08-311-0/+73
| | | | | | broadcastss on ymm registers is incorrectly set. llvm-svn: 341197
* [X86][BtVer2] Fix WriteFShuffle256 schedule write info.Andrea Di Biagio2018-08-311-11/+11
| | | | | | | | | | | | | | | This patch fixes the number of micro opcodes, and processor resource cycles for the following AVX instructions: vinsertf128rr/rm vperm2f128rr/rm vbroadcastf128 Tests have been regenerated using the usual scripts in the llvm/utils directory. Differential Revision: https://reviews.llvm.org/D51492 llvm-svn: 341185
* [llvm-mca] Report the number of dispatched micro opcodes in the ↵Andrea Di Biagio2018-08-305-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DispatchStatistics view. This patch introduces the following changes to the DispatchStatistics view: * DispatchStatistics now reports the number of dispatched opcodes instead of the number of dispatched instructions. * The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of stall cycles against the total simulated cycles. This change allows users to easily compare dispatch group sizes with the processor DispatchWidth. Before this change, it was difficult to correlate the two numbers, since DispatchStatistics view reported numbers of instructions (instead of opcodes). DispatchWidth defines the maximum size of a dispatch group in terms of number of micro opcodes. The other change introduced by this patch is related to how DispatchStage generates "instruction dispatch" events. In particular: * There can be multiple dispatch events associated with a same instruction * Each dispatch event now encapsulates the number of dispatched micro opcodes. The number of micro opcodes declared by an instruction may exceed the processor DispatchWidth. Therefore, we cannot assume that instructions are always fully dispatched in a single cycle. DispatchStage knows already how to handle instructions declaring a number of opcodes bigger that DispatchWidth. However, DispatchStage always emitted a single instruction dispatch event (during the first simulated dispatch cycle) for instructions dispatched. With this patch, DispatchStage now correctly notifies multiple dispatch events for instructions that cannot be dispatched in a single cycle. A few views had to be modified. Views can no longer assume that there can only be one dispatch event per instruction. Tests (and docs) have been updated. Differential Revision: https://reviews.llvm.org/D51430 llvm-svn: 341055
* [llvm-mca] Add fields "Total uOps" and "uOps Per Cycle" to the report ↵Andrea Di Biagio2018-08-2939-39/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | generated by the SummaryView. This patch adds two new fields to the perf report generated by the SummaryView. Fields are now logically organized into two small groups; only the second group contains throughput indicators. Example: ``` Iterations: 100 Instructions: 300 Total Cycles: 414 Total uOps: 700 Dispatch Width: 4 uOps Per Cycle: 1.69 IPC: 0.72 Block RThroughput: 4.0 ``` This patch also updates the docs for llvm-mca. Due to the nature of this change, several tests in the tools/llvm-mca directory were affected, and had to be updated using script `update_mca_test_checks.py`. llvm-svn: 340946
* [llvm-mca][TimelineView] Force the same number of executions for every entry ↵Andrea Di Biagio2018-08-283-19/+19
| | | | | | | | | | | | | | | | | | | in the 'wait-times' table. This patch also uses colors to highlight problematic wait-time entries. A problematic entry is an entry with an high wait time that tends to match (or exceed) the size of the scheduler's buffer. Color RED is used if an instruction had to wait an average number of cycles which is bigger than (or equal to) the size of the underlying scheduler's buffer. Color YELLOW is used if the time (in cycles) spend waiting for the operands or pipeline resources is bigger than half the size of the underlying scheduler's buffer. Color MAGENTA is used if an instruction does not consume buffer resources according to the scheduling model. llvm-svn: 340825
* [llvm-mca] Improved report generated by the SchedulerStatistics view.Andrea Di Biagio2018-08-271-3/+9
| | | | | | | | | | | | | Before this patch, the SchedulerStatistics only printed the maximum number of buffer entries consumed in each scheduler's queue at a given point of the simulation. This patch restructures the reported table, and adds an extra field named "Average number of used buffer entries" to it. This patch also uses different colors to help identifying bottlenecks caused by high scheduler's buffer pressure. llvm-svn: 340746
* [X86] MCA tests for XCHG*, XADD* and CMPXCHG* instructionsAndrew V. Tischenko2018-08-071-1/+94
| | | | | | Differential Revision: https://reviews.llvm.org/D49912 llvm-svn: 339145
* [llvm-mca][x86] Add CMPXCHG instruction resource testsSimon Pilgrim2018-08-011-0/+42
| | | | | | I've put CMPXCHG8B/CMPXCHG16B in the same file, even though technically they are under separate CPUID bits all targets seem to support both (or neither). llvm-svn: 338595
* [llvm-mca][x86] Add PREFETCHW instruction resource testsSimon Pilgrim2018-08-011-0/+42
| | | | | | These aren't just available via 3DNow! so test for them separately as well. llvm-svn: 338584
* [llvm-mca][x86] Add PCLMUL instruction resource testsSimon Pilgrim2018-08-011-0/+0
| | | | | | Renamed the btver2 file that already contained them - the other targets were only testing the AVX versions llvm-svn: 338583
* [llvm-mca] Correctly update the rank in `Scheduler::select()`.Andrea Di Biagio2018-08-011-0/+112
| | | | | | Found by inspection. llvm-svn: 338579
* [llvm-mca][x86] Add SET/TEST instruction resource testsSimon Pilgrim2018-08-011-1/+180
| | | | llvm-svn: 338576
* [llvm-mca][x86] Add more x86-64 system instruction resource testsSimon Pilgrim2018-08-011-1/+92
| | | | | | CPUID, IN/OUT, INS/OUTS, INT, PAUSE, SCAS, UD2, XLAT llvm-svn: 338563
* [llvm-mca][x86] Add CMPS/LODS/MOVS/STOS string instruction resource testsSimon Pilgrim2018-08-011-1/+53
| | | | llvm-svn: 338532
* [llvm-mca][x86] Add STC + STD instruction resource testsSimon Pilgrim2018-08-011-1/+8
| | | | llvm-svn: 338514
* [llvm-mca][x86] Add 32-bit instruction resource testsSimon Pilgrim2018-07-311-0/+84
| | | | | | These aren't exhaustive, but cover some instructions that are only available in 32-bit mode (where would we be without good BCD math performance?). llvm-svn: 338404
* [llvm-mca][BtVer2] Teach how to identify dependency-breaking idioms.Andrea Di Biagio2018-07-314-104/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch teaches llvm-mca how to identify dependency breaking instructions on btver2. An example of dependency breaking instructions is the zero-idiom XOR (example: `XOR %eax, %eax`), which always generates zero regardless of the actual value of the input register operands. Dependency breaking instructions don't have to wait on their input register operands before executing. This is because the computation is not dependent on the inputs. Not all dependency breaking idioms are also zero-latency instructions. For example, `CMPEQ %xmm1, %xmm1` is independent on the value of XMM1, and it generates a vector of all-ones. That instruction is not eliminated at register renaming stage, and its opcode is issued to a pipeline for execution. So, the latency is not zero. This patch adds a new method named isDependencyBreaking() to the MCInstrAnalysis interface. That method takes as input an instruction (i.e. MCInst) and a MCSubtargetInfo. The default implementation of isDependencyBreaking() conservatively returns false for all instructions. Targets may override the default behavior for specific CPUs, and return a value which better matches the subtarget behavior. In future, we should teach to Tablegen how to automatically generate the body of isDependencyBreaking from scheduling predicate definitions. This would allow us to expose the knowledge about dependency breaking instructions to the machine schedulers (and, potentially, other codegen passes). Differential Revision: https://reviews.llvm.org/D49310 llvm-svn: 338372
* [llvm-mca][x86] Add movsx/movzx instructions to general x86_64 resource testsSimon Pilgrim2018-07-201-1/+70
| | | | llvm-svn: 337586
* [X86][BtVer2] correctly model the latency/throughput of LEA instructions.Andrea Di Biagio2018-07-191-171/+171
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the latency/throughput of LEA instructions in the BtVer2 scheduling model. On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of 1. That is because it uses one cycle of SAGU followed by 1cy of ALU1. An LEA with a "Scale" operand is also slow, and it has the same latency profile as the 3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e. RThrouhgput of 2.0). This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td. The tablegen backend (for instruction-info) expands that definition into this (file X86GenInstrInfo.inc): ``` static bool isThreeOperandsLEA(const MachineInstr &MI) { return ( ( MI.getOpcode() == X86::LEA32r || MI.getOpcode() == X86::LEA64r || MI.getOpcode() == X86::LEA64_32r || MI.getOpcode() == X86::LEA16r ) && MI.getOperand(1).isReg() && MI.getOperand(1).getReg() != 0 && MI.getOperand(3).isReg() && MI.getOperand(3).getReg() != 0 && ( ( MI.getOperand(4).isImm() && MI.getOperand(4).getImm() != 0 ) || (MI.getOperand(4).isGlobal()) ) ); } ``` A similar method is generated in the X86_MC namespace, and included into X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h). Back to the BtVer2 scheduling model: A new scheduling predicate named JSlowLEAPredicate now checks if either the instruction is a three-operands LEA, or it is an LEA with a Scale value different than 1. A variant scheduling class uses that new predicate to correctly select the appropriate latency profile. Differential Revision: https://reviews.llvm.org/D49436 llvm-svn: 337469
* [llvm-mca][x86] Add extend, carry-flag and CMP instructions to general ↵Simon Pilgrim2018-07-171-1/+120
| | | | | | x86_64 resource tests llvm-svn: 337306
* [llvm-mca][x86] Add MOVBE resource tests to all supporting targetsSimon Pilgrim2018-07-171-0/+56
| | | | | | SNB doesn't support MOVBE but the numbers in Generic (which use the SNB model) look sane. llvm-svn: 337305
* [llvm-mca][x86] Add BSWAP resource testsSimon Pilgrim2018-07-171-1/+8
| | | | llvm-svn: 337302
* [llvm-mca][x86] Add displacement-only and additional scale=1 LEA testsSimon Pilgrim2018-07-171-1/+82
| | | | llvm-svn: 337298
* [llvm-mca][x86] Add LEA resource tests (PR32326)Simon Pilgrim2018-07-171-0/+362
| | | | | | Add llvm-mca tests demonstrating how LEA instructions are currently modelled. Once this is working on btver2 I'll copy the test file to the other target directories. llvm-svn: 337297
OpenPOWER on IntegriCloud