summaryrefslogtreecommitdiffstats
path: root/llvm/test/tools/llvm-mca/X86
Commit message (Collapse)AuthorAgeFilesLines
...
* [llvm-mca][X86] Add ADC/SBB with zero test casesSimon Pilgrim2019-03-0611-11/+803
| | | | | | Some targets have fast-path handling for these patterns that we should model. llvm-svn: 355498
* [MCA] Highlight kernel bottlenecks in the summary view.Andrea Di Biagio2019-03-043-0/+263
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new flag named -bottleneck-analysis to print out information about throughput bottlenecks. MCA knows how to identify and classify dynamic dispatch stalls. However, it doesn't know how to analyze and highlight kernel bottlenecks. The goal of this patch is to teach MCA how to correlate increases in backend pressure to backend stalls (and therefore, the loss of throughput). From a Scheduler point of view, backend pressure is a function of the scheduler buffer usage (i.e. how the number of uOps in the scheduler buffers changes over time). Backend pressure increases (or decreases) when there is a mismatch between the number of opcodes dispatched, and the number of opcodes issued in the same cycle. Since buffer resources are limited, continuous increases in backend pressure would eventually leads to dispatch stalls. So, there is a strong correlation between dispatch stalls, and how backpressure changed over time. This patch teaches how to identify situations where backend pressure increases due to: - unavailable pipeline resources. - data dependencies. Data dependencies may delay execution of instructions and therefore increase the time that uOps have to spend in the scheduler buffers. That often translates to an increase in backend pressure which may eventually lead to a bottleneck. Contention on pipeline resources may also delay execution of instructions, and lead to a temporary increase in backend pressure. Internally, the Scheduler classifies instructions based on whether register / memory operands are available or not. An instruction is marked as "ready to execute" only if data dependencies are fully resolved. Every cycle, the Scheduler attempts to execute all instructions that are ready to execute. If an instruction cannot execute because of unavailable pipeline resources, then the Scheduler internally updates a BusyResourceUnits mask with the ID of each unavailable resource. ExecuteStage is responsible for tracking changes in backend pressure. If backend pressure increases during a cycle because of contention on pipeline resources, then ExecuteStage sends a "backend pressure" event to the listeners. That event would contain information about instructions delayed by resource pressure, as well as the BusyResourceUnits mask. Note that ExecuteStage also knows how to identify situations where backpressure increased because of delays introduced by data dependencies. The SummaryView observes "backend pressure" events and prints out a "bottleneck report". Example of bottleneck report: ``` Cycles with backend pressure increase [ 99.89% ] Throughput Bottlenecks: Resource Pressure [ 0.00% ] Data Dependencies: [ 99.89% ] - Register Dependencies [ 0.00% ] - Memory Dependencies [ 99.89% ] ``` A bottleneck report is printed out only if increases in backend pressure eventually caused backend stalls. About the time complexity: Time complexity is linear in the number of instructions in the Scheduler::PendingSet. The average slowdown tends to be in the range of ~5-6%. For memory intensive kernels, the slowdown can be significant if flag -noalias=false is specified. In the worst case scenario I have observed a slowdown of ~30% when flag -noalias=false was specified. We can definitely recover part of that slowdown if we optimize class LSUnit (by doing extra bookkeeping to speedup queries). For now, this new analysis is disabled by default, and it can be enabled via flag -bottleneck-analysis. Users of MCA as a library can enable the generation of pressure events through the constructor of ExecuteStage. This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494 Differential Revision: https://reviews.llvm.org/D58728 llvm-svn: 355308
* [MCA] Always check if scheduler resources are unavailable when reporting ↵Andrea Di Biagio2019-02-261-1/+1
| | | | | | | | | | | dispatch stalls. Dispatch stall cycles may be associated to multiple dispatch stall events. Before this patch, each stall cycle was associated with a single stall event. This patch also improves a couple of code comments, and adds a helper method to query the Scheduler for dispatch stalls. llvm-svn: 354877
* [X86] Correct some ADC/SBB with immediate scheduler data for Broadwell and ↵Craig Topper2019-02-243-51/+51
| | | | | | | | | | | | | | | | | | | | | Skylake. Summary: The AX/EAX/RAX with immediate forms are 2 uops just like the AL with immediate. The modrm form with r8 and immediate is a single uop just like r16/r32/r64 with immediate. Reviewers: RKSimon, andreadb Reviewed By: RKSimon Subscribers: gbedwell, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58581 llvm-svn: 354754
* [MCA] Correctly update register definitions in the PRF after move elimination.Andrea Di Biagio2019-02-181-0/+119
| | | | | | | | | | This patch fixes a bug where register writes performed by optimizable register moves were sometimes wrongly treated like partial register updates. Before this patch, llvm-mca wrongly predicted a 1.50 IPC for test reg-move-elimination-6.s (added by this patch). With this patch, llvm-mca correctly updates the register defintions in the PRF, and the IPC for that test is now correctly reported as 2. llvm-svn: 354271
* [X86] Print all register forms of x87 fadd/fsub/fdiv/fmul as having two ↵Craig Topper2019-02-0411-484/+484
| | | | | | | | | | arguments where on is %st. All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read. This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior. llvm-svn: 353061
* [X86] Print %st(0) as %st when its implicit to the instruction. Continue ↵Craig Topper2019-02-0411-462/+462
| | | | | | | | printing it as %st(0) when its encoded in the instruction. This is a step back from the change I made in r352985. This appears to be more consistent with gcc and objdump behavior. llvm-svn: 353015
* Revert r352985 "[X86] Print %st(0) as %st to match what gcc inline asm uses ↵Craig Topper2019-02-0411-594/+594
| | | | | | | | | | as the clobber name to make MS inline asm work correctly" Looking into gcc and objdump behavior more this was overly aggressive. If the register is encoded in the instruction we should print %st(0), if its implicit we should print %st. I'll be making a more directed change in a future patch. llvm-svn: 353013
* [X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber ↵Craig Topper2019-02-0311-594/+594
| | | | | | | | | | | | | | | | | | | name to make MS inline asm work correctly Summary: When calculating clobbers for MS style inline assembly we fail if the asm clobbers stack top because we print st(0) and try to pass it through the gcc register name check. This was found with when I attempted to make a emms/femms clobber all ST registers. If you use emms/femms in MS inline asm we would try to use st(0) as the clobber name but clang would think that wasn't a valid clobber name. This also matches what objdump disassembly prints. It's also what is printed by gcc -S. Reviewers: RKSimon, rnk, efriedma, spatel, andreadb, lebedev.ri Reviewed By: rnk Subscribers: eraman, gbedwell, lebedev.ri, llvm-commits Differential Revision: https://reviews.llvm.org/D57621 llvm-svn: 352985
* [X86][BdVer2] Transfer delays from the integer to the floating point unit.Roman Lebedev2019-02-017-46/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: I'm unable to find this number in the "AMD SOG for family 15h". llvm-exegesis measures the latencies of these instructions as `2`, which matches the latencies specified in "AMD SOG for family 15h". However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver, Steamroller and Excavator pipeline", "Data delay between different execution domains", the int->ivec transfer is listed as `8`..`10`cy of additional latency. Also, Agner's "Instruction tables", for Piledriver, lists their latencies as `12`, which is consistent with `2cy` from exegesis / AMD SOG + `10cy` transfer delay. Additional data point comes from the fact that Agner's "Instruction tables", for Jaguar, lists their latencies as `8`; and "AMD SOG for family 16h" does state the `+6cy` int->ivec delay, which is consistent with instr latency of `1` or `2`. Reviewers: andreadb, RKSimon, craig.topper Reviewed By: andreadb Subscribers: gbedwell, courbet, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D57300 llvm-svn: 352861
* [X86][Btver2] Improved latency/throughput model for scalar int-to-float ↵Andrea Di Biagio2019-01-294-33/+34
| | | | | | | | | | | | | | conversions. Account for bypass delays when computing the latency of scalar int-to-float conversions. On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG). This patch also fixes the number of micropcodes for the register-memory variants of scalar int-to-float conversions. Differential Revision: https://reviews.llvm.org/D57148 llvm-svn: 352518
* [NFC][MCA][X86][BdVer2] Cherry-pick int-to-ivec forwarding tests from BtVer2Roman Lebedev2019-01-273-0/+705
| | | | llvm-svn: 352317
* [llvm-mca][X86] Add some missing DQI testsSimon Pilgrim2019-01-268-4/+3954
| | | | | | Match more of the coverage of test\CodeGen\X86\avx512-schedule.ll as discussed on D57244 llvm-svn: 352273
* [llvm-mca][X86] Add missing shuffle testsSimon Pilgrim2019-01-258-8/+3924
| | | | | | Match the coverage of test\CodeGen\X86\avx512-shuffle-schedule.ll so we can get rid of -print-schedule (and fix PR37160) without losing schedule tests llvm-svn: 352179
* [MC][X86] Correctly model additional operand latency caused by transfer ↵Andrea Di Biagio2019-01-232-29/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | delays from the integer to the floating point unit. This patch adds a new ReadAdvance definition named ReadInt2Fpu. ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by data transfers from the integer unit to the floating point unit. ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all x86 models excluding BtVer2. That means, this patch is only a functional change for the Jaguar cpu model only. Tablegen definitions for instructions (V)PINSR* have been updated to account for the new ReadInt2Fpu. That read is mapped to the the GPR input operand. On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch, that extra delay was added to the opcode latency. In practice, the insert opcode only executes for 1cy. Most of the actual latency is actually contributed by the so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR* latency is defined by expression f+1, where f is defined as a forwarding delay from the integer unit to the fpu. When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC (only when flag -print-schedule is speified), we now need to account for any extra forwarding delays. We do this by checking if scheduling classes declare any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td: "A negative advance effectively increases latency, which may be used for cross-domain stalls". When computing the instruction latency for the purpose of our scheduling tests, we now add any extra delay to the formula. This avoids regressing existing codegen and mca schedule tests. It comes with the cost of an extra (but very simple) hook in MCSchedModel. Differential Revision: https://reviews.llvm.org/D57056 llvm-svn: 351965
* [llvm-mca][X86] Tidyup avx512 placeholder testsSimon Pilgrim2019-01-228-156/+1348
| | | | | | Ensure we keep avx512f/bw/dq + vl versions separate, add example broadcast tests - this should allow us to better the test coverage of test\CodeGen\X86\avx512-schedule.ll llvm-svn: 351848
* [llvm-mca][X86] Add VPOPCNTDQ testsSimon Pilgrim2019-01-222-0/+238
| | | | | | Matches test coverage of test\CodeGen\X86\avx512vpopcntdq-schedule.ll llvm-svn: 351842
* [llvm-mca][X86] Add missing CLWB/CLZERO/FSGSBASE/LWP/MWAITX/RDPID/SHA testsSimon Pilgrim2019-01-2217-0/+918
| | | | | | We're getting pretty close to matching/exceeding test coverage of the test\CodeGen\X86\*-schedule.ll files, which should allow us to get rid of -print-schedule and fix PR37160 llvm-svn: 351836
* [llvm-mca][X86] Add missing enter/leave, invlpg/invlpga, rdmsr/wrmsr, rdpmc ↵Simon Pilgrim2019-01-2211-13/+363
| | | | | | and rdtsc/rdtscp tests llvm-svn: 351835
* [llvm-mca][X86] Add missing mfence/pinsrw testsSimon Pilgrim2019-01-2211-11/+132
| | | | llvm-svn: 351831
* [llvm-mca][X86] Add missing monitor/mwait testsSimon Pilgrim2019-01-2211-10/+98
| | | | | | These technically should be under a MONITOR cpuid bit, but we tag them as SSE3 so I've done that here as well. llvm-svn: 351829
* [llvm-mca][X86] Add missing vperm2i128 testsSimon Pilgrim2019-01-226-6/+48
| | | | llvm-svn: 351828
* [llvm-mca][X86] Add missing tzcntw testsSimon Pilgrim2019-01-228-8/+64
| | | | llvm-svn: 351827
* [MCA] Add tests for int-to-fpu transfer delays. NFCAndrea Di Biagio2019-01-223-0/+612
| | | | llvm-svn: 351822
* [X86][BtVer2] SSE2 vector shifts has local forwarding disabledSimon Pilgrim2019-01-222-48/+48
| | | | | | | | Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57026 llvm-svn: 351817
* [X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabledSimon Pilgrim2019-01-221-8/+8
| | | | | | | | Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57022 llvm-svn: 351815
* [X86][BtVer2] Update latency of mmx horizontal operationsSimon Pilgrim2019-01-211-12/+12
| | | | | | | | D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty. Confirmed with @andreadb llvm-svn: 351755
* [X86][BtVer2] Update the WriteLoad latency.Andrea Di Biagio2019-01-216-17/+17
| | | | | | | | | | | | | | | | | | r327630 introduced new write definitions for float/vector loads. Before that revision, WriteLoad was used by both integer/float (scalar/vector) load. So, WriteLoad had to conservatively declare a latency to 5cy. That is because the load-to-use latency for float/vector load is 5cy. Now that we have dedicated writes for float/vector loads, there is no reason why we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only used by scalar integer loads only; we can assume an optimstic 3cy latency for them. This patch changes that latency from 5cy to 3cy, and regenerates the affected scheduling/mca tests. Differential Revision: https://reviews.llvm.org/D56922 llvm-svn: 351742
* [X86][BtVer2] Update latency of horizontal operations.Andrea Di Biagio2019-01-167-95/+95
| | | | | | | | | | | | | | | | On Jaguar, horizontal adds/subs have local forwarding disable. That means, we pay a compulsory extra cycle of write-back stage, and the value is not available until the end of that stage. This patch changes the latency of horizontal operations by adding an extra cycle. With this patch, latency numbers now match what is reported by perf. I plan to send another patch to also 'fix' the latency of shuffle operations (on Jaguar, local forwarding is disabled for vector shuffles too). Differential Revision: https://reviews.llvm.org/D56777 llvm-svn: 351366
* [llvm-mca][x86] Add RDSEED instruction resource tests for GLMSimon Pilgrim2018-12-071-0/+39
| | | | llvm-svn: 348624
* [llvm-mca][x86] Add missing AES instruction resource testsSimon Pilgrim2018-12-074-0/+292
| | | | | | Add missing non-VEX instructions llvm-svn: 348623
* [llvm-mca][x86] Add RDRAND/RDSEED instruction resource testsSimon Pilgrim2018-12-0713-0/+529
| | | | llvm-svn: 348622
* [llvm-mca][MC] Add the ability to declare which processor resources model ↵Andrea Di Biagio2018-11-292-89/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | load/store queues (PR36666). This patch adds the ability to specify via tablegen which processor resources are load/store queue resources. A new tablegen class named MemoryQueue can be optionally used to mark resources that model load/store queues. Information about the load/store queue is collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and `StoreQueueID`. Those two fields are identifiers for buffered resources used to describe the load queue and the store queue. Field `BufferSize` is interpreted as the number of entries in the queue, while the number of units is a throughput indicator (i.e. number of available pickers for loads/stores). At construction time, LSUnit in llvm-mca checks for the presence of extra processor information (i.e. MCExtraProcessorInfo) in the scheduling model. If that information is available, and fields LoadQueueID and StoreQueueID are set to a value different than zero (i.e. the invalid processor resource index), then LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value declared by the two processor resources. With this patch, we more accurately track dynamic dispatch stalls caused by the lack of LS tokens (i.e. load/store queue full). This is also shown by the differences in two BdVer2 tests. Stalls that were previously classified as generic SCHEDULER FULL stalls, are not correctly classified either as "load queue full" or "store queue full". About the differences in the -scheduler-stats view: those differences are expected, because entries in the load/store queue are not released at instruction issue stage. Instead, those are released at instruction executed stage. This is the main reason why for the modified tests, the load/store queues gets full before PdEx is full. Differential Revision: https://reviews.llvm.org/D54957 llvm-svn: 347857
* [llvm-mca] pass -dispatch-stats flag to a couple of tests. NFCAndrea Di Biagio2018-11-272-2/+202
| | | | | | | | | | | | | | | | This change is in preparation for a patch that fixes PR36666. llvm-mca currently doesn't know if a buffered processor resource describes a load or store queue. So, any dynamic dispatch stall caused by the lack of load/store queue entries is normally reported as a generic SCHEDULER stall. See for example the -dispatch-stats output from the two tests modified by this patch. In future, processor models will be able to tag processor resources that are used to describe load/store queues. That information would then be used by llvm-mca to correctly classify dynamic dispatch stalls caused by the lack of tokens in the LS. llvm-svn: 347662
* [llvm-mca][View] Improved Retire Control Unit Statistics.Andrea Di Biagio2018-11-236-0/+24
| | | | | | | | | | | | | | | | | | | | | RetireControlUnitStatistics now reports extra information about the ROB and the avg/maximum number of entries consumed over the entire simulation. Example: Retire Control Unit - number of cycles where we saw N instructions retired: [# retired], [# cycles] 0, 109 (17.9%) 1, 102 (16.7%) 2, 399 (65.4%) Total ROB Entries: 64 Max Used ROB Entries: 35 ( 54.7% ) Average Used ROB Entries per cy: 32 ( 50.0% ) Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to reflect this change. llvm-svn: 347493
* [llvm-mca] Fix an invalid memory read introduced by r346487.Andrea Di Biagio2018-11-221-0/+104
| | | | | | | | | | | | | | | | | | | | This patch fixes an invalid memory read introduced by r346487. Before this patch, partial register write had to query the latency of the dependent full register write by calling a method on the full write descriptor. However, if the full write is from an already retired instruction, chances are that the EntryStage already reclaimed its memory. In some parial register write tests, valgrind was reporting an invalid memory read. This change fixes the invalid memory access problem. Writes are now responsible for tracking dependent partial register writes, and notify them in the event of instruction issued. That means, partial register writes no longer need to query their associated full write to check when they are ready to execute. Added test X86/BtVer2/partial-reg-update-7.s llvm-svn: 347459
* [llvm-mca] Correctly update the resource strategy for processor resources ↵Andrea Di Biagio2018-11-1221-125/+125
| | | | | | | | | | | | | | | | | | | | | | | with multiple units. When looking at the tests committed by Roman at r346587, I noticed that numbers reported by the resource pressure for PdAGU01 were wrong. In particular, according to the aut-generated CHECK lines in tests memcpy-like-test.s and store-throughput.s, resource pressure for PdAGU01 was not uniformly distributed among the two AGEN pipes. It turns out that the reason why pressure was not correctly distributed, was because the "resource selection strategy" object associated with PdAGU01 was not correctly updated on the event of AGEN pipe used. As a result, llvm-mca was not simulating a round-robin pipeline allocation for PdAGU01. Instead, PdAGU1 was always prioritized over PdAGU0. This patch fixes the issue; now processor resource strategy objects for resources declaring multiple units, are correctly notified in the event of "resource used". llvm-svn: 346650
* [X86][BdVer2] Fix loads/stores throughput for Piledriver (PR39465)Roman Lebedev2018-11-1065-3806/+4046
| | | | | | | | | | | | | There are two AGU units, and per 1cy, there can be either two loads, or a load and a store; but not two stores, or two loads and a store. Additionally, loads shouldn't affect the store scheduler and vice versa. (but *should* affect the PdEX scheduler.) Required rL346545. Fixes https://bugs.llvm.org/show_bug.cgi?id=39465 llvm-svn: 346587
* [NFC][MCA][BdVer2] Add bdver2 runline into register-file-statistics.s testRoman Lebedev2018-11-101-0/+11
| | | | | | | Missed this one by accident when adding the initial version in rL345463 / rL345462 llvm-svn: 346585
* [X86] Fix VZEROUPPER scheduling info on SNB,HSW,BDW,SXL,SKX.Clement Courbet2018-11-096-18/+18
| | | | | | | | | | | | | | | | | | | | | | | Summary: Starting from SNB, VZEROUPPER is handled by the renamer and uses no proc resources. After HSW, it also has zero latency. This fixes PR35606. To reproduce: Uops: llvm-exegesis -mode=uops -opcode-name=VZEROUPPER Latency: echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper' | /tmp/llvm-exegesis -mode=latency -snippets-file=- echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper\naddps %xmm0, %xmm1' | /tmp/llvm-exegesis -mode=latency -snippets-file=- Reviewers: RKSimon, craig.topper, andreadb Subscribers: gbedwell, llvm-commits Differential Revision: https://reviews.llvm.org/D54107 llvm-svn: 346482
* [NFC][BdVer2] Load and store throughput tests: also check sched stats (PR39465)Roman Lebedev2018-11-082-2/+240
| | | | | | | | As noted by Andrea Di Biagio in https://bugs.llvm.org/show_bug.cgi?id=39465 both the loads and stores occupy both the store and load queues. This is clearly wrong. llvm-svn: 346425
* [NFC][BdVer2] Tests for load and store throughput (PR39465)Roman Lebedev2018-11-082-0/+1209
| | | | | | | | | | | | | | | | During review it was noted that while it appears that the Piledriver can do two [consecutive] loads per cycle, it can only do one store per cycle. It was suggested that the sched model incorrectly models that, but it was opted to fix this afterwards. These tests show that the two consecutive loads are modelled correctly, and one consecutive stores is not modelled incorrectly. Unless i'm missing the point. https://bugs.llvm.org/show_bug.cgi?id=39465 llvm-svn: 346404
* [llvm-mca] Add extra counters for move elimination in view ↵Andrea Di Biagio2018-11-015-0/+20
| | | | | | | | | | | | | | | | | | RegisterFileStatistics. This patch teaches view RegisterFileStatistics how to report events for optimizable register moves. For each processor register file, view RegisterFileStatistics reports the following extra information: - Number of optimizable register moves - Number of register moves eliminated - Number of zero moves (i.e. register moves that propagate a zero) - Max Number of moves eliminated per cycle. Differential Revision: https://reviews.llvm.org/D53976 llvm-svn: 345865
* AMD BdVer2 (Piledriver) Initial Scheduler modelRoman Lebedev2018-10-2785-8457/+9409
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: # Overview This is somewhat partial. * Latencies are good {F7371125} * All of these remaining inconsistencies //appear// to be noise/noisy/flaky. * NumMicroOps are somewhat good {F7371158} * Most of the remaining inconsistencies are from `Ld` / `Ld_ReadAfterLd` classes * Actual unit occupation (pipes, `ResourceCycles`) are undiscovered lands, i did not really look there. They are basically verbatum copy from `btver2` * Many `InstRW`. And there are still inconsistencies left... To be noted: I think this is the first new schedule profile produced with the new next-gen tools like llvm-exegesis! # Benchmark I realize that isn't what was suggested, but i'll start with some "internal" public real-world benchmark i understand - [[ https://github.com/darktable-org/rawspeed | RawSpeed raw image decoding library ]]. Diff (the exact clang from trunk without/with this patch): ``` Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench Benchmark Time CPU Time Old Time New CPU Old CPU New ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_mean -0.0607 -0.0604 234 219 233 219 Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_median -0.0630 -0.0626 233 219 233 219 Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_stddev +0.2581 +0.2587 1 2 1 2 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_mean -0.0770 -0.0767 144 133 144 133 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_median -0.0767 -0.0763 144 133 144 133 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_stddev -0.4170 -0.4156 1 0 1 0 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_mean -0.0271 -0.0270 463 450 463 450 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_median -0.0093 -0.0093 453 449 453 449 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_stddev -0.7280 -0.7280 13 4 13 4 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_pvalue 0.0004 0.0004 U Test, Repetitions: 25 vs 25 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_mean -0.0065 -0.0065 569 565 569 565 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_median -0.0077 -0.0077 569 564 569 564 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_stddev +1.0077 +1.0068 2 5 2 5 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_pvalue 0.0220 0.0199 U Test, Repetitions: 25 vs 25 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_mean +0.0006 +0.0007 312 312 312 312 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_median +0.0031 +0.0032 311 312 311 312 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_stddev -0.7069 -0.7072 4 1 4 1 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_pvalue 0.0004 0.0004 U Test, Repetitions: 25 vs 25 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_mean -0.0015 -0.0015 141 141 141 141 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_median -0.0010 -0.0011 141 141 141 141 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_stddev -0.1486 -0.1456 0 0 0 0 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_pvalue 0.6139 0.8766 U Test, Repetitions: 25 vs 25 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_mean -0.0008 -0.0005 60 60 60 60 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_median -0.0006 -0.0002 60 60 60 60 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_stddev -0.1467 -0.1390 0 0 0 0 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_pvalue 0.0137 0.0137 U Test, Repetitions: 25 vs 25 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_mean +0.0002 +0.0002 275 275 275 275 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_median -0.0015 -0.0014 275 275 275 275 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_stddev +3.3687 +3.3587 0 2 0 2 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_pvalue 0.4041 0.3933 U Test, Repetitions: 25 vs 25 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_mean +0.0004 +0.0004 67 67 67 67 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_median -0.0000 -0.0000 67 67 67 67 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_stddev +0.1947 +0.1995 0 0 0 0 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_pvalue 0.0074 0.0001 U Test, Repetitions: 25 vs 25 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_mean -0.0092 +0.0074 547 542 25 25 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_median -0.0054 +0.0115 544 541 25 25 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_stddev -0.4086 -0.3486 8 5 0 0 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_pvalue 0.3320 0.0000 U Test, Repetitions: 25 vs 25 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_mean +0.0015 +0.0204 218 218 12 12 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_median +0.0001 +0.0203 218 218 12 12 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_stddev +0.2259 +0.2023 1 1 0 0 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_pvalue 0.0000 0.0001 U Test, Repetitions: 25 vs 25 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_mean -0.0209 -0.0179 96 94 90 88 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_median -0.0182 -0.0155 95 93 90 88 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_stddev -0.6164 -0.2703 2 1 2 1 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_mean -0.0098 -0.0098 176 175 176 175 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_median -0.0126 -0.0126 176 174 176 174 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_stddev +6.9789 +6.9157 0 2 0 2 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_mean -0.0237 -0.0238 474 463 474 463 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_median -0.0267 -0.0267 473 461 473 461 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_stddev +0.7179 +0.7178 3 5 3 5 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_pvalue 0.6837 0.6554 U Test, Repetitions: 25 vs 25 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_mean -0.0014 -0.0013 1375 1373 1375 1373 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_median +0.0018 +0.0019 1371 1374 1371 1374 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_stddev -0.7457 -0.7382 11 3 10 3 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_mean -0.0080 -0.0289 22 22 10 10 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_median -0.0070 -0.0287 22 22 10 10 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_stddev +1.0977 +0.6614 0 0 0 0 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_mean +0.0132 +0.0967 35 36 10 11 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_median +0.0132 +0.0956 35 36 10 11 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_stddev -0.0407 -0.1695 0 0 0 0 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_mean +0.0331 +0.1307 13 13 6 6 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_median +0.0430 +0.1373 12 13 6 6 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_stddev -0.9006 -0.8847 1 0 0 0 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_pvalue 0.0016 0.0010 U Test, Repetitions: 25 vs 25 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_mean -0.0023 -0.0024 395 394 395 394 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_median -0.0029 -0.0030 395 394 395 393 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_stddev -0.0275 -0.0375 1 1 1 1 Phase One/P65/CF027310.IIQ/threads:8/real_time_pvalue 0.0232 0.0000 U Test, Repetitions: 25 vs 25 Phase One/P65/CF027310.IIQ/threads:8/real_time_mean -0.0047 +0.0039 114 113 28 28 Phase One/P65/CF027310.IIQ/threads:8/real_time_median -0.0050 +0.0037 114 113 28 28 Phase One/P65/CF027310.IIQ/threads:8/real_time_stddev -0.0599 -0.2683 1 1 0 0 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_mean +0.0206 +0.0207 405 414 405 414 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_median +0.0204 +0.0205 405 414 405 414 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_stddev +0.2155 +0.2212 1 1 1 1 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_mean -0.0109 -0.0108 147 145 147 145 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_median -0.0104 -0.0103 147 145 147 145 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_stddev -0.4919 -0.4800 0 0 0 0 Samsung/NX3000/_3184416.SRW/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Samsung/NX3000/_3184416.SRW/threads:8/real_time_mean -0.0149 -0.0147 220 217 220 217 Samsung/NX3000/_3184416.SRW/threads:8/real_time_median -0.0173 -0.0169 221 217 220 217 Samsung/NX3000/_3184416.SRW/threads:8/real_time_stddev +1.0337 +1.0341 1 3 1 3 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_pvalue 0.0001 0.0001 U Test, Repetitions: 25 vs 25 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_mean -0.0019 -0.0019 194 193 194 193 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_median -0.0021 -0.0021 194 193 194 193 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_stddev -0.4441 -0.4282 0 0 0 0 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_pvalue 0.0000 0.4263 U Test, Repetitions: 25 vs 25 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_mean +0.0258 -0.0006 81 83 19 19 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_median +0.0235 -0.0011 81 82 19 19 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_stddev +0.1634 +0.1070 1 1 0 0 ``` {F7443905} If we look at the `_mean`s, the time column, the biggest win is `-7.7%` (`Canon/EOS 5D Mark II/10.canon.sraw2.cr2`), and the biggest loose is `+3.3%` (`Panasonic/DC-GH5S/P1022085.RW2`); Overall: mean `-0.7436%`, median `-0.23%`, `cbrt(sum(time^3))` = `-8.73%` Looks good so far i'd say. llvm-exegesis details: {F7371117} {F7371125} {F7371128} {F7371144} {F7371158} Reviewers: craig.topper, RKSimon, andreadb, courbet, avt77, spatel, GGanesh Reviewed By: andreadb Subscribers: javed.absar, gbedwell, jfb, llvm-commits Differential Revision: https://reviews.llvm.org/D52779 llvm-svn: 345463
* [NFC][X86] Baseline tests for AMD BdVer2 (Piledriver) Scheduler modelRoman Lebedev2018-10-2786-0/+15775
| | | | | | | | | | | | Adding the baseline tests in a preparatory NFC commit, so that the actual commit shows the *diff*. Yes, i'm aware that a few of these codegen-based sched tests are testing wrong instructions, i will fix that afterwards. For https://reviews.llvm.org/D52779 llvm-svn: 345462
* [MC] Separate masm integer literal lexer support from inline asmReid Kleckner2018-10-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: This renames the IsParsingMSInlineAsm member variable of AsmLexer to LexMasmIntegers and moves it up to MCAsmLexer. This is the only behavior controlled by that variable. I added a public setter, so that it can be set from outside or from the llvm-mc command line. We may need to arrange things so that users can get this behavior from clang, but that's future work. I also put additional hex literal lexing functionality under this flag to fix PR32973. It appears that this hex literal parsing wasn't intended to be enabled in non-masm-style blocks. Now, masm integers (0b1101 and 0ABCh) work in __asm blocks from clang, but 0b label references work when using .intel_syntax in standalone .s files. However, 0b label references will *not* work from __asm blocks in clang. They will work from GCC inline asm blocks, which it sounds like is important for Crypto++ as mentioned in PR36144. Essentially, we only lex masm literals for inline asm blobs that use intel syntax. If the .intel_syntax directive is used inside a gnu-style inline asm statement, masm literals will not be lexed, which is compatible with gas and llvm-mc standalone .s assembly. This fixes PR36144 and PR32973. Reviewers: Gerolf, avt77 Subscribers: eraman, hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D53535 llvm-svn: 345189
* [X86] Fix Skylake ReadAfterLd for PADDrm etc.Simon Pilgrim2018-10-161-51/+53
| | | | | | Missed in rL343868 as due to their custom InstrRW. llvm-svn: 344600
* [tblgen][llvm-mca] Add the ability to describe move elimination candidates ↵Andrea Di Biagio2018-10-125-175/+173
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | via tablegen. This patch adds the ability to identify instructions that are "move elimination candidates". It also allows scheduling models to describe processor register files that allow move elimination. A move elimination candidate is an instruction that can be eliminated at register renaming stage. Each subtarget can specify which instructions are move elimination candidates with the help of tablegen class "IsOptimizableRegisterMove" (see llvm/Target/TargetInstrPredicate.td). For example, on X86, BtVer2 allows both GPR and MMX/SSE moves to be eliminated. The definition of 'IsOptimizableRegisterMove' for BtVer2 looks like this: ``` def : IsOptimizableRegisterMove<[ InstructionEquivalenceClass<[ // GPR variants. MOV32rr, MOV64rr, // MMX variants. MMX_MOVQ64rr, // SSE variants. MOVAPSrr, MOVUPSrr, MOVAPDrr, MOVUPDrr, MOVDQArr, MOVDQUrr, // AVX variants. VMOVAPSrr, VMOVUPSrr, VMOVAPDrr, VMOVUPDrr, VMOVDQArr, VMOVDQUrr ], CheckNot<CheckSameRegOperand<0, 1>> > ]>; ``` Definitions of IsOptimizableRegisterMove from processor models of a same Target are processed by the SubtargetEmitter to auto-generate a target-specific override for each of the following predicate methods: ``` bool TargetSubtargetInfo::isOptimizableRegisterMove(const MachineInstr *MI) const; bool MCInstrAnalysis::isOptimizableRegisterMove(const MCInst &MI, unsigned CPUID) const; ``` By default, those methods return false (i.e. conservatively assume that there are no move elimination candidates). Tablegen class RegisterFile has been extended with the following information: - The set of register classes that allow move elimination. - Maxium number of moves that can be eliminated every cycle. - Whether move elimination is restricted to moves from registers that are known to be zero. This patch is structured in three part: A first part (which is mostly boilerplate) adds the new 'isOptimizableRegisterMove' target hooks, and extends existing register file descriptors in MC by introducing new fields to describe properties related to move elimination. A second part, uses the new tablegen constructs to describe move elimination in the BtVer2 scheduling model. A third part, teaches llm-mca how to query the new 'isOptimizableRegisterMove' hook to mark instructions that are candidates for move elimination. It also teaches class RegisterFile how to describe constraints on move elimination at PRF granularity. llvm-mca tests for btver2 show differences before/after this patch. Differential Revision: https://reviews.llvm.org/D53134 llvm-svn: 344334
* [llvm-mca][BtVer2] Add tests for optimizable GPR register moves. NFCAndrea Di Biagio2018-10-112-0/+216
| | | | llvm-svn: 344253
* [llvm-mca][BtVer2] Add two more move-elimination tests. NFCAndrea Di Biagio2018-10-102-0/+259
| | | | | | | These should test all the optimizable moves on Jaguar. A follow-up patch will teach how to recognize these optimizable register moves. llvm-svn: 344144
OpenPOWER on IntegriCloud