diff options
author | Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net> | 2019-08-05 13:18:37 +0000 |
---|---|---|
committer | Andrea Di Biagio <Andrea_DiBiagio@sn.scee.net> | 2019-08-05 13:18:37 +0000 |
commit | 225655f82c3f48a25d97738f64da701991c51f5f (patch) | |
tree | 22f80641958c43dbc45081a2ef973dcd769dcb41 /llvm/docs/CommandGuide | |
parent | 94484d2b118cd4045d18c0132770755641ff78cd (diff) | |
download | bcm5719-llvm-225655f82c3f48a25d97738f64da701991c51f5f.tar.gz bcm5719-llvm-225655f82c3f48a25d97738f64da701991c51f5f.zip |
[MCA][doc] Add a section for the 'Bottleneck Analysis'.
Also clarify the meaning of 'Block RThroughput' and 'RThroughput'.
llvm-svn: 367853
Diffstat (limited to 'llvm/docs/CommandGuide')
-rw-r--r-- | llvm/docs/CommandGuide/llvm-mca.rst | 85 |
1 files changed, 79 insertions, 6 deletions
diff --git a/llvm/docs/CommandGuide/llvm-mca.rst b/llvm/docs/CommandGuide/llvm-mca.rst index c8b11fc6ed2..f2ebbec43c0 100644 --- a/llvm/docs/CommandGuide/llvm-mca.rst +++ b/llvm/docs/CommandGuide/llvm-mca.rst @@ -373,17 +373,28 @@ overview of the performance throughput. Important performance indicators are **IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal Throughput). +Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched +to the out-of-order backend every simulated cycle. + IPC is computed dividing the total number of simulated instructions by the total -number of cycles. In the absence of loop-carried data dependencies, the -observed IPC tends to a theoretical maximum which can be computed by dividing -the number of instructions of a single iteration by the *Block RThroughput*. +number of cycles. + +Field *Block RThroughput* is the reciprocal of the block throughput. Block +throuhgput is a theoretical quantity computed as the maximum number of blocks +(i.e. iterations) that can be executed per simulated clock cycle in the absence +of loop carried dependencies. Block throughput is is superiorly +limited by the dispatch rate, and the availability of hardware resources. + +In the absence of loop-carried data dependencies, the observed IPC tends to a +theoretical maximum which can be computed by dividing the number of instructions +of a single iteration by the `Block RThroughput`. Field 'uOps Per Cycle' is computed dividing the total number of simulated micro opcodes by the total number of cycles. A delta between Dispatch Width and this field is an indicator of a performance issue. In the absence of loop-carried data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical maximum throughput which can be computed by dividing the number of uOps of a -single iteration by the *Block RThroughput*. +single iteration by the `Block RThroughput`. Field *uOps Per Cycle* is bounded from above by the dispatch width. That is because the dispatch width limits the maximum size of a dispatch group. Both IPC @@ -392,12 +403,12 @@ availability of hardware resources affects the resource pressure distribution, and it limits the number of instructions that can be executed in parallel every cycle. A delta between Dispatch Width and the theoretical maximum uOps per Cycle (computed by dividing the number of uOps of a single iteration by the -*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the +`Block RThroughput`) is an indicator of a performance bottleneck caused by the lack of hardware resources. In general, the lower the Block RThroughput, the better. In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there -are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to +are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to approach 1.50 when the number of iterations tends to infinity. The delta between the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is an indicator of a performance bottleneck caused by the lack of hardware @@ -409,6 +420,13 @@ throughput of every instruction in the sequence. That section also reports extra information related to the number of micro opcodes, and opcode properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). +Field *RThroughput* is the reciprocal of the instruction throughput. Throughput +is computed as the maximum number of instructions of a same type that can be +executed per clock cycle in the absence of operand dependencies. In this +example, the reciprocal throughput of a vector float multiply is 1 +cycles/instruction. That is because the FP multiplier JFPM is only available +from pipeline JFPU1. + The third section is the *Resource pressure view*. This view reports the average number of resource cycles consumed every iteration by instructions for every processor resource unit available on the target. Information is @@ -540,6 +558,61 @@ resources, the delta between the two counters is small. However, the number of cycles spent in the queue tends to be larger (i.e., more than 1-3cy), especially when compared to other low latency instructions. +Bottleneck Analysis +^^^^^^^^^^^^^^^^^^^ +The ``-bottleneck-analysis`` command line option enables the analysis of +performance bottlenecks. + +This analysis is potentially expensive. It attempts to correlate increases in +backend pressure (caused by pipeline resource pressure and data dependencies) to +dynamic dispatch stalls. + +Below is an example of ``-bottleneck-analysis`` output generated by +:program:`llvm-mca` for 500 iterations of the dot-product example on btver2. + +.. code-block:: none + + + Cycles with backend pressure increase [ 48.07% ] + Throughput Bottlenecks: + Resource Pressure [ 47.77% ] + - JFPA [ 47.77% ] + - JFPU0 [ 47.77% ] + Data Dependencies: [ 0.30% ] + - Register Dependencies [ 0.30% ] + - Memory Dependencies [ 0.00% ] + + Critical sequence based on the simulation: + + Instruction Dependency Information + +----< 2. vhaddps %xmm3, %xmm3, %xmm4 + | + | < loop carried > + | + | 0. vmulps %xmm0, %xmm1, %xmm2 + +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] + +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3 + | + | < loop carried > + | + +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] + + +According to the analysis, throughput is limited by resource pressure and not by +data dependencies. The analysis observed increases in backend pressure during +48.07% of the simulated run. Almost all those pressure increase events were +caused by contention on processor resources JFPA/JFPU0. + +The `critical sequence` is the most expensive sequence of instructions according +to the simulation. It is annotated to provide extra information about critical +register dependencies and resource interferences between instructions. + +Instructions from the critical sequence are expected to significantly impact +performance. By construction, the accuracy of this analysis is strongly +dependent on the simulation and (as always) by the quality of the processor +model in llvm. + + Extra Statistics to Further Diagnose Performance Issues ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``-all-stats`` command line option enables extra statistics and performance |