summaryrefslogtreecommitdiffstats
path: root/llvm/lib/MCA
Commit message (Collapse)AuthorAgeFilesLines
* [MCA] Add an experimental MicroOpQueue stage.Andrea Di Biagio2019-03-293-0/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds an experimental stage named MicroOpQueueStage. MicroOpQueueStage can be used to simulate a hardware micro-op queue (basically, a decoupling queue between 'decode' and 'dispatch'). Users can specify a queue size, as well as a optional MaxIPC (which - in the absence of a "Decoders" stage - can be used to simulate a different throughput from the decoders). This stage is added to the default pipeline between the EntryStage and the DispatchStage only if PipelineOption::MicroOpQueue is different than zero. By default, llvm-mca sets PipelineOption::MicroOpQueue to the value of hidden flag -micro-op-queue-size. Throughput from the decoder can be simulated via another hidden flag named -decoder-throughput. That flag allows us to quickly experiment with different frontend throughputs. For targets that declare a loop buffer, flag -decoder-throughput allows users to do multiple runs, each time simulating a different throughput from the decoders. This stage can/will be extended in future. For example, we could add a "buffer full" event to notify bottlenecks caused by backpressure. flag -decoder-throughput would probably go away if in future we delegate to another stage (DecoderStage?) the simulation of a (potentially variable) throughput from the decoders. For now, flag -decoder-throughput is "good enough" to run some simple experiments. Differential Revision: https://reviews.llvm.org/D59928 llvm-svn: 357248
* [MCA] Fix -Wparentheses warning breaking the -Werror build.Andrea Di Biagio2019-03-271-1/+2
| | | | | | Waring was introduced at r357074. llvm-svn: 357085
* [MCA][Pipeline] Don't visit stages in reverse order when calling method ↵Andrea Di Biagio2019-03-271-3/+3
| | | | | | | | | | cycleEnd(). NFCI There is no reason why stages should be visited in reverse order. This patch allows the definition of stages that push instructions forward from their cycleEnd() routine. llvm-svn: 357074
* [MCA] Correctly update the UsedResourceGroups mask in the InstrBuilder.Andrea Di Biagio2019-03-262-0/+11
| | | | | | | | Found by inspection when looking at the debug output of MCA. This problem was latent, and none of the upstream models were affected by it. No functional change intended. llvm-svn: 357000
* [MCA] Highlight kernel bottlenecks in the summary view.Andrea Di Biagio2019-03-043-4/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new flag named -bottleneck-analysis to print out information about throughput bottlenecks. MCA knows how to identify and classify dynamic dispatch stalls. However, it doesn't know how to analyze and highlight kernel bottlenecks. The goal of this patch is to teach MCA how to correlate increases in backend pressure to backend stalls (and therefore, the loss of throughput). From a Scheduler point of view, backend pressure is a function of the scheduler buffer usage (i.e. how the number of uOps in the scheduler buffers changes over time). Backend pressure increases (or decreases) when there is a mismatch between the number of opcodes dispatched, and the number of opcodes issued in the same cycle. Since buffer resources are limited, continuous increases in backend pressure would eventually leads to dispatch stalls. So, there is a strong correlation between dispatch stalls, and how backpressure changed over time. This patch teaches how to identify situations where backend pressure increases due to: - unavailable pipeline resources. - data dependencies. Data dependencies may delay execution of instructions and therefore increase the time that uOps have to spend in the scheduler buffers. That often translates to an increase in backend pressure which may eventually lead to a bottleneck. Contention on pipeline resources may also delay execution of instructions, and lead to a temporary increase in backend pressure. Internally, the Scheduler classifies instructions based on whether register / memory operands are available or not. An instruction is marked as "ready to execute" only if data dependencies are fully resolved. Every cycle, the Scheduler attempts to execute all instructions that are ready to execute. If an instruction cannot execute because of unavailable pipeline resources, then the Scheduler internally updates a BusyResourceUnits mask with the ID of each unavailable resource. ExecuteStage is responsible for tracking changes in backend pressure. If backend pressure increases during a cycle because of contention on pipeline resources, then ExecuteStage sends a "backend pressure" event to the listeners. That event would contain information about instructions delayed by resource pressure, as well as the BusyResourceUnits mask. Note that ExecuteStage also knows how to identify situations where backpressure increased because of delays introduced by data dependencies. The SummaryView observes "backend pressure" events and prints out a "bottleneck report". Example of bottleneck report: ``` Cycles with backend pressure increase [ 99.89% ] Throughput Bottlenecks: Resource Pressure [ 0.00% ] Data Dependencies: [ 99.89% ] - Register Dependencies [ 0.00% ] - Memory Dependencies [ 99.89% ] ``` A bottleneck report is printed out only if increases in backend pressure eventually caused backend stalls. About the time complexity: Time complexity is linear in the number of instructions in the Scheduler::PendingSet. The average slowdown tends to be in the range of ~5-6%. For memory intensive kernels, the slowdown can be significant if flag -noalias=false is specified. In the worst case scenario I have observed a slowdown of ~30% when flag -noalias=false was specified. We can definitely recover part of that slowdown if we optimize class LSUnit (by doing extra bookkeeping to speedup queries). For now, this new analysis is disabled by default, and it can be enabled via flag -bottleneck-analysis. Users of MCA as a library can enable the generation of pressure events through the constructor of ExecuteStage. This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494 Differential Revision: https://reviews.llvm.org/D58728 llvm-svn: 355308
* [MCA] Always check if scheduler resources are unavailable when reporting ↵Andrea Di Biagio2019-02-262-4/+13
| | | | | | | | | | | dispatch stalls. Dispatch stall cycles may be associated to multiple dispatch stall events. Before this patch, each stall cycle was associated with a single stall event. This patch also improves a couple of code comments, and adds a helper method to query the Scheduler for dispatch stalls. llvm-svn: 354877
* [MCA][Scheduler] Collect resource pressure and memory dependency bottlenecks.Andrea Di Biagio2019-02-203-26/+31
| | | | | | | | | | | | | | | | | | Every cycle, the Scheduler checks if instructions in the ReadySet can be issued to the underlying pipelines. If an instruction cannot be issued because one or more pipeline resources are unavailable, then field Instruction::CriticalResourceMask is updated with the resource identifier of the unavailable resources. If an instruction cannot be promoted from the PendingSet to the ReadySet because of a memory dependency, then field Instruction::CriticalMemDep is updated with the identifier of the dependending memory instruction. Bottleneck information is collected after every cycle for instructions that are waiting to execute. The idea is to help identify causes of bottlenecks; this information can be used in future to implement a bottleneck analysis. llvm-svn: 354490
* [MCA][ResourceManager] Add a table that maps processor resource indices to ↵Andrea Di Biagio2019-02-201-21/+24
| | | | | | | | | | | | processor resource identifiers. This patch adds a lookup table to speed up resource queries in the ResourceManager. This patch also moves helper function 'getResourceStateIndex()' from ResourceManager.cpp to Support.h, so that we can reuse that logic in the SummaryView (and potentially other views in llvm-mca). No functional change intended. llvm-svn: 354470
* [MCA] Correctly update register definitions in the PRF after move elimination.Andrea Di Biagio2019-02-181-14/+9
| | | | | | | | | | This patch fixes a bug where register writes performed by optimizable register moves were sometimes wrongly treated like partial register updates. Before this patch, llvm-mca wrongly predicted a 1.50 IPC for test reg-move-elimination-6.s (added by this patch). With this patch, llvm-mca correctly updates the register defintions in the PRF, and the IPC for that test is now correctly reported as 2. llvm-svn: 354271
* [MCA] Slightly refactor method writeStartEvent in WriteState and ReadState. NFCIAndrea Di Biagio2019-02-183-13/+13
| | | | | | | This is another change in preparation for PR37494. No functional change intended. llvm-svn: 354261
* [MCA] Improved code comment. NFCAndrea Di Biagio2019-02-151-1/+2
| | | | llvm-svn: 354154
* [MCA][LSUnit] Return the ID of the dependent memory operation from methodAndrea Di Biagio2019-02-152-12/+16
| | | | | | | | isReady(). NFCI This is yet another change in preparation for a fix for PR37494. llvm-svn: 354150
* [MCA] Store a bitmask of used groups in the instruction descriptor.Andrea Di Biagio2019-02-132-8/+22
| | | | | | | This is to speedup 'checkAvailability' queries in class ResourceManager. No functional change intended. llvm-svn: 353949
* [MCA][Scheduler] Use latency information to further classify busy instructions.Andrea Di Biagio2019-02-132-18/+82
| | | | | | | | | | | | | | | | | | This patch introduces a new instruction stage named 'IS_PENDING'. An instruction transitions from the IS_DISPATCHED to the IS_PENDING stage if input registers are not available, but their latency is known. This patch also adds a new set of instructions named 'PendingSet' to class Scheduler. The idea is that the PendingSet will only contain instructions that have reached the IS_PENDING stage. By construction, an instruction in the PendingSet is only dependent on instructions that have already reached the execution stage. The plan is to use this knowledge to identify bottlenecks caused by data dependencies (see PR37494). Differential Revision: https://reviews.llvm.org/D58066 llvm-svn: 353937
* [MCA] Improved debug prints. NFCAndrea Di Biagio2019-02-122-3/+21
| | | | llvm-svn: 353852
* [MCA][Scheduler] Track resources that were found busy when issuing an ↵Andrea Di Biagio2019-02-111-0/+3
| | | | | | | | | | | | instruction. This is a follow up of r353706. When the scheduler fails to issue a ready instruction to the underlying pipelines, it now updates a mask of 'busy resource units'. That information will be used in future to obtain the set of "problematic" resources in the case of bottlenecks caused by resource pressure. No functional change intended. llvm-svn: 353728
* [MCA] Return a mask of busy resources from method ↵Andrea Di Biagio2019-02-112-10/+23
| | | | | | | | | | | ResourceManager::checkAvailability(). NFCI In case of bottlenecks caused by pipeline pressure, we want to be able to correctly report the set of problematic pipelines. This is a first step towards adding support for bottleneck hints in llvm-mca (see PR37494). No functional change intended. llvm-svn: 353706
* [MCA] Speedup ResourceManager queries. NFCIAndrea Di Biagio2019-02-061-8/+9
| | | | | | | | | | | | | When a resource unit R is released, the ResourceManager notifies groups that contain R. Before this patch, the logic in method ResourceManager::release() implemented a potentially slow iterative search of dependent groups on the entire set of processor resources. This patch replaces that logic with a simpler (and often faster) lookup on array `Resource2Groups`. This patch gives an average speedup of ~3-4% (observed on a release build when testing for target btver2). No functional change intended. llvm-svn: 353301
* [MCA] Moved the logic that updates register dependencies from DispatchStage ↵Andrea Di Biagio2019-02-053-26/+20
| | | | | | | | | | | to RegisterFile. NFC DispatchStage should always delegate to an object of class RegisterFile the task of updating data dependencies. ReadState and WriteState objects should not be modified directly by DispatchStage. This patch also renames stage IS_AVAILABLE to IS_DISPATCHED. llvm-svn: 353170
* [MCA] Simplify the logic in method WriteState::addUser. NFCIAndrea Di Biagio2019-02-051-5/+1
| | | | | | | | In some cases, it is faster to just grow the set of 'Users' rather than performing a llvm::find_if every time a new user is added to the set. No functional change intended. llvm-svn: 353162
* [MC][X86] Correctly model additional operand latency caused by transfer ↵Andrea Di Biagio2019-01-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | delays from the integer to the floating point unit. This patch adds a new ReadAdvance definition named ReadInt2Fpu. ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by data transfers from the integer unit to the floating point unit. ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all x86 models excluding BtVer2. That means, this patch is only a functional change for the Jaguar cpu model only. Tablegen definitions for instructions (V)PINSR* have been updated to account for the new ReadInt2Fpu. That read is mapped to the the GPR input operand. On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch, that extra delay was added to the opcode latency. In practice, the insert opcode only executes for 1cy. Most of the actual latency is actually contributed by the so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR* latency is defined by expression f+1, where f is defined as a forwarding delay from the integer unit to the fpu. When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC (only when flag -print-schedule is speified), we now need to account for any extra forwarding delays. We do this by checking if scheduling classes declare any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td: "A negative advance effectively increases latency, which may be used for cross-domain stalls". When computing the instruction latency for the purpose of our scheduling tests, we now add any extra delay to the formula. This avoids regressing existing codegen and mca schedule tests. It comes with the cost of an extra (but very simple) hook in MCSchedModel. Differential Revision: https://reviews.llvm.org/D57056 llvm-svn: 351965
* Update the file headers across all of the LLVM projects in the monorepoChandler Carruth2019-01-1919-76/+57
| | | | | | | | | | | | | | | | | to reflect the new license. We understand that people may be surprised that we're moving the header entirely to discuss the new license. We checked this carefully with the Foundation's lawyer and we believe this is the correct approach. Essentially, all code in the project is now made available by the LLVM project under our new license, so you will see that the license headers include that license only. Some of our contributors have contributed code under our old license, and accordingly, we have retained a copy of our old license notice in the top-level files in each project and repository. llvm-svn: 351636
* [MCA] Fix wrong definition of ResourceUnitMask in DefaultResourceStrategy.Andrea Di Biagio2019-01-105-15/+36
| | | | | | | | | | | | | | Field ResourceUnitMask was incorrectly defined as a 'const unsigned' mask. It should have been a 64 bit quantity instead. That means, ResourceUnitMask was always implicitly truncated to a 32 bit quantity. This issue has been found by inspection. Surprisingly, that bug was latent, and it never negatively affected any existing upstream targets. This patch fixes the wrong definition of ResourceUnitMask, and adds a bunch of extra debug prints to help debugging potential issues related to invalid processor resource masks. llvm-svn: 350820
* [llvm-mca] Display masks in hexEvandro Menezes2019-01-092-5/+6
| | | | | | Display the resources masks as hexadecimal. Otherwise, NFC. llvm-svn: 350777
* [llvm-mca] Improve debugging (NFC)Evandro Menezes2019-01-082-0/+4
| | | | llvm-svn: 350661
* [MCA] Improved handling of in-order issue/dispatch resources.Andrea Di Biagio2019-01-043-21/+15
| | | | | | | | | | | Added field 'MustIssueImmediately' to the instruction descriptor of instructions that only consume in-order issue/dispatch processor resources. This speeds up queries from the hardware Scheduler, and gives an average ~5% speedup on a release build. No functional change intended. llvm-svn: 350397
* [MCA] Store extra information about processor resources in the ResourceManager.Andrea Di Biagio2019-01-041-14/+32
| | | | | | | | | | | | | | | | | | | | | Method ResourceManager::use() is responsible for updating the internal state of used processor resources, as well as notifying resource groups that contain used resources. Before this patch, method 'use()' didn't know how to quickly obtain the set of groups that contain a particular resource unit. It had to discover groups by perform a potentially slow search (done by iterating over the set of processor resource descriptors). With this patch, the relationship between resource units and groups is stored in the ResourceManager. That means, method 'use()' no longer has to search for groups. This gives an average speedup of ~4-5% on a release build. This patch also adds extra code comments in ResourceManager.h to better describe the resource mask layout, and how resouce indices are computed from resource masks. llvm-svn: 350387
* [MCA] Improve code comment and reuse an helper function in ResourceManager. NFCIAndrea Di Biagio2019-01-031-9/+10
| | | | llvm-svn: 350322
* [MCA] Minor refactoring of method DefaultResourceStrategy::select. NFCIAndrea Di Biagio2019-01-021-18/+21
| | | | | | | | | | | | | | | | | Common code used by the default resource strategy to select pipeline resources has been moved to an helper function. The new selection logic has been slightly rewritten to get rid of a redundant zero check on the `ReadyMask` value. Before this patch, method select internally called function `PowerOf2Floor` to compute the next ready pipeline resource. However, `PowerOf2Floor` forces an implicit (redundant) zero check on the input value. By construction, `ReadyMask` can never be zero. This patch replaces the call to `PowerOf2Floor` with an equivalent block of code which avoids the redundant zero check. This gives a minor 3-3.5% speedup on a release build. No functional change intended. llvm-svn: 350218
* [llvm-mca] Dump mask in hexEvandro Menezes2018-12-181-2/+4
| | | | | | Dump the resources masks as hexadecimal. llvm-svn: 349536
* [MCA] Add support for BeginGroup/EndGroup.Andrea Di Biagio2018-12-172-0/+10
| | | | llvm-svn: 349354
* [MCA] Don't assume that createMCInstrAnalysis() always returns a valid pointer.Andrea Di Biagio2018-12-171-8/+13
| | | | | | | | | | Class InstrBuilder wrongly assumed that llvm targets were always able to return a non-null pointer when createMCInstrAnalysis() was called on them. This was causing crashes when simulating executions for targets that don't provide an MCInstrAnalysis object. This patch fixes the issue by making MCInstrAnalysis optional. llvm-svn: 349352
* [llvm-mca] Move llvm-mca library to llvm/lib/MCA.Clement Courbet2018-12-1720-0/+3194
Summary: See PR38731. Reviewers: andreadb Subscribers: mgorny, javed.absar, tschuett, gbedwell, andreadb, RKSimon, llvm-commits Differential Revision: https://reviews.llvm.org/D55557 llvm-svn: 349332
OpenPOWER on IntegriCloud