bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	[ARM][Thumb2] Fix ADD/SUB invalid writes to SP	Diogo Sampaio	2020-01-14	1	-0/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This patch fixes pr23772 [ARM] r226200 can emit illegal thumb2 instruction: "sub sp, r12, #80". The violation was that SUB and ADD (reg, immediate) instructions can only write to SP if the source register is also SP. So the above instructions was unpredictable. To enforce that the instruction t2(ADD\|SUB)ri does not write to SP we now enforce the destination register to be rGPR (That exclude PC and SP). Different than the ARM specification, that defines one instruction that can read from SP, and one that can't, here we inserted one that can't write to SP, and other that can only write to SP as to reuse most of the hard-coded size optimizations. When performing this change, it uncovered that emitting Thumb2 Reg plus Immediate could not emit all variants of ADD SP, SP #imm instructions before so it was refactored to be able to. (see test/CodeGen/Thumb2/mve-stacksplot.mir where we use a subw sp, sp, Imm12 variant ) It also uncovered a disassembly issue of adr.w instructions, that were only written as SUBW instructions (see llvm/test/MC/Disassembler/ARM/thumb2.txt). Reviewers: eli.friedman, dmgreen, carwil, olista01, efriedma, andreadb Reviewed By: efriedma Subscribers: gbedwell, john.brawn, efriedma, ostannard, kristof.beyls, hiraditya, dmgreen, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D70680
*	[X86] AMD Znver2 (Rome) Scheduler enablement	Ganesh Gopalasubramanian	2020-01-10	55	-6/+12598
\| \| \| \| \| \| \| \| \| \| \| \|	The patch gives out the details of the znver2 scheduler model. There are few improvements with respect to execution units, latencies and throughput when compared with znver1. The tests that were present for znver1 for llvm-mca tool were replicated. The latencies, execution units, timeline and throughput information are updated for znver2. Reviewers: craig.topper, Simon Pilgrim Differential Revision: https://reviews.llvm.org/D66088
*	[MCA] Fix test cases (NFC)	Evandro Menezes	2019-11-22	10	-30/+30
\| \| \| \|	Fix the test cases for Exynos M5 that break under Darwin.
*	[AArch64] Add the pipeline model for Exynos M5	Evandro Menezes	2019-11-22	28	-34/+2335
\| \| \| \|	Add the scheduling and cost models for Exynos M5.
*	Revert "[AArch64] Add the pipeline model for Exynos M5"	Eric Christopher	2019-11-20	28	-2335/+34
\| \| \| \| \| \|	as it's causing test failures in llvm-mca. This reverts commit 9bdfee2a3bd13d405ce1592930182f23849d2897.
*	[AArch64] Add the pipeline model for Exynos M5	Evandro Menezes	2019-11-20	28	-34/+2335
\| \| \| \|	Add the scheduling and cost models for Exynos M5.
*	[X86] Fix SLM v2i64 ADD/Sub/CMPEQ instruction schedules	Simon Pilgrim	2019-11-06	2	-14/+14
\| \| \| \| \| \|	Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2i64 ops. Numbers taken from Intel AOM (+ checked against Agner)
*	[X86] Fix SLM v2f64 ADD/MUL + FP BLEND/HADD instruction schedules	Simon Pilgrim	2019-11-06	3	-43/+43
\| \| \| \|	Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2f64/v2i64 ops.
*	[mca] Fix test case (NFC)	Evandro Menezes	2019-10-31	1	-6/+6
\| \| \| \|	Fix test case for Darwin builds.
*	[AArch64] Update for Exynos	Evandro Menezes	2019-10-31	1	-0/+71
\| \| \| \|	Fix the costs of `add` and `orr` with an immediate operand.
*	[clang][llvm] Obsolete Exynos M1 and M2	Evandro Menezes	2019-10-30	4	-46/+0
\|
*	[X86][BtVer2] Improved latency and throughput of float/vector loads and stores.	Andrea Di Biagio	2019-10-14	7	-59/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch introduces the following changes to the btver2 scheduling model: - The number of micro opcodes for YMM loads and stores is now 2 (it was incorrectly set to 1 for both aligned and misaligned loads/stores). - Increased the number of AGU resource cycles for YMM loads and stores to 2cy (instead of 1cy). - Removed JFPU01 and JFPX from the list of resources consumed by pure float/vector loads (no MMX). I verified with llvm-exegesis that pure XMM/YMM loads are no-pipe. Those are dispatched to the FPU but not really issues on JFPU01. Differential Revision: https://reviews.llvm.org/D68871 llvm-svn: 374765
*	[MCA] Show aggregate over Average Wait times for the whole snippet (PR43219)	Roman Lebedev	2019-10-10	150	-0/+296
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: As disscused in https://bugs.llvm.org/show_bug.cgi?id=43219, i believe it may be somewhat useful to show //some// aggregates over all the sea of statistics provided. Example: ``` Average Wait times (based on the timeline view): [0]: Executions [1]: Average time spent waiting in a scheduler's queue [2]: Average time spent waiting in a scheduler's queue while ready [3]: Average time elapsed from WB until retire stage [0] [1] [2] [3] 0. 3 1.0 1.0 4.7 vmulps %xmm0, %xmm1, %xmm2 1. 3 2.7 0.0 2.3 vhaddps %xmm2, %xmm2, %xmm3 2. 3 6.0 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 3 3.2 0.3 2.3 <total> ``` I.e. we average the averages. Reviewers: andreadb, mattd, RKSimon Reviewed By: andreadb Subscribers: gbedwell, arphaman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68714 llvm-svn: 374361
*	[MCA][LSUnit] Track loads and stores until retirement.	Andrea Di Biagio	2019-10-08	3	-75/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before this patch, loads and stores were only tracked by their corresponding queues in the LSUnit from dispatch until execute stage. In practice we should be more conservative and assume that memory opcodes leave their queues at retirement stage. Basically, loads should leave the load queue only when they have completed and delivered their data. We conservatively assume that a load is completed when it is retired. Stores should be tracked by the store queue from dispatch until retirement. In practice, stores can only leave the store queue if their data can be written to the data cache. This is mostly a mechanical change. With this patch, the retire stage notifies the LSUnit when a memory instruction is retired. That would triggers the release of LDQ/STQ entries. The only visible change is in memory tests for the bdver2 model. That is because bdver2 is the only model that defines the load/store queue size. This patch partially addresses PR39830. Differential Revision: https://reviews.llvm.org/D68266 llvm-svn: 374034
*	[llvm-mca] Add a -mattr flag	David Green	2019-10-01	1	-0/+29
\| \| \| \| \| \| \| \| \|	This adds a -mattr flag to llvm-mca, for cases where the -mcpu option does not contain all optional features. Differential Revision: https://reviews.llvm.org/D68190 llvm-svn: 373358
*	[MCA] Improved cost computation for loop carried dependencies in the ↵	Andrea Di Biagio	2019-09-19	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bottleneck analysis. This patch introduces a cut-off threshold for dependency edge frequences with the goal of simplifying the critical sequence computation. This patch also removes the cost normalization for loop carried dependencies. We didn't really need to artificially amplify the cost of loop-carried dependencies since it is already computed as the integral over time of the delay (in cycle). In the absence of backend stalls there is no need for computing a critical sequence. With this patch we early exit from the critical sequence computation if no bottleneck was reported during the simulation. llvm-svn: 372337
*	[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.	Andrea Di Biagio	2019-09-02	2	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On BtVer2 conditional SIMD stores are heavily microcoded. The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit. Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this: - The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP-executed on JFPU0]. - In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1. As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element. VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired). This patch replaces the existing writes for conditional SIMD stores (i.e. WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes: WriteFMaskedStore32 [ XMM Packed Single ] WriteFMaskedStore32Y [ YMM Packed Single ] WriteFMaskedStore64 [ XMM Packed Double ] WriteFMaskedStore64Y [ YMM Packed Double ] Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe both RM and MR variants for conditional SIMD moves in a single tablegen definition. Instances of that class are then passed in input to multiclass avx_movmask_rm when constructing MASKMOVPS/PD definitions. Since this patch introduces new writes, I had to update all the X86 scheduling models. Differential Revision: https://reviews.llvm.org/D66801 llvm-svn: 370649
*	[X86][BtVer2] Add a read-advance to every implicit register use of ↵	Andrea Di Biagio	2019-08-23	1	-0/+304
\| \| \| \| \| \| \| \| \| \| \| \|	CMPXCHG8B/16B. This is a follow up of r369642. This patch assigns a ReadAfterLd to every implicit register use of instruction CMPXCHG8B and instruction CMPXCHG16B. Perf micro-benchmarks show that implicit registers are read after 3cy from the start of execution. llvm-svn: 369750
*	[X86][BtVer2] Fix latency of ALU RMW instructions.	Andrea Di Biagio	2019-08-23	1	-222/+222
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Excluding ADC/SBB and the bit-test instructions (BTR/BTS/BTC), the observed latency of all other RMW integer arithmetic/logic instructions is 6cy and not 5cy. Example (ADD): ``` addb $0, (%rsp) # Latency: 6cy addb $7, (%rsp) # Latency: 6cy addb %sil, (%rsp) # Latency: 6cy addw $0, (%rsp) # Latency: 6cy addw $511, (%rsp) # Latency: 6cy addw %si, (%rsp) # Latency: 6cy addl $0, (%rsp) # Latency: 6cy addl $511, (%rsp) # Latency: 6cy addl %esi, (%rsp) # Latency: 6cy addq $0, (%rsp) # Latency: 6cy addq $511, (%rsp) # Latency: 6cy addq %rsi, (%rsp) # Latency: 6cy ``` The same latency profile applies to SUB/AND/OR/XOR/INC/DEC. The observed latency of ADC/SBB is 7-8cy. So we need a different write to model those. Latency of BTS/BTR/BTC is not fixed by this patch (they are much slower than what the model for btver2 currently reports). Differential Revision: https://reviews.llvm.org/D66636 llvm-svn: 369748
*	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	Andrea Di Biagio	2019-08-22	15	-246/+247
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Single operand MUL instructions that implicitly set EAX have the following latency/throughput profile (see below): imul %cl # latency: 3cy - uOPs: 1 - 1 JMul imul %cx # latency: 3cy - uOPs: 3 - 3 JMul imul %ecx # latency: 3cy - uOPs: 2 - 2 JMul imul %rcx # latency: 6cy - uOPs: 2 - 4 JMul mul %cl # latency: 3cy - uOPs: 1 - 1 JMul mul %cx # latency: 3cy - uOPs: 3 - 3 JMul mul %ecx # latency: 3cy - uOPs: 2 - 2 JMul mul %rcx # latency: 6cy - uOPs: 2 - 4 JMul Excluding the 64bit variant, which has a latency of 6cy, every other instruction has a latency of 3cy. However, the number of decoded macro-opcodes (as well as the resource cyles) depend on the MUL size. The two operand MULs have a more predictable profile (see below): imul %dx, %dx # latency: 3cy - uOPs: 1 - 1 JMul imul %edx, %edx # latency: 3cy - uOPs: 1 - 1 JMul imul %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul imul $3, %dx, %dx # latency: 4cy - uOPs: 2 - 2 JMul imul $3, %ecx, %ecx # latency: 3cy - uOPs: 1 - 1 JMul imul $3, %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul This patch updates the values in the Jaguar scheduling model and regenerates llvm-mca tests. Differential Revision: https://reviews.llvm.org/D66547 llvm-svn: 369661
*	[X86][BtVer2] Fix latency and throughput of XCHG and XADD.	Andrea Di Biagio	2019-08-22	3	-55/+328
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum throughput for XCHG is 1 IPC. The byte exchange has worse latency and decodes to 1 extra uOP; maximum observed throughput is 0.5 IPC. ``` xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 2 ALU xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 2 ALU xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 2 ALU ``` The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy. The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed. ``` xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy ``` The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy). Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant). ``` xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 3 ALU xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 3 ALU xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 3 ALU xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 3 ALU ``` The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency'). ``` xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU ``` The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution. ``` lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy ``` Added test xadd.s to verify those latencies as well as read-advance values. Differential Revision: https://reviews.llvm.org/D66535 llvm-svn: 369642
*	[X86][BtVer2] Use ReadAfterLd entries for the register operands of CMPXCHG.	Andrea Di Biagio	2019-08-20	1	-0/+286
\| \| \| \| \| \|	This is a follow-up of r369365. llvm-svn: 369412
*	[X86][BtVer2] Fix latency and throughput of atomic INC/DEC/NEG/NOT.	Andrea Di Biagio	2019-08-20	1	-33/+33
\| \| \| \| \| \| \| \| \|	Latency and throughput of LOCK INC/DEC/NEG/NOT is always 19cy. Number of uOPs is still 1. Differential Revision: https://reviews.llvm.org/D66469 llvm-svn: 369388
*	[MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops	Simon Pilgrim	2019-08-20	12	-24/+4596
\| \| \| \| \| \|	D66424 adds the base support for LOCK so we should be able to add special case support for all these cases in future patches llvm-svn: 369367
*	[X86][Btver2] Fix latency and throughput of CMPXCHG instructions.	Andrea Di Biagio	2019-08-20	2	-34/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC. Throughput is superiorly limited to 0.33 because of the implicit in/out dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the same memory location, store-to-load forwarding occurs and values for sequent loads are quickly forwarded from the store buffer. Interestingly, the functionality in LLVM that computes the reciprocal throughput doesn't seem to know about RMW instructions. That functionality only looks at the "consumed resource cycles" for the throughput computation. It should be fixed/improved by a future patch. In particular, for RMW instructions, that logic should also take into account for the write latency of in/out register operands. An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to ~17cy/inst due to cache locking, which prevents other memory uOPs to start executing before the "lock releasing" store uOP. CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less macro opcode. Their latency tend to be the same as the other RR/RM variants. RR variants are relatively fast 3cy (but still microcoded - 5 macro opcodes). CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load forwarding. That means, throughput is clearly limited by the in/out dependency on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs for the Integer pipes). I have reused the same mix of consumed resource from the other CMPXCHG instructions for CMPXCHG8B too. LOCK CMPXCHG8B is instead 18cycles. CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38) cycles dependeing on whether the LOCK prefix is specified or not. I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to 2x the resource cycles used for CMPXCHG8B. The two new hasLockPrefix() functions are used by the btver2 scheduling model check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have been encoded in predicates of variant scheduling classes that describe lat/thr of CMPXCHG. Differential Revision: https://reviews.llvm.org/D66424 llvm-svn: 369365
*	[X86] Move scheduling tests for CMPXCHG to the corresponding ↵	Andrea Di Biagio	2019-08-19	24	-492/+168
\| \| \| \| \| \| \| \| \| \|	resources-x86_64.s files. NFC In D66424 it has been requested to move all the new tests added by r369278 into resources-x86_64.s. That is because only the 8b/16 ops should be tested by resources-cmpxchg.s. This partially reverts r369278. llvm-svn: 369288
*	[X86] Added extensive scheduling model tests for all the CMPXCHG variants. NFC	Andrea Di Biagio	2019-08-19	12	-12/+552
\| \| \| \| \| \|	Addresses a review comment in D66424 llvm-svn: 369279
*	[MCA] Add flag -show-encoding to llvm-mca.	Andrea Di Biagio	2019-08-09	1	-0/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flag -show-encoding enables the printing of instruction encodings as part of the the instruction info view. Example (with flags -mtriple=x86_64-- -mcpu=btver2): Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [7]: Encoding Size [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 In this example, column Encoding Size is the size in bytes of the instruction encoding. Column Encodings reports the actual instruction encodings as byte sequences in hex (objdump style). The computation of encodings is done by a utility class named mca::CodeEmitter. In future, I plan to expose the CodeEmitter to the instruction builder, so that information about instruction encoding sizes can be used by the simulator. That would be a first step towards simulating the throughput from the decoders in the hardware frontend. Differential Revision: https://reviews.llvm.org/D65948 llvm-svn: 368432
*	[X86] Limit vpermil2pd/vpermil2ps immediates to 4 bits in the assembly parser.	Craig Topper	2019-08-07	2	-12/+12
\| \| \| \| \| \| \| \| \| \|	The upper 4 bits of the immediate byte are used to encode a register. We need to limit the explicit immediate to fit in the remaining 4 bits. Fixes PR42899. llvm-svn: 368123
*	[MCA] Add support for printing immedate values as hex. Also enable lexing of ↵	Andrea Di Biagio	2019-08-02	2	-0/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	masm binary and hex literals. This patch adds a new llvm-mca flag named -print-imm-hex. By default, the instruction printer prints immediate operands as decimals. Flag -print-imm-hex enables the instruction printer to print those operands in hex. This patch also adds support for MASM binary and hex literal numbers (example 0FFh, 101b). Added tests to verify the behavior of the new flag. Tests also verify that masm numeric literal operands are now recognized. Differential Revision: https://reviews.llvm.org/D65588 llvm-svn: 367671
*	Set an explicit x86 triple for test bottleneck-analysis.s added by my ↵	Andrea Di Biagio	2019-06-21	1	-1/+1
\| \| \| \| \| \| \| \|	r364045. NFC This should unbreak the ppc64 buildbots. llvm-svn: 364048
*	[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of ↵	Andrea Di Biagio	2019-06-21	8	-0/+321
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instructions based on the simulation. This patch teaches the bottleneck analysis how to identify and print the most expensive sequence of instructions according to the simulation. Fixes PR37494. The goal is to help users identify the sequence of instruction which is most critical for performance. A dependency graph is internally used by the bottleneck analysis to describe data dependencies and processor resource interferences between instructions. There is one node in the graph for every instruction in the input assembly sequence. The number of nodes in the graph is independent from the number of iterations simulated by the tool. It means that a single node of the graph represents all the possible instances of a same instruction contributed by the simulated iterations. Edges are dynamically "discovered" by the bottleneck analysis by observing instruction state transitions and "backend pressure increase" events generated by the Execute stage. Information from the events is used to identify critical dependencies, and materialize edges in the graph. A dependency edge is uniquely identified by a pair of node identifiers plus an instance of struct DependencyEdge::Dependency (which provides more details about the actual dependency kind). The bottleneck analysis internally ranks dependency edges based on their impact on the runtime (see field DependencyEdge::Dependency::Cost). To this end, each edge of the graph has an associated cost. By default, the cost of an edge is a function of its latency (in cycles). In practice, the cost of an edge is also a function of the number of cycles where the dependency has been seen as 'contributing to backend pressure increases'. The idea is that the higher the cost of an edge, the higher is the impact of the dependency on performance. To put it in another way, the cost of an edge is a measure of criticality for performance. Note how a same edge may be found in multiple iteration of the simulated loop. The logic that adds new edges to the graph checks if an equivalent dependency already exists (duplicate edges are not allowed). If an equivalent dependency edge is found, field DependencyEdge::Frequency of that edge is incremented by one, and the new cost is cumulatively added to the existing edge cost. At the end of simulation, costs are propagated to nodes through the edges of the graph. The goal is to identify a critical sequence from a node of the root-set (composed by node of the graph with no predecessors) to a 'sink node' with no successors. Note that the graph is intentionally kept acyclic to minimize the complexity of the critical sequence computation algorithm (complexity is currently linear in the number of nodes in the graph). The critical path is finally computed as a sequence of dependency edges. For edges describing processor resource interferences, the view also prints a so-called "interference probability" value (by dividing field DependencyEdge::Frequency by the total number of iterations). Examples of critical sequence computations can be found in tests added/modified by this patch. On output streams that support colored output, instructions from the critical sequence are rendered with a different color. Strictly speaking the analysis conducted by the bottleneck analysis view is not a critical path analysis. The cost of an edge doesn't only depend on the dependency latency. More importantly, the cost of a same edge may be computed differently by different iterations. The number of dependencies is discovered dynamically based on the events generated by the simulator. However, their number is not fixed. This is especially true for edges that model processor resource interferences; an interference may not occur in every iteration. For that reason, it makes sense to also print out a "probability of interference". By construction, the accuracy of this analysis (as always) is strongly dependent on the simulation (and therefore the quality of the information available in the scheduling model). That being said, the critical sequence effectively identifies a performance criticality. Instructions from that sequence are expected to have a very big impact on performance. So, users can take advantage of this information to focus their attention on specific interactions between instructions. In my experience, it works quite well in practice, and produces useful output (in a reasonable amount time). Differential Revision: https://reviews.llvm.org/D63543 llvm-svn: 364045
*	Fix r363773: Update Barcelona MCA tests.	Clement Courbet	2019-06-19	1	-2/+2
\| \| \| \|	llvm-svn: 363781
*	[NFC][X86][MCA] Barcelona: add load/store/load-store-throughput tests	Roman Lebedev	2019-06-19	3	-0/+1855
\| \| \| \|	llvm-svn: 363775
*	[NFC][X86][MCA] BdVer2: add load-store-throughput test	Roman Lebedev	2019-06-19	1	-0/+736
\| \| \| \|	llvm-svn: 363774
*	[X86] Add missing properties on llvm.x86.sse.{st,ld}mxcsr	Clement Courbet	2019-06-19	22	-46/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: llvm.x86.sse.stmxcsr only writes to memory. llvm.x86.sse.ldmxcsr only reads from memory, and might generate an FPE. Reviewers: craig.topper, RKSimon Subscribers: llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62896 llvm-svn: 363773
*	[lit] Delete empty lines at the end of lit.local.cfg NFC	Fangrui Song	2019-06-17	3	-3/+0
\| \| \| \|	llvm-svn: 363538
*	[NFC][MCA][X86] Add one more 'clear super register' pattern - movss/movsd ↵	Roman Lebedev	2019-06-15	2	-0/+230
\| \| \| \| \| \|	load clears high XMM bits llvm-svn: 363498
*	[NFC][MCA][X86] Add baseline test coverage for AMD Barcelona (aka K10, fam10h)	Roman Lebedev	2019-06-15	48	-120/+8928
\| \| \| \| \| \|	Looking into sched model for that CPU ... llvm-svn: 363497
*	[llvm-mca] Enable bottleneck analysis when flag -all-views is specified.	Andrea Di Biagio	2019-06-10	4	-1/+22
\| \| \| \| \| \| \|	Bottleneck Analysis is one of the many views available in llvm-mca. Therefore, it should be enabled when flag -all-views is passed in input to the tool. llvm-svn: 362964
*	[MCA][Scheduler] Improved critical memory dependency computation.	Andrea Di Biagio	2019-05-26	1	-1/+1
\| \| \| \| \| \| \| \|	This fixes a problem where back-pressure increases caused by register dependencies were not correctly notified if execution was also delayed by memory dependencies. llvm-svn: 361740
*	[X86] Add zero idioms to the haswell, broadwell, and skylake schedule ↵	Craig Topper	2019-05-25	5	-1405/+1405
\| \| \| \| \| \| \| \| \| \|	models. Add 256-bit fp xor to sandybridge zero idioms This copies the Sandy Bridge zero idiom support to later CPUs. Adding the AVX2 and AVX512F/VL instructions as appropriate. Differential Revision: https://reviews.llvm.org/D62360 llvm-svn: 361690
*	[X86][llvm-mca] Add zero idiom tests for Intel CPUs. NFC	Craig Topper	2019-05-25	5	-44/+2296
\| \| \| \| \| \|	This pre-commits tests for D62360 llvm-svn: 361689
*	[MCA] Add support for nested and overlapping region markers	Andrea Di Biagio	2019-05-09	7	-4/+265
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes PR41523 https://bugs.llvm.org/show_bug.cgi?id=41523 Regions can now nest/overlap provided that they have different names. Anonymous regions cannot overlap. Region end markers must specify the region name. The only exception is for when there is only one user-defined region; in that particular case, the region end marker doesn't need to specify a name. Incorrect region end markers are no longer ignored. Instead, the tool reports an error and we exit with an error code. Added test cases to verify the new diagnostic error messages. Updated the llvm-mca docs to reflect this feature change. Differential Revision: https://reviews.llvm.org/D61676 llvm-svn: 360351
*	[X86] AMD Piledriver (BdVer2): major cleanup (mainly inverse throughput)	Roman Lebedev	2019-05-09	81	-6232/+6237
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I've started this cleanup more several times now, but got sidetracked elsewhere, e.g. by llvm-exegesis problems. Not this time, finally! This is mainly cleaning up the inverse throughput values, and a few latencies/uops, based on the llvm-exegesis measured values. Though this is not complete by any means, there's certainly more cleanup to be done. The performance numbers (i've only checked by RawSpeed benchmark) aren't really surprising - overall this slightly (< -1%) improves perf. llvm-svn: 360341
*	[MCA] Don't add a name to the default code region.	Andrea Di Biagio	2019-05-08	1	-1/+1
\| \| \| \| \| \|	This is done in preparation for a patch that fixes PR41523. llvm-svn: 360243
*	[X86] Remove the suffix on vcvt[u]si2ss/sd register variants in assembly ↵	Craig Topper	2019-05-06	33	-176/+176
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	printing. We require d/q suffixes on the memory form of these instructions to disambiguate the memory size. We don't require it on the register forms, but need to support parsing both with and without it. Previously we always printed the d/q suffix on the register forms, but it's redundant and inconsistent with gcc and objdump. After this patch we should support the d/q for parsing, but not print it when its unneeded. llvm-svn: 360085
*	[llvm-mca][x86] Fix MMX PMOVMSKB test	Simon Pilgrim	2019-04-29	11	-33/+33
\| \| \| \| \| \|	This is defined as part of SSE1, XMM PMOVMSKB doesn't appear until SSE2 llvm-svn: 359477
*	[MCA] Fix typo in AVX2 gather tests. NFC	Andrea Di Biagio	2019-04-28	6	-18/+18
\| \| \| \|	llvm-svn: 359397
*	[MCA] Remove wrong comments from a test. NFC	Andrea Di Biagio	2019-04-11	1	-3/+0
\| \| \| \|	llvm-svn: 358160