bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	[X86][BtVer2] Improved latency and throughput of float/vector loads and stores.	Andrea Di Biagio	2019-10-14	1	-34/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch introduces the following changes to the btver2 scheduling model: - The number of micro opcodes for YMM loads and stores is now 2 (it was incorrectly set to 1 for both aligned and misaligned loads/stores). - Increased the number of AGU resource cycles for YMM loads and stores to 2cy (instead of 1cy). - Removed JFPU01 and JFPX from the list of resources consumed by pure float/vector loads (no MMX). I verified with llvm-exegesis that pure XMM/YMM loads are no-pipe. Those are dispatched to the FPU but not really issues on JFPU01. Differential Revision: https://reviews.llvm.org/D68871 llvm-svn: 374765
*	[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.	Andrea Di Biagio	2019-09-02	1	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On BtVer2 conditional SIMD stores are heavily microcoded. The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit. Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this: - The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP-executed on JFPU0]. - In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1. As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element. VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired). This patch replaces the existing writes for conditional SIMD stores (i.e. WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes: WriteFMaskedStore32 [ XMM Packed Single ] WriteFMaskedStore32Y [ YMM Packed Single ] WriteFMaskedStore64 [ XMM Packed Double ] WriteFMaskedStore64Y [ YMM Packed Double ] Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe both RM and MR variants for conditional SIMD moves in a single tablegen definition. Instances of that class are then passed in input to multiclass avx_movmask_rm when constructing MASKMOVPS/PD definitions. Since this patch introduces new writes, I had to update all the X86 scheduling models. Differential Revision: https://reviews.llvm.org/D66801 llvm-svn: 370649
*	[X86] Add missing properties on llvm.x86.sse.{st,ld}mxcsr	Clement Courbet	2019-06-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: llvm.x86.sse.stmxcsr only writes to memory. llvm.x86.sse.ldmxcsr only reads from memory, and might generate an FPE. Reviewers: craig.topper, RKSimon Subscribers: llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62896 llvm-svn: 363773
*	[X86] Remove the suffix on vcvt[u]si2ss/sd register variants in assembly ↵	Craig Topper	2019-05-06	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	printing. We require d/q suffixes on the memory form of these instructions to disambiguate the memory size. We don't require it on the register forms, but need to support parsing both with and without it. Previously we always printed the d/q suffix on the register forms, but it's redundant and inconsistent with gcc and objdump. After this patch we should support the d/q for parsing, but not print it when its unneeded. llvm-svn: 360085
*	[X86] Remove the _alt forms of (V)CMP instructions. Use a combination of ↵	Craig Topper	2019-03-18	1	-24/+24
\| \| \| \| \| \| \| \| \| \|	custom printing and custom parsing to achieve the same result and more Similar to previous change done for VPCOM and VPCMP Differential Revision: https://reviews.llvm.org/D59468 llvm-svn: 356384
*	[X86][Btver2] Improved latency/throughput model for scalar int-to-float ↵	Andrea Di Biagio	2019-01-29	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	conversions. Account for bypass delays when computing the latency of scalar int-to-float conversions. On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG). This patch also fixes the number of micropcodes for the register-memory variants of scalar int-to-float conversions. Differential Revision: https://reviews.llvm.org/D57148 llvm-svn: 352518
*	[X86][BtVer2] SSE2 vector shifts has local forwarding disabled	Simon Pilgrim	2019-01-22	1	-24/+24
\| \| \| \| \| \| \| \|	Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57026 llvm-svn: 351817
*	[X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabled	Simon Pilgrim	2019-01-22	1	-8/+8
\| \| \| \| \| \| \| \|	Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57022 llvm-svn: 351815
*	[X86][BtVer2] Update the WriteLoad latency.	Andrea Di Biagio	2019-01-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	r327630 introduced new write definitions for float/vector loads. Before that revision, WriteLoad was used by both integer/float (scalar/vector) load. So, WriteLoad had to conservatively declare a latency to 5cy. That is because the load-to-use latency for float/vector load is 5cy. Now that we have dedicated writes for float/vector loads, there is no reason why we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only used by scalar integer loads only; we can assume an optimstic 3cy latency for them. This patch changes that latency from 5cy to 3cy, and regenerates the affected scheduling/mca tests. Differential Revision: https://reviews.llvm.org/D56922 llvm-svn: 351742
*	[X86][BtVer2] Update latency of horizontal operations.	Andrea Di Biagio	2019-01-16	1	-28/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Jaguar, horizontal adds/subs have local forwarding disable. That means, we pay a compulsory extra cycle of write-back stage, and the value is not available until the end of that stage. This patch changes the latency of horizontal operations by adding an extra cycle. With this patch, latency numbers now match what is reported by perf. I plan to send another patch to also 'fix' the latency of shuffle operations (on Jaguar, local forwarding is disabled for vector shuffles too). Differential Revision: https://reviews.llvm.org/D56777 llvm-svn: 351366
*	[X86][Btver2] Fix BLENDV and AESDEC schedules	Simon Pilgrim	2018-10-02	1	-19/+19
\| \| \| \| \| \|	Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343597
*	[X86][Btver2] Fix masked load schedule	Simon Pilgrim	2018-10-01	1	-5/+5
\| \| \| \| \| \| \| \|	JFPU01 resource usage should match JFPX Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343468
*	[LLVM-MCA][X86] Add missing VCMPESTR/VCMPESTR tests	Simon Pilgrim	2018-09-30	1	-1/+29
\| \| \| \|	llvm-svn: 343421
*	[X86][Btver2] CVTSS2I/CVTSD2I - add missing JFPU0 pipe	Simon Pilgrim	2018-09-28	1	-17/+17
\| \| \| \| \| \| \| \|	We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe) Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343314
*	[X86][BtVer2] Fix PHMINPOS schedule resources typo	Simon Pilgrim	2018-09-28	1	-4/+4
\| \| \| \| \| \|	PHMINPOS can run on either JFPU pipe llvm-svn: 343299
*	[X86][Btver2] (V)MPSADBW instructions take 3uops not 1	Simon Pilgrim	2018-09-27	1	-2/+2
\| \| \| \|	llvm-svn: 343238
*	[X86][BtVer2] Fix WriteFShuffle256 schedule write info.	Andrea Di Biagio	2018-08-31	1	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes the number of micro opcodes, and processor resource cycles for the following AVX instructions: vinsertf128rr/rm vperm2f128rr/rm vbroadcastf128 Tests have been regenerated using the usual scripts in the llvm/utils directory. Differential Revision: https://reviews.llvm.org/D51492 llvm-svn: 341185
*	[X86] Fix MayLoad/HasSideEffect flag for (V)MOVLPSrm instructions.	Andrea Di Biagio	2018-07-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before revision 336728, the "mayLoad" flag for instruction (V)MOVLPSrm was inferred directly from the "default" pattern associated with the instruction definition. r336728 removed special node X86Movlps, and all the patterns associated to it. Now instruction (V)MOVLPSrm doesn't have a pattern associated to it, and the 'mayLoad/hasSideEffects' flags are left unset. When the instruction info is emitted by tablegen, method CodeGenDAGPatterns::InferInstructionFlags() sees that (V)MOVLPSrm doesn't have a pattern, and flags are undefined. So, it conservatively sets the "hasSideEffects" flag for it. As a consequence, we were losing the 'mayLoad' flag, and we were gaining a 'hasSideEffect' flag in its place. This patch fixes the issue (originally reported by Michael Holmen). The mca tests show the differences in the instruction info flags. Instructions that were affected by this problem were: MOVLPSrm/VMOVLPSrm/VMOVLPSZ128rm. Differential Revision: https://reviews.llvm.org/D49182 llvm-svn: 336818
*	[llvm-mca] Use a different character to flag instructions with side-effects ↵	Andrea Di Biagio	2018-07-11	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in the Instruction Info View. NFC This makes easier to identify changes in the instruction info flags. It also helps spotting potential regressions similar to the one recently introduced at r336728. Using the same character to mark MayLoad/MayStore/HasSideEffects is problematic for llvm-lit. When pattern matching substrings, llvm-lit consumes tabs and spaces. A change in position of the flag marker may not trigger a test failure. This patch only changes the character used for flag `hasSideEffects`. The reason why I didn't touch other flags is because I want to avoid spamming the mailing because of the massive diff due to the numerous tests affected by this change. In future, each instruction flag should be associated with a different character in the Instruction Info View. llvm-svn: 336797
*	[CodeGen] assume max/default throughput for unspecified instructions	Sanjay Patel	2018-06-05	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is a fix for the problem arising in D47374 (PR37678): https://bugs.llvm.org/show_bug.cgi?id=37678 We may not have throughput info because it's not specified in the model or it's not available with variant scheduling, so assume that those instructions can execute/complete at max-issue-width. Differential Revision: https://reviews.llvm.org/D47723 llvm-svn: 334055
*	[llvm-mca] Make sure not to end the test files with an empty line.	Roman Lebedev	2018-06-04	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: It's super irritating. [properly configured] git client then complains about that double-newline, and you have to use `--force` to ignore the warning, since even if you fix it manually, it will be reintroduced the very next runtime :/ Reviewers: RKSimon, andreadb, courbet, craig.topper, javed.absar, gbedwell Reviewed By: gbedwell Subscribers: javed.absar, tschuett, gbedwell, llvm-commits Differential Revision: https://reviews.llvm.org/D47697 llvm-svn: 333887
*	[X86] Add GPR<->XMM Schedule Tags	Simon Pilgrim	2018-05-18	1	-9/+9
\| \| \| \| \| \| \| \| \| \|	BtVer2 - fix NumMicroOp and account for the Lat+6cy GPR->XMM and Lat+1cy XMm->GPR delays (see rL332737) The high number of MOVD/MOVQ equivalent instructions meant that there were a number of missed patterns in SNB/Znver1: SNB - add missing GPR<->MMX costs (taken from Agner / Intel AOM) Znver1 - add missing GPR<->XMM MOVQ costs (taken from Agner) llvm-svn: 332745
*	[X86][BtVer2] Improve simulation of (V)PINSR values	Simon Pilgrim	2018-05-18	1	-8/+8
\| \| \| \| \| \|	Include the 6cy delay transferring from the GPR to FPU. llvm-svn: 332737
*	[X86][BtVer2] Partial vector stores (inc MMX) have a 2cy latency	Simon Pilgrim	2018-05-18	1	-8/+8
\| \| \| \|	llvm-svn: 332722
*	[X86][SSE] Ensure vector partial load/stores use the ↵	Simon Pilgrim	2018-05-18	1	-5/+5
\| \| \| \| \| \| \| \| \| \|	WriteVecLoad/WriteVecStore scheduler classes Retag some instructions that were missed when we split off vector load/store/moves - MOVQ/MOVD etc. Fixes BtVer2/SLM which have different behaviours for GPR stores. llvm-svn: 332718
*	[X86][SSE] Ensure float load/stores use the WriteFLoad/WriteFStore scheduler ↵	Simon Pilgrim	2018-05-18	1	-9/+9
\| \| \| \| \| \| \| \| \| \|	classes Retag some instructions that were missed when we split off vector load/store/moves - MOVSS/MOVSD/MOVHPD/MOVHPD/MOVLPD/MOVLPS etc. Fixes BtVer2/SLM which have different behaviours for GPR stores. llvm-svn: 332714
*	[llvm-mca] Regenerate tests after r332381 and r332361. NFC	Andrea Di Biagio	2018-05-16	1	-1382/+1382
\| \| \| \|	llvm-svn: 332447
*	[X86] Split WriteCvtF2F into F32->F64 and F64->F32 scheduler classes	Simon Pilgrim	2018-05-15	1	-7/+7
\| \| \| \| \| \| \| \|	BtVer2 - Fixes schedules for (V)CVTPS2PD instructions A lot of the Intel models still have too many InstRW overrides for these new classes - this needs cleaning up but I wanted to get the classes in first llvm-svn: 332376
*	[X86][BtVer2] Fix MMX/YMM integer vector nt store schedules	Simon Pilgrim	2018-05-14	1	-1/+1
\| \| \| \| \| \|	MMX was missing and YMM was tagged as a fp nt store llvm-svn: 332269
*	[X86][BtVer2] Model ymm move as double pumped instructions	Simon Pilgrim	2018-05-11	1	-13/+13
\| \| \| \| \| \|	We still need to handle mmx/xmm moves as 'decode-only' no-pipe instructions llvm-svn: 332109
*	[X86] Add SchedWriteFRnd fp rounding scheduler classes	Simon Pilgrim	2018-05-04	1	-9/+9
\| \| \| \| \| \| \| \|	Split off from SchedWriteFAdd for fp rounding/bit-manipulation instructions. Fixes an issue on btver2 which only had the ymm version using the JSTC pipe instead of JFPA. llvm-svn: 331515
*	[X86] Split off PHMINPOSUW to their own schedule class	Simon Pilgrim	2018-04-24	1	-3/+3
\| \| \| \| \| \|	This also fixes Jaguar's schedule which was treating it as the WriteVecIMul default. llvm-svn: 330756
*	[UpdateTestChecks] Add update_mca_test_checks.py script	Greg Bedwell	2018-04-18	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \|	This script can be used to regenerate tests in the test/tools/llvm-mca directory (PR36904). Regenerated a number of tests using the pattern: test/tools/llvm-mca///*.s Differential Revision: https://reviews.llvm.org/D45369 llvm-svn: 330246
*	[X86] Add separate scheduling class for PSADBW instruction.	Craig Topper	2018-04-17	1	-2/+2
\| \| \| \|	llvm-svn: 330204
*	[X86][Btver2] Add vector extract costs	Simon Pilgrim	2018-04-08	1	-20/+20
\| \| \| \|	llvm-svn: 329524
*	[X86][Btver2] Strip unnecessary check prefixes from resources tests	Simon Pilgrim	2018-04-04	1	-1/+1
\| \| \| \|	llvm-svn: 329192
*	[X86] Add SchedRW for PMULLD	Craig Topper	2018-03-31	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: It seems many CPUs don't implement this instruction as well as the other vector multiplies. Often using a multi uop flow. Silvermont in particular has a 7 uop flow with 11 cycle throughput. Sandy Bridge implements it as a single uop with 5 cycle latency and 1 cycle throughput. But Haswell and later use 2 uops with 10 cycle latency and 2 cycle throughput. This patch adds a new X86SchedWritePair we can use to tag this instruction separately. I've provided correct information for Silvermont, Btver2, and Sandy Bridge. I've removed the InstRWs for SandyBridge. I've left Haswell/Broadwell/Skylake InstRWs in place because I wasn't sure how to account for the different load latency between 128 and 256 bits. I also left Znver1 InstRWs in place because the existing values don't match Agner's spreadsheet. I also left a FIXME in the SandyBridge model because it being used for the "generic" model is too optimistic for the 256/512-bit versions since those are multiple uops on all known CPUs. Reviewers: RKSimon, GGanesh, courbet Reviewed By: RKSimon Subscribers: gchatelet, gbedwell, andreadb, llvm-commits Differential Revision: https://reviews.llvm.org/D44972 llvm-svn: 328914
*	[X86][BtVer2] Fixed the number of micro opcodes for AVX vector converts and	Andrea Di Biagio	2018-03-30	1	-8/+8
\| \| \| \| \| \| \| \| \|	VSQRT instructions. There were still a few AVX instructions with an incorrect number of opcodes. These should be fixed now. llvm-svn: 328892
*	[X86][BtVer2] Fix the number of uOps for horizontal operations.	Andrea Di Biagio	2018-03-30	1	-8/+8
\| \| \| \|	llvm-svn: 328886
*	[X86][BtVer2] Fix the number of micro opcodes for AES[ENC\|DEC] and other YMM ↵	Andrea Di Biagio	2018-03-28	1	-22/+22
\| \| \| \| \| \| \| \| \| \| \|	instructions. Similar to r328694. The number of micro opcodes should be 2 for those instructions. This was found when testing AVX code for BtVer2 using llvm-mca. llvm-svn: 328698
*	[X86][BtVer2] Fix the number of micro opcodes for a bunch of YMM instructions.	Andrea Di Biagio	2018-03-28	1	-0/+695
\| \| \| \| \| \| \| \| \| \| \| \| \|	The Jaguar backend natively supports 128-bit data types. Operations on YMM registers are split into two COPs (complex operations). Each COP consumes a slot in the dispatch group, and in the reorder buffer. The scheduling model for Jaguar should mark those instructions as `let NumMicroOps = 2`. This was found when testing AVX code for BtVer2 using llvm-mca. llvm-svn: 328694
*	[X86][Btver2] Add (U)COMISD/(U)COMISD scheduler costs	Simon Pilgrim	2018-03-26	1	-8/+8
\| \| \| \| \| \|	Account for the "+i" integer pipe transfer cost (1cy use of JALU0 for GPR PRF write) llvm-svn: 328573
*	[X86][Btver2] Add CVTSD2SS/CVTSS2SD scheduler costs	Simon Pilgrim	2018-03-26	1	-4/+4
\| \| \| \|	llvm-svn: 328541
*	[X86][Btver2] Account for the "+i" integer pipe transfer costs (1cy use of ↵	Simon Pilgrim	2018-03-26	1	-17/+17
\| \| \| \| \| \|	JALU0 for GPR PRF write) llvm-svn: 328536
*	[X86][Btver2] Add CVTSD2SI/CVTSS2SI scheduler costs	Simon Pilgrim	2018-03-26	1	-12/+21
\| \| \| \| \| \| \| \|	Account for the "+i" integer pipe transfer cost (1cy use of JALU0 for GPR PRF write) This also adds missing vcvttss2si tests llvm-svn: 328505
*	[X86][Btver2] Fix YMM BLENDPD/BLENDPS + UNPCKPD/UNPCKP instructions costs	Simon Pilgrim	2018-03-26	1	-12/+12
\| \| \| \| \| \|	These should match the YMM MOVDUP/ PERMILPD/PERMILPS + SHUFPD/SHUFPS shuffles instead of using the WriteFShuffle defaults. llvm-svn: 328501
*	[X86][Btver2] Add (V)SQRTPD/(V)SQRTSD costs	Simon Pilgrim	2018-03-26	1	-4/+4
\| \| \| \| \| \|	The xmm sd/pd versions were using the WriteFSQRT default which is modelled on sqrtss/sqrtps llvm-svn: 328497
*	[X86][Btver2] Double the AGU and schedule pipe resources for YMM	Simon Pilgrim	2018-03-26	1	-103/+103
\| \| \| \| \| \|	Both the AGUs and schedule pipes are double pumped for 256-bit instructions as well as the functional units which we already model. llvm-svn: 328491
*	[llvm-mca] Add flag -instruction-tables to print the theoretical resource ↵	Andrea Di Biagio	2018-03-26	1	-389/+389
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pressure distribution for instructions (PR36874) The goal of this patch is to address most of PR36874. To fully fix PR36874 we need to split the "InstructionInfo" view from the "SummaryView". That would make easy to check the latency and rthroughput as well. The patch reuses all the logic from ResourcePressureView to print out the "instruction tables". We have an entry for every instruction in the input sequence. Each entry reports the theoretical resource pressure distribution. Resource pressure is uniformly distributed across all the processor resource units of a group. At the moment, the backend pipeline is not configurable, so the only way to fix this is by creating a different driver that simply sends instruction events to the resource pressure view. That means, we don't use the Backend interface. Instead, it is simpler to just have a different code-path for when flag -instruction-tables is specified. Once Clement addresses bug 36663, then we can port the "instruction tables" logic into a stage of our configurable pipeline. Updated the BtVer2 test cases (thanks Simon for the help). Now we pass flag -instruction-tables to each modified test. Differential Revision: https://reviews.llvm.org/D44839 llvm-svn: 328487
*	[X86][Btver2] Cleanup TEST instructions to use JFPA (+JFPX on ymms) function ↵	Simon Pilgrim	2018-03-23	1	-33/+33
\| \| \| \| \| \|	unit llvm-svn: 328343