bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[X86] Make _Int instructions the preferred instructon for the assembly ↵	Craig Topper	2019-04-10	2	-12/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	parser and disassembly parser to remove inconsistencies between VEX and EVEX. Many of our instructions have both a _Int form used by intrinsics and a form used by other IR constructs. In the EVEX space the _Int versions usually cover all the capabilities include broadcasting and rounding. While the other version only covers simple register/register or register/load forms. For this reason in EVEX, the non intrinsic form is usually marked isCodeGenOnly=1. In the VEX encoding space we were less consistent, but usually the _Int version was the isCodeGenOnly version. This commit makes the VEX instructions match the EVEX instructions. This was done by manually studying the AsmMatcher table so its possible I missed some cases, but we should be closer now. I'm thinking about using the isCodeGenOnly bit to simplify the EVEX2VEX tablegen code that disambiguates the _Int and non _Int versions. Currently it checks register class sizes and Record the memory operands come from. I have some other changes I was looking into for D59266 that may break the memory check. I had to make a few scheduler hacks to keep the _Int versions from being treated differently than the non _Int version. Differential Revision: https://reviews.llvm.org/D60441 llvm-svn: 358138
*	[llvm-mca][scheduler-stats] Print issued micro opcodes per cycle. NFCI	Andrea Di Biagio	2019-04-08	10	-24/+24
\| \| \| \| \| \| \| \| \|	It makes more sense to print out the number of micro opcodes that are issued every cycle rather than the number of instructions issued per cycle. This behavior is also consistent with the dispatch-stats: numbers from the two views can now be easily compared. llvm-svn: 357919
*	[MCA] Add an experimental MicroOpQueue stage.	Andrea Di Biagio	2019-03-29	1	-0/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds an experimental stage named MicroOpQueueStage. MicroOpQueueStage can be used to simulate a hardware micro-op queue (basically, a decoupling queue between 'decode' and 'dispatch'). Users can specify a queue size, as well as a optional MaxIPC (which - in the absence of a "Decoders" stage - can be used to simulate a different throughput from the decoders). This stage is added to the default pipeline between the EntryStage and the DispatchStage only if PipelineOption::MicroOpQueue is different than zero. By default, llvm-mca sets PipelineOption::MicroOpQueue to the value of hidden flag -micro-op-queue-size. Throughput from the decoder can be simulated via another hidden flag named -decoder-throughput. That flag allows us to quickly experiment with different frontend throughputs. For targets that declare a loop buffer, flag -decoder-throughput allows users to do multiple runs, each time simulating a different throughput from the decoders. This stage can/will be extended in future. For example, we could add a "buffer full" event to notify bottlenecks caused by backpressure. flag -decoder-throughput would probably go away if in future we delegate to another stage (DecoderStage?) the simulation of a (potentially variable) throughput from the decoders. For now, flag -decoder-throughput is "good enough" to run some simple experiments. Differential Revision: https://reviews.llvm.org/D59928 llvm-svn: 357248
*	[X86] AMD Piledriver (BdVer2): fine-tune some latencies	Roman Lebedev	2019-03-28	12	-197/+197
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on llvm-exegesis measurements. Now that llvm-exegesis is ~2 magnitudes faster, and is a bit smarter, it is now possible to continue cleanup of the scheduler model. With this, there are no more latency inconsistencies for the opcodes that produce stable measurements, and only a few inconsistencies for unstable measurements (MMX_* opcodes, opcodes that llvm-exegesis measures by chaining - CMP, TEST, BT, SETcc, CVT, MOV, etc.) llvm-svn: 357169
*	[X86] Remove the _alt forms of (V)CMP instructions. Use a combination of ↵	Craig Topper	2019-03-18	31	-392/+392
\| \| \| \| \| \| \| \| \| \|	custom printing and custom parsing to achieve the same result and more Similar to previous change done for VPCOM and VPCMP Differential Revision: https://reviews.llvm.org/D59468 llvm-svn: 356384
*	[X86] Remove the _alt forms of XOP VPCOM instructions. Use a combination of ↵	Craig Topper	2019-03-17	2	-64/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	custom printing and custom parsing to achieve the same result and more Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible. The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b". The _alt form on the other hand parsed and printed like any other instruction with no specialness. With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax. I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off. I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them. Differential Revision: https://reviews.llvm.org/D59398 llvm-svn: 356343
*	[X86] Correct scheduler information for rotate by constant for Haswell, ↵	Craig Topper	2019-03-07	4	-68/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Broadwell, and Skylake. Rotate with explicit immediate is a single uop from Haswell on. An immediate of 1 has a dependency on the previous writer of flags, but the other immediate values do not. The implicit rotate by 1 instruction is 2 uops. But the flags are merged after the rotate uop so the data result does not see the flag dependency. But I don't think we have any way of modeling that. RORX is 1 uop without the load. 2 uops with the load. We currently model these with WriteShift/WriteShiftLd. Differential Revision: https://reviews.llvm.org/D59077 llvm-svn: 355636
*	[X86] Model ADC/SBB with immediate 0 more accurately in the Haswell ↵	Craig Topper	2019-03-07	1	-13/+13
\| \| \| \| \| \| \| \| \| \|	scheduler model Haswell and possibly Sandybridge have an optimization for ADC/SBB with immediate 0 to use a single uop flow. This only applies GR16/GR32/GR64 with an 8-bit immediate. It does not apply to GR8. It also does not apply to the implicit AX/EAX/RAX forms. Differential Revision: https://reviews.llvm.org/D59058 llvm-svn: 355635
*	[llvm-mca] Emit a message when no bottlenecks are identified.	Matt Davis	2019-03-07	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Since bottleneck hints are enabled via user request, it can be confusing if no bottleneck information is presented. Such is the case when no bottlenecks are identified. This patch emits a message in that case. Reviewers: andreadb Reviewed By: andreadb Subscribers: tschuett, gbedwell, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D59098 llvm-svn: 355628
*	[llvm-mca][X86] Add ADC/SBB with zero test cases	Simon Pilgrim	2019-03-06	11	-11/+803
\| \| \| \| \| \|	Some targets have fast-path handling for these patterns that we should model. llvm-svn: 355498
*	[MCA] Highlight kernel bottlenecks in the summary view.	Andrea Di Biagio	2019-03-04	3	-0/+263
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds a new flag named -bottleneck-analysis to print out information about throughput bottlenecks. MCA knows how to identify and classify dynamic dispatch stalls. However, it doesn't know how to analyze and highlight kernel bottlenecks. The goal of this patch is to teach MCA how to correlate increases in backend pressure to backend stalls (and therefore, the loss of throughput). From a Scheduler point of view, backend pressure is a function of the scheduler buffer usage (i.e. how the number of uOps in the scheduler buffers changes over time). Backend pressure increases (or decreases) when there is a mismatch between the number of opcodes dispatched, and the number of opcodes issued in the same cycle. Since buffer resources are limited, continuous increases in backend pressure would eventually leads to dispatch stalls. So, there is a strong correlation between dispatch stalls, and how backpressure changed over time. This patch teaches how to identify situations where backend pressure increases due to: - unavailable pipeline resources. - data dependencies. Data dependencies may delay execution of instructions and therefore increase the time that uOps have to spend in the scheduler buffers. That often translates to an increase in backend pressure which may eventually lead to a bottleneck. Contention on pipeline resources may also delay execution of instructions, and lead to a temporary increase in backend pressure. Internally, the Scheduler classifies instructions based on whether register / memory operands are available or not. An instruction is marked as "ready to execute" only if data dependencies are fully resolved. Every cycle, the Scheduler attempts to execute all instructions that are ready to execute. If an instruction cannot execute because of unavailable pipeline resources, then the Scheduler internally updates a BusyResourceUnits mask with the ID of each unavailable resource. ExecuteStage is responsible for tracking changes in backend pressure. If backend pressure increases during a cycle because of contention on pipeline resources, then ExecuteStage sends a "backend pressure" event to the listeners. That event would contain information about instructions delayed by resource pressure, as well as the BusyResourceUnits mask. Note that ExecuteStage also knows how to identify situations where backpressure increased because of delays introduced by data dependencies. The SummaryView observes "backend pressure" events and prints out a "bottleneck report". Example of bottleneck report: ``` Cycles with backend pressure increase [ 99.89% ] Throughput Bottlenecks: Resource Pressure [ 0.00% ] Data Dependencies: [ 99.89% ] - Register Dependencies [ 0.00% ] - Memory Dependencies [ 99.89% ] ``` A bottleneck report is printed out only if increases in backend pressure eventually caused backend stalls. About the time complexity: Time complexity is linear in the number of instructions in the Scheduler::PendingSet. The average slowdown tends to be in the range of ~5-6%. For memory intensive kernels, the slowdown can be significant if flag -noalias=false is specified. In the worst case scenario I have observed a slowdown of ~30% when flag -noalias=false was specified. We can definitely recover part of that slowdown if we optimize class LSUnit (by doing extra bookkeeping to speedup queries). For now, this new analysis is disabled by default, and it can be enabled via flag -bottleneck-analysis. Users of MCA as a library can enable the generation of pressure events through the constructor of ExecuteStage. This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494 Differential Revision: https://reviews.llvm.org/D58728 llvm-svn: 355308
*	[MCA] Always check if scheduler resources are unavailable when reporting ↵	Andrea Di Biagio	2019-02-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	dispatch stalls. Dispatch stall cycles may be associated to multiple dispatch stall events. Before this patch, each stall cycle was associated with a single stall event. This patch also improves a couple of code comments, and adds a helper method to query the Scheduler for dispatch stalls. llvm-svn: 354877
*	[X86] Correct some ADC/SBB with immediate scheduler data for Broadwell and ↵	Craig Topper	2019-02-24	3	-51/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Skylake. Summary: The AX/EAX/RAX with immediate forms are 2 uops just like the AL with immediate. The modrm form with r8 and immediate is a single uop just like r16/r32/r64 with immediate. Reviewers: RKSimon, andreadb Reviewed By: RKSimon Subscribers: gbedwell, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58581 llvm-svn: 354754
*	[MCA] Correctly update register definitions in the PRF after move elimination.	Andrea Di Biagio	2019-02-18	1	-0/+119
\| \| \| \| \| \| \| \| \| \|	This patch fixes a bug where register writes performed by optimizable register moves were sometimes wrongly treated like partial register updates. Before this patch, llvm-mca wrongly predicted a 1.50 IPC for test reg-move-elimination-6.s (added by this patch). With this patch, llvm-mca correctly updates the register defintions in the PRF, and the IPC for that test is now correctly reported as 2. llvm-svn: 354271
*	[X86] Print all register forms of x87 fadd/fsub/fdiv/fmul as having two ↵	Craig Topper	2019-02-04	11	-484/+484
\| \| \| \| \| \| \| \| \| \|	arguments where on is %st. All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read. This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior. llvm-svn: 353061
*	[X86] Print %st(0) as %st when its implicit to the instruction. Continue ↵	Craig Topper	2019-02-04	11	-462/+462
\| \| \| \| \| \| \| \|	printing it as %st(0) when its encoded in the instruction. This is a step back from the change I made in r352985. This appears to be more consistent with gcc and objdump behavior. llvm-svn: 353015
*	Revert r352985 "[X86] Print %st(0) as %st to match what gcc inline asm uses ↵	Craig Topper	2019-02-04	11	-594/+594
\| \| \| \| \| \| \| \| \| \|	as the clobber name to make MS inline asm work correctly" Looking into gcc and objdump behavior more this was overly aggressive. If the register is encoded in the instruction we should print %st(0), if its implicit we should print %st. I'll be making a more directed change in a future patch. llvm-svn: 353013
*	[X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber ↵	Craig Topper	2019-02-03	11	-594/+594
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	name to make MS inline asm work correctly Summary: When calculating clobbers for MS style inline assembly we fail if the asm clobbers stack top because we print st(0) and try to pass it through the gcc register name check. This was found with when I attempted to make a emms/femms clobber all ST registers. If you use emms/femms in MS inline asm we would try to use st(0) as the clobber name but clang would think that wasn't a valid clobber name. This also matches what objdump disassembly prints. It's also what is printed by gcc -S. Reviewers: RKSimon, rnk, efriedma, spatel, andreadb, lebedev.ri Reviewed By: rnk Subscribers: eraman, gbedwell, lebedev.ri, llvm-commits Differential Revision: https://reviews.llvm.org/D57621 llvm-svn: 352985
*	[X86][BdVer2] Transfer delays from the integer to the floating point unit.	Roman Lebedev	2019-02-01	7	-46/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: I'm unable to find this number in the "AMD SOG for family 15h". llvm-exegesis measures the latencies of these instructions as `2`, which matches the latencies specified in "AMD SOG for family 15h". However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver, Steamroller and Excavator pipeline", "Data delay between different execution domains", the int->ivec transfer is listed as `8`..`10`cy of additional latency. Also, Agner's "Instruction tables", for Piledriver, lists their latencies as `12`, which is consistent with `2cy` from exegesis / AMD SOG + `10cy` transfer delay. Additional data point comes from the fact that Agner's "Instruction tables", for Jaguar, lists their latencies as `8`; and "AMD SOG for family 16h" does state the `+6cy` int->ivec delay, which is consistent with instr latency of `1` or `2`. Reviewers: andreadb, RKSimon, craig.topper Reviewed By: andreadb Subscribers: gbedwell, courbet, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D57300 llvm-svn: 352861
*	[X86][Btver2] Improved latency/throughput model for scalar int-to-float ↵	Andrea Di Biagio	2019-01-29	4	-33/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	conversions. Account for bypass delays when computing the latency of scalar int-to-float conversions. On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG). This patch also fixes the number of micropcodes for the register-memory variants of scalar int-to-float conversions. Differential Revision: https://reviews.llvm.org/D57148 llvm-svn: 352518
*	[NFC][MCA][X86][BdVer2] Cherry-pick int-to-ivec forwarding tests from BtVer2	Roman Lebedev	2019-01-27	3	-0/+705
\| \| \| \|	llvm-svn: 352317
*	[llvm-mca][X86] Add some missing DQI tests	Simon Pilgrim	2019-01-26	8	-4/+3954
\| \| \| \| \| \|	Match more of the coverage of test\CodeGen\X86\avx512-schedule.ll as discussed on D57244 llvm-svn: 352273
*	[llvm-mca][X86] Add missing shuffle tests	Simon Pilgrim	2019-01-25	8	-8/+3924
\| \| \| \| \| \|	Match the coverage of test\CodeGen\X86\avx512-shuffle-schedule.ll so we can get rid of -print-schedule (and fix PR37160) without losing schedule tests llvm-svn: 352179
*	[MC][X86] Correctly model additional operand latency caused by transfer ↵	Andrea Di Biagio	2019-01-23	2	-29/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	delays from the integer to the floating point unit. This patch adds a new ReadAdvance definition named ReadInt2Fpu. ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by data transfers from the integer unit to the floating point unit. ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all x86 models excluding BtVer2. That means, this patch is only a functional change for the Jaguar cpu model only. Tablegen definitions for instructions (V)PINSR* have been updated to account for the new ReadInt2Fpu. That read is mapped to the the GPR input operand. On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch, that extra delay was added to the opcode latency. In practice, the insert opcode only executes for 1cy. Most of the actual latency is actually contributed by the so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR* latency is defined by expression f+1, where f is defined as a forwarding delay from the integer unit to the fpu. When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC (only when flag -print-schedule is speified), we now need to account for any extra forwarding delays. We do this by checking if scheduling classes declare any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td: "A negative advance effectively increases latency, which may be used for cross-domain stalls". When computing the instruction latency for the purpose of our scheduling tests, we now add any extra delay to the formula. This avoids regressing existing codegen and mca schedule tests. It comes with the cost of an extra (but very simple) hook in MCSchedModel. Differential Revision: https://reviews.llvm.org/D57056 llvm-svn: 351965
*	[llvm-mca][X86] Tidyup avx512 placeholder tests	Simon Pilgrim	2019-01-22	8	-156/+1348
\| \| \| \| \| \|	Ensure we keep avx512f/bw/dq + vl versions separate, add example broadcast tests - this should allow us to better the test coverage of test\CodeGen\X86\avx512-schedule.ll llvm-svn: 351848
*	[llvm-mca][X86] Add VPOPCNTDQ tests	Simon Pilgrim	2019-01-22	2	-0/+238
\| \| \| \| \| \|	Matches test coverage of test\CodeGen\X86\avx512vpopcntdq-schedule.ll llvm-svn: 351842
*	[llvm-mca][X86] Add missing CLWB/CLZERO/FSGSBASE/LWP/MWAITX/RDPID/SHA tests	Simon Pilgrim	2019-01-22	17	-0/+918
\| \| \| \| \| \|	We're getting pretty close to matching/exceeding test coverage of the test\CodeGen\X86\*-schedule.ll files, which should allow us to get rid of -print-schedule and fix PR37160 llvm-svn: 351836
*	[llvm-mca][X86] Add missing enter/leave, invlpg/invlpga, rdmsr/wrmsr, rdpmc ↵	Simon Pilgrim	2019-01-22	11	-13/+363
\| \| \| \| \| \|	and rdtsc/rdtscp tests llvm-svn: 351835
*	[llvm-mca][X86] Add missing mfence/pinsrw tests	Simon Pilgrim	2019-01-22	11	-11/+132
\| \| \| \|	llvm-svn: 351831
*	[llvm-mca][X86] Add missing monitor/mwait tests	Simon Pilgrim	2019-01-22	11	-10/+98
\| \| \| \| \| \|	These technically should be under a MONITOR cpuid bit, but we tag them as SSE3 so I've done that here as well. llvm-svn: 351829
*	[llvm-mca][X86] Add missing vperm2i128 tests	Simon Pilgrim	2019-01-22	6	-6/+48
\| \| \| \|	llvm-svn: 351828
*	[llvm-mca][X86] Add missing tzcntw tests	Simon Pilgrim	2019-01-22	8	-8/+64
\| \| \| \|	llvm-svn: 351827
*	[MCA] Add tests for int-to-fpu transfer delays. NFC	Andrea Di Biagio	2019-01-22	3	-0/+612
\| \| \| \|	llvm-svn: 351822
*	[X86][BtVer2] SSE2 vector shifts has local forwarding disabled	Simon Pilgrim	2019-01-22	2	-48/+48
\| \| \| \| \| \| \| \|	Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57026 llvm-svn: 351817
*	[X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabled	Simon Pilgrim	2019-01-22	1	-8/+8
\| \| \| \| \| \| \| \|	Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57022 llvm-svn: 351815
*	[X86][BtVer2] Update latency of mmx horizontal operations	Simon Pilgrim	2019-01-21	1	-12/+12
\| \| \| \| \| \| \| \|	D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty. Confirmed with @andreadb llvm-svn: 351755
*	[X86][BtVer2] Update the WriteLoad latency.	Andrea Di Biagio	2019-01-21	6	-17/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	r327630 introduced new write definitions for float/vector loads. Before that revision, WriteLoad was used by both integer/float (scalar/vector) load. So, WriteLoad had to conservatively declare a latency to 5cy. That is because the load-to-use latency for float/vector load is 5cy. Now that we have dedicated writes for float/vector loads, there is no reason why we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only used by scalar integer loads only; we can assume an optimstic 3cy latency for them. This patch changes that latency from 5cy to 3cy, and regenerates the affected scheduling/mca tests. Differential Revision: https://reviews.llvm.org/D56922 llvm-svn: 351742
*	[X86][BtVer2] Update latency of horizontal operations.	Andrea Di Biagio	2019-01-16	7	-95/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Jaguar, horizontal adds/subs have local forwarding disable. That means, we pay a compulsory extra cycle of write-back stage, and the value is not available until the end of that stage. This patch changes the latency of horizontal operations by adding an extra cycle. With this patch, latency numbers now match what is reported by perf. I plan to send another patch to also 'fix' the latency of shuffle operations (on Jaguar, local forwarding is disabled for vector shuffles too). Differential Revision: https://reviews.llvm.org/D56777 llvm-svn: 351366
*	[llvm-mca] Update tests for Exynos (NFC)	Evandro Menezes	2019-01-11	4	-0/+46
\| \| \| \| \| \|	Update test cases for Exynos M4. llvm-svn: 350961
*	[llvm-mca] Update the Exynos test cases (NFC)	Evandro Menezes	2019-01-08	1	-18/+18
\| \| \| \| \| \|	Add more entropy to the test cases. llvm-svn: 350662
*	[llvm-mca] Rename directory for the Cortex tests (NFC)	Evandro Menezes	2018-12-19	2	-0/+0
\| \| \| \|	llvm-svn: 349688
*	[llvm-mca] Update Exynos test cases (NFC)	Evandro Menezes	2018-12-19	2	-108/+0
\| \| \| \|	llvm-svn: 349687
*	[AArch64] Improve the Exynos M3 pipeline model	Evandro Menezes	2018-12-19	1	-1/+1
\| \| \| \|	llvm-svn: 349652
*	[llvm-mca] Split test (NFC)	Evandro Menezes	2018-12-19	2	-29/+56
\| \| \| \| \| \| \|	Split the Exynos test of the register offset addressing mode into separate loads and stores tests. llvm-svn: 349651
*	[llvm-mca] Improve test (NFC)	Evandro Menezes	2018-12-18	1	-18/+56
\| \| \| \| \| \|	Add more instruction variations for Exynos. llvm-svn: 349567
*	[llvm-mca] Update the Exynos test cases (NFC)	Evandro Menezes	2018-12-18	5	-49/+60
\| \| \| \| \| \|	Add more entropy to the test cases. llvm-svn: 349537
*	[MCA] Add support for BeginGroup/EndGroup.	Andrea Di Biagio	2018-12-17	1	-10/+10
\| \| \| \|	llvm-svn: 349354
*	[MCA] Don't assume that createMCInstrAnalysis() always returns a valid pointer.	Andrea Di Biagio	2018-12-17	2	-0/+75
\| \| \| \| \| \| \| \| \| \|	Class InstrBuilder wrongly assumed that llvm targets were always able to return a non-null pointer when createMCInstrAnalysis() was called on them. This was causing crashes when simulating executions for targets that don't provide an MCInstrAnalysis object. This patch fixes the issue by making MCInstrAnalysis optional. llvm-svn: 349352
*	[AArch64] Refactor the Exynos scheduling predicates	Evandro Menezes	2018-12-10	3	-50/+47
\| \| \| \| \| \| \| \| \|	Refactor the scheduling predicates based on `MCInstPredicate`. In this case, for the Exynos processors. Differential revision: https://reviews.llvm.org/D55345 llvm-svn: 348774
*	[llvm-mca] Add new tests for Exynos (NFC)	Evandro Menezes	2018-12-10	3	-0/+150
\| \| \| \|	llvm-svn: 348766