bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[X86] Remove MCInstLower code that drops operands from some CALL and TAILJMP ↵	Craig Topper	2019-08-22	1	-18/+8
\| \| \| \| \| \| \| \| \| \|	instructions. Add asserts to verify operand count It appears the FIXME here was handled at some point. r159728 from 2012 seems to be at least aportion of fixing it. Differential Revision: https://reviews.llvm.org/D66570 llvm-svn: 369665
*	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	Andrea Di Biagio	2019-08-22	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Single operand MUL instructions that implicitly set EAX have the following latency/throughput profile (see below): imul %cl # latency: 3cy - uOPs: 1 - 1 JMul imul %cx # latency: 3cy - uOPs: 3 - 3 JMul imul %ecx # latency: 3cy - uOPs: 2 - 2 JMul imul %rcx # latency: 6cy - uOPs: 2 - 4 JMul mul %cl # latency: 3cy - uOPs: 1 - 1 JMul mul %cx # latency: 3cy - uOPs: 3 - 3 JMul mul %ecx # latency: 3cy - uOPs: 2 - 2 JMul mul %rcx # latency: 6cy - uOPs: 2 - 4 JMul Excluding the 64bit variant, which has a latency of 6cy, every other instruction has a latency of 3cy. However, the number of decoded macro-opcodes (as well as the resource cyles) depend on the MUL size. The two operand MULs have a more predictable profile (see below): imul %dx, %dx # latency: 3cy - uOPs: 1 - 1 JMul imul %edx, %edx # latency: 3cy - uOPs: 1 - 1 JMul imul %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul imul $3, %dx, %dx # latency: 4cy - uOPs: 2 - 2 JMul imul $3, %ecx, %ecx # latency: 3cy - uOPs: 1 - 1 JMul imul $3, %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul This patch updates the values in the Jaguar scheduling model and regenerates llvm-mca tests. Differential Revision: https://reviews.llvm.org/D66547 llvm-svn: 369661
*	[X86][BtVer2] Fix latency and throughput of XCHG and XADD.	Andrea Di Biagio	2019-08-22	1	-1/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum throughput for XCHG is 1 IPC. The byte exchange has worse latency and decodes to 1 extra uOP; maximum observed throughput is 0.5 IPC. ``` xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 2 ALU xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 2 ALU xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 2 ALU ``` The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy. The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed. ``` xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy ``` The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy). Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant). ``` xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 3 ALU xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 3 ALU xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 3 ALU xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 3 ALU ``` The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency'). ``` xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU ``` The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution. ``` lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy ``` Added test xadd.s to verify those latencies as well as read-advance values. Differential Revision: https://reviews.llvm.org/D66535 llvm-svn: 369642
*	[MVT] Add MVT equivalent to EVT::getHalfNumVectorElementsVT() helper. NFCI.	Simon Pilgrim	2019-08-22	1	-21/+11
\| \| \| \| \| \|	Allows for some cleanup in a lot of SSE/AVX vector splitting code llvm-svn: 369640
*	[X86] Lower the cost of v2i32->v2f64 sint_to_fp under vector widening ↵	Craig Topper	2019-08-22	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	legalization. I don't really understand the costs we're using for fp_to_sint, but prior to widening legalization we used 20 as the cost for this via the v2i64->v2f64 entry. That number seems better than the 40 we got with widening legalization. So now we need either a v2i32->v2f64 entry or a v4i32->v2f64 entry depending on whether AVX is enabled or not since we skip the first SSE2 table look up under AVX. llvm-svn: 369628
*	[X86] Making X86OptimizeLEAs pass public. NFC	Pengfei Wang	2019-08-22	3	-26/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reviewers: wxiao3, LuoYuanke, andrew.w.kaylor, craig.topper, annita.zhang, liutianle, pengfei, xiangzhangllvm, RKSimon, spatel, andreadb Reviewed By: RKSimon Subscribers: andreadb, hiraditya, llvm-commits Tags: #llvm Patch by Gen Pei (gpei) Differential Revision: https://reviews.llvm.org/D65933 llvm-svn: 369612
*	[X86] Correct the scheduler classes for TAILJMP and TCRETURN CodeGenOnly ↵	Craig Topper	2019-08-21	1	-24/+25
\| \| \| \| \| \| \| \| \| \| \| \|	instructions. We had an odd combination of WriteJump applied to some memory instructions and WriteJumpLd applied to register and immediate instructions. Thsi should hopefully assign them all correctly. llvm-svn: 369599
*	[X86] Replace a couple hardcoded '5's with X86::AddrNumOperands for ↵	Craig Topper	2019-08-21	1	-2/+3
\| \| \| \| \| \|	readability. NFC llvm-svn: 369598
*	[AArch64][GlobalISel] Add support for narrowScalar of G_ZEXT	Amara Emerson	2019-08-21	1	-1/+3
\| \| \| \| \| \| \| \|	We do this by merging the source with the high bits set to 0. Differential Revision: https://reviews.llvm.org/D66181 llvm-svn: 369480
*	[DAGCombiner][X86] Teach visitCONCAT_VECTORS to combine (concat_vectors ↵	Craig Topper	2019-08-20	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	(concat_vectors X, Y), undef)) -> (concat_vectors X, Y, undef, undef) I also had to add a new combine to X86's combineExtractSubvector to prevent a regression. This helps our vXi1 code see the full concat operation and allow it optimize undef to a zero if there is already a zero in the concat. This helped us use a movzx instead of an AND in some of the tests. In those tests, one concat comes from SelectionDAGBuilder and the second comes from type legalization of v4i1->i4 bitcasts which uses an additional concat. Though these changes weren't my original motivation. I'm looking at making X86ISelLowering's narrowShuffle emit a concat_vectors instead of an insert_subvector since concat_vectors is more canonical during early DAG combine. This patch helps prevent a regression from my experiments with that. Differential Revision: https://reviews.llvm.org/D66456 llvm-svn: 369459
*	Revert [WinEH] Allocate space in funclets stack to save XMM CSRs	Reid Kleckner	2019-08-20	3	-135/+26
\| \| \| \| \| \| \| \|	This reverts r367088 (git commit 9ad565f70ec5fd3531056d7c939302d4ea970c83) And the follow up fix r368631 / e9865b9b31bb2e6bc742dc6fca8f9f9517c3c43e llvm-svn: 369457
*	[X86] Add a DAG combine to transform (i8 (bitcast (v8i1 (extract_subvector ↵	Craig Topper	2019-08-20	1	-0/+12
\| \| \| \| \| \| \| \| \| \|	(v16i1 X), 0)))) -> (i8 (trunc (i16 (bitcast (v16i1 X))))) on KNL target Without AVX512DQ we don't have KMOVB so we can't really copy 8-bits of a k-register to a GPR. We have to copy 16 bits instead. We do this even if the DAG copy is from v8i1->v16i1. If we detect the (i8 (bitcast (v8i1 (extract_subvector (v16i1 X), 0)))) we should rewrite the types to match the copy we do support. By doing this, we can help known bits to propagate without losing the upper 8 bits of the input to the extract_subvector. This allows some zero extends to be removed since we have an isel pattern to use kmovw for (zero_extend (i16 (bitcast (v16i1 X))). Differential Revision: https://reviews.llvm.org/D66489 llvm-svn: 369434
*	[X86] Add isel patterns for (i64 (zext (i8 (bitcast (v16i1 X))))) to use a ↵	Craig Topper	2019-08-20	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	KMOVW and a SUBREG_TO_REG. Similar for i8 and anyextend. We already had patterns for extending to i32 to take advantage of the impliciting zeroing of the upper bits of a 32-bit GPR that is done by KMOVW/KMOVB. But the extend might be all the way to i64, in which case the existing patterns would fail and we'd get a KMOVW/B followed by a MOVZX. By adding patterns for i64 we can use the fact that KMOVW/B zero the upper bits of the 32-bit GPR and the normal property that 32-bit GPR writes implicitly zero the upper 32-bits of the full 64-bit GPR. The anyextend patterns are slightly different since we don't care about the upper zeros. For the i8->i64 I think this avoids selecting the anyextend as a MOVZX to prevent a partial register issue that doesn't exist. For i16->i64 I think we would have just emitted an insert_subreg on top of the extract_subreg that the vXi16->i16 bitcast pattern emits. The register coalescer or peephole pass should combine those, but this saves that work and makes i8/16 consistent. llvm-svn: 369431
*	[TargetMachine] Don't try to create COFFSTUB references on windows on non-COFF	Martin Storsjo	2019-08-20	1	-0/+3
\| \| \| \| \| \| \| \|	This avoids spurious relocation types for windows/elf targets. Differential Revision: https://reviews.llvm.org/D66401 llvm-svn: 369426
*	[X86][BtVer2] Use ReadAfterLd entries for the register operands of CMPXCHG.	Andrea Di Biagio	2019-08-20	1	-3/+12
\| \| \| \| \| \|	This is a follow-up of r369365. llvm-svn: 369412
*	[X86] Use isNullConstant instead of getConstantOperandVal == 0. NFC	Craig Topper	2019-08-20	1	-2/+2
\| \| \| \|	llvm-svn: 369410
*	[X86][BtVer2] Fix latency and throughput of atomic INC/DEC/NEG/NOT.	Andrea Di Biagio	2019-08-20	1	-0/+14
\| \| \| \| \| \| \| \| \|	Latency and throughput of LOCK INC/DEC/NEG/NOT is always 19cy. Number of uOPs is still 1. Differential Revision: https://reviews.llvm.org/D66469 llvm-svn: 369388
*	[X86][Btver2] Fix latency and throughput of CMPXCHG instructions.	Andrea Di Biagio	2019-08-20	5	-4/+140
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC. Throughput is superiorly limited to 0.33 because of the implicit in/out dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the same memory location, store-to-load forwarding occurs and values for sequent loads are quickly forwarded from the store buffer. Interestingly, the functionality in LLVM that computes the reciprocal throughput doesn't seem to know about RMW instructions. That functionality only looks at the "consumed resource cycles" for the throughput computation. It should be fixed/improved by a future patch. In particular, for RMW instructions, that logic should also take into account for the write latency of in/out register operands. An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to ~17cy/inst due to cache locking, which prevents other memory uOPs to start executing before the "lock releasing" store uOP. CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less macro opcode. Their latency tend to be the same as the other RR/RM variants. RR variants are relatively fast 3cy (but still microcoded - 5 macro opcodes). CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load forwarding. That means, throughput is clearly limited by the in/out dependency on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs for the Integer pipes). I have reused the same mix of consumed resource from the other CMPXCHG instructions for CMPXCHG8B too. LOCK CMPXCHG8B is instead 18cycles. CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38) cycles dependeing on whether the LOCK prefix is specified or not. I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to 2x the resource cycles used for CMPXCHG8B. The two new hasLockPrefix() functions are used by the btver2 scheduling model check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have been encoded in predicates of variant scheduling classes that describe lat/thr of CMPXCHG. Differential Revision: https://reviews.llvm.org/D66424 llvm-svn: 369365
*	[X86] Add back the -x86-experimental-vector-widening-legalization comand ↵	Craig Topper	2019-08-20	2	-143/+1227
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	line flag and all associated code, but leave it enabled by default Google is reporting performance issues with the new default behavior and have asked for a way to switch back to the old behavior while we investigate and make fixes. I've restored all of the code that had since been removed and added additional checks of the command flag onto code paths that are not otherwise guarded by a check of getTypeAction. I've also modified the cost model tables to hopefully get us back to the previous costs. Hopefully we won't need to support this for very long since we have no test coverage of the old behavior so we can very easily break it. llvm-svn: 369332
*	[X86] Teach lowerV4I32Shuffle to only use broadcasts if the mask has more ↵	Craig Topper	2019-08-19	1	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	than one undef element. Prioritize shifts over broadcast in lowerV8I16Shuffle. The motivating case are the changes in vector-reduce-add.ll where we were doing extra work in the scalar domain instead of shuffling. There may be some one use check that needs to be looked into there, but this patch sidesteps the issue by avoiding broadcasts that aren't really broadcasting. Differential Revision: https://reviews.llvm.org/D66071 llvm-svn: 369287
*	[X86] Teach lower1BitShuffle to match right shifts with upper zero elements ↵	Craig Topper	2019-08-19	1	-19/+20
\| \| \| \| \| \| \| \| \| \|	on types that don't natively support KSHIFT. We can support these by widening to a supported type, then shifting all the way to the left and then back to the right to ensure that we shift in zeroes. llvm-svn: 369232
*	[X86] Fix the lower1BitShuffle code added in r369215 to correctly pass the ↵	Craig Topper	2019-08-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	widened vector to the KSHIFT node. Not sure how to test this as we have tests that exercise this code, but nothing failed for the types not matching. Since all the k-registers use equivalent register classes everything just ends up working. llvm-svn: 369228
*	[X86] Teach lower1BitShuffle to match KSHIFTR that doesn't use Zeroable and ↵	Craig Topper	2019-08-19	1	-0/+48
\| \| \| \| \| \| \| \| \| \| \|	only relies on undef. This allows us to widen the type when the KSHIFTR instruction doesn't exist for the type. If we need to shift in zeroes into the upper elements we would need more work to guarantee zeroes when widening. llvm-svn: 369227
*	[X86] Teach lower1BitShuffle to recognize padding a subvector with zeros ↵	Craig Topper	2019-08-19	1	-7/+16
\| \| \| \| \| \| \| \| \|	with V2 as the source and V1 as the zero vector. Shuffle canonicalization can swap the sources so the zero vector might be V1 and the subvector that's being padded can be V2. llvm-svn: 369226
*	[X86] Add a special case to LowerCONCAT_VECTORSvXi1 to handle concatenating ↵	Craig Topper	2019-08-18	1	-14/+30
\| \| \| \| \| \| \| \| \| \|	zero vectors followed by one non-zero vector followed by undef vectors. For such a case we should only need a KSHIFTL, but we were previously generating a KSHIFTL followed by a KSHIFTR because we mistakenly believed we need to zero the undef elements. llvm-svn: 369224
*	[X86] Replace uses of getZeroVector for vXi1 vectors with DAG.getConstant.	Craig Topper	2019-08-18	1	-4/+4
\| \| \| \| \| \|	vXi1 vectors don't need special handling. llvm-svn: 369222
*	[X86] Improve lower1BitShuffle handling for KSHIFTL on narrow vectors.	Craig Topper	2019-08-18	1	-8/+24
\| \| \| \| \| \| \|	We can insert the value into a larger legal type and shift that by the desired amount. llvm-svn: 369215
*	Fix signed/unsigned comparison warning. NFCI.	Simon Pilgrim	2019-08-18	1	-2/+2
\| \| \| \|	llvm-svn: 369213
*	[X86] isTargetShuffleEquivalent - add BUILD_VECTOR matching	Simon Pilgrim	2019-08-18	1	-3/+21
\| \| \| \| \| \| \| \| \| \|	Add similar functionality to isShuffleEquivalent - if the mask elements don't match, try matching the BUILD_VECTOR scalars instead. As target shuffles need to handle SM_Sentinel values, this can get a bit tricky, so commit just adds actual mask element index handling - full SM_SentinelZero support will be added when the need arises. Also, enables support in matchVectorShuffleWithPACK llvm-svn: 369212
*	[X86] isTargetShuffleEquivalent - early out on illegal shuffle masks. NFCI.	Simon Pilgrim	2019-08-18	1	-8/+10
\| \| \| \| \| \|	Simplifies shuffle mask comparisons by just bailing out if the shuffle mask has any out of range values - will make an upcoming patch much simpler. llvm-svn: 369211
*	[X86] Add a one use check to the combineStore code that handles ↵	Craig Topper	2019-08-17	1	-1/+1
\| \| \| \| \| \| \| \| \|	v16i16->v16i8 truncate+store by extending to v16i32 and then emitting a v16i32->v16i8 truncstore. This prevent us from emitting a separate truncate and a truncating store instruction. llvm-svn: 369200
*	Revert [X86] SimplifyDemandedVectorElts - attempt to recombine target ↵	Jordan Rupprecht	2019-08-16	1	-17/+0
\| \| \| \| \| \| \| \| \| \|	shuffle using DemandedElts mask (reapplied) This reverts r368662 (git commit 1a8d790cf5f89c1df718844f13e934e39bef6ef5) The compile-time regression repro is in https://bugs.llvm.org/show_bug.cgi?id=43024 llvm-svn: 369167
*	[X86] Use Register/MCRegister in more places in X86	Craig Topper	2019-08-16	9	-43/+45
\| \| \| \| \| \| \| \| \| \|	This was a quick pass through some obvious places. I haven't tried the clang-tidy check. I also replaced the zeroes in getX86SubSuperRegister with X86::NoRegister which is the real sentinel name. Differential Revision: https://reviews.llvm.org/D66363 llvm-svn: 369151
*	[X86] resolveTargetShuffleInputs - add DemandedElts variant. NFCI.	Simon Pilgrim	2019-08-16	1	-3/+10
\| \| \| \| \| \|	Nothing calls this yet, everything still goes through the non (all) DemandedElts wrapper. llvm-svn: 369136
*	[X86] combineExtractWithShuffle - handle extract(truncate(x), 0)	Simon Pilgrim	2019-08-16	1	-1/+11
\| \| \| \| \| \|	Eventually we need to generalize combineExtractWithShuffle to handle all faux shuffles and handle truncate (and X86ISD::VTRUNC etc.) there, but we're not ready yet (still creates nodes on the fly, incomplete DemandedElts support, bad use of recursive Depth limit). llvm-svn: 369134
*	[X86] Alphabetize pass initialization definitions. NFCI.	Simon Pilgrim	2019-08-16	1	-1/+1
\| \| \| \|	llvm-svn: 369126
*	[X86] Remove unused include. NFCI.	Simon Pilgrim	2019-08-16	1	-1/+0
\| \| \| \| \| \|	We don't use anything from TargetOptions.h directly and its included via TargetLowering.h anyhow. llvm-svn: 369110
*	[X86] Manually reimplement getTargetInsertSubreg in ↵	Craig Topper	2019-08-16	1	-2/+6
\| \| \| \| \| \| \| \| \| \|	X86DAGToDAGISel::matchBitExtract so we can call insertDAGNode on the target constant. This is needed to maintain the topological sort order. Fixes PR42992. llvm-svn: 369084
*	Apply llvm-prefer-register-over-unsigned from clang-tidy to LLVM	Daniel Sanders	2019-08-15	25	-236/+233
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This clang-tidy check is looking for unsigned integer variables whose initializer starts with an implicit cast from llvm::Register and changes the type of the variable to llvm::Register (dropping the llvm:: where possible). Partial reverts in: X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister X86FixupLEAs.cpp - Some functions return unsigned and arguably should be MCRegister X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister HexagonBitSimplify.cpp - Function takes BitTracker::RegisterRef which appears to be unsigned& MachineVerifier.cpp - Ambiguous operator==() given MCRegister and const Register PPCFastISel.cpp - No Register::operator-=() PeepholeOptimizer.cpp - TargetInstrInfo::optimizeLoadInstr() takes an unsigned& MachineTraceMetrics.cpp - MachineTraceMetrics lacks a suitable constructor Manual fixups in: ARMFastISel.cpp - ARMEmitLoad() now takes a Register& instead of unsigned& HexagonSplitDouble.cpp - Ternary operator was ambiguous between unsigned/Register HexagonConstExtenders.cpp - Has a local class named Register, used llvm::Register instead of Register. PPCFastISel.cpp - PPCEmitLoad() now takes a Register& instead of unsigned& Depends on D65919 Reviewers: arsenm, bogner, craig.topper, RKSimon Reviewed By: arsenm Subscribers: RKSimon, craig.topper, lenary, aemerson, wuzish, jholewinski, MatzeB, qcolombet, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, wdng, nhaehnle, sbc100, jgravelle-google, kristof.beyls, hiraditya, aheejin, kbarton, fedor.sergeev, javed.absar, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, tpr, PkmX, jocewei, jsji, Petar.Avramovic, asbirlea, Jim, s.egerton, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D65962 llvm-svn: 369041
*	[X86] Add custom type legalization for bitcasting mmx to v2i32/v4i16/v8i8 to ↵	Craig Topper	2019-08-15	3	-0/+21
\| \| \| \| \| \|	use movq2dq instead of going through memory. llvm-svn: 369031
*	[X86] Improve cost model for subvector extraction of less than 128-bit vectors	Craig Topper	2019-08-15	1	-0/+33
\| \| \| \| \| \| \| \|	Now that we're using widening legalization. We need to improve our extract_subvector cost model for these types. This patch begins by modeling these as a subvector extract followed by a permute. I've left FIXMEs in the code for future improvements. Differential Revision: https://reviews.llvm.org/D65892 llvm-svn: 369022
*	[llvm] Migrate llvm::make_unique to std::make_unique	Jonas Devlieghere	2019-08-15	7	-21/+21
\| \| \| \| \| \| \| \|	Now that we've moved to C++14, we no longer need the llvm::make_unique implementation from STLExtras.h. This patch is a mechanical replacement of (hopefully) all the llvm::make_unique instances across the monorepo. llvm-svn: 369013
*	[SDAG][x86] check for relaxed math when matching an FP reduction	Sanjay Patel	2019-08-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the last step in an FP add reduction allows reassociation and doesn't care about -0.0, then we are free to recognize that computation as a reduction that may reorder the intermediate steps. This is requested directly by PR42705: https://bugs.llvm.org/show_bug.cgi?id=42705 and solves PR42947 (if horizontal math instructions are actually faster than the alternative): https://bugs.llvm.org/show_bug.cgi?id=42947 Differential Revision: https://reviews.llvm.org/D66236 llvm-svn: 368995
*	[X86] Add isel pattern to match VZEXT_MOVL and a v2i64 scalar_to_vector ↵	Craig Topper	2019-08-15	1	-0/+4
\| \| \| \| \| \| \| \| \|	bitcasted from x86mmx to MOVQ2DQ. We already had the pattern for just the scalar to vector and bitcast, but not the case where we wanted zeroes in the high half of the xmm. llvm-svn: 368972
*	[X86] Make sure load is non-volatile in the MMX_X86movdq2q (loadv2i64) isel ↵	Craig Topper	2019-08-15	1	-1/+1
\| \| \| \| \| \| \| \| \|	pattern. This pattern will narrow the load so we should make sure its not volatile. llvm-svn: 368971
*	[X86] Remove unneeded isel pattern for v4f32->v4i32 fp_to_sint and ↵	Craig Topper	2019-08-15	1	-3/+0
\| \| \| \| \| \| \| \| \| \|	conversion to MMX. fp_to_sint is turned into X86cvttp2si during isel preprocessing. The other redundant isel patterns were removed previously, but I missed this one because its in the MMX td file. llvm-svn: 368968
*	[X86] Disable custom type legalization for v2i32/v4i16/v8i8->i64.	Craig Topper	2019-08-15	1	-2/+1
\| \| \| \| \| \|	The default legalization can take care of this. llvm-svn: 368967
*	[X86] Disable custom type legalization for v2i32/v4i16/v8i8->f64 bitcast.	Craig Topper	2019-08-15	1	-1/+2
\| \| \| \| \| \| \|	The generic legalization handles this in the same way so just use that. llvm-svn: 368966
*	[X86] Remove some unreachable code from LowerBITCAST.	Craig Topper	2019-08-15	1	-42/+26
\| \| \| \|	llvm-svn: 368965
*	[X86] Remove some dead code and combine some repeated code that's left.	Craig Topper	2019-08-15	1	-17/+3
\| \| \| \| \| \| \| \|	If the width is 256 bits, then we must have AVX so the else here was unnecessary. Once that's removed then the >= 256 bit code is identical to the 128 bit code with a different VT so combine them. llvm-svn: 368956