summaryrefslogtreecommitdiffstats
path: root/llvm/lib/Target/X86
Commit message (Collapse)AuthorAgeFilesLines
...
* [X86] Remove an unnecessary 'if' that prevented treating INT64_MAX and ↵Craig Topper2018-07-271-38/+36
| | | | | | | | -INT64_MAX as power of 2 minus 1 in the multiply expansion code. Not sure why they were being explicitly excluded, but I believe all the math inside the if works. I changed the absolute value to be uint64_t instead of int64_t so INT64_MIN+1 wouldn't be signed wrap. llvm-svn: 338101
* [X86] Add matching for another pattern of PMADDWD.Craig Topper2018-07-271-0/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | Summary: This is the pattern you get from the loop vectorizer for something like this int16_t A[1024]; int16_t B[1024]; int32_t C[512]; void pmaddwd() { for (int i = 0; i != 512; ++i) C[i] = (A[2*i]*B[2*i]) + (A[2*i+1]*B[2*i+1]); } In this case we will have (add (mul (build_vector), (build_vector)), (mul (build_vector), (build_vector))). This is different than the pattern we currently match which has the build_vectors between an add and a single multiply. I'm not sure what C code would get you that pattern. Reviewers: RKSimon, spatel, zvi Reviewed By: zvi Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D49636 llvm-svn: 338097
* [X86] When removing sign extends from gather/scatter indices, make sure we ↵Craig Topper2018-07-271-15/+20
| | | | | | | | handle UpdateNodeOperands finding an existing node to CSE with. If this happens the operands aren't updated and the existing node is returned. Make sure we pass this existing node up to the DAG combiner so that a proper replacement happens. Otherwise we get stuck in an infinite loop with an unoptimized node. llvm-svn: 338090
* [x86/SLH] Extract the logic to trace predicate state through calls toChandler Carruth2018-07-261-19/+39
| | | | | | | | | | a helper function with a nice overview comment. NFC. This is a preperatory refactoring to implementing another component of mitigation here that was descibed in the design document but hadn't been implemented yet. llvm-svn: 338016
* [X86] Don't use CombineTo to skip adding new nodes to the DAGCombiner ↵Craig Topper2018-07-261-5/+1
| | | | | | | | | | | | worklist in combineMul. I'm not sure if this was trying to avoid optimizing the new nodes further or what. Or maybe to prevent a cycle if something tried to reform the multiply? But I don't think its a reliable way to do that. If the user of the expanded multiply is visited by the DAGCombiner after this conversion happens, the DAGCombiner will check its operands, see that they haven't been visited by the DAGCombiner before and it will then add the first node to the worklist. This process will repeat until all the new nodes are visited. So this seems like an unreliable prevention at best. So this patch just returns the new nodes like any other combine. If this starts causing problems we can try to add target specific nodes or something to more directly prevent optimizations. Now that we handle the combine normally, we can combine any negates the mul expansion creates into their users since those will be visited now. llvm-svn: 338007
* [X86] Remove some unnecessary explicit calls to DCI.AddToWorkList.Craig Topper2018-07-261-10/+0
| | | | | | These calls were making sure some newly created nodes were added to worklist, but the DAGCombiner has internal support for ensuring it has visited all nodes. Any time it visits a node it ensures the operands have been queued to be visited as well. This means if we only need to return the last new node. The DAGCombiner will take care of adding its inputs thus walking backwards through all the new nodes. llvm-svn: 337996
* CodeGen: Cleanup regmask construction; NFCMatthias Braun2018-07-261-3/+3
| | | | | | | | | - Avoid duplication of regmask size calculation. - Simplify allocateRegisterMask() call. - Rename allocateRegisterMask() to allocateRegMask() to be consistent with naming in MachineOperand. llvm-svn: 337986
* [COFF] Hoist constant pool handling from X86AsmPrinter into AsmPrinterMartin Storsjo2018-07-252-26/+0
| | | | | | | | | | | | | | | | | | | In SVN r334523, the first half of comdat constant pool handling was hoisted from X86WindowsTargetObjectFile (which despite the name only was used for msvc targets) into the arch independent TargetLoweringObjectFileCOFF, but the other half of the handling was left behind in X86AsmPrinter::GetCPISymbol. With only half of the handling in place, inconsistent comdat sections/symbols are created, causing issues with both GNU binutils (avoided for X86 in SVN r335918) and with the MS linker, which would complain like this: fatal error LNK1143: invalid or corrupt file: no symbol for COMDAT section 0x4 Differential Revision: https://reviews.llvm.org/D49644 llvm-svn: 337950
* [x86/SLH] Sink the return hardening into the main block-walk + hardeningChandler Carruth2018-07-251-26/+17
| | | | | | | | | | | code. This consolidates all our hardening calls, and simplifies the code a bit. It seems much more clear to handle all of these together. No functionality changed here. llvm-svn: 337895
* [x86/SLH] Improve name and comments for the main hardening function.Chandler Carruth2018-07-251-174/+190
| | | | | | | | | | | | | | | | | | | This function actually does two things: it traces the predicate state through each of the basic blocks in the function (as that isn't directly handled by the SSA updater) *and* it hardens everything necessary in the block as it goes. These need to be done together so that we have the currently active predicate state to use at each point of the hardening. However, this also made obvious that the flag to disable actual hardening of loads was flawed -- it also disabled tracing the predicate state across function calls within the body of each block. So this patch sinks this debugging flag test to correctly guard just the hardening of loads. Unless load hardening was disabled, no functionality should change with tis patch. llvm-svn: 337894
* [X86] Use X86ISD::MUL_IMM instead of ISD::MUL for multiply we intend to be ↵Craig Topper2018-07-251-1/+2
| | | | | | | | selected to LEA. This prevents other combines from possibly disturbing it. llvm-svn: 337890
* [x86/SLH] Teach the x86 speculative load hardening pass to hardenChandler Carruth2018-07-251-0/+200
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | against v1.2 BCBS attacks directly. Attacks using spectre v1.2 (a subset of BCBS) are described in the paper here: https://people.csail.mit.edu/vlk/spectre11.pdf The core idea is to speculatively store over the address in a vtable, jumptable, or other target of indirect control flow that will be subsequently loaded. Speculative execution after such a store can forward the stored value to subsequent loads, and if called or jumped to, the speculative execution will be steered to this potentially attacker controlled address. Up until now, this could be mitigated by enableing retpolines. However, that is a relatively expensive technique to mitigate this particular flavor. Especially because in most cases SLH will have already mitigated this. To fully mitigate this with SLH, we need to do two core things: 1) Unfold loads from calls and jumps, allowing the loads to be post-load hardened. 2) Force hardening of incoming registers even if we didn't end up needing to harden the load itself. The reason we need to do these two things is because hardening calls and jumps from this particular variant is importantly different from hardening against leak of secret data. Because the "bad" data here isn't a secret, but in fact speculatively stored by the attacker, it may be loaded from any address, regardless of whether it is read-only memory, mapped memory, or a "hardened" address. The only 100% effective way to harden these instructions is to harden the their operand itself. But to the extent possible, we'd like to take advantage of all the other hardening going on, we just need a fallback in case none of that happened to cover the particular input to the control transfer instruction. For users of SLH, currently they are paing 2% to 6% performance overhead for retpolines, but this mechanism is expected to be substantially cheaper. However, it is worth reminding folks that this does not mitigate all of the things retpolines do -- most notably, variant #2 is not in *any way* mitigated by this technique. So users of SLH may still want to enable retpolines, and the implementation is carefuly designed to gracefully leverage retpolines to avoid the need for further hardening here when they are enabled. Differential Revision: https://reviews.llvm.org/D49663 llvm-svn: 337878
* [X86] Use a shift plus an lea for multiplying by a constant that is a power ↵Craig Topper2018-07-251-0/+18
| | | | | | | | of 2 plus 2/4/8. The LEA allows us to combine an add and the multiply by 2/4/8 together so we just need a shift for the larger power of 2. llvm-svn: 337875
* [X86] Expand mul by pow2 + 2 using a shift and two adds similar to what we ↵Craig Topper2018-07-251-11/+15
| | | | | | do for pow2 - 2. llvm-svn: 337874
* [X86] Use a two lea sequence for multiply by 37, 41, and 73.Craig Topper2018-07-241-0/+9
| | | | | | These fit a pattern used by 11, 21, and 19. llvm-svn: 337871
* [X86] Change multiply by 26 to use two multiplies by 5 and an add instead of ↵Craig Topper2018-07-241-7/+7
| | | | | | | | multiply by 3 and 9 and a subtract. Same number of operations, but ending in an add is friendlier due to it being commutable. llvm-svn: 337869
* [X86] When expanding a multiply by a negative of one less than a power of 2, ↵Craig Topper2018-07-241-10/+12
| | | | | | | | | | like 31, don't generate a negate of a subtract that we'll never optimize. We generated a subtract for the power of 2 minus one then negated the result. The negate can be optimized away by swapping the subtract operands, but DAG combine doesn't know how to do that and we don't add any of the new nodes to the worklist anyway. This patch makes use explicitly emit the swapped subtract. llvm-svn: 337858
* [X86] Generalize the multiply by 30 lowering to generic multipy by power 2 ↵Craig Topper2018-07-241-15/+10
| | | | | | | | | | minus 2. Use a left shift and 2 subtracts like we do for 30. Move this out from behind the slow lea check since it doesn't even use an LEA. Use this for multiply by 14 as well. llvm-svn: 337856
* [X86] Change multiply by 19 to use (9 * X) * 2 + X instead of (5 * X) * 4 - 1.Craig Topper2018-07-241-2/+2
| | | | | | The new lowering can be done in 2 LEAs. The old code took 1 LEA, 1 shift, and 1 sub. llvm-svn: 337851
* [MachineOutliner][NFC] Move target frame info into OutlinedFunctionJessica Paquette2018-07-242-9/+10
| | | | | | | | | | | | | | Just some gardening here. Similar to how we moved call information into Candidates, this moves outlined frame information into OutlinedFunction. This allows us to remove TargetCostInfo entirely. Anywhere where we returned a TargetCostInfo struct, we now return an OutlinedFunction. This establishes OutlinedFunctions as more of a general repeated sequence, and Candidates as occurrences of those repeated sequences. llvm-svn: 337848
* [x86] Teach the x86 backend that it can fold between TCRETURNm* and ↵Chandler Carruth2018-07-242-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCRETURNr* and fix latent bugs with register class updates. Summary: Enabling this fully exposes a latent bug in the instruction folding: we never update the register constraints for the register operands when fusing a load into another operation. The fused form could, in theory, have different register constraints on its operands. And in fact, TCRETURNm* needs its memory operands to use tailcall compatible registers. I've updated the folding code to re-constrain all the registers after they are mapped onto their new instruction. However, we still can't enable folding in the general case from TCRETURNr* to TCRETURNm* because doing so may require more registers to be available during the tail call. If the call itself uses all but one register, and the folded load would require both a base and index register, there will not be enough registers to allocate the tail call. It would be better, IMO, to teach the register allocator to *unfold* TCRETURNm* when it runs out of registers (or specifically check the number of registers available during the TCRETURNr*) but I'm not going to try and solve that for now. Instead, I've just blocked the forward folding from r -> m, leaving LLVM free to unfold from m -> r as that doesn't introduce new register pressure constraints. The down side is that I don't have anything that will directly exercise this. Instead, I will be immediately using this it my SLH patch. =/ Still worse, without allowing the TCRETURNr* -> TCRETURNm* fold, I don't have any tests that demonstrate the failure to update the memory operand register constraints. This patch still seems correct, but I'm nervous about the degree of testing due to this. Suggestions? Reviewers: craig.topper Subscribers: sanjoy, mcrosier, hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D49717 llvm-svn: 337845
* [MachineOutliner][NFC] Make Candidates own their call informationJessica Paquette2018-07-242-24/+29
| | | | | | | | | | | | | Before this, TCI contained all the call information for each Candidate. This moves that information onto the Candidates. As a result, each Candidate can now supply how it ought to be called. Thus, Candidates will be able to, say, call the same function in cheaper ways when possible. This also removes that information from TCI, since it's no longer used there. A follow-up patch for the AArch64 outliner will demonstrate this. llvm-svn: 337840
* [x86/SLH] Extract the core register hardening logic to a low-levelChandler Carruth2018-07-241-36/+73
| | | | | | | | | | | | | | | | | helper and restructure the post-load hardening to use this. This isn't as trivial as I would have liked because the post-load hardening used a trick that only works for it where it swapped in a temporary register to the load rather than replacing anything. However, there is a simple way to do this without that trick that allows this to easily reuse a friendly API for hardening a value in a register. That API will in turn be usable in subsequent patcehs. This also techincally changes the position at which we insert the subreg extraction for the predicate state, but that never resulted in an actual instruction and so tests don't change at all. llvm-svn: 337825
* [x86/SLH] Tidy up a comment, using doxygen structure and wording it toChandler Carruth2018-07-241-5/+7
| | | | | | be more accurate and understandable. llvm-svn: 337822
* [x86/SLH] Simplify the code for hardening a loaded value. NFC.Chandler Carruth2018-07-241-20/+15
| | | | | | | This is in preparation for extracting this into a re-usable utility in this code. llvm-svn: 337785
* [x86/SLH] Remove complex SHRX-based post-load hardening.Chandler Carruth2018-07-241-73/+10
| | | | | | | | | | | | | | | | | | | This code was really nasty, had several bugs in it originally, and wasn't carrying its weight. While on Zen we have all 4 ports available for SHRX, on all of the Intel parts with Agner's tables, SHRX can only execute on 2 ports, giving it 1/2 the throughput of OR. Worse, all too often this pattern required two SHRX instructions in a chain, hurting the critical path by a lot. Even if we end up needing to safe/restore EFLAGS, that is no longer so bad. We pay for a uop to save the flag, but we very likely get fusion when it is used by forming a test/jCC pair or something similar. In practice, I don't expect the SHRX to be a significant savings here, so I'd like to avoid the complex code required. We can always resurrect this if/when someone has a specific performance issue addressed by it. llvm-svn: 337781
* Re-land r335297 "[X86] Implement more of x86-64 large and medium PIC code ↵Reid Kleckner2018-07-236-29/+132
| | | | | | | | | | | | | | models" Don't try to generate large PIC code for non-ELF targets. Neither COFF nor MachO have relocations for large position independent code, and users have been using "large PIC" code models to JIT 64-bit code for a while now. With this change, if they are generating ELF code, their JITed code will truly be PIC, but if they target MachO or COFF, it will contain 64-bit immediates that directly reference external symbols. For a JIT, that's perfectly fine. llvm-svn: 337740
* [NFC][MCA] ZnVer1: Update RegisterFile to identify false dependencies on ↵Roman Lebedev2018-07-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | partially written registers. Summary: Pretty mechanical follow-up for D49196. As microarchitecture.pdf notes, "20 AMD Ryzen pipeline", "20.8 Register renaming and out-of-order schedulers": The integer register file has 168 physical registers of 64 bits each. The floating point register file has 160 registers of 128 bits each. "20.14 Partial register access": The processor always keeps the different parts of an integer register together. ... An instruction that writes to part of a register will therefore have a false dependence on any previous write to the same register or any part of it. Reviewers: andreadb, courbet, RKSimon, craig.topper, GGanesh Reviewed By: GGanesh Subscribers: gbedwell, llvm-commits Differential Revision: https://reviews.llvm.org/D49393 llvm-svn: 337676
* [x86/SLH] Fix a bug where we would harden tail calls twice -- once asChandler Carruth2018-07-231-1/+5
| | | | | | | | | a call, and then again as a return. Also added a comment to try and explain better why we would be doing what we're doing when hardening the (non-call) returns. llvm-svn: 337673
* [x86/SLH] Rename and comment the main hardening function. NFC.Chandler Carruth2018-07-231-4/+21
| | | | | | | | | | This provides an overview of the algorithm used to harden specific loads. It also brings this our terminology further in line with hardening rather than checking. Differential Revision: https://reviews.llvm.org/D49583 llvm-svn: 337667
* [X86] Remove the max vector width restriction from combineLoopMAddPattern ↵Craig Topper2018-07-221-7/+1
| | | | | | | | and rely splitOpsAndApply to handle splitting. This seems to be a net improvement. There's still an issue under avx512f where we have a 512-bit vpaddd, but not vpmaddwd so we end up doing two 256-bit vpmaddwds and inserting the results before a 512-bit vpaddd. It might be better to do two 512-bits paddds with zeros in the upper half. Same number of instructions, but breaks a dependency. llvm-svn: 337656
* Revert "[X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use ↵Benjamin Kramer2018-07-202-48/+17
| | | | | | | | SimplifyDemandedVectorElts" This reverts commit r337547. It triggers an infinite loop. llvm-svn: 337617
* [X86] Remove isel patterns for MOVSS/MOVSD ISD opcodes with integer types.Craig Topper2018-07-204-79/+8
| | | | | | | | | | Ideally our ISD node types going into the isel table would have types consistent with their instruction domain. This prevents us having to duplicate patterns with different types for the same instruction. Unfortunately, it seems our shuffle combining is currently relying on this a little remove some bitcasts. This seems to enable some switching between shufps and shufd. Hopefully there's some way we can address this in the combining. Differential Revision: https://reviews.llvm.org/D49280 llvm-svn: 337590
* [X86] Remove what appear to be unnecessary uses of DCI.CombineToCraig Topper2018-07-201-19/+13
| | | | | | | | | | CombineTo is most useful when you need to replace multiple results, avoid the worklist management, or you need to something else after the combine, etc. Otherwise you should be able to just return the new node and let DAGCombiner go through its usual worklist code. All of the places changed in this patch look to be standard cases where we should be able to use the more stand behavior of just returning the new node. Differential Revision: https://reviews.llvm.org/D49569 llvm-svn: 337589
* [X86][XOP] Fix SUB constant folding for VPSHA/VPSHL shift loweringSimon Pilgrim2018-07-201-4/+3
| | | | | | | | We can safely use getConstant here as we're still lowering, which allows constant folding to kick in and simplify the vector shift codegen. Noticed while working on D49562. llvm-svn: 337578
* [X86][SSE] Use SplitOpsAndApply to improve HADD/HSUB loweringSimon Pilgrim2018-07-201-8/+20
| | | | | | Improve AVX1 256-bit vector HADD/HSUB matching by using SplitOpsAndApply to split into 128-bit instructions. llvm-svn: 337568
* [X86][AVX] Add support for i16 256-bit vector horizontal op redundant ↵Simon Pilgrim2018-07-201-1/+3
| | | | | | shuffle removal llvm-svn: 337566
* [X86][AVX] Add support for 32/64 bits 256-bit vector horizontal op redundant ↵Simon Pilgrim2018-07-201-5/+11
| | | | | | shuffle removal llvm-svn: 337561
* [X86][AVX] Convert X86ISD::VBROADCAST demanded elts combine to use ↵Simon Pilgrim2018-07-202-17/+48
| | | | | | | | | | SimplifyDemandedVectorElts This is an early step towards using SimplifyDemandedVectorElts for target shuffle combining - this merely moves the existing X86ISD::VBROADCAST simplification code to use the SimplifyDemandedVectorElts mechanism. Adds X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode to handle X86ISD::VBROADCAST - in time we can support all target shuffles (and other ops) here. llvm-svn: 337547
* Improved sched model for X86 BSWAP* instrs.Andrew V. Tischenko2018-07-2011-79/+35
| | | | | | Differential Revision: https://reviews.llvm.org/D49477 llvm-svn: 337537
* [x86/SLH] Clean up helper naming for return instruction handling andChandler Carruth2018-07-191-4/+26
| | | | | | | | | | | | | | remove dead declaration of a call instruction handling helper. This moves to the 'harden' terminology that I've been trying to settle on for returns. It also adds a really detailed comment explaining what all we're trying to accomplish with return instructions and why. Hopefully this makes it much more clear what exactly is being "hardened". Differential Revision: https://reviews.llvm.org/D49571 llvm-svn: 337510
* [X86][AVX] Use extract_subvector to reduce vector op widths (PR36761)Simon Pilgrim2018-07-191-0/+25
| | | | | | | | | | We have a number of cases where we fail to reduce vector op widths, performing the op in a larger vector and then extracting a subvector. This is often because by default it would create illegal types. This peephole patch attempts to handle a few common cases detailed in PR36761, which typically involved extension+conversion to vX2f64 types. Differential Revision: https://reviews.llvm.org/D49556 llvm-svn: 337500
* [X86] Fix some 'return SDValue()' after DCI.CombineTo instead return the ↵Craig Topper2018-07-191-13/+7
| | | | | | | | | | output of CombineTo Returning SDValue() means nothing was changed. Returning the result of CombineTo returns the first argument of CombineTo. This is specially detected by DAGCombiner as meaning that something changed, but worklist management was already taken care of. I think the only real effect of this change is that we now properly update the Statistic the counts the number of combines performed. That's the only thing between the check for null and the check for N in the DAGCombiner. llvm-svn: 337491
* [X86][BtVer2] correctly model the latency/throughput of LEA instructions.Andrea Di Biagio2018-07-197-14/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the latency/throughput of LEA instructions in the BtVer2 scheduling model. On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of 1. That is because it uses one cycle of SAGU followed by 1cy of ALU1. An LEA with a "Scale" operand is also slow, and it has the same latency profile as the 3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e. RThrouhgput of 2.0). This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td. The tablegen backend (for instruction-info) expands that definition into this (file X86GenInstrInfo.inc): ``` static bool isThreeOperandsLEA(const MachineInstr &MI) { return ( ( MI.getOpcode() == X86::LEA32r || MI.getOpcode() == X86::LEA64r || MI.getOpcode() == X86::LEA64_32r || MI.getOpcode() == X86::LEA16r ) && MI.getOperand(1).isReg() && MI.getOperand(1).getReg() != 0 && MI.getOperand(3).isReg() && MI.getOperand(3).getReg() != 0 && ( ( MI.getOperand(4).isImm() && MI.getOperand(4).getImm() != 0 ) || (MI.getOperand(4).isGlobal()) ) ); } ``` A similar method is generated in the X86_MC namespace, and included into X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h). Back to the BtVer2 scheduling model: A new scheduling predicate named JSlowLEAPredicate now checks if either the instruction is a three-operands LEA, or it is an LEA with a Scale value different than 1. A variant scheduling class uses that new predicate to correctly select the appropriate latency profile. Differential Revision: https://reviews.llvm.org/D49436 llvm-svn: 337469
* [x86/SLH] Major refactoring of SLH implementaiton. There are two bigChandler Carruth2018-07-191-174/+204
| | | | | | | | | | | | | | | | | | | | | | | | | | | | changes that are intertwined here: 1) Extracting the tracing of predicate state through the CFG to its own function. 2) Creating a struct to manage the predicate state used throughout the pass. Doing #1 necessitates and motivates the particular approach for #2 as now the predicate management is spread across different functions focused on different aspects of it. A number of simplifications then fell out as a direct consequence. I went with an Optional to make it more natural to construct the MachineSSAUpdater object. This is probably the single largest outstanding refactoring step I have. Things get a bit more surgical from here. My current goal, beyond generally making this maintainable long-term, is to implement several improvements to how we do interprocedural tracking of predicate state. But I don't want to do that until the predicate state management and tracing is in reasonably clear state. Differential Revision: https://reviews.llvm.org/D49427 llvm-svn: 337446
* Fix spelling mistake in comments. NFCI.Simon Pilgrim2018-07-191-2/+2
| | | | llvm-svn: 337442
* [X86][SSE] Canonicalize scalar fp arithmetic shuffle patternsSimon Pilgrim2018-07-181-2/+31
| | | | | | | | | | | | As discussed on PR38197, this canonicalizes MOVS*(N0, OP(N0, N1)) --> MOVS*(N0, SCALAR_TO_VECTOR(OP(N0[0], N1[0]))) This returns the scalar-fp codegen lost by rL336971. Additionally it handles the OP(N1, N0)) case for commutable (FADD/FMUL) ops. Differential Revision: https://reviews.llvm.org/D49474 llvm-svn: 337419
* [X86][SSE] Remove BLENDPD canonicalization from combineTargetShuffleSimon Pilgrim2018-07-181-25/+0
| | | | | | When rL336971 removed the scalar-fp isel patterns, we lost the need for this canonicalization - commutation/folding can handle everything else. llvm-svn: 337387
* [X86] Enable commuting of VUNPCKHPD to VMOVLHPS to enable load folding by ↵Craig Topper2018-07-183-16/+38
| | | | | | | | using VMOVLPS with a modified address. This required an annoying amount of tablegen multiclass changes to make only VUNPCKHPDZ128rr commutable. llvm-svn: 337357
* [X86] Remove patterns that mix X86ISD::MOVLHPS/MOVHLPS with v2i64/v2f64 types.Craig Topper2018-07-182-33/+0
| | | | | | The X86ISD::MOVLHPS/MOVHLPS should now only be emitted in SSE1 only. This means that the v2i64/v2f64 types would be illegal thus we don't need these patterns. llvm-svn: 337349
OpenPOWER on IntegriCloud