bcm5719-llvm - Project Ortega BCM5719 LLVM

	Commit message (Collapse)	Author	Age	Files	Lines
*	[PowerPC] fix trivial typos in comment, NFC	Hiroshi Inoue	2019-04-09	3	-6/+6
\| \| \| \|	llvm-svn: 357981
*	[X86] Have EVEX2VEX tablegenerator use HasVEX_L and HasEVEX_L2 fields ↵	Craig Topper	2019-04-09	1	-4/+1
\| \| \| \| \| \| \| \| \| \|	instead of the composite EVEX_LL field. Remove the EVEX_LL field. NFCI The composite existed to simplify some other tablegen code and not really in an important way. Remove the combined field and just calculate the vector size using two ifs. llvm-svn: 357972
*	[X86] Use VEX_WIG for VPINSRB/W and VPEXTRB/W to match what is done for EVEX.	Craig Topper	2019-04-09	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	The instruction's document this as W0 for the VEX encoding. But there's a footnote mentioning that VEX.W is ignored in 64-bit mode. And the main VEX encoding description says the VEX.W bit is ignored for instructions that are equivalent to a legacy SSE instruction that uses REX.W to select a GPR which would apply here. By making this match EVEX we can remove a special case of allowing EVEX2VEX to turn an EVEX.WIG instruction into VEX.W0. llvm-svn: 357971
*	[X86] Split the VEX_WPrefix in X86Inst tablegen class into 3 separate fields ↵	Craig Topper	2019-04-09	1	-8/+8
\| \| \| \| \| \|	with clear meanings. llvm-svn: 357970
*	AMDGPU/GlobalISel: Implement call lowering for shaders returning values	Tom Stellard	2019-04-09	1	-3/+73
\| \| \| \| \| \| \| \| \| \|	Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, jvesely, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, volkan, llvm-commits Differential Revision: https://reviews.llvm.org/D57166 llvm-svn: 357964
*	[PowerPC] initialize SchedModel according to platform.	Chen Zheng	2019-04-09	1	-0/+1
\| \| \| \| \| \|	Differential Revision: https://reviews.llvm.org/D60177 llvm-svn: 357962
*	[X86] Derive ssmem and sdmem from X86MemOperand. NFCI	Craig Topper	2019-04-09	1	-12/+2
\| \| \| \| \| \|	This changes the operand type from v4f32/v2f64 to iPTR which seems more correct. But that doesn't seem to do anything other than change the comments in X86GenDAGISel.inc. Probably because we use a ComplexPattern to do the matching so there's no autogenerated code to change. llvm-svn: 357959
*	[X86] Fix a couple lowering functions that called ReplaceAllUsesOfValueWith ↵	Craig Topper	2019-04-08	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \|	for the newly created code and then return SDValue(). Use MERGE_VALUES instead. Returning SDValue() makes the caller think custom lowering was unsuccessful and then it will fall back to trying to expand the original node. This expanded code will end up with no users and end up being pruned later. But it was useless unnecessary work to create it. Instead return a MERGE_VALUES with all the results so the caller knows something changed. The caller can handle the replacements. For one of the cases I had to use UNDEF has a dummy value for a result we know is unused. This should get pruned later. llvm-svn: 357935
*	[x86] make 8-bit shl undesirable	Sanjay Patel	2019-04-08	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I was looking at a potential DAGCombiner fix for 1 of the regressions in D60278, and it caused severe regression test pain because x86 TLI lies about the desirability of 8-bit shift ops. We've hinted at making all 8-bit ops undesirable for the reason in the code comment: // TODO: Almost no 8-bit ops are desirable because they have no actual // size/speed advantages vs. 32-bit ops, but they do have a major // potential disadvantage by causing partial register stalls. ...but that leads to massive diffs and exposes all kinds of optimization holes itself. Differential Revision: https://reviews.llvm.org/D60286 llvm-svn: 357912
*	[X86] Make LowerOperationWrapper more robust. Remove now unnecessary ↵	Craig Topper	2019-04-08	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ReplaceAllUsesWith from LowerMSCATTER. Previously LowerOperationWrapper took the number of results from the original node and counted that many results from the new node. This was intended to drop chain operands from FP_TO_SINT lowering that uses X87 with memory operations to stack temporaries. The final load had an extra chain output that needs to be ignored. Unfortunately, it didn't work with scatter which has 2 result operands, the mask output which is discarded and a chain output. The chain output is the one that is needed but it comes second and it would be dropped by the previous logic here. To workaround this we were doing a ReplaceAllUses in the lowering code so that the generic legalization code wouldn't see any uses to replace since it had been given the wrong result/type. After this change we take the LowerOperation result directly if the original node has one result. This allows us to directly return the chain from scatter or the load data from the FP_TO_SINT case. When the original node has multiple results we'll ensure the returned node has the same number and copy them over. For cases where the original node has multiple results and the new code for some reason has even more results, MERGE_VALUES can be used to pass only the needed results. llvm-svn: 357887
*	[X86] Use (SUBREG_TO_REG (MOV32rm)) for extloadi64i8/extloadi64i16 when the ↵	Craig Topper	2019-04-07	2	-3/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	load is 4 byte aligned or better and not volatile. Summary: Previously we would use MOVZXrm8/MOVZXrm16, but those are longer encodings. This is similar to what we do in the loadi32 predicate. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D60341 llvm-svn: 357875
*	Reapply [ValueTracking] Support min/max selects in computeConstantRange()	Nikita Popov	2019-04-07	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for min/max flavor selects in computeConstantRange(), which allows us to fold comparisons of a min/max against a constant in InstSimplify. This fixes an infinite InstCombine loop, with the test case taken from D59378. Relative to the previous iteration, this contains some adjustments for AMDGPU med3 tests: The AMDGPU target runs InstSimplify prior to codegen, which ends up constant folding some existing med3 tests after this change. To preserve these tests a hidden -amdgpu-scalar-ir-passes option is added, which allows disabling scalar IR passes (that use InstSimplify) for testing purposes. Differential Revision: https://reviews.llvm.org/D59506 llvm-svn: 357870
*	[CostModel][X86] Masked load legalization requires an binary-shuffle not a ↵	Simon Pilgrim	2019-04-07	1	-2/+2
\| \| \| \| \| \| \| \|	select (PR39812) Expansion/truncation is better described by SK_PermuteTwoSrc than SK_Select llvm-svn: 357864
*	[X86][SSE] SimplifyDemandedBitsForTargetNode - Add initial PACKSS support	Simon Pilgrim	2019-04-07	1	-0/+19
\| \| \| \| \| \| \| \| \| \|	In the case where we only want the sign bit (e.g. when using PACKSS truncation of comparison results for MOVMSK) then we can just demand the sign bit of the source operands. This makes use of the fact that PACKSS saturates out of range values to the min/max int values - so the sign bit is always preserved. Differential Revision: https://reviews.llvm.org/D60333 llvm-svn: 357859
*	[X86] When converting (x << C1) AND C2 to (x AND (C2>>C1)) << C1 during ↵	Craig Topper	2019-04-06	1	-6/+13
\| \| \| \| \| \|	isel, try using andl over andq by favoring 32-bit unsigned immediates. llvm-svn: 357848
*	[X86] combineBitcastvxi1 - provide dst VT and src SDValue directly. NFCI.	Simon Pilgrim	2019-04-06	1	-19/+17
\| \| \| \| \| \|	Prep work to make it easier to reuse the BITCAST->MOVSMK combine in other cases. llvm-svn: 357847
*	[X86] Use a signed mask in foldMaskedShiftToScaledMask to enable a shorter ↵	Craig Topper	2019-04-06	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \|	immediate encoding. This function reorders AND and SHL to enable the SHL to fold into an LEA. The upper bits of the AND will be shifted out by the SHL so it doesn't matter what mask value we use for these bits. By using sign bits from the original mask in these upper bits we might enable a shorter immediate encoding to be used. llvm-svn: 357846
*	Fix spelling mistake. NFCI.	Simon Pilgrim	2019-04-06	1	-1/+1
\| \| \| \|	llvm-svn: 357843
*	[AMDGPU] Sort out and rename multiple CI/VI predicates	Stanislav Mekhanoshin	2019-04-06	14	-85/+82
\| \| \| \| \| \|	Differential Revision: https://reviews.llvm.org/D60346 llvm-svn: 357835
*	[X86] Enable tail calls for CallingConv::Swift	Francis Visoiu Mistrih	2019-04-05	1	-0/+2
\| \| \| \| \| \|	It's currently only enabled on AArch64 (enabled in r281376). llvm-svn: 357809
*	[X86] Preserve operand flag when expanding TCRETURNri	Francis Visoiu Mistrih	2019-04-05	3	-2/+11
\| \| \| \| \| \| \| \| \|	The expansion of TCRETURNri(64) would not keep operand flags like undef/renamable/etc. which can result in machine verifier issues. Also add plumbing to be able to use `-run-pass=x86-pseudo`. llvm-svn: 357808
*	[AMDGPU] Add MachineDCE pass after RenameIndependentSubregs	Stanislav Mekhanoshin	2019-04-05	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Detect dead lanes can create some dead defs. Then RenameIndependentSubregs will break a REG_SEQUENCE which may use these dead defs. At this point a dead instruction can be removed but we do not run a DCE anymore. MachineDCE was only running before live variable analysis. The patch adds a mean to preserve LiveIntervals and SlotIndexes in case it works past this. Differential Revision: https://reviews.llvm.org/D59626 llvm-svn: 357805
*	[X86] Merge the different Jcc instructions for each condition code into ↵	Craig Topper	2019-04-05	21	-232/+177
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	single instructions that store the condition code as an operand. Summary: This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between Jcc instructions and condition codes. Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser. Reviewers: spatel, lebedev.ri, courbet, gchatelet, RKSimon Reviewed By: RKSimon Subscribers: MatzeB, qcolombet, eraman, hiraditya, arphaman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D60228 llvm-svn: 357802
*	[X86] Merge the different SETcc instructions for each condition code into ↵	Craig Topper	2019-04-05	20	-256/+290
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	single instructions that store the condition code as an operand. Summary: This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between SETcc instructions and condition codes. Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser. Reviewers: andreadb, courbet, RKSimon, spatel, lebedev.ri Reviewed By: andreadb Subscribers: hiraditya, lebedev.ri, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D60138 llvm-svn: 357801
*	[X86] Merge the different CMOV instructions for each condition code into ↵	Craig Topper	2019-04-05	31	-422/+460
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	single instructions that store the condition code as an immediate. Summary: Reorder the condition code enum to match their encodings. Move it to MC layer so it can be used by the scheduler models. This avoids needing an isel pattern for each condition code. And it removes translation switches for converting between CMOV instructions and condition codes. Now the printer, encoder and disassembler take care of converting the immediate. We use InstAliases to handle the assembly matching. But we print using the asm string in the instruction definition. The instruction itself is marked IsCodeGenOnly=1 to hide it from the assembly parser. This does complicate the scheduler models a little since we can't assign the A and BE instructions to a separate class now. I plan to make similar changes for SETcc and Jcc. Reviewers: RKSimon, spatel, lebedev.ri, andreadb, courbet Reviewed By: RKSimon Subscribers: gchatelet, hiraditya, kristina, lebedev.ri, jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D60041 llvm-svn: 357800
*	[AMDGPU] predicate and feature refactoring	Stanislav Mekhanoshin	2019-04-05	18	-196/+245
\| \| \| \| \| \| \| \| \|	We have done some predicate and feature refactoring lately but did not upstream it. This is to sync. Differential revision: https://reviews.llvm.org/D60292 llvm-svn: 357791
*	Change some dyn_cast to more apropriate isa. NFC	Fangrui Song	2019-04-05	3	-3/+3
\| \| \| \|	llvm-svn: 357773
*	AMDGPU/GlobalISel: Fix non-power-of-2 select	Matt Arsenault	2019-04-05	1	-0/+1
\| \| \| \|	llvm-svn: 357762
*	[DAGCombiner][x86] scalarize splatted vector FP ops	Sanjay Patel	2019-04-05	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are a variety of vector patterns that may be profitably reduced to a scalar op when scalar ops are performed using a subset (typically, the first lane) of the vector register file. For x86, this is true for float/double ops and element 0 because insert/extract is just a sub-register rename. Other targets should likely enable the hook in a similar way. Differential Revision: https://reviews.llvm.org/D60150 llvm-svn: 357760
*	[SelectionDAG] Compute known bits of CopyFromReg	Piotr Sobczak	2019-04-05	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Teach SelectionDAG how to compute known bits of ISD::CopyFromReg if the virtual reg used has one def only. This can be particularly useful when calling isBaseWithConstantOffset() with the ISD::CopyFromReg argument, as more optimizations may get enabled in the result. Also add a missing truncation on X86, found by testing of this patch. Change-Id: Id1c9fceec862d118c54a5b53adf72ada5d6daefa Reviewers: bogner, craig.topper, RKSimon Reviewed By: RKSimon Subscribers: lebedev.ri, nemanjai, jvesely, nhaehnle, javed.absar, jsji, jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D59535 llvm-svn: 357745
*	[X86] Promote i16 SRA instructions to i32	Craig Topper	2019-04-05	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	We already promote SRL and SHL to i32. This will introduce sign extends sometimes which might be harder to deal with than the zero we use for promoting SRL. I ran this through some of our internal benchmark lists and didn't see any major regressions. I think there might be some DAG combine improvement opportunities in the test changes here. Differential Revision: https://reviews.llvm.org/D60278 llvm-svn: 357743
*	[IR] Refactor attribute methods in Function class (NFC)	Evandro Menezes	2019-04-04	31	-63/+63
\| \| \| \| \| \| \| \|	Rename the functions that query the optimization kind attributes. Differential revision: https://reviews.llvm.org/D60287 llvm-svn: 357731
*	Revert [X86] When using Win64 ABI, exit with error if SSE is disabled for ↵	James Y Knight	2019-04-04	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	varargs It unnecessarily breaks previously-working code which used varargs, but didn't pass any float/double arguments (such as EDK2). Also revert the fixup on top of that: Revert [X86] Fix a test from r357317 This reverts r357317 (git commit d413f41de6baf500e5d20c638375447e18777db2) This reverts r357380 (git commit 7af32444b9b17719ebabb6bee6eb52465acc8507) llvm-svn: 357718
*	[WebAssembly] Add new explicit relocation types for PIC relocations	Sam Clegg	2019-04-04	4	-26/+63
\| \| \| \| \| \| \| \|	See https://github.com/WebAssembly/tool-conventions/pull/106 Differential Revision: https://reviews.llvm.org/D59907 llvm-svn: 357710
*	[x86] eliminate unnecessary broadcast of horizontal op	Sanjay Patel	2019-04-04	1	-4/+14
\| \| \| \| \| \| \|	This is another pattern that comes up if we more aggressively scalarize FP ops. llvm-svn: 357703
*	[RISCV] Support assembling TLS add and associated modifiers	Lewis Revill	2019-04-04	9	-11/+202
\| \| \| \| \| \| \| \| \| \|	This patch adds support in the MC layer for parsing and assembling the 4-operand add instruction needed for TLS addressing. This also involves parsing the %tprel_hi, %tprel_lo and %tprel_add operand modifiers. Differential Revision: https://reviews.llvm.org/D55341 llvm-svn: 357698
*	[SystemZ] Bugfix in isFusableLoadOpStorePattern()	Jonas Paulsson	2019-04-04	1	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This function is responsible for checking the legality of fusing an instance of load -> op -> store into a single operation. In the SystemZ backend the check was incomplete and a test case emerged with a cycle in the instruction selection DAG as a result. Instead of using the NodeIds to determine node relationships, hasPredecessorHelper() now is used just like in the X86 backend. This handled the failing tests and as well gave a few additional transformations on benchmarks. The SystemZ isFusableLoadOpStorePattern() is now a very near copy of the X86 function, and it seems this could be made a utility function in common code instead. Review: Ulrich Weigand https://reviews.llvm.org/D60255 llvm-svn: 357688
*	[ARM GlobalISel] Support DBG_VALUE	Diana Picus	2019-04-04	1	-0/+7
\| \| \| \| \| \|	Make sure we can map and select DBG_VALUE. llvm-svn: 357681
*	[AArch64][AsmParser] Fix .arch_extension directive parsing	Sander de Smalen	2019-04-04	1	-8/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes .arch_extension directive parsing to handle a wider range of architecture extension options. The existing parser was parsing extensions as an identifier which breaks for extensions containing a "-", such as the "tlb-rmi" extension. The extension is now parsed as a string. This is consistent with the extension parsing in the .arch and .cpu directive parsing. Patch by Cullen Rhodes (c-rhodes) Reviewed By: SjoerdMeijer Differential Revision: https://reviews.llvm.org/D60118 llvm-svn: 357677
*	[X86] Use INSERT_SUBREG rather than SUBREG_TO_REG when creating LEA64_32 ↵	Craig Topper	2019-04-04	1	-13/+8
\| \| \| \| \| \| \| \| \|	during isel. SUBREG_TO_REG is supposed to be used to assert that we know the upper bits are zero. But that isn't the case here. We've done no analysis of the inputs. llvm-svn: 357673
*	[WebAssembly] EmscriptenEHSjLj: Don't abort if __THREW__ is defined	Sam Clegg	2019-04-04	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows __THREW__ to be defined in the current module, although it is still required to be a GlobalVariable. In emscripten we want to be able to compile the source code that defines this symbols. Previously we were avoid this by not running this pass when building that compiler-rt library, but I have change out to build it using the normal compiler path: https://github.com/emscripten-core/emscripten/pull/8391 Differential Revision: https://reviews.llvm.org/D60232 llvm-svn: 357665
*	[X86] Remove CustomInserters for RDPKRU/WRPKRU. Use some custom lowering and ↵	Craig Topper	2019-04-04	4	-52/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	new ISD opcodes instead. These inserters inserted some instructions to zero some registers and copied from virtual registers to physical registers. This change instead inserts the zeros directly into the DAG at lowering time using new ISD opcodes that take the extra zeroes as inputs. The zeros will then go through isel on their own to select the MOV32r0 pseudo. Then we just need to mention the physical registers directly in the isel patterns and the isel table and InstrEmitter will take care of inserting the necessary copies to/from physical registers. llvm-svn: 357659
*	[X86] Remove CustomInserter pseudos for MONITOR/MONITORX/CLZERO. Use custom ↵	Craig Topper	2019-04-03	5	-84/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instruction selection instead. This custom inserter existed so we could do a weird thing where we pretended that the instructions support a full address mode instead of taking a pointer in EAX/RAX. I think was largely so we could be pointer size agnostic in the isel pattern. To make this work we would then put the address into an LEA into EAX/RAX in front of the instruction after isel. But the LEA is overkill when we just have a base pointer. So we end up using the LEA as a slower MOV instruction. With this change we now just do custom selection during isel instead and just assign the incoming address of the intrinsic into EAX/RAX based on its size. After the intrinsic is selected, we can let isel take care of selecting an LEA or other operation to do any address computation needed in this basic block. I've also split the instruction into a 32-bit mode version and a 64-bit mode version so the implicit use is properly sized based on the pointer. Without this we get comments in the assembly output about killing eax and defing rax or vice versa depending on whether we define the instruction to use EAX/RAX. llvm-svn: 357652
*	[x86] fold shuffles of h-ops that have an undef operand	Sanjay Patel	2019-04-03	1	-2/+2
\| \| \| \| \| \| \|	If an operand is undef, we can assume it's the same as the other operand. llvm-svn: 357644
*	[x86] eliminate movddup of horizontal op	Sanjay Patel	2019-04-03	1	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \|	This pattern would show up as a regression if we more aggressively convert vector FP ops to scalar ops. There's still a missed optimization for the v4f64 legal case (AVX) because we create that h-op with an undef operand. We should probably just duplicate the operands for that pattern to avoid trouble. llvm-svn: 357642
*	[IR] Create new method in `Function` class (NFC)	Evandro Menezes	2019-04-03	2	-2/+2
\| \| \| \| \| \| \| \| \|	Create method `optForNone()` testing for the function level equivalent of `-O0` and refactor appropriately. Differential revision: https://reviews.llvm.org/D59852 llvm-svn: 357638
*	AMDGPU: Split block for si_end_cf	Matt Arsenault	2019-04-03	5	-17/+128
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Relying on no spill or other code being inserted before this was precarious. It relied on code diligently checking isBasicBlockPrologue which is likely to be forgotten. Ideally this could be done earlier, but this doesn't work because of phis. Any other instruction can't be placed before them, so we have to accept the position being incorrect during SSA. This avoids regressions in the fast register allocator rewrite from inverting the direction. llvm-svn: 357634
*	[X86] Extend boolean arguments to inline-asm according to getBooleanType	Krzysztof Parzyszek	2019-04-03	1	-2/+7
\| \| \| \| \| \|	Differential Revision: https://reviews.llvm.org/D60208 llvm-svn: 357615
*	[X86][AVX] combineHorizontalPredicateResult - split any/allof v16i16/v32i8 ↵	Simon Pilgrim	2019-04-03	1	-1/+8
\| \| \| \| \| \| \| \|	reduction on AVX1 Perform the 2 x 128-bit lo/hi OR/AND on the vectors before calling PMOVMSKB on the 128-bit result. llvm-svn: 357611
*	[X86][AVX] combineHorizontalPredicateResult - support v16i16/v32i8 reduction ↵	Simon Pilgrim	2019-04-03	1	-6/+3
\| \| \| \| \| \| \| \|	on AVX1 Use getPMOVMSKB helper which splits v32i8 MOVMSK calls on pre-AVX2 targets. llvm-svn: 357608