| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
| |
These aren't documented in the Intel Intrinsics Guide, but are supported by gcc and icc.
Includes these intrinsics:
_ktestc_mask8_u8, _ktestz_mask8_u8, _ktest_mask8_u8
_ktestc_mask16_u8, _ktestz_mask16_u8, _ktest_mask16_u8
_ktestc_mask32_u8, _ktestz_mask32_u8, _ktest_mask32_u8
_ktestc_mask64_u8, _ktestz_mask64_u8, _ktest_mask64_u8
llvm-svn: 341265
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds:
_cvtmask8_u32, _cvtmask16_u32, _cvtmask32_u32, _cvtmask64_u64
_cvtu32_mask8, _cvtu32_mask16, _cvtu32_mask32, _cvtu64_mask64
_load_mask8, _load_mask16, _load_mask32, _load_mask64
_store_mask8, _store_mask16, _store_mask32, _store_mask64
These are currently missing from the Intel Intrinsics Guide webpage.
llvm-svn: 341251
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the following intrinsics:
_kshiftli_mask8
_kshiftli_mask16
_kshiftli_mask32
_kshiftli_mask64
_kshiftri_mask8
_kshiftri_mask16
_kshiftri_mask32
_kshiftri_mask64
llvm-svn: 341234
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the following intrinsics:
_kadd_mask64
_kadd_mask32
_kadd_mask16
_kadd_mask8
These are missing from the Intel Intrinsics Guide, but are implemented by both gcc and icc.
llvm-svn: 340879
|
|
|
|
|
|
|
|
| |
names for 16 bit masks.
This matches gcc and icc despite not being documented in the Intel Intrinsics Guide.
llvm-svn: 340798
|
|
|
|
|
|
|
|
|
|
| |
64-bit mask registers.
This also adds a second intrinsic name for the 16-bit mask versions.
These intrinsics match gcc and icc. They just aren't published in the Intel Intrinsics Guide so I only recently found they existed.
llvm-svn: 340719
|
|
|
|
|
|
| |
k-registers.
llvm-svn: 340718
|
|
|
|
|
|
|
|
| |
avx512dqintrin.h and avx512bwintrin.h.
This is preparation for adding removing min_vector_width 512 from some intrinsics.
llvm-svn: 340717
|
|
|
|
|
|
| |
Fixes test failure after r340713
llvm-svn: 340714
|
|
|
|
|
|
| |
registers.
llvm-svn: 340713
|
|
|
|
|
|
|
|
|
|
|
|
| |
r337619 added __shiftleft128 / __shiftright128 as functions in intrin.h.
Microsoft's STL plans on using these functions, and they're using intrin0.h
which just has declarations of built-ins to not pull in the huge intrin.h
header in the standard library headers. That requires that these functions are
real built-ins.
https://reviews.llvm.org/D50907
llvm-svn: 340048
|
|
|
|
|
|
| |
builtin instead.
llvm-svn: 339845
|
|
|
|
|
|
| |
builtin instead.
llvm-svn: 339843
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
These macros are defined in the C11 standard and can be defined based on
the __*_HAS_DENORM__ default macros.
Reviewers: bruno, rsmith, doug.gregor
Subscribers: llvm-commits, enh, srhines
Differential Revision: https://reviews.llvm.org/D37302
llvm-svn: 339284
|
|
|
|
|
|
|
|
| |
This matches how GCC defines this struct.
Differential Revision: https://reviews.llvm.org/D50380
llvm-svn: 339170
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The code defines __FLOAT_H and then includes the next <float.h>, which is
guarded on __FLOAT_H so it gets skipped entirely. This commit uses the header
guard __CLANG_FLOAT_H, like other headers (such as limits.h) do.
Reviewers: jfb
Subscribers: dexonsmith, cfe-commits
Differential Revision: https://reviews.llvm.org/D50276
llvm-svn: 339016
|
|
|
|
|
|
| |
sed -Ei 's/[[:space:]]+$//' include/**/*.{def,h,td} lib/**/*.{cpp,h}
llvm-svn: 338291
|
|
|
|
|
|
|
|
|
|
|
|
| |
Carefully match the pattern matched by ISel so that this produces shld / shrd
(unless Subtarget->isSHLDSlow() is true).
Thanks to Craig Topper for providing the LLVM IR pattern that gets successfully
matched.
Fixes PR37755.
llvm-svn: 337619
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CUDA-9.2 made all integer SIMD functions into compiler builtins,
so clang no longer has access to the implementation of these
functions in either headers of libdevice and has to provide
its own implementation.
This is mostly a 1:1 mapping to a corresponding PTX instructions
with an exception of vhadd2/vhadd4 that don't have an equivalent
instruction and had to be implemented with a bit hack.
Performance of this implementation will be suboptimal for SM_50
and newer GPUs where PTXAS generates noticeably worse code for
the SIMD instructions compared to the code it generates
for the inline assembly generated by nvcc (or used to come
with CUDA headers).
Differential Revision: https://reviews.llvm.org/D49274
llvm-svn: 337587
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Added the following intrinsics:
_BitScanForward, _BitScanReverse, _BitScanForward64, _BitScanReverse64
_InterlockedAnd64, _InterlockedDecrement64, _InterlockedExchange64,
_InterlockedExchangeAdd64, _InterlockedExchangeSub64,
_InterlockedIncrement64, _InterlockedOr64, _InterlockedXor64.
Reviewers: compnerd, mstorsjo, rnk, javed.absar
Reviewed By: mstorsjo
Subscribers: kristof.beyls, chrib, llvm-commits
Differential Revision: https://reviews.llvm.org/D49445
llvm-svn: 337327
|
|
|
|
| |
llvm-svn: 337321
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch lowers the _mm[256|512]_cvtepi{64|32|16}_epi{32|16|8} intrinsics to
native IR in cases where the result's length is less than 128 bits.
The resulting IR for 256-bit inputs is folded into VPMOV instructions, while for
128-bit inputs the vpshufb (or, in the 64-to-32-bit case, vinsertps)
instructions are generated instead
Differential Revision: https://reviews.llvm.org/D48712
llvm-svn: 336643
|
|
|
|
|
|
|
|
|
|
| |
rounding version of the fma intrinsics.
The rounding mode is checked in CGBuiltin.cpp to generate the correct intrinsic call.
Making this switch switchs the masking to use the i8 bitcast to <8 x i1> and extract i1 version of the IR for the mask. Previously we ended up with a scalar 'and' plus an icmp.
llvm-svn: 336637
|
|
|
|
|
|
|
|
|
|
|
|
| |
is suitable for use in scalar mask intrinsics.
This will convert the i8 mask argument to <8 x i1> and extract an i1 and then emit a select instruction. This replaces the '(__U & 1)" and ternary operator used in some of intrinsics. The old sequence was lowered to a scalar and and compare. The new sequence uses an i1 vector that will interoperate better with other mask intrinsics.
This removes the need to handle div_ss/sd specially in CGBuiltin.cpp. A follow up patch will add the GCCBuiltin name back in llvm and remove the custom handling.
I made some adjustments to legacy move_ss/sd intrinsics which we reused here to do a simpler extract and insert instead of 2 extracts and two inserts or a shuffle.
llvm-svn: 336622
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
width. Add a min_vector_width function attribute and tag all x86 instrinsics with it
This is part of an ongoing attempt at making 512 bit vectors illegal in the X86 backend type legalizer due to CPU frequency penalties associated with wide vectors on Skylake Server CPUs. We want the loop vectorizer to be able to emit IR containing wide vectors as intermediate operations in vectorized code and allow these wide vectors to be legalized to 256 bits by the X86 backend even though we are targetting a CPU that supports 512 bit vectors. This is similar to what happens with an AVX2 CPU, the vectorizer can emit wide vectors and the backend will split them. We want this splitting behavior, but still be able to use new Skylake instructions that work on 256-bit vectors and support things like masking and gather/scatter.
Of course if the user uses explicit vector code in their source code we need to not split those operations. Especially if they have used any of the 512-bit vector intrinsics from immintrin.h. And we need to make it so that merely using the intrinsics produces the expected code in order to be backwards compatible.
To support this goal, this patch adds a new IR function attribute "min-legal-vector-width" that can indicate the need for a minimum vector width to be legal in the backend. We need to ensure this attribute is set to the largest vector width needed by any intrinsics from immintrin.h that the function uses. The inliner will be reponsible for merging this attribute when a function is inlined. We may also need a way to limit inlining in the future as well, but we can discuss that in the future.
To make things more complicated, there are two different ways intrinsics are implemented in immintrin.h. Either as an always_inline function containing calls to builtins(can be target specific or target independent) or vector extension code. Or as a macro wrapper around a taget specific builtin. I believe I've removed all cases where the macro was around a target independent builtin.
To support the always_inline function case this patch adds attribute((min_vector_width(128))) that can be used to tag these functions with their vector width. All x86 intrinsic functions that operate on vectors have been tagged with this attribute.
To support the macro case, all x86 specific builtins have also been tagged with the vector width that they require. Use of any builtin with this property will implicitly increase the min_vector_width of the function that calls it. I've done this as a new property in the attribute string for the builtin rather than basing it on the type string so that we can opt into it on a per builtin basis and avoid any impact to target independent builtins.
There will be future work to support vectors passed as function arguments and supporting inline assembly. And whatever else we can find that isn't covered by this patch.
Special thanks to Chandler who suggested this direction and reviewed a preview version of this patch. And thanks to Eric Christopher who has had many conversations with me about this issue.
Differential Revision: https://reviews.llvm.org/D48617
llvm-svn: 336583
|
|
|
|
| |
llvm-svn: 336499
|
|
|
|
|
|
|
|
|
|
| |
and hardcoded _MM_FROUND_CUR_DIRECTION internally.
I believe these have been broken since their introduction into clang.
I've enhanced the tests for these intrinsics to using a real rounding mode and checking all the intrinsic arguments instead of just the name.
llvm-svn: 336498
|
|
|
|
|
|
|
|
| |
shuffle builtins instead of generic __builtin_shufflevector.
I added the builtins for 128, 256, and 512 bits recently but looks like I failed to convert to using the 512 bit one.
llvm-svn: 336488
|
|
|
|
|
|
|
|
| |
that cause extra bitcasts to be emitted in the IR.
Found via imprecise grepping of the -O0 IR. There could still be more bugs out there.
llvm-svn: 336487
|
|
|
|
|
|
|
|
| |
We had the mask versions of the rounding intrinsics, but not one without masking.
Also change the rounding tests to not use the CUR_DIRECTION rounding mode.
llvm-svn: 336470
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All of these found by grepping through IR from the builtin tests for extra trunc and zext/sext instructions that shouldn't have been there.
Some of these were real bugs where we lost bits from the user input:
_mm512_mask_broadcast_f32x8
_mm512_maskz_broadcast_f32x8
_mm512_mask_broadcast_i32x8
_mm512_maskz_broadcast_i32x8
_mm256_mask_cvtusepi16_storeu_epi8
llvm-svn: 336042
|
|
|
|
|
|
| |
instead.
llvm-svn: 336036
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Tests in a separate change to the test-suite.
Reviewers: rsmith, tra
Subscribers: lahwaacz, sanjoy, cfe-commits
Differential Revision: https://reviews.llvm.org/D48151
llvm-svn: 336026
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Fixes PR37753: min/max can't be called from __host__ __device__
functions in C++14 mode.
Testcase in a separate test-suite commit.
Reviewers: rsmith
Subscribers: sanjoy, lahwaacz, cfe-commits
Differential Revision: https://reviews.llvm.org/D48036
llvm-svn: 336025
|
|
|
|
|
|
| |
builtins instead.
llvm-svn: 335945
|
|
|
|
|
|
|
|
|
|
|
|
| |
__stosw/d/q to mark registers/memory as modified
The inline assembly for these didn't mark that edi, esi, ecx are modified by movs/stos instruction. It also didn't mark that memory is modified.
This issue was reported to llvm-dev last year http://lists.llvm.org/pipermail/cfe-dev/2017-November/055863.html but no bug was ever filed.
Differential Revision: https://reviews.llvm.org/D48448
llvm-svn: 335270
|
|
|
|
|
|
|
|
|
|
| |
ms-intrinsics.c to not issue warnings
ud2 and int2c were missing declarations entirely. And the bitscans were only under x86_64, but they seem to be in BuiltinsARM.def as well and are tested by ms_intrinsics.c
Differential Revision: https://reviews.llvm.org/D48187
llvm-svn: 335259
|
|
|
|
|
|
|
|
|
|
|
|
| |
other intrinsics and remove undef shuffle indices.
Similar to what was done to max/min recently.
These already reduced the vector width to 256 and 128 bit as we go unlike the original max/min code.
Differential Revision: https://reviews.llvm.org/D48346
llvm-svn: 335253
|
|
|
|
|
|
| |
select in IR instead.
llvm-svn: 335200
|
|
|
|
|
|
| |
redefining it.
llvm-svn: 335086
|
|
|
|
|
|
|
|
| |
better use of other functions and to reduce width to 256 and 128 bits were possible.""
Test has been updated to reflect the IRGen.
llvm-svn: 335075
|
|
|
|
|
|
|
|
| |
better use of other functions and to reduce width to 256 and 128 bits were possible."
The test changes are failing the buildbot and its going to take me some time to fix it.
llvm-svn: 335072
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
other functions and to reduce width to 256 and 128 bits were possible.
We only need to use 512 bit vectors all the way through v8i64 reductions since those max instructions are new to avx512f and only available in 512 bits until SKX.
For v16i32 and floating point we have legacy 128/256 bit instructions we can use.
I've tried to use other intrinsics to reduce the verbosity of the code and avoid having to mention all the shuffles. I've also removed all the -1 shuffle indices so the output sequence is fully specified and not left to backend optimization.
Differential Revision: https://reviews.llvm.org/D47401
llvm-svn: 335070
|
|
|
|
|
|
|
|
|
|
|
|
| |
__builtin_ia32_pslldqi128_byteshift and similar for other sizes. Remove the multiply by 8 from the header files.
The previous names took the shift amount in bits to match gcc and required a multiply by 8 in the header. This creates a misleading error message when we check the range of the immediate to the builtin since the allowed range also got multiplied by 8.
This commit changes the builtins to use a byte shift amount to match the underlying instruction and the Intel intrinsic.
Fixes the remaining issue from PR37795.
llvm-svn: 334773
|
|
|
|
|
|
|
|
|
|
| |
_InterlockedExchange_HLEAcquire/Release and _InterlockedCompareExchange_HLEAcquire/Release for MSVC compatibility.
Clang/LLVM doesn't have a way to pass an HLE hint through to the X86 backend to emit HLE prefixed instructions. So this is a good short term fix.
Differential Revision: https://reviews.llvm.org/D47672
llvm-svn: 334751
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Lowering add, sub, mul, and div mask scalar intrinsic calls
to native IR.
Reviewers: craig.topper, RKSimon, spatel, sroland
Reviewed By: craig.topper
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D47979
llvm-svn: 334741
|
|
|
|
|
|
| |
builtins. Use select builtins instead.
llvm-svn: 334577
|
|
|
|
| |
llvm-svn: 334385
|
|
|
|
|
|
| |
builtins. Use select in IR instead.
llvm-svn: 334359
|
|
|
|
|
|
|
|
|
|
| |
others.
I'd like to make the select builtins require an avx512f, avx512bw, or avx512vl fature to match what is normally required to get masking. Truncate is special in that there are instructions with a 128/256-bit masked result even without avx512vl.
By using special buitlins we can emit a select without using the 128/256-bit select builtins.
llvm-svn: 334331
|