| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
| |
llvm-svn: 356359
|
|
|
|
|
|
|
|
|
| |
These are used to help convert OR->LEA when needed to avoid avoid a copy. They
aren't need after register allocation.
Happens to remove an ugly goto from X86MCCodeEmitter.cpp
llvm-svn: 356356
|
|
|
|
|
|
| |
All the other instructions are printed with a preceeding tab.
llvm-svn: 356355
|
|
|
|
|
|
|
|
|
|
|
|
| |
for all other printing functions.
The only thing the print methods currently need to know is the string to print for the memory size in intel syntax.
This patch merges the functions based on this string. If we ever need something else in the future, its easy to split them back out.
This reduces the number of cases in the assembly printers. It shrinks the intel printer to only use 7 bytes per instruction instead of 8.
llvm-svn: 356352
|
|
|
|
|
|
|
|
|
|
| |
of custom printing and custom parsing to achieve the same result and more
Similar to the previous patch for VPCOM.
Differential Revision: https://reviews.llvm.org/D59398
llvm-svn: 356344
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
custom printing and custom parsing to achieve the same result and more
Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.
The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".
The _alt form on the other hand parsed and printed like any other instruction with no specialness.
With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.
I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.
I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.
Differential Revision: https://reviews.llvm.org/D59398
llvm-svn: 356343
|
|
|
|
|
|
| |
Replaces existing i1-only fold.
llvm-svn: 356325
|
|
|
|
|
|
| |
Improved constant folding for PEXTRB/PEXTRW will be added in a future commit
llvm-svn: 356324
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780
If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`
Reviewers: craig.topper, andreadb, RKSimon
Reviewed By: craig.topper
Subscribers: llvm-commits, andreadb
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59412
llvm-svn: 356301
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP
Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.
There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?
# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>
// Future preliminary libc++ code, from Marshall Clow.
namespace std {
template <class _Tp>
__inline _Tp midpoint(_Tp __a, _Tp __b) noexcept {
using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type;
int __sign = 1;
_Up __m = __a;
_Up __M = __b;
if (__a > __b) {
__sign = -1;
__m = __b;
__M = __a;
}
return __a + __sign * _Tp(_Up(__M - __m) >> 1);
}
} // namespace std
template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
std::numeric_limits<T>::max());
std::vector<T> v;
v.reserve(count);
std::generate_n(std::back_inserter(v), count,
[&dis, &gen]() { return dis(gen); });
assert(v.size() == count);
return v;
}
struct RandRand {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
return std::make_pair(getVectorOfRandomNumbers<T>(count),
getVectorOfRandomNumbers<T>(count));
}
};
struct ZeroRand {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
return std::make_pair(std::vector<T>(count, T(0)),
getVectorOfRandomNumbers<T>(count));
}
};
template <class T, class Gen>
void BM_StdMidpoint(benchmark::State& state) {
const size_t Length = state.range(0);
const std::pair<std::vector<T>, std::vector<T>> Data =
Gen::template Gen<T>(Length);
const std::vector<T>& a = Data.first;
const std::vector<T>& b = Data.second;
assert(a.size() == Length && b.size() == a.size());
benchmark::ClobberMemory();
benchmark::DoNotOptimize(a);
benchmark::DoNotOptimize(a.data());
benchmark::DoNotOptimize(b);
benchmark::DoNotOptimize(b.data());
for (auto _ : state) {
for (size_t i = 0; i < Length; i++) {
const auto calculated = std::midpoint(a[i], b[i]);
benchmark::DoNotOptimize(calculated);
}
}
state.SetComplexityN(Length);
state.counters["midpoints"] =
benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
state.counters["midpoints/sec"] =
benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
const size_t BytesRead = 2 * sizeof(T) * Length;
state.counters["bytes_read/iteration"] =
benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
benchmark::Counter::OneK::kIs1024);
state.counters["bytes_read/sec"] = benchmark::Counter(
BytesRead, benchmark::Counter::kIsIterationInvariantRate,
benchmark::Counter::OneK::kIs1024);
}
template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
const size_t L2SizeBytes = 2 * 1024 * 1024;
// What is the largest range we can check to always fit within given L2 cache?
const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
/*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}
// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
->Apply(CustomArguments<uint8_t>);
// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
->Apply(CustomArguments<uint8_t>);
```
```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW
RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm
2019-03-06 21:53:31
Running ./llvm-cmov-bench-OLD
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 1.78, 1.81, 1.36
----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072 300398 ns 300404 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s
BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072 300433 ns 300433 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536 169857 ns 169858 ns 4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s
BM_StdMidpoint<int64_t, RandRand>_BigO 2.59 N 2.59 N
BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536 169770 ns 169771 ns 4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO 2.59 N 2.59 N
BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144 591169 ns 591179 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s
BM_StdMidpoint<int16_t, RandRand>_BigO 2.25 N 2.25 N
BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144 591264 ns 591274 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO 2.25 N 2.25 N
BM_StdMidpoint<uint16_t, RandRand>_RMS 1 % 1 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288 2983669 ns 2983689 ns 235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s
BM_StdMidpoint<int8_t, RandRand>_BigO 5.69 N 5.69 N
BM_StdMidpoint<int8_t, RandRand>_RMS 0 % 0 %
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288 2668398 ns 2668419 ns 262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO 5.09 N 5.09 N
BM_StdMidpoint<uint8_t, RandRand>_RMS 0 % 0 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072 300887 ns 300887 ns 2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536 169634 ns 169634 ns 4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144 592252 ns 592255 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288 987295 ns 987309 ns 711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO 1.88 N 1.88 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS 1 % 1 %
RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW
2019-03-06 21:56:58
Running ./llvm-cmov-bench-NEW
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 1.17, 1.46, 1.30
----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072 300878 ns 300880 ns 2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s
BM_StdMidpoint<int32_t, RandRand>_BigO 2.29 N 2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS 2 % 2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072 300231 ns 300226 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO 2.29 N 2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS 2 % 2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536 170819 ns 170777 ns 4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s
BM_StdMidpoint<int64_t, RandRand>_BigO 2.60 N 2.60 N
BM_StdMidpoint<int64_t, RandRand>_RMS 3 % 3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536 171705 ns 171708 ns 4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO 2.62 N 2.62 N
BM_StdMidpoint<uint64_t, RandRand>_RMS 3 % 3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144 592510 ns 592516 ns 1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s
BM_StdMidpoint<int16_t, RandRand>_BigO 2.26 N 2.26 N
BM_StdMidpoint<int16_t, RandRand>_RMS 1 % 1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144 614823 ns 614823 ns 1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO 2.33 N 2.33 N
BM_StdMidpoint<uint16_t, RandRand>_RMS 4 % 4 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288 1073181 ns 1073201 ns 650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s
BM_StdMidpoint<int8_t, RandRand>_BigO 2.05 N 2.05 N
BM_StdMidpoint<int8_t, RandRand>_RMS 1 % 1 %
BM_StdMidpoint<uint8_t, RandRand>/524288 1071010 ns 1071020 ns 653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO 2.05 N 2.05 N
BM_StdMidpoint<uint8_t, RandRand>_RMS 1 % 1 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072 300413 ns 300416 ns 2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO 2.29 N 2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS 2 % 2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536 169667 ns 169669 ns 4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO 2.59 N 2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS 3 % 3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144 591396 ns 591404 ns 1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO 2.26 N 2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS 1 % 1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288 1069421 ns 1069413 ns 655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO 2.04 N 2.04 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS 0 % 0 %
Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072 +0.0016 +0.0016 300398 300878 300404 300880
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072 -0.0007 -0.0007 300433 300231 300433 300226
<...>
BM_StdMidpoint<int64_t, RandRand>/65536 +0.0057 +0.0054 169857 170819 169858 170777
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536 +0.0114 +0.0114 169770 171705 169771 171708
<...>
BM_StdMidpoint<int16_t, RandRand>/262144 +0.0023 +0.0023 591169 592510 591179 592516
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144 +0.0398 +0.0398 591264 614823 591274 614823
<...>
BM_StdMidpoint<int8_t, RandRand>/524288 -0.6403 -0.6403 2983669 1073181 2983689 1073201
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288 -0.5986 -0.5986 2668398 1071010 2668419 1071020
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072 -0.0016 -0.0016 300887 300413 300887 300416
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536 +0.0002 +0.0002 169634 169667 169634 169669
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144 -0.0014 -0.0014 592252 591396 592255 591404
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288 +0.0832 +0.0832 987295 1069421 987309 1069413
```
What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
That is because there we are computing mid point between zero and some random number,
thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.
# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
`cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted, then the PGO
will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
https://godbolt.org/z/P5ufig
# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td: let MispredictPenalty = 16;
X86SchedHaswell.td: let MispredictPenalty = 16;
X86SchedSandyBridge.td: let MispredictPenalty = 16;
X86SchedSkylakeClient.td: let MispredictPenalty = 14;
X86SchedSkylakeServer.td: let MispredictPenalty = 14;
X86ScheduleBdVer2.td: let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td: let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td: let MispredictPenalty = 10;
X86ScheduleZnver1.td: let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.
In short, i'd say this is an improvement, at least on this microbenchmark.
Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 | PR40965 ]].
Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic
Reviewed By: craig.topper, andreadb
Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists
Tags: #llvm, #libc
Differential Revision: https://reviews.llvm.org/D59035
llvm-svn: 356300
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use TargetConstant to save a conversion in the isel table.
The asm parser generates the immediate without the SAE bit. So for consistency we should generate the MCInst the same way from CodeGen.
Since they are now both the same, remove the masking from the printer and replace with an llvm_unreachable.
Use a target constant since we're rebuilding the node anyway. Then we don't have to have isel convert it. Saves about 500 bytes from the isel table.
llvm-svn: 356294
|
|
|
|
|
|
|
|
| |
bitcast(scalar_to_vector(i32 anyext(x)))
Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.
llvm-svn: 356292
|
|
|
|
|
|
|
|
| |
The existing lowering code is accidentally correct for unordered atomics as far as I can tell. An unordered atomic has no memory ordering, and simply requires the actual load or store to be done as a single well aligned instruction. As such, relax the restriction while adding tests to ensure the lowering remains correct in the future.
Differential Revision: https://reviews.llvm.org/D57803
llvm-svn: 356280
|
|
|
|
| |
llvm-svn: 356270
|
|
|
|
|
|
| |
Prep work for PR40203
llvm-svn: 356249
|
|
|
|
|
|
|
|
|
|
|
|
| |
use the rotr opcode.
These instructions used to use rotl with a bitwidth-1 immediate. I changed the immediate to 1,
but failed to change the opcode.
Thankfully this seems to have not caused a functional issue because we now had two rotl by 1 patterns,
but the correct ones were earlier and took priority. So we just missed some optimization.
llvm-svn: 356164
|
|
|
|
|
|
|
|
|
| |
This is an immediate fix for:
https://bugs.llvm.org/show_bug.cgi?id=41066
...but as noted there and the code comments, we should do better
by stubbing this out sooner.
llvm-svn: 356158
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to the introduction of funnel shift intrinsics we could count on rotate
by immediates prefering to use rotl since that's what MatchRotate would check
first. The or+shift pattern doesn't have a direction so one must be chosen
arbitrarily.
With funnel shift, there is a direction and fshr will try to use rotr first.
While fshl will try to use rotl first.
This patch adds the isel patterns for rotr to complement the rotl patterns. I've
put the rotr by 1 patterns in the instruction patterns. And moved the rotl by
bitwidth-1 patterns to separate Pat patterns.
Fixes PR41057.
llvm-svn: 356121
|
|
|
|
|
|
|
|
|
|
|
|
| |
The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well.
We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass.
I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue.
Differential Revision: https://reviews.llvm.org/D59308
llvm-svn: 356078
|
|
|
|
|
|
| |
SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 356067
|
|
|
|
|
|
|
|
|
|
| |
Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization.
This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time.
The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit.
llvm-svn: 356064
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A fuzzer found the crasher:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700
The bug was introduced recently here:
rL355741
This is the quick fix. If we need to do this transform
later, then we'd have to extend/truncate the vector setcc
element type to the scalar setcc type (i8).
llvm-svn: 356053
|
|
|
|
| |
llvm-svn: 356046
|
|
|
|
|
|
|
|
|
|
| |
AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false.
This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on.
Differential Revision: https://reviews.llvm.org/D58888
llvm-svn: 356043
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add break statements in Object/ELF.cpp since the code should consider the
generic tags for Hexagon, MIPS, and PPC. Add a test (copied from llvm-readobj)
to show that this works correctly (earlier versions of this patch would have
asserted).
The warnings in X86ELFObjectWriter.cpp are actually false-positives since
the nested switch() handles all possible values and returns in all cases.
Make this explicit by adding llvm_unreachable's.
Differential Revision: https://reviews.llvm.org/D58837
llvm-svn: 356037
|
|
|
|
|
|
| |
AAD will print without an immediate when the immediate is 10.
llvm-svn: 355997
|
|
|
|
|
|
| |
A faulting_op is one that has specified behavior when a fault occurs, generally redirecting control flow to another location. This change just adds a comment to the assembly output which makes it both human readable, and machine checkable w/o having to parse the FaultMap section. This is used to split a test file into two parts, so that I can (in a near future commit) easily extend the test file to demonstrate another case.
llvm-svn: 355982
|
|
|
|
| |
llvm-svn: 355955
|
|
|
|
|
|
|
|
| |
This makes SandyBridge inherit back to Westmere/Nehalem.
Make bdver1-4 inherit from each other and btver2 inherit from btver1.
llvm-svn: 355935
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
lists into a ProcessorFeatures class.
ProcFeatures was a class that just concatenated two feature lists together and gave it a name. We used it to inherit features between CPUs.
ProcModel took a two CPU feature lists and concatenated them before deferring to ProcessorModel. This was to allow inherited features and specific features to be passed to each CPU.
Both of these allowed for only very rigid CPU inheritance rules.
With this patch we now store all of the lists we were using for inheritance in one object and do any list oncatenation we want there. Then we just pass whatever list we want from this class into the ProcessorModel class for each CPU.
Hopefully this gives us more flexibility to build up feature lists in whatever ways we think make sense. Perhaps untangling ISA flags and tuning flags.
I've only touched the CPUs that were directly affected by the removal of the ProcModel and ProcFeatures classes. We should move more of the feature lists into ProcessorFeatures.
llvm-svn: 355872
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: It is incomplete and has no users AFAIK.
Reviewers: pcc, vitalybuka
Subscribers: srhines, kubamracek, mgorny, krytarowski, eraman, hiraditya, jdoerfert, #sanitizers, llvm-commits, thakis
Tags: #sanitizers, #llvm
Differential Revision: https://reviews.llvm.org/D59154
llvm-svn: 355870
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AMDGPU target run out of Subtarget feature flags hitting the limit of 64.
AssemblerPredicates uses at most uint64_t for their representation.
At the same time CodeGen has exhausted this a long time ago and switched
to a FeatureBitset with the current limit of 192 bits.
This patch completes transition to the bitset for feature bits extending
it to asm matcher and MC code emitter.
Differential Revision: https://reviews.llvm.org/D59002
llvm-svn: 355839
|
|
|
|
| |
llvm-svn: 355810
|
|
|
|
|
|
|
|
| |
extractelement.
I suspect if this pattern was seen, DAG combine would just change the type of the store to eliminate the bitcast.
llvm-svn: 355809
|
|
|
|
|
|
| |
They mean the same thing, but 'HasAVX, NoAVX512' only appears in this one place. Every other place uses UseAVX.
llvm-svn: 355808
|
|
|
|
|
|
| |
After this we no longer need to match FROUND_CURRENT or FROUND_NO_EXC during isel so I remove those.
llvm-svn: 355807
|
|
|
|
| |
llvm-svn: 355806
|
|
|
|
|
|
| |
direction nodes. Remove rounding mode operand.
llvm-svn: 355805
|
|
|
|
|
|
|
|
| |
_SAE. Remove SAE operand.
No need to explicitly store it and match it during isel.
llvm-svn: 355804
|
|
|
|
| |
llvm-svn: 355803
|
|
|
|
|
|
|
|
| |
Split VFPROUNDS_RND/VFPEXT(S)_RND into versions without rounding operand.
For VFPEXT(S) we only need current rounding mode and an SAE version. Neither need extra operand.
llvm-svn: 355802
|
|
|
|
|
|
|
|
| |
_RND. Remove rounding operand.
The operand could only be the SAE encoding so no need to include it.
llvm-svn: 355801
|
|
|
|
|
|
|
|
| |
SAE ISD opcode.
Remove matching of FROUND_CURRENT and FROUND_NO_EXC for these nodes from isel table.
llvm-svn: 355800
|
|
|
|
|
|
|
|
|
|
| |
tables.
Instead I plan to have dedicated nodes for FROUND_CURRENT and FROUND_NO_EXC.
This patch starts with FADDS/FSUBS/FMULS/FDIVS/FMAXS/FMINS/FSQRTS.
llvm-svn: 355799
|
|
|
|
|
|
|
|
| |
We had patterns using X86ISD::SCALAR_SINT_TO_FP_RND/SCALAR_UINT_TO_FP_RND for
these instructions. There's nothing to round. Instead, we use a regular
sint_to_fp/uint_to_fp and a movsd as the pattern for these.
llvm-svn: 355796
|
|
|
|
|
|
| |
This would convert a signed 32-bit integer to double precision with rounding. But there's nothing to round.
llvm-svn: 355795
|
|
|
|
| |
llvm-svn: 355792
|
|
|
|
| |
llvm-svn: 355790
|
|
|
|
|
|
|
|
|
|
| |
valid rounding modes are lowered. Update tests accordingly
Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates.
With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure.
llvm-svn: 355789
|
|
|
|
|
|
| |
The code in here handles nodes with 6 or 7 operands. But only the 6 operand case is ever used these days.
llvm-svn: 355788
|