| Commit message (Collapse) | Author | Age | Files | Lines |
| ... | |
| |
|
|
|
|
|
| |
That comment misleads the current discussions in mentioned bug. Leave
the discussions to the bug. Also, adding a future change FIXME.
llvm-svn: 238653
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
best approach of each.
For vNi16, we use SHL + ADD + SRL pattern that seem easily the best.
For vNi32, we use the PUNPCK + PSADBW + PACKUSWB pattern. In some cases
there is a huge improvement with this in IACA's estimated throughput --
over 2x higher throughput!!!! -- but the measurements are too good to be
true. In one narrow case, the SHL + ADD + SHL + ADD + SRL pattern looks
slightly faster, but I'm not sure I believe any of the measurements at
this point. Both are the exact same uops though. Hard to be confident of
anything past that.
If anyone wants to collect very detailed (Agner-level) timings with the
result of this patch, or with the i32 case replaced with SHL + ADD + SHl
+ ADD + SRL, I'd be very interested. Note that you'll need to test it on
both Ivybridge and Haswell, with both SSE3, SSSE3, and AVX selected as
I saw unique behavior in each of these buckets with IACA all of which
should be checked against measured performance.
But this patch is still a useful improvement by dropping duplicate work
and getting the much nicer PSADBW lowering for v2i64.
I'd still like to rephrase this in terms of generic horizontal sum. It's
a bit lame to have a special case of that just for popcount.
llvm-svn: 238652
|
| |
|
|
|
|
|
|
|
|
|
| |
The plan was to move the whole table into the already existing ArchExtNames
but some fields depend on a table-generated file, and we don't yet have this
feature in the generic lib/Support side.
Once the minimum target-specific table-generated files are available in a
generic fashion to these libraries, we'll have to keep it in the ASM parser.
llvm-svn: 238651
|
| |
|
|
|
|
|
|
| |
lowering into a helper function.
NFC.
llvm-svn: 238650
|
| |
|
|
|
|
|
|
| |
The first named data member is the field used to default initialize the
union. An IndirectFieldDecl can introduce the first named data member
of a union.
llvm-svn: 238649
|
| |
|
|
|
|
| |
typeIsConvertibleTo was just calling baseClassOf(this) on the argument passed to it, but there weren't different signatures for baseClassOf so passing 'this' didn't really do anything interesting. typeIsConvertibleTo could have just been a non-virtual method in RecTy. But since that would be kind of a silly method, I instead re-distributed the logic from baseClassOf into typeIsConvertibleTo.
llvm-svn: 238648
|
| |
|
|
| |
llvm-svn: 238647
|
| |
|
|
|
|
| |
the conversions in convertInitializerTo directly. This saves a bunch of vtable entries. NFC
llvm-svn: 238646
|
| |
|
|
| |
llvm-svn: 238645
|
| |
|
|
| |
llvm-svn: 238644
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Running indvar before Polly is useful as this eliminates zexts as they commonly
appear when a 32 bit induction variable (type int) was used on a 64 bit system.
These zexts confuse our delinearization and prevent for example the successful
delinearization of the nussinov kernel in polybench-c-4.1.
This fixes http://llvm.org/PR23426
Suggested-by: Xing Su <xsu.llvm@outlook.com>
llvm-svn: 238643
|
| |
|
|
|
|
|
|
|
| |
helper that skips creating a cast when it isn't necessary.
It's really somewhat concerning that this was caused by the the presence
of a no-op bitcast, but...
llvm-svn: 238642
|
| |
|
|
|
|
|
|
|
| |
.safeseh adds an entry to the .sxdata section to register all the
appropriate functions which may handle an exception. This entry is not
a relocation to the symbol but instead the symbol table index of the
function.
llvm-svn: 238641
|
| |
|
|
|
|
|
|
|
|
| |
shorter one. NFC.
In addition to being much shorter to type and requiring fewer arguments,
this change saves over 30 lines from this one file, all wasted on total
boilerplate...
llvm-svn: 238640
|
| |
|
|
|
|
| |
spelling. NFC.
llvm-svn: 238639
|
| |
|
|
|
|
|
|
| |
around a value using its existing SDLoc.
Start using this in just one function to save omg lines of code.
llvm-svn: 238638
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
shifting vectors of bytes as x86 doesn't have direct support for that.
This removes a bunch of redundant masking in the generated code for SSE2
and SSE3.
In order to avoid the really significant code size growth this would
have triggered, I also factored the completely repeatative logic for
shifting and masking into two lambdas which in turn makes all of this
much easier to read IMO.
llvm-svn: 238637
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in-register LUT technique.
Summary:
A description of this technique can be found here:
http://wm.ite.pl/articles/sse-popcount.html
The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.
On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.
The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.
I have extended Bruno's patch in the following ways:
0) I committed the new tests with baseline sequences so this shows
a diff, and regenerated the tests using the update scripts.
1) Bruno had noticed and mentioned in IRC a redundant mask that
I removed.
2) I introduced a particular optimization for the i32 vector cases where
we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD
+ PSADBW to compute doubled high i32 popcounts. This takes advantage
of the fact that to line up the high i32 popcounts we have to shift
them anyways, and we can shift them by one fewer bit to effectively
divide the count by two. While the PSHUFD based horizontal add is no
faster, it doesn't require registers or load traffic the way a mask
would, and provides more ILP as it happens on different ports with
high throughput.
3) I did some code cleanups throughout to simplify the implementation
logic.
4) I refactored it to continue to use the parallel bitmath lowering when
SSSE3 is not available to preserve the performance of that version on
SSE2 targets where it is still much better than scalarizing as we'll
still do a bitmath implementation of popcount even in scalar code
there.
With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.
With this, we can lower every integer vector popcount implementation
using the LUT strategy if we have SSSE3 or better (and thus have
PSHUFB). I've updated the operation lowering to reflect this. This also
fixes an issue where we were scalarizing horribly some AVX lowerings.
Finally, there are some remaining cleanups. There is duplication between
the two techniques in how they perform the horizontal sum once the byte
population count is computed. I'm going to factor and merge those two in
a separate follow-up commit.
Differential Revision: http://reviews.llvm.org/D10084
llvm-svn: 238636
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
a separate routine, generalize it to work for all the integer vector
sizes, and do general code cleanups.
This dramatically improves lowerings of byte and short element vector
popcount, but more importantly it will make the introduction of the
LUT-approach much cleaner.
The biggest cleanup I've done is to just force the legalizer to do the
bitcasting we need. We run these iteratively now and it makes the code
much simpler IMO. Other changes were minor, and mostly naming and
splitting things up in a way that makes it more clear what is going on.
The other significant change is to use a different final horizontal sum
approach. This is the same number of instructions as the old method, but
shifts left instead of right so that we can clear everything but the
final sum with a single shift right. This seems likely better than
a mask which will usually have to read the mask from memory. It is
certaily fewer u-ops. Also, this will be temporary. This and the LUT
approach share the need of horizontal adds to finish the computation,
and we have more clever approaches than this one that I'll switch over
to.
llvm-svn: 238635
|
| |
|
|
| |
llvm-svn: 238634
|
| |
|
|
|
|
|
|
| |
It's reachable from user input.
Bug found with AFL fuzz.
llvm-svn: 238633
|
| |
|
|
|
|
|
|
|
|
|
| |
r238503 fixed the problem of too-small shift types by promoting them
during legalization, but the correct solution is to promote only the
operands that actually demand promotion.
This fixes a crash on an out-of-tree target caused by trying to
promote an operand that can't be promoted.
llvm-svn: 238632
|
| |
|
|
| |
llvm-svn: 238631
|
| |
|
|
| |
llvm-svn: 238630
|
| |
|
|
| |
llvm-svn: 238629
|
| |
|
|
|
|
| |
for the standalone configuration.
llvm-svn: 238628
|
| |
|
|
|
|
|
|
|
|
| |
It turns out that _except_handler3 and _except_handler4 really use the
same stack allocation layout, at least today. They just make different
choices about encoding the LSDA.
This is in preparation for lowering the llvm.eh.exceptioninfo().
llvm-svn: 238627
|
| |
|
|
| |
llvm-svn: 238626
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We catch most of the various other __fp16 implicit conversions to
float, but not this one:
__fp16 a;
int i;
...
a += i;
For which we used to generate something 'fun' like:
%conv = sitofp i32 %i to float
%1 = tail call i16 @llvm.convert.to.fp16.f32(float %conv)
%add = add i16 %0, %1
Instead, when we have an __fp16 LHS and an integer RHS, we should
use float as the result type.
While there, add a bunch of missing tests for mixed
__fp16/integer expressions.
llvm-svn: 238625
|
| |
|
|
|
|
|
|
| |
llvm::sys::path::is_relative in PlatformAndroid::GetFile.
http://reviews.llvm.org/D10141
llvm-svn: 238624
|
| |
|
|
|
|
| |
http://reviews.llvm.org/D10080
llvm-svn: 238623
|
| |
|
|
| |
llvm-svn: 238622
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is done by creating a named shared memory region, unlinking it
and setting up a private (i.e. copy-on-write) mapping of that instead
of a regular anonymous mapping. I've experimented with regular
(sparse) files, but they can not be scaled to the size of MSan shadow
mapping, at least on Linux/X86_64 and ext3 fs.
Controlled by a common flag, decorate_proc_maps, disabled by default.
This patch has a few shortcomings:
* not all mappings are annotated, especially in TSan.
* our handling of memset() of shadow via mmap() puts small anonymous
mappings inside larger named mappings, which looks ugly and can, in
theory, hit the mapping number limit.
llvm-svn: 238621
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
These intrinsics should take a generic input address space and outputs a
non-generic address space.
Test Plan: no
Reviewers: jholewinski, eliben
Reviewed By: eliben
Subscribers: eliben, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D10132
llvm-svn: 238620
|
| |
|
|
|
|
|
| |
The value in 'ebp' acts as an implicit argument to the outlined
handlers, and is recovered with frameaddress(1).
llvm-svn: 238619
|
| |
|
|
|
|
|
|
|
| |
The new mechanism is less code, and fixes the case where all inputs
are archives.
Differential Revision: http://reviews.llvm.org/D10136
llvm-svn: 238618
|
| |
|
|
|
|
| |
This completes the mechanical part of merging MCSymbol and MCSymbolData.
llvm-svn: 238617
|
| |
|
|
|
|
|
|
| |
We need to apply to FreeBSD a change equivalent to r238549.
llvm.org/pr23699
llvm-svn: 238616
|
| |
|
|
|
|
|
|
| |
We need to apply to FreeBSD a change equivalent to r238549.
llvm.org/pr23699
llvm-svn: 238615
|
| |
|
|
|
|
| |
We were getting "#define __ARM_ARCH_7 -S__ 1" which is really not a good idea.
llvm-svn: 238614
|
| |
|
|
| |
llvm-svn: 238612
|
| |
|
|
|
|
| |
The getData member function is next.
llvm-svn: 238611
|
| |
|
|
| |
llvm-svn: 238610
|
| |
|
|
|
|
|
| |
As a transition hack leave MCSymbolData as a typedef of MCSymbol. I will be
removing that in a second.
llvm-svn: 238609
|
| |
|
|
| |
llvm-svn: 238608
|
| |
|
|
|
|
| |
Another step in merging MCSymbol and MCSymbolData.
llvm-svn: 238607
|
| |
|
|
|
|
| |
concatenate it with current working directory if needed.
llvm-svn: 238606
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Depends on D9728.
Reviewers: ovyalov, zturner, clayborg
Reviewed By: clayborg
Subscribers: lldb-commits
Differential Revision: http://reviews.llvm.org/D9806
llvm-svn: 238605
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This should solve the issue of sending denormalized paths over gdb-remote
if we stick to GetPath(false) in GDBRemoteCommunicationClient, and let the
server handle any denormalization.
Reviewers: ovyalov, zturner, vharron, clayborg
Reviewed By: clayborg
Subscribers: tberghammer, emaste, lldb-commits
Differential Revision: http://reviews.llvm.org/D9728
llvm-svn: 238604
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Skip the g++ 4.6 test if we're not going to build any C++ source.
If a test has C++ source files we automatically determine which C++
compiler to use based on $CC (for example, clang++ if CC=clang).
However, this is not done for tests without C++ source and CXX will
be GNU make's default of g++. This produces suprious "g++: not found"
errors in testrun output on systems without a gcc/g++.
Differential Revision: http://reviews.llvm.org/D10122
llvm-svn: 238603
|