| Commit message (Collapse) | Author | Age | Files | Lines |
| ... | |
| |
|
|
|
|
|
| |
The PPC GetSymbolFromOperand already prefixed stubs of MO_ExternalSymbol, so
this should be a nop.
llvm-svn: 196059
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a scheduling model for the POWER7 (P7) core, and enables the
machine-instruction scheduler when targeting the P7. Scheduling for the P7,
like earlier ooo PPC cores, requires considering both dispatch group hazards,
and functional unit resources and latencies. These are both modeled in a
combined itinerary. Dispatch group formation is still handled by the post-RA
scheduler (which still needs to be updated for the P7, but nevertheless does a
pretty good job).
One interesting aspect of this change is that I've also enabled to use of AA
duing CodeGen for the P7 (just as it is for the embedded cores). The benchmark
results seem to support this decision (see below), and while this is normally
useful for in-order cores, and not for ooo cores like the P7, I think that the
dispatch slot hazards are enough like in-order resources to make the AA useful.
Test suite significant performance differences (where negative is a speedup,
and positive is a regression) vs. the current situation:
MultiSource/Benchmarks/BitBench/drop3/drop3
with AA: N/A
without AA: -28.7614% +/- 19.8356%
(significantly against AA)
MultiSource/Benchmarks/FreeBench/neural/neural
with AA: -17.7406% +/- 11.2712%
without AA: N/A
(significantly in favor of AA)
MultiSource/Benchmarks/SciMark2-C/scimark2
with AA: -11.2079% +/- 1.80543%
without AA: -11.3263% +/- 2.79651%
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt
with AA: -41.8649% +/- 17.0053%
without AA: -34.5256% +/- 23.7072%
MultiSource/Benchmarks/mafft/pairlocalalign
with AA: 25.3016% +/- 17.8614%
without AA: 38.6629% +/- 14.9391%
(significantly in favor of AA)
MultiSource/Benchmarks/sim/sim
with AA: N/A
without AA: 13.4844% +/- 7.18195%
(significantly in favor of AA)
SingleSource/Benchmarks/BenchmarkGame/Large/fasta
with AA: 15.0664% +/- 6.70216%
without AA: 12.7747% +/- 8.43043%
SingleSource/Benchmarks/BenchmarkGame/puzzle
with AA: 82.2713% +/- 26.3567%
without AA: 75.7525% +/- 41.1842%
SingleSource/Benchmarks/Misc/flops-2
with AA: -37.1621% +/- 20.7964%
without AA: -35.2342% +/- 20.2999%
(significantly in favor of AA)
These are 99.5% confidence intervals from 5 runs per configuration. Regarding
the choice to turn on AA during CodeGen, of these results, four seem
significantly in favor of using AA, and one seems significantly against. I'm
not making this decision based on these numbers alone, but these results
seem consistent with results I have from other tests, and so I think that, on
balance, using AA is a win.
llvm-svn: 195981
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
In preparation for adding scheduling definitions for the POWER7, split some PPC
itinerary classes so that the P7's latencies and hazards can be better
described. For the most part, this means differentiating indexed from non-index
pre-increment loads and stores. Also, differentiate single from
double-precision sqrt.
No functionality change intended (except for a more-specific latency for
single-precision sqrt on the A2).
llvm-svn: 195980
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On the PPC A2, instructions are only issued after their input operands are
ready. Model this by specifying that input operands are read at dispatch (0
cycles after issue). This changes all input operand latencies from 1 to 0.
Significant test-suite performance changes (these are 99.5% confidence
intervals on 6 runs for both before and after):
speedups:
MultiSource/Benchmarks/sim/sim
-1.21915% +/- 0.175063%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
-1.23946% +/- 1.05133%
SingleSource/Benchmarks/Misc/flops-2
-1.24237% +/- 0.681362%
MultiSource/Applications/JM/lencod/lencod
-1.33992% +/- 0.757498%
MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt
-1.51802% +/- 1.21468%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt
-2.18818% +/- 1.28605%
MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt
-2.21977% +/- 1.19499%
SingleSource/Benchmarks/BenchmarkGame/spectral-norm
-2.29822% +/- 0.671871%
MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl
-2.40975% +/- 0.355931%
SingleSource/Benchmarks/Misc/fp-convert
-2.41899% +/- 1.04751%
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl
-2.50349% +/- 0.126765%
SingleSource/Benchmarks/Misc/flops-3
-3.00214% +/- 0.700795%
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt
-3.56995% +/- 3.2929%
MultiSource/Applications/sgefa/sgefa
-4.24908% +/- 2.00413%
MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk
-18.1294% +/- 3.96489%
regressions:
MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl
1.03249% +/- 0.178547%
MultiSource/Applications/hexxagon/hexxagon
1.16597% +/- 0.285235%
MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt
1.39576% +/- 1.07855%
SingleSource/Benchmarks/Misc-C++/stepanov_v1p2
1.71539% +/- 0.173182%
MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1
1.90013% +/- 0.866472%
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl
2.39854% +/- 1.05914%
MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl
2.4402% +/- 0.817904%
MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl
5.87997% +/- 3.3172%
MultiSource/Benchmarks/Trimaran/netbench-crc/netbench-crc
9.02643% +/- 5.79591%
MultiSource/Benchmarks/VersaBench/bmm/bmm
10.3517% +/- 1.227%
Obviously, there are data points on both sides of this; but I think, overall,
this supports making the change.
llvm-svn: 195951
|
| |
|
|
|
|
|
| |
Some of the older PPC processor definitions don't have associated
SchedMachineModels; correct this for the PPC440.
llvm-svn: 195949
|
| |
|
|
|
|
|
|
| |
The operand latencies for loads and stores in the PPC440 itinerary were wrong
(the store operands are all inputs, and the "with update" (pre-increment)
instructions need a latency for the additional output).
llvm-svn: 195948
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The operand latencies for the PPC440 should be specified relative to dispatch,
not relative to the initial fetch-and-decode stages. Because most instructions
(ignoring bypass) wait in dispatch until their operands are ready, this is
modeled as reading input operands "at dispatch" (0 cycles after issue), and so
every input and output operand has 4 cycles subtracted from it.
This could alter scheduling slightly, but I don't expect a large effect.
llvm-svn: 195947
|
| |
|
|
|
|
|
|
|
|
| |
Modeling the fetch and decode units in the PPC440 itinerary does not add
anything to the hazard detection capability (and so modeling them just wastes
compile time).
No functionality change intended.
llvm-svn: 195946
|
| |
|
|
|
|
|
|
|
|
| |
I think, in principle, intrinsics_gen may be added explicitly.
That said, it can be added incidentally, since each target already has dependencies to llvm-tblgen.
Almost all source files depend on both CommonTaleGen and intrinsics_gen.
Explicit add_dependencies() have been pruned under lib/Target.
llvm-svn: 195929
|
| |
|
|
|
|
|
|
|
| |
CommonTableGen.
add_public_tablegen_target adds *CommonTableGen to LLVM_COMMON_DEPENDS.
LLVM_COMMON_DEPENDS affects add_llvm_library (and other add_target stuff) within its scope.
llvm-svn: 195927
|
| |
|
|
|
|
| |
sets them.
llvm-svn: 195921
|
| |
|
|
| |
llvm-svn: 195911
|
| |
|
|
|
|
|
|
|
|
|
| |
Instead of sharing functional unit names between the various PPC itineraries,
give each core its own unit names prefixed with the core name. This follows
the convention used by other backends (such as ARM), and removes a non-obvious
ordering dependency between the various PPCSchedule*.td files.
No functionality change intended.
llvm-svn: 195908
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the IIC_ prefix to the instruction itinerary class names, giving the
PPC backend a naming convention for itinerary classes that is more consistent
with that used by the X86 and ARM backends.
Instruction scheduling in the PPC backend needs a bunch of cleanup and
improvement (especially for the ooo cores). This is just a preliminary step.
No functionality change intended.
llvm-svn: 195890
|
| |
|
|
| |
llvm-svn: 195884
|
| |
|
|
| |
llvm-svn: 195807
|
| |
|
|
|
|
|
| |
The instruction definitions incorrectly specified that popcntd and popcntw have
record forms; they do not. This mistake was causing invalid code generation.
llvm-svn: 195272
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Masking operations (where only some number of the low bits are being kept) are
selected to rldicl(x, 0, mb). If x is a logical right shift (which would become
rldicl(y, 64-n, n)), we might be able to fold the two instructions together:
rldicl(rldicl(x, 64-n, n), 0, mb) -> rldicl(x, 64-n, mb) for n <= mb
The right shift is really a left rotate followed by a mask, and if the explicit
mask is a more-restrictive sub-mask of the mask implied by the shift, only one
rldicl is needed.
llvm-svn: 195185
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This patch removes most of the trivial cases of weak vtables by pinning them to
a single object file. The memory leaks in this version have been fixed. Thanks
Alexey for pointing them out.
Differential Revision: http://llvm-reviews.chandlerc.com/D2068
Reviewed by Andy
llvm-svn: 195064
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This change is incorrect. If you delete virtual destructor of both a base class
and a subclass, then the following code:
Base *foo = new Child();
delete foo;
will not cause the destructor for members of Child class. As a result, I observe
plently of memory leaks. Notable examples I investigated are:
ObjectBuffer and ObjectBufferStream, AttributeImpl and StringSAttributeImpl.
llvm-svn: 194997
|
| |
|
|
|
|
|
|
|
|
|
| |
This patch removes most of the trivial cases of weak vtables by pinning them to
a single object file.
Differential Revision: http://llvm-reviews.chandlerc.com/D2068
Reviewed by Andy
llvm-svn: 194865
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Stop folding constant adds into GEP when the type size doesn't match.
Otherwise, the adds' operands are effectively being promoted, changing the
conditions of an overflow. Results are different when:
sext(a) + sext(b) != sext(a + b)
Problem originally found on x86-64, but also fixed issues with ARM and PPC,
which used similar code.
<rdar://problem/15292280>
Patch by Duncan Exon Smith!
llvm-svn: 194840
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On non-Darwin PPC systems, we currently strip off the register name prefix
prior to instruction printing. So instead of something like this:
mr r3, r4
we print this:
mr 3, 4
The first form is the default on Darwin, and is understood by binutils, but not
yet understood by our integrated assembler. Once our integrated-as understands
full register names as well, this temporary option will be replaced by tying
this functionality to the verbose-asm option. The numeric-only form is
compatible with legacy assemblers and tools, and is also gcc's default on most
PPC systems. On the other hand, it is harder to read, and there are some
analysis tools that expect full register names.
llvm-svn: 194384
|
| |
|
|
| |
llvm-svn: 193796
|
| |
|
|
| |
llvm-svn: 193627
|
| |
|
|
|
|
| |
as we don't actually use it to emit any code yet.
llvm-svn: 192837
|
| |
|
|
|
|
| |
We had a MCAsmInfoCOFF, but no common class for all the ELF MCAsmInfos before.
llvm-svn: 192760
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes an old FIXME by creating a MCTargetStreamer interface
and moving the target specific functions for ARM, Mips and PPC to it.
The ARM streamer is still declared in a common place because it is
used from lib/CodeGen/ARMException.cpp, but the Mips and PPC are
completely hidden in the corresponding Target directories.
I will send an email to llvmdev with instructions on how to use this.
llvm-svn: 192181
|
| |
|
|
|
|
| |
They haven't been used for a long time. Patch by MathOnNapkins.
llvm-svn: 192099
|
| |
|
|
|
|
|
|
| |
When generating code for shared libraries, even local calls may be
intercepted, so we need a nop after the call for the linker to fix up the
TOC. Test case adapted from the one provided in PR17354.
llvm-svn: 191440
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When asked to pad an irregular number of bytes, we should fill with
zeros. This is consistent with the behavior specified in the AIX
Assembler Language Reference as well as other LLVM and binutils
assemblers.
N.B. There is a small deviation from binutils' PPC assembler:
when handling pads which are greater than 4 bytes but not mod 4,
binutils will not emit any NOP sequences at all and only use zeros.
This may or may not be a bug but there is no excellent rationale as to
why that behavior is important to emulate. If that behavior is needed,
we can change writeNopData() to behave in the same way.
This fixes PR17352.
llvm-svn: 191426
|
| |
|
|
| |
llvm-svn: 191421
|
| |
|
|
|
|
|
|
|
| |
Encodings were checked against the Power ISA documents and double
checked against binutils.
This fixes PR17350.
llvm-svn: 191419
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The binutils assembler supports a mode called DOLLAR_DOT which treats
the dollar sign token as a reference to the current program counter if
the dollar sign doesn't precede a constant or identifier.
This commit adds a new MCAsmInfo flag stating whether or not a given
target supports this interpretation of the dollar sign token; by
default, this flag is not enabled.
Further, enable this flag for PPC. The system assembler for AIX and
binutils both support using the dollar sign in this manner.
This fixes PR17353.
llvm-svn: 191368
|
| |
|
|
| |
llvm-svn: 191362
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, the DAGISel function WalkChainUsers was spotting that it
had entered already-selected territory by whether a node was a
MachineNode (amongst other things). Since it's fairly common practice
to insert MachineNodes during ISelLowering, this was not the correct
check.
Looking around, it seems that other nodes get their NodeId set to -1
upon selection, so this makes sure the same thing happens to all
MachineNodes and uses that characteristic to determine whether we
should stop looking for a loop during selection.
This should fix PR15840.
llvm-svn: 191165
|
| |
|
|
|
|
|
|
| |
Pre-increment loads are microcoded on the A2, and the address increment occurs
only after the load completes. As a result, the latency of the GPR address
update is an additional 2 cycles on top of the load latency.
llvm-svn: 191156
|
| |
|
|
|
|
|
|
| |
Documenting a design choice to generate only medium model sequences for TLS
addresses at this time. Small and large code models could be supported if
necessary.
llvm-svn: 190883
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Large code model on PPC64 requires creating and referencing TOC entries when
using the addis/ld form of addressing. This was not being done in all cases.
The changes in this patch to PPCAsmPrinter::EmitInstruction() fix this. Two
test cases are also modified to reflect this requirement.
Fast-isel was not creating correct code for loading floating-point constants
using large code model. This also requires the addis/ld form of addressing.
Previously we were using the addis/lfd shortcut which is only applicable to
medium code model. One test case is modified to reflect this requirement.
llvm-svn: 190882
|
| |
|
|
|
|
|
|
|
| |
Fast-isel generates a COPY_TO_REGCLASS for widening f32 to f64, which
is a nop on PPC64. This is needed to keep the register class system
happy, but on the fast-isel path it is not removed before emit as it
is for DAG select. Ignore this op when emitting instructions.
llvm-svn: 190795
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a re-commit of r190764, with an extra check to make sure that we're not
performing the transformation on illegal types (a small test case has been
added for this as well).
Original commit message:
The PPC backend uses a target-specific DAG combine to turn unaligned Altivec
loads into a permutation-based sequence when possible. Unfortunately, the
target-specific DAG combine is not always called on all loads of interest
(sometimes the routines in DAGCombine call CombineTo such that the new node and
users are not added to the worklist); allowing the combine to trigger early
(before type legalization) mitigates this problem. Because the autovectorizers
only create legal vector types, I don't expect a lot of cases where this
optimization is enabled by type legalization in practice.
llvm-svn: 190771
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is causing test-suite failures.
Original commit message:
The PPC backend uses a target-specific DAG combine to turn unaligned Altivec
loads into a permutation-based sequence when possible. Unfortunately, the
target-specific DAG combine is not always called on all loads of interest
(sometimes the routines in DAGCombine call CombineTo such that the new node and
users are not added to the worklist); allowing the combine to trigger early
(before type legalization) mitigates this problem. Because the autovectorizers
only create legal vector types, I don't expect a lot of cases where this
optimization is enabled by type legalization in practice.
llvm-svn: 190765
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The PPC backend uses a target-specific DAG combine to turn unaligned Altivec
loads into a permutation-based sequence when possible. Unfortunately, the
target-specific DAG combine is not always called on all loads of interest
(sometimes the routines in DAGCombine call CombineTo such that the new node and
users are not added to the worklist); allowing the combine to trigger early
(before type legalization) mitigates this problem. Because the autovectorizers
only create legal vector types, I don't expect a lot of cases where this
optimization is enabled by type legalization in practice.
llvm-svn: 190764
|
| |
|
|
|
|
| |
As it turns out, not a problem in practice, but it should be there.
llvm-svn: 190720
|
| |
|
|
| |
llvm-svn: 190640
|
| |
|
|
|
|
|
|
|
|
| |
When a structure is passed by value, and that structure contains a vector
member, according to the PPC ABI, the structure will receive enhanced alignment
(so that the vector within the structure will always be aligned).
This should resolve PR16641.
llvm-svn: 190636
|
| |
|
|
|
|
|
|
|
|
|
| |
In fast-math mode sqrt(x) is calculated using the fast expansion of the
reciprocal of the reciprocal sqrt expansion. The reciprocal and reciprocal
sqrt expansions use the associated estimate instructions along with some Newton
iterations. Unfortunately, as a result, sqrt(0) was being calculated as NaN,
which is not correct. Now we explicitly return a result of zero if the input is
zero.
llvm-svn: 190624
|
| |
|
|
|
|
| |
FreeBSD kernel.
llvm-svn: 190618
|
| |
|
|
|
|
|
|
| |
Use the new instruction deprecation feature to mark mftb (now replaced with
mfspr) and dst (along with the other Altivec cache control instructions) as
deprecated when targeting cores supporting at least ISA v2.03.
llvm-svn: 190605
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 'Deprecated' class allows you to specify a SubtargetFeature that the
instruction is deprecated on.
The 'ComplexDeprecationPredicate' class allows you to define a custom
predicate that is called to check for deprecation.
For example:
ComplexDeprecationPredicate<"MCR">
would mean you would have to define the following function:
bool getMCRDeprecationInfo(MCInst &MI, MCSubtargetInfo &STI,
std::string &Info)
Which returns 'false' for not deprecated, and 'true' for deprecated
and store the warning message in 'Info'.
The MCTargetAsmParser constructor was chaned to take an extra argument of
the MCInstrInfo class, so out-of-tree targets will need to be changed.
llvm-svn: 190598
|