|  | Commit message (Collapse) | Author | Age | Files | Lines | 
|---|
| ... |  | 
| | 
| 
| 
| 
| 
| | allowing us to distinguish the encodings that use shifted registers from those that use shifted immediates.  This is necessary to allow the fixed-length decoder to distinguish things like BICS vs LDRH.
llvm-svn: 135693 | 
| | 
| 
| 
| 
| 
| | ARM MC code from target.
llvm-svn: 135636 | 
| | 
| 
| 
| 
| 
| 
| 
| | sink them into MC layer.
- Added MCInstrInfo, which captures the tablegen generated static data. Chang
TargetInstrInfo so it's based off MCInstrInfo.
llvm-svn: 134021 | 
| | 
| 
| 
| 
| 
| 
| 
| | first operand.  This operand is lowered away by the time we reach MachineInstrs, so the actual register-allocation handling of them doesn't need to change.
This is intended to support using REG_SEQUENCE SDNode's with type MVT::untyped, and is part of the long road to eliminating some of the hacks we currently use to support register pairs and other strange constraints, particularly on ARM NEON.
llvm-svn: 133178 | 
| | 
| 
| 
| 
| 
| 
| 
| | to load/store i64 values. Since there's no current support to explicitly
declare such restrictions, implement it by using specific hardcoded register
pairs during isel.
llvm-svn: 132248 | 
| | 
| 
| 
| | llvm-svn: 130557 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Making use of VFP / NEON floating point multiply-accumulate / subtraction is
difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
   of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
   additional pipeline stall. So it's frequently better to single codegen
   vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
   stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
   vmla + vmla is very bad. But this isn't ideal either:
     vmul
     vadd
     vmla
   Instead, we want to expand the second vmla:
     vmla
     vmul
     vadd
   Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
   faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
   compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
   fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
   vmla / vmls will trigger one of the special hazards.
Enable these fp vmlx codegen changes for Cortex-A9.
llvm-svn: 129775 | 
| | 
| 
| 
| | llvm-svn: 129738 | 
| | 
| 
| 
| | llvm-svn: 127899 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | can. As Nate pointed out, VTBL isn't super performant, but it *has* to be better
than this:
_shuf:
@ BB#0:       @ %entry
  push        {r4, r7, lr}
  add         r7, sp, #4
  sub         sp, #12
  mov         r4, sp
  bic         r4, r4, #7
  mov         sp, r4
  mov         r2, sp
  vmov        d16, r0, r1
  orr         r0, r2, #6
  orr         r3, r2, #7
  vst1.8      {d16[0]}, [r3]
  vst1.8      {d16[5]}, [r0]
  subs        r4, r7, #4
  orr         r0, r2, #5
  vst1.8      {d16[4]}, [r0]
  orr         r0, r2, #4
  vst1.8      {d16[4]}, [r0]
  orr         r0, r2, #3
  vst1.8      {d16[0]}, [r0]
  orr         r0, r2, #2
  vst1.8      {d16[2]}, [r0]
  orr         r0, r2, #1
  vst1.8      {d16[1]}, [r0]
  vst1.8      {d16[3]}, [r2]
  vldr.64     d16, [sp]
  vmov        r0, r1, d16
  mov         sp, r4
  pop         {r4, r7, pc}
The "illegal" testcase in vext.ll is no longer illegal.
<rdar://problem/9078775>
llvm-svn: 127630 | 
| | 
| 
| 
| | llvm-svn: 127509 | 
| | 
| 
| 
| | llvm-svn: 127090 | 
| | 
| 
| 
| | llvm-svn: 126477 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | have their low bits set to zero.  This allows us to optimize
out explicit stack alignment code like in stack-align.ll:test4 when
it is redundant.
Doing this causes the code generator to start turning FI+cst into
FI|cst all over the place, which is general goodness (that is the
canonical form) except that various pieces of the code generator
don't handle OR aggressively.  Fix this by introducing a new
SelectionDAG::isBaseWithConstantOffset predicate, and using it
in places that are looking for ADD(X,CST).  The ARM backend in
particular was missing a lot of addressing mode folding opportunities
around OR.
llvm-svn: 125470 | 
| | 
| 
| 
| 
| 
| 
| 
| | The vld1-lane, vld1-dup and vst1-lane instructions do not yet support using
post-increment versions, but all the rest of the NEON load/store instructions
should be handled now.
llvm-svn: 125014 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | These operations are expanded to pairs of loads or stores, and the first one
uses the address register update to produce the address for the second one.
So far, the second load/store has also updated the address register, just
for convenience, since that output has never been used.  In anticipation of
actually supporting post-increment updates for these operations, this changes
the non-updating operations to use a non-updating load/store for the second
instruction.
llvm-svn: 125013 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | TargetInstrInfo:
Change produceSameValue() to take MachineRegisterInfo as an optional argument.
When in SSA form, targets can use it to make more aggressive equality analysis.
Machine LICM:
1. Eliminate isLoadFromConstantMemory, use MI.isInvariantLoad instead.
2. Fix a bug which prevent CSE of instructions which are not re-materializable.
3. Use improved form of produceSameValue.
ARM:
1. Teach ARM produceSameValue to look pass some PIC labels.
2. Look for operands from different loads of different constant pool entries
   which have same values.
3. Re-implement PIC GA materialization using movw + movt. Combine the pair with
   a "add pc" or "ldr [pc]" to form pseudo instructions. This makes it possible
   to re-materialize the instruction, allow machine LICM to hoist the set of
   instructions out of the loop and make it possible to CSE them. It's a bit
   hacky, but it significantly improve code quality.
4. Some minor bug fixes as well.
With the fixes, using movw + movt to materialize GAs significantly outperform the
load from constantpool method. 186.crafty and 255.vortex improved > 20%, 254.gap
and 176.gcc ~10%.
llvm-svn: 123905 | 
| | 
| 
| 
| | llvm-svn: 123823 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | movw    r0, :lower16:(L_foo$non_lazy_ptr-(LPC0_0+4))
        movt    r0, :upper16:(L_foo$non_lazy_ptr-(LPC0_0+4))
LPC0_0:
        add     r0, pc, r0
It's not yet enabled by default as some tests are failing. I suspect bugs in
down stream tools.
llvm-svn: 123619 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| | earlyclobber stuff. This should fix PRs 2313 and 8157.
Unfortunately, no testcase, since it'd be dependent on register
assignments.
llvm-svn: 122663 | 
| | 
| 
| 
| | llvm-svn: 122539 | 
| | 
| 
| 
| 
| 
| 
| | something that just glues two nodes together, even if it is
sometimes used for flags.
llvm-svn: 122310 | 
| | 
| 
| 
| | llvm-svn: 122017 | 
| | 
| 
| 
| 
| 
| | Canonicalize on tLDRpci and remove tLDRcp.
llvm-svn: 121920 | 
| | 
| 
| 
| | llvm-svn: 121820 | 
| | 
| 
| 
| | llvm-svn: 121815 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | particular, we want
   ldr r2, [r3]
to be equivalent to
   ldr r2, [r3, #0]
and not
   ldr r2, [r3, r0]
llvm-svn: 121808 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | instruction based on the t_addrmode_s# mode and what it returned. There is some
obvious badness to this. In particular, it's hard to do MC-encoding when the
instruction may change out from underneath you after the t_addrmode_s# variable
is finally resolved.
The solution is to revert a long-ago change that merged the reg/reg and reg/imm
versions. There is the addition of several new addressing modes. They no longer
have extraneous operands associated with them. I.e., if it's reg/reg we don't
have to have a dummy zero immediate tacked on to the SDNode.
There are some obvious cleanups here, which will happen shortly.
llvm-svn: 121747 | 
| | 
| 
| 
| 
| 
| 
| | Alignments smaller than the total size of the memory being loaded or stored,
unless the alignment is 8 bytes, are not allowed.  Add tests for this, too.
llvm-svn: 121506 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
   of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
   additional pipeline stall. So it's frequently better to single codegen
   vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
   stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
   vmla + vmla is very bad. But this isn't ideal either:
     vmul
     vadd
     vmla
   Instead, we want to expand the second vmla:
     vmla
     vmul
     vadd
   Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
   faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
   compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
   fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
   vmla / vmls will trigger one of the special hazards.
Work in progress, only A+B are enabled.
llvm-svn: 120960 | 
| | 
| 
| 
| 
| 
| | The encoding for alignment in VLD4-dup instructions is still a work in progress.
llvm-svn: 120356 | 
| | 
| 
| 
| | llvm-svn: 120312 | 
| | 
| 
| 
| | llvm-svn: 120236 | 
| | 
| 
| 
| | llvm-svn: 119866 | 
| | 
| 
| 
| 
| 
| | in a register. These immediates aren't free.
llvm-svn: 119558 | 
| | 
| 
| 
| | llvm-svn: 118968 | 
| | 
| 
| 
| | llvm-svn: 118951 | 
| | 
| 
| 
| | llvm-svn: 118935 | 
| | 
| 
| 
| 
| 
| 
| | with a SimpleValueType, while an EVT supports equality and
inequality comparisons with SimpleValueType.
llvm-svn: 118169 | 
| | 
| 
| 
| 
| 
| 
| | parts. Represent the operation mode as an optional operand instead.
rdar://8614429
llvm-svn: 118137 | 
| | 
| 
| 
| 
| 
| | This is another part of the fix for Radar 8599955.
llvm-svn: 117976 | 
| | 
| 
| 
| 
| 
| 
| | complex load / store addressing mode) when they have higher cost and
when they have more than one use.
llvm-svn: 117509 | 
| | 
| 
| 
| 
| 
| 
| 
| | explicit about the operands. Split out the different variants into separate
instructions. This gives us the ability to, among other things, assign
different scheduling itineraries to the variants. rdar://8477752.
llvm-svn: 117409 | 
| | 
| 
| 
| | llvm-svn: 117050 | 
| | 
| 
| 
| | llvm-svn: 116776 | 
| | 
| 
| 
| | llvm-svn: 115890 | 
| | 
| 
| 
| | llvm-svn: 115884 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| | which require the use of the shifter-operand. This will be used to split
the ldr/str instructions such that those versions needing the shifter operand
can get a different scheduling itenerary, as in some cases, the use of the
shifter can cause different scheduling than the simpler forms.
llvm-svn: 115066 | 
| | 
| 
| 
| | llvm-svn: 115043 | 
| | 
| 
| 
| | llvm-svn: 114709 |