bcm5719-llvm/llvm/test/Transforms/LoopVectorize/ARM, branch meklort-10.0.1

bcm5719-llvm/llvm/test/Transforms/LoopVectorize/ARM, branch meklort-10.0.1 Project Ortega BCM5719 LLVM https://git.raptorcs.com/git/bcm5719-llvm/atom?h=meklort-10.0.1 2020-01-09T14:03:25+00:00 [ARM][MVE] MVE-I should not be disabled by -mfpu=none 2020-01-09T14:03:25+00:00 Momchil Velikov momchil.velikov@arm.com 2020-01-09T13:47:52+00:00 urn:sha1:173b711e83d7b61a46f55eb44f03ea98f69a1dd6 Architecturally, it's allowed to have MVE-I without an FPU, thus -mfpu=none should not disable MVE-I, or moves to/from FP-registers. This patch removes `+/-fpregs` from features unconditionally added to target feature list, depending on FPU and moves the logic to Clang driver, where the negative form (`-fpregs`) is conditionally added to the target features list for the cases of `-mfloat-abi=soft`, or `-mfpu=none` without either `+mve` or `+mve.fp`. Only the negative form is added by the driver, the positive one is derived from other features in the backend. Differential Revision: https://reviews.llvm.org/D71843 [LV] Still vectorise when tail-folding can't find a primary inducation variable 2020-01-09T09:14:00+00:00 Sjoerd Meijer sjoerd.meijer@arm.com 2020-01-09T09:14:00+00:00 urn:sha1:8f1887456ab4ba24a62ccb19d0d04b08972a0289 This addresses a vectorisation regression for tail-folded loops that are counting down, e.g. loops as simple as this: void foo(char *A, char *B, char *C, uint32_t N) { while (N > 0) { *C++ = *A++ + *B++; N--; } } These are loops that can be vectorised, but when tail-folding is requested, it can't find a primary induction variable which we do need for predicating the loop. As a result, the loop isn't vectorised at all, which it is able to do when tail-folding is not attempted. So, this adds a check for the primary induction variable where we decide how to lower the scalar epilogue. I.e., when there isn't a primary induction variable, a scalar epilogue loop is allowed (i.e. don't request tail-folding) so that vectorisation could still be triggered. Having this check for the primary induction variable make sense anyway, and in addition, in a follow-up of this I will look into discovering earlier the primary induction variable for counting down loops, so that this can also be tail-folded. Differential revision: https://reviews.llvm.org/D72324 Migrate function attribute "no-frame-pointer-elim" to "frame-pointer"="all" as cleanups after D56351 2019-12-24T23:57:33+00:00 Fangrui Song maskray@google.com 2019-12-24T23:52:21+00:00 urn:sha1:502a77f125f43ffde57af34d3fd1b900248a91cd [ARM] Add missing REQUIRES: asserts to test. NFC 2019-12-09T11:43:43+00:00 David Green david.green@arm.com 2019-12-09T11:43:23+00:00 urn:sha1:d6642ed1c867f97fdf951aac751c7854fbc7c51f [ARM] Enable MVE masked loads and stores 2019-12-09T11:37:34+00:00 David Green david.green@arm.com 2019-12-08T16:10:01+00:00 urn:sha1:b1aba0378e52be51cfb7fb6f03417ebf408d66cc With the extra optimisations we have done, these should now be fine to enable by default. Which is what this patch does. Differential Revision: https://reviews.llvm.org/D70968 [ARM] Teach the Arm cost model that a Shift can be folded into other instructions 2019-12-09T10:24:33+00:00 David Green david.green@arm.com 2019-12-08T15:33:24+00:00 urn:sha1:be7a1070700e591732b254e29f2dd703325fb52a This attempts to teach the cost model in Arm that code such as: %s = shl i32 %a, 3 %a = and i32 %s, %b Can under Arm or Thumb2 become: and r0, r1, r2, lsl #3 So the cost of the shift can essentially be free. To do this without trying to artificially adjust the cost of the "and" instruction, it needs to get the users of the shl and check if they are a type of instruction that the shift can be folded into. And so it needs to have access to the actual instruction in getArithmeticInstrCost, which if available is added as an extra parameter much like getCastInstrCost. We otherwise limit it to shifts with a single user, which should hopefully handle most of the cases. The list of instruction that the shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR, ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and ICmp. Differential Revision: https://reviews.llvm.org/D70966 [ARM] Additional tests and minor formatting. NFC 2019-12-09T10:24:33+00:00 David Green david.green@arm.com 2019-12-08T15:26:32+00:00 urn:sha1:f008b5b8ce724d60f0f0eeafceee0119c42022d4 This adds some extra cost model tests for shifts, and does some minor adjustments to some Neon code to make it clear as to what it applies to. Both NFC. [ARM] Disable VLD4 under MVE 2019-12-08T10:37:29+00:00 David Green david.green@arm.com 2019-12-08T09:58:03+00:00 urn:sha1:3a6eb5f16054e8c0f41a37542a5fc806016502a0 Alas, using half the available vector registers in a single instruction is just too much for the register allocator to handle. The mve-vldst4.ll test here fails when these instructions are enabled at present. This patch disables the generation of VLD4 and VST4 by adding a mve-max-interleave-factor option, which we currently default to 2. Differential Revision: https://reviews.llvm.org/D71109 [LV] PreferPredicateOverEpilog respecting option 2019-11-21T14:06:10+00:00 Sjoerd Meijer sjoerd.meijer@arm.com 2019-11-21T14:03:28+00:00 urn:sha1:901cd3b3f62d0c700e5d2c3f97eff97d634bec5e Follow-up of cb47b8783: don't query TTI->preferPredicateOverEpilogue when option -prefer-predicate-over-epilog is set to false, i.e. when we prefer not to predicate the loop. Differential Revision: https://reviews.llvm.org/D70382 [ARM] MVE interleaving load and stores. 2019-11-19T18:37:30+00:00 David Green david.green@arm.com 2019-11-19T18:37:21+00:00 urn:sha1:882f23caeae5ad3ec1806eb6ec387e3611649d54 Now that we have the intrinsics, we can add VLD2/4 and VST2/4 lowering for MVE. This works the same way as Neon, recognising the load/shuffles combination and converting them into intrinsics in a pre-isel pass, which just calls getMaxSupportedInterleaveFactor, lowerInterleavedLoad and lowerInterleavedStore. The main difference to Neon is that we do not have a VLD3 instruction. Otherwise most of the code works very similarly, with just some minor differences in the form of the intrinsics to work around. VLD3 is disabled by making isLegalInterleavedAccessType return false for those cases. We may need some other future adjustments, such as VLD4 take up half the available registers so should maybe cost more. This patch should get the basics in though. Differential Revision: https://reviews.llvm.org/D69392