summaryrefslogtreecommitdiffstats
path: root/polly/lib/Transform/ScheduleOptimizer.cpp
Commit message (Collapse)AuthorAgeFilesLines
...
* Introduce another level of metadata to distinguish non-aliasing accessesRoman Gareev2017-03-221-0/+16
| | | | | | | | | | | | | | Introduce another level of alias metadata to distinguish the individual non-aliasing accesses that have inter iteration alias-free base pointers marked with "Inter iteration alias-free" mark nodes. It can be used to, for example, distinguish different stores (loads) produced by unrolling of the innermost loops and, subsequently, sink (hoist) them by LICM. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D30606 llvm-svn: 298510
* [ScheduleOptimiser] fix typos in top comment [NFC]Siddharth Bhat2017-03-171-2/+2
| | | | | | | coice -> choice Transations -> Transactions llvm-svn: 298095
* [ScheduleOptimizer] Allow tiling after fusionTobias Grosser2017-03-121-8/+30
| | | | | | | | | | | | | | | | | | | | | | | In ScheduleOptimizer::isTileableBand(), allow the case in which the band node's child is an isl_schedule_sequence_node and its grandchildren isl_schedule_leaf_nodes. This case can arise when two or more statements are fused by the isl scheduler. The tile_after_fusion.ll test has two statements in separate loop nests and checks whether they are tiled after being fused when polly-opt-fusion equals "max". Reviewers: grosser Subscribers: gareevroman, pollydev Tags: #polly Contributed-by: Theodoros Theodoridis <theodort@student.ethz.ch> Differential Revision: https://reviews.llvm.org/D30815 llvm-svn: 297587
* Make optimizations based on pattern matching be enabled by defaultRoman Gareev2017-02-231-1/+1
| | | | | | | | | | | | | Currently, pattern based optimizations of Polly can identify matrix multiplication and optimize it according to BLIS matmul optimization pattern (see ScheduleTreeOptimizer for details). This patch makes optimizations based on pattern matching be enabled by default. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D30293 llvm-svn: 295958
* [FIX] Fix the typo in ScheduleOptimizer.cpp.Roman Gareev2017-02-161-1/+2
| | | | llvm-svn: 295292
* Check reduction dependencies in case of the matrix multiplication optimizationRoman Gareev2017-02-111-3/+6
| | | | | | | | | | | | | | To determine parameters of the matrix multiplication, we check RAW dependencies that can be expressed using only reduction dependencies. Consequently, we should check the reduction dependencies, if this is the case. Reviewed-by: Tobias Grosser <tobias@grosser.es>, Sven Verdoolaege <skimo-polly@kotnet.org> Michael Kruse <llvm@meinersbur.de> Differential Revision: https://reviews.llvm.org/D29814 llvm-svn: 294836
* [FIX] Fix the potential issue of containsOnlyMatMulDep.Roman Gareev2017-02-111-0/+2
| | | | llvm-svn: 294835
* [NFC] Fix the style issue of lib/Transform/ScheduleOptimizer.cpp.Roman Gareev2017-02-111-2/+1
| | | | llvm-svn: 294834
* [NFC] Fix style issues of lib/Transform/ScheduleOptimizer.cpp.Roman Gareev2017-02-111-9/+6
| | | | llvm-svn: 294831
* Use the size of the widest type of the matrix multiplication operandsRoman Gareev2017-02-111-8/+48
| | | | | | | | | | | | | The size of the operands type is the one of the parameters required to determine the BLIS micro-kernel. We get the size of the widest type of the matrix multiplication operands in case there are several different types. Reviewed-by: Michael Kruse <llvm@meinersbur.de> Differential Revision: https://reviews.llvm.org/D29269 llvm-svn: 294828
* Isolate a set of partial tile prefixes in case of the matrix multiplicationRoman Gareev2017-02-091-6/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | optimization Isolate a set of partial tile prefixes to allow hoisting and sinking out of the unrolled innermost loops produced by the optimization of the matrix multiplication. In case it cannot be proved that the number of loop iterations can be evenly divided by tile sizes and we tile and unroll the point loop, the isl generates conditional expressions. Subsequently, the conditional expressions can prevent stores and loads of the unrolled loops from being sunk and hoisted. The patch isolates a set of partial tile prefixes, which have exactly Mr x Nr iterations of the two innermost loops, the result of the loop tiling performed by the matrix multiplication optimization, where Mr and Mr are parameters of the micro-kernel. This helps to get rid of the conditional expressions of the unrolled innermost loops. Probably this approach can be replaced with padding in future. In case of, for example, the gemm from Polybench/C 3.2 and parametric loop bounds, it helps to increase the performance from 7.98 GFlops (27.71% of theoretical peak) to 21.47 GFlops (74.57% of theoretical peak). Hence, we get the same performance as in case of scalar loops bounds. It also cause compile time regression. The compile-time is increased from 0.795 seconds to 0.837 seconds in case of scalar loops bounds and from 1.222 seconds to 1.490 seconds in case of parametric loops bounds. Reviewed-by: Michael Kruse <llvm@meinersbur.de> Differential Revision: https://reviews.llvm.org/D29244 llvm-svn: 294564
* [NFC] Make ScheduleTreeOptimizer::optimizeBand return a schedule node optimizedRoman Gareev2017-02-081-1/+1
| | | | | | | | | | | | | | with optimizeMatMulPattern This patch makes ScheduleTreeOptimizer::optimizeBand return a schedule node optimized with optimizeMatMulPattern. Otherwise, it could not use the isolate option, because standardBandOpts could try to tile a band node with anchored subtree and get the error, since the use of the isolate option causes any tree containing the node to be considered anchored. Furthermore, it is not intended to apply standard optimizations, when the matrix multiplication has been detected. llvm-svn: 294444
* A new algorithm for identification of a SCoP statement that implement a matrixRoman Gareev2017-02-021-260/+396
| | | | | | | | | | | | | | | | | | | | multiplication The current identification of a SCoP statement that implement a matrix multiplication does not help to identify different permutations of loops that contain it and check for dependencies, which can prevent it from being optimized. It also requires external determination of the operands of the matrix multiplication. This patch contains the implementation of a new algorithm that helps to avoid these issues. It also modifies the test cases that generate matrix multiplications with linearized accesses, because the new algorithm does not support them. Reviewed-by: Michael Kruse <llvm@meinersbur.de>, Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D28357 llvm-svn: 293890
* Update to recent formatting changesTobias Grosser2017-02-011-3/+2
| | | | llvm-svn: 293756
* Update the documentation on how the packing transformation is implementedRoman Gareev2017-01-291-2/+19
| | | | | | | | | | | | Add a simple example to update the documentation on how the packing transformation is implemented. Reviewed-by: Tobias Grosser <tobias@grosser.es>, Michael Kruse <llvm@meinersbur.de> Differential Revision: https://reviews.llvm.org/D28021 llvm-svn: 293429
* Adjust formatting to commit r292110 [NFC]Tobias Grosser2017-01-161-5/+8
| | | | llvm-svn: 292123
* ScheduleOptimizer: Allow to set register width in command lineTobias Grosser2017-01-141-1/+11
| | | | | | | We use this option to set a fixed register width in our test cases to make sure the results are identical accross platforms. llvm-svn: 292002
* Specify the default values of the cache parametersRoman Gareev2016-12-251-19/+35
| | | | | | | | | | | | | | | | | If the parameters of the target cache (i.e., cache level sizes, cache level associativities) are not specified or have wrong values, we use ones for parameters of the macro-kernel and do not perform data-layout optimizations of the matrix multiplication. In this patch we specify the default values of the cache parameters to be able to apply the pattern matching optimizations even in this case. Since there is no typical values of this parameters, we use the parameters of Intel Core i7-3820 SandyBridge that also help to attain the high-performance on IBM POWER System S822 and IBM Power 730 Express server. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D28090 llvm-svn: 290518
* ScheduleOptimizer: Fix spelling of option '-polly-target-throughput-vector-fma'Tobias Grosser2016-12-231-4/+4
| | | | | | througput -> throughput llvm-svn: 290418
* Change the determination of parameters of macro-kernelRoman Gareev2016-12-211-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Typically processor architectures do not include an L3 cache, which means that Nc, the parameter of the micro-kernel, is, for all practical purposes, redundant ([1]). However, its small values can cause the redundant packing of the same elements of the matrix A, the first operand of the matrix multiplication. At the same time, big values of the parameter Nc can cause segmentation faults in case the available stack is exceeded. This patch adds an option to specify the parameter Nc as a multiple of the parameter of the micro-kernel Nr. In case of Intel Core i7-3820 SandyBridge and the following options, clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8 -mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm -polly-target-latency-vector-fma=8 it helps to improve the performance from 11.303 GFlops/sec (39,247% of theoretical peak) to 17.896 GFlops/sec (62,14% of theoretical peak). Refs.: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D28019 llvm-svn: 290256
* [Polly] Use three-dimensional arrays to store packed operands of the matrixRoman Gareev2016-12-211-25/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | multiplication Previously we had two-dimensional accesses to store packed operands of the matrix multiplication for the sake of simplicity of the packed arrays. However, addition of the third dimension helps to simplify the corresponding memory access, reduce the execution time of isl operations applied to it, and consequently reduce the compile-time of Polly. For example, in case of Intel Core i7-3820 SandyBridge and the following options, clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8 -mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm -polly-target-latency-vector-fma=7 it helps to reduce the compile-time from about 361.456 seconds to about 0.816 seconds. Reviewed-by: Michael Kruse <llvm@meinersbur.de>, Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D27878 llvm-svn: 290251
* Restrict ranges of extension mapsRoman Gareev2016-12-151-1/+20
| | | | | | | | | | | To prevent copy statements from accessing arrays out of bounds, ranges of their extension maps are restricted, according to the constraints of domains. Reviewed-by: Michael Kruse <llvm@meinersbur.de> Differential Revision: https://reviews.llvm.org/D25655 llvm-svn: 289815
* [NFC] Fix typos in getMacroKernelParams.Roman Gareev2016-12-151-5/+3
| | | | llvm-svn: 289808
* The order of the loops defines the data reused in the BLIS implementation ofRoman Gareev2016-12-151-29/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gemm ([1]). In particular, elements of the matrix B, the second operand of matrix multiplication, are reused between iterations of the innermost loop. To keep the reused data in cache, only elements of matrix A, the first operand of matrix multiplication, should be evicted during an iteration of the innermost loop. To provide such a cache replacement policy, elements of the matrix A can, in particular, be loaded first and, consequently, be least-recently-used. In our case matrices are stored in row-major order instead of column-major order used in the BLIS implementation ([1]). One of the ways to address it is to accordingly change the order of the loops of the loop nest. However, it makes elements of the matrix A to be reused in the innermost loop and, consequently, requires to load elements of the matrix B first. Since the LLVM vectorizer always generates loads from the matrix A before loads from the matrix B and we can not provide it. Consequently, we only change the BLIS micro kernel and the computation of its parameters instead. In particular, reused elements of the matrix B are successively multiplied by specific elements of the matrix A . Refs.: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D25653 llvm-svn: 289806
* [ScheduleOptimizer] Fix memory leak. NFC.Michael Kruse2016-12-121-1/+3
| | | | llvm-svn: 289434
* Perform copying to created arrays according to the packing transformationRoman Gareev2016-09-141-13/+117
| | | | | | | | | | | | | | | | This is the fourth patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we perform copying to created arrays, which is the last step to implement the packing transformation. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D23260 llvm-svn: 281441
* Store the size of the outermost dimension in case of newly created arrays ↵Roman Gareev2016-09-121-5/+4
| | | | | | | | | | | | | that require memory allocation. We do not need the size of the outermost dimension in most cases, but if we allocate memory for newly created arrays, that size is needed. Reviewed-by: Michael Kruse <llvm@meinersbur.de> Differential Revision: https://reviews.llvm.org/D23991 llvm-svn: 281234
* Drop '@brief' from doxygen commentsTobias Grosser2016-09-021-28/+26
| | | | | | | | LLVM's coding guideline suggests to not use @brief for one-sentence doxygen comments to improve readability. Switch this once and for all to ensure people do not copy @brief comments from other parts of Polly, when writing new code. llvm-svn: 280468
* Add a flag to dump SCoP optimized with the IslScheduleOptimizer passRoman Gareev2016-08-211-0/+10
| | | | | | | | | | | | | Dump polyhedral descriptions of Scops optimized with the isl scheduling optimizer and the set of post-scheduling transformations applied on the schedule tree to be able to check the work of the IslScheduleOptimizer pass at the polyhedral level. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D23740 llvm-svn: 279395
* Perform replacement of access relations and creation of new arrays according ↵Roman Gareev2016-08-151-0/+215
| | | | | | | | | | | | | | | | | | | | to the packing transformation This is the third patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we perform replacement of the access relations and create empty arrays, which are steps to implement the packing transformation. In subsequent changes we will implement copying to created arrays. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: http://reviews.llvm.org/D22187 llvm-svn: 278666
* Fix a couple of spelling mistakesTobias Grosser2016-08-031-1/+1
| | | | llvm-svn: 277569
* Apply all necessary tilings and interchangings to get a macro-kernelRoman Gareev2016-07-251-0/+95
| | | | | | | | | | | | | | | | | | | This is the second patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS macro-kernel by applying a combination of tiling and interchanging. In subsequent changes we will implement the packing transformation. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: http://reviews.llvm.org/D21491 llvm-svn: 276627
* [NFC] Refactor creation of the BLIS mirco-kernel and improve documentationRoman Gareev2016-07-251-5/+28
| | | | | Reviewed-by: Tobias Grosser <tobias@grosser.es> llvm-svn: 276616
* Propagate on-error statusTobias Grosser2016-06-301-1/+2
| | | | | | | | | | | | | | | | This ensures that the error status set with -polly-on-isl-error-abort is maintained even after running DependenceInfo and ScheduleOptimizer. Both passes temporarily set the error status to CONTINUE as the dependence analysis uses a compute-out and the scheduler may not be able to derive a schedule. In both cases we want to not abort, but to handle the error gracefully. Before this commit, we always set the error reporting to ABORT after these passes. After this commit, we use the error reporting mode that was active earlier. This comes without a test case as this would require us to introduce (memory) errors which would trigger the isl errors. llvm-svn: 274272
* Simplify: get isl_ctx only once [NFC]Tobias Grosser2016-06-301-10/+11
| | | | | | ... instead of call S.getIslCtx() many times. llvm-svn: 274271
* clang-tidy: Add llvm namespace commentsTobias Grosser2016-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | llvm commonly adds a comment to the closing brace of a namespace to indicate which namespace is closed. clang-tidy provides with llvm-namespace-comment a handy tool to check for this habit. We use it to ensure we consitently use namespace comments in Polly. There are slightly different styles in how namespaces are closed in LLVM. As there is no large difference between the different comment styles we go for the style clang-tidy suggests by default. To reproduce this fix run: for i in `ls tools/polly/lib/*/*.cpp`; \ clang-tidy -checks='-*,llvm-namespace-comment' -p build $i -fix \ -header-filter=".*"; \ done This cleanup was suggested by Eugene Zelenko <eugene.zelenko@gmail.com> in http://reviews.llvm.org/D21488 and was split out to increase readability. llvm-svn: 273621
* clang-tidy: apply modern-use-nullptr fixesTobias Grosser2016-06-221-2/+2
| | | | | | | | | | Instead of using 0 or NULL use the C++11 nullptr symbol when referencing null pointers. This cleanup was suggested by Eugene Zelenko <eugene.zelenko@gmail.com> in http://reviews.llvm.org/D21488 and was split out to increase readability. llvm-svn: 273435
* [NFC] Use isl_schedule_node_band_n_member to get the number of dimensions of ↵Roman Gareev2016-06-221-9/+4
| | | | | | a band node. llvm-svn: 273400
* Apply all necessary tilings and unrollings to get a micro-kernelRoman Gareev2016-06-221-6/+49
| | | | | | | | | | | | | | | | | | | | | This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will add the extraction of the BLIS macro-kernel and implement the packing transformation. Contributed-by: Roman Gareev <gareevroman@gmail.com> Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: http://reviews.llvm.org/D21140 llvm-svn: 273397
* [NFC] Outline the application of register tiling.Roman Gareev2016-06-121-7/+14
| | | | llvm-svn: 272515
* [NFC] "#include <ciso646>" is unnecessary, because "and", "or" were replacedRoman Gareev2016-06-081-1/+0
| | | | | | by "&&", "||". llvm-svn: 272168
* [NFC] Check that a parameter of ScheduleTreeOptimizer::isMatrMultPattern ↵Roman Gareev2016-06-041-0/+3
| | | | | | contains a correct partial schedule llvm-svn: 271780
* [FIX] Fix potential issue related to subtraction from an unsigned 0 in ↵Roman Gareev2016-06-031-1/+3
| | | | | | | | | | | circularShiftOutputDims Reported-by: Mehdi Amini <mehdi.amini@apple.com> Contributed-by: Michael Kruse <llvm@meinersbur.de> Differential Revision: http://reviews.llvm.org/D20969 llvm-svn: 271705
* [GSoC 2016] [Polly] [FIX] Determination of statements that contain matrixRoman Gareev2016-05-311-7/+7
| | | | | | | | | | multiplication Fix small issues related to characters, operators and descriptions of tests. Differential Revision: http://reviews.llvm.org/D20806 llvm-svn: 271264
* Decouple SCoP building logic from passJohannes Doerfert2016-05-311-1/+1
| | | | | | | | | | | | | | | | Created a new pass ScopInfoRegionPass. As name suggests, it is a region pass and it is there to preserve compatibility with our existing Polly passes. ScopInfoRegionPass will return a SCoP object for a valid region while the creation of the SCoP stays in the ScopInfo class. Contributed-by: Utpal Bora <cs14mtech11017@iith.ac.in> Reviewed-by: Tobias Grosser <tobias@grosser.es>, Johannes Doerfert <doerfert@cs.uni-saarland.de> Differential Revision: http://reviews.llvm.org/D20770 llvm-svn: 271259
* MSVC compile fix: #include <ciso646>. NFC.Michael Kruse2016-05-301-0/+1
| | | | | | | | | This header is required to make the ISO 646 alternative operator spellings ("and", "or" instead of "&&", "||") work. Should these operators be replaced by the standard ones as already suggested by Johannes, also remove this #include again. llvm-svn: 271206
* Determination of statements that contain matrix multiplicationRoman Gareev2016-05-281-5/+111
| | | | | | | | | | | | | | | | | Add determination of statements that contain, in particular, matrix multiplications and can be optimized with [1] to try to get close-to-peak performance. It can be enabled via polly-pm-based-opts, which is false by default. Refs: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf Contributed-by: Roman Gareev <gareevroman@gmail.com> Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: http://reviews.llvm.org/D20575 llvm-svn: 271128
* [ScheduleOptimizer] Add -polly-opt-outer-coincidence option.Michael Kruse2016-05-021-0/+20
| | | | | | | | | | | | Add a command line switch to set the isl_options_set_schedule_outer_coincidence option. ISL then tries to build schedules where the outer member of a band satisfies the coincidence constraints. In practice this allows loop skewing for more parallelism in inner loops. llvm-svn: 268222
* Allow the client of DependenceInfo to obtain dependences at different ↵Hongbin Zheng2016-03-031-1/+2
| | | | | | granularities. llvm-svn: 262591
* Annotation of SIMD loopsRoman Gareev2016-02-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use 'mark' nodes annotate a SIMD loop during ScheduleTransformation and skip parallelism checks. The buildbot shows the following compile/execution time changes: Compile time: Improvements Δ Previous Current σ …/gesummv -6.06% 0.2640 0.2480 0.0055 …/gemver -4.46% 0.4480 0.4280 0.0044 …/covariance -4.31% 0.8360 0.8000 0.0065 …/adi -3.23% 0.9920 0.9600 0.0065 …/doitgen -2.53% 0.9480 0.9240 0.0090 …/3mm -2.33% 1.0320 1.0080 0.0087 Execution time: Regressions Δ Previous Current σ …/viterbi 1.70% 5.1840 5.2720 0.0074 …/smallpt 1.06% 12.4920 12.6240 0.0040 Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: http://reviews.llvm.org/D14491 llvm-svn: 261620
OpenPOWER on IntegriCloud