diff options
Diffstat (limited to 'llvm/docs/Frontend')
-rw-r--r-- | llvm/docs/Frontend/PerformanceTips.rst | 60 |
1 files changed, 57 insertions, 3 deletions
diff --git a/llvm/docs/Frontend/PerformanceTips.rst b/llvm/docs/Frontend/PerformanceTips.rst index 5d7ad590464..d8c04651f0a 100644 --- a/llvm/docs/Frontend/PerformanceTips.rst +++ b/llvm/docs/Frontend/PerformanceTips.rst @@ -53,13 +53,25 @@ Other things to consider #. Make sure that a DataLayout is provided (this will likely become required in the near future, but is certainly important for optimization). -#. Add nsw/nuw/fast-math flags as appropriate +#. Add nsw/nuw flags as appropriate. Reasoning about overflow is + generally hard for an optimizer so providing these facts from the frontend + can be very impactful. For languages which need overflow semantics, + consider using the :ref:`overflow intrinsics <int_overflow>`. + +#. Use fast-math flags on floating point operations if legal. If you don't + need strict IEEE floating point semantics, there are a number of additional + optimizations that can be performed. This can be highly impactful for + floating point intensive computations. + +#. Use inbounds on geps. This can help to disambiguate some aliasing queries. #. Add noalias/align/dereferenceable/nonnull to function arguments and return values as appropriate -#. Mark functions as readnone/readonly/nounwind when known (especially for - external functions) +#. Mark functions as readnone/readonly or noreturn/nounwind when known. The + optimizer will try to infer these flags, but may not always be able to. + Manual annotations are particularly important for external functions that + the optimizer can not analyze. #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing analysis), prefer GEPs @@ -85,9 +97,51 @@ Other things to consider and may not be well optimized by the current optimizer. Depending on your source language, you may consider using fences instead. +#. If calling a function which is known to throw an exception (unwind), use + an invoke with a normal destination which contains an unreachable + instruction. This form conveys to the optimizer that the call returns + abnormally. For an invoke which neither returns normally or requires unwind + code in the current function, you can use a noreturn call instruction if + desired. This is generally not required because the optimizer will convert + an invoke with an unreachable unwind destination to a call instruction. + #. If you language uses range checks, consider using the IRCE pass. It is not currently part of the standard pass order. +#. For languages with numerous rarely executed guard conditions (e.g. null + checks, type checks, range checks) consider adding an extra execution or + two of LoopUnswith and LICM to your pass order. The standard pass order, + which is tuned for C and C++ applications, may not be sufficient to remove + all dischargeable checks from loops. + +#. Use profile metadata to indicate statically known cold paths, even if + dynamic profiling information is not available. This can make a large + difference in code placement and thus the performance of tight loops. + +#. When generating code for loops, try to avoid terminating the header block of + the loop earlier than necessary. If the terminator of the loop header + block is a loop exiting conditional branch, the effectiveness of LICM will + be limited for loads not in the header. (This is due to the fact that LLVM + may not know such a load is safe to speculatively execute and thus can't + lift an otherwise loop invariant load unless it can prove the exiting + condition is not taken.) It can be profitable, in some cases, to emit such + instructions into the header even if they are not used along a rarely + executed path that exits the loop. This guidance specifically does not + apply if the condition which terminates the loop header is itself invariant, + or can be easily discharged by inspecting the loop index variables. + +#. In hot loops, consider duplicating instructions from small basic blocks + which end in highly predictable terminators into their successor blocks. + If a hot successor block contains instructions which can be vectorized + with the duplicated ones, this can provide a noticeable throughput + improvement. Note that this is not always profitable and does involve a + potentially large increase in code size. + +#. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds + of predecessors). Among other issues, the register allocator is known to + perform badly with confronted with such structures. The only exception to + this guidance is that a unified return block with high in-degree is fine. + p.s. If you want to help improve this document, patches expanding any of the above items into standalone sections of their own with a more complete discussion would be very welcome. |