1 files changed, 57 insertions, 3 deletions
diff --git a/llvm/docs/Frontend/PerformanceTips.rst b/llvm/docs/Frontend/PerformanceTips.rst
index 5d7ad590464..d8c04651f0a 100644
--- a/llvm/docs/Frontend/PerformanceTips.rst
+++ b/llvm/docs/Frontend/PerformanceTips.rst
@@ -53,13 +53,25 @@ Other things to consider
 #. Make sure that a DataLayout is provided (this will likely become required in
    the near future, but is certainly important for optimization).
 
-#. Add nsw/nuw/fast-math flags as appropriate
+#. Add nsw/nuw flags as appropriate.  Reasoning about overflow is 
+   generally hard for an optimizer so providing these facts from the frontend 
+   can be very impactful.  For languages which need overflow semantics, 
+   consider using the :ref:`overflow intrinsics <int_overflow>`.
+
+#. Use fast-math flags on floating point operations if legal.  If you don't 
+   need strict IEEE floating point semantics, there are a number of additional 
+   optimizations that can be performed.  This can be highly impactful for 
+   floating point intensive computations.
+
+#. Use inbounds on geps.  This can help to disambiguate some aliasing queries.
 
 #. Add noalias/align/dereferenceable/nonnull to function arguments and return 
    values as appropriate
 
-#. Mark functions as readnone/readonly/nounwind when known (especially for 
-   external functions)
+#. Mark functions as readnone/readonly or noreturn/nounwind when known.  The 
+   optimizer will try to infer these flags, but may not always be able to.  
+   Manual annotations are particularly important for external functions that 
+   the optimizer can not analyze.
 
 #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing 
    analysis), prefer GEPs
@@ -85,9 +97,51 @@ Other things to consider
    and may not be well optimized by the current optimizer.  Depending on your
    source language, you may consider using fences instead.
 
+#. If calling a function which is known to throw an exception (unwind), use 
+   an invoke with a normal destination which contains an unreachable 
+   instruction.  This form conveys to the optimizer that the call returns 
+   abnormally.  For an invoke which neither returns normally or requires unwind
+   code in the current function, you can use a noreturn call instruction if 
+   desired.  This is generally not required because the optimizer will convert
+   an invoke with an unreachable unwind destination to a call instruction.
+
 #. If you language uses range checks, consider using the IRCE pass.  It is not 
    currently part of the standard pass order.
 
+#. For languages with numerous rarely executed guard conditions (e.g. null 
+   checks, type checks, range checks) consider adding an extra execution or 
+   two of LoopUnswith and LICM to your pass order.  The standard pass order, 
+   which is tuned for C and C++ applications, may not be sufficient to remove 
+   all dischargeable checks from loops.
+
+#. Use profile metadata to indicate statically known cold paths, even if 
+   dynamic profiling information is not available.  This can make a large 
+   difference in code placement and thus the performance of tight loops.
+
+#. When generating code for loops, try to avoid terminating the header block of
+   the loop earlier than necessary.  If the terminator of the loop header 
+   block is a loop exiting conditional branch, the effectiveness of LICM will
+   be limited for loads not in the header.  (This is due to the fact that LLVM 
+   may not know such a load is safe to speculatively execute and thus can't 
+   lift an otherwise loop invariant load unless it can prove the exiting 
+   condition is not taken.)  It can be profitable, in some cases, to emit such 
+   instructions into the header even if they are not used along a rarely 
+   executed path that exits the loop.  This guidance specifically does not 
+   apply if the condition which terminates the loop header is itself invariant,
+   or can be easily discharged by inspecting the loop index variables.
+
+#. In hot loops, consider duplicating instructions from small basic blocks 
+   which end in highly predictable terminators into their successor blocks.  
+   If a hot successor block contains instructions which can be vectorized 
+   with the duplicated ones, this can provide a noticeable throughput
+   improvement.  Note that this is not always profitable and does involve a 
+   potentially large increase in code size.
+
+#. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds
+   of predecessors).  Among other issues, the register allocator is known to 
+   perform badly with confronted with such structures.  The only exception to 
+   this guidance is that a unified return block with high in-degree is fine.
+
 p.s. If you want to help improve this document, patches expanding any of the 
 above items into standalone sections of their own with a more complete 
 discussion would be very welcome.