161:– this can vastly improve efficiency by moving a computation from inside the loop to outside of it, computing a value just once before the loop begins, if the resultant quantity of the calculation will be the same for every loop iteration (i.e., a loop-invariant quantity). This is particularly important with address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with inversion, because not all code is safe to be moved outside the loop.
255:– duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which may degrade performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead (except multiple instruction fetches and increased program load time), but requires that the number of iterations be known at compile time (except in the case of
91:
if it is to preserve the result of the program (i.e., be a legal transformation). Evaluating the benefit of a transformation or sequence of transformations can be quite difficult within this approach, as the application of one beneficial transformation may require the prior use of one or more other
62:
Since instructions inside loops can be executed repeatedly, it is frequently not possible to give a bound on the number of instruction executions that will be impacted by a loop optimization. This presents challenges when reasoning about the correctness and benefits of a loop optimization,
428:
handles a wider class of programs and transformations than the unimodular framework. The set of executions of a set of statements within a possibly imperfectly nested set of loops is seen as the union of a set of polytopes representing the executions of the statements.
433:
are applied to these polytopes, producing a description of a new execution order. The boundaries of the polytopes, the data dependencies, and the transformations are often described using systems of constraints, and this approach is often referred to as a
287:(single instruction, multiple data)-encodings of loops and improving memory performance. This involves each vector operation being done for a size less-than or equal-to the maximum vector length on a given vector machine.
142:
conditional, reducing the number of jumps by two for cases where the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a
112:
or combining – this combines the bodies of two adjacent loops that would iterate the same number of times (whether or not that number is known at compile time), as long as they make no reference to each other's
407:
119:
or permutation – these optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve locality of reference, depending on the array's
461:. Estimating the benefits of a transformation, or finding the best transformation for a given code on a given computer, remain the subject of ongoing research as of the time of this writing (2010).
356:. The application of a unimodular transformation corresponds to the multiplication of the points within this space by the matrix. For example, the interchange of two loops corresponds to the matrix
416:; measuring the performance impact of a unimodular transformation is more difficult. Imperfectly nested loops and some transformations (such as tiling) do not fit easily into this framework.
102:
or distribution – loop fission attempts to break a loop into multiple loops over the same index range, but each new loop takes only part of the original loop's body. This can improve
300:
to describe the combined result of a sequence of many of the above transformations. Central to this approach is the view of the set of all executions of a statement within
509:, Jean-Francois Collard discusses in depth the general question of representing executions of programs rather than program text in the context of static optimization.
352:
259:). Care must also be taken to ensure that multiple re-calculation of indexed variables is not a greater overhead than advancing pointers within the original loop.
526:
Report No. UCB/CSD 93/781, Computer
Science Division-EECS, University of California, Berkeley, Berkeley, California 94720, November 1993 (available at
532:). Introduces compiler analysis such as data dependence analysis and interprocedural analysis, as well as a very complete list of loop transformations
636:
182:
783:
87:
having an associated test for legality. A transformation (or sequence of transformations) generally must preserve the temporal sequence of all
566:
265:– moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of it inside each of the
209:
depends on previous iterations, and rearranges its array accesses so that the only dependencies are between iterations of the outer loop.
229:
by breaking it into multiple loops which have the same bodies but iterate over different portions of the index range. A special case is
359:
817:
233:, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop.
19:
This article is about compiler optimization for loops. For the more specific technique of overhead reduction of loop nests, see
171:
focusing on loops, restructuring them to run efficiently on multiprocessor systems. It can be done automatically by compilers (
185:– a subtle optimization that reverses the order in which values are assigned to the index variable. This can help eliminate
629:
595:
989:
866:
767:
205:– this technique is applied to a nested loop iterating over a multidimensional array, where each iteration of the
1015:
908:
622:
505:
148:
803:
63:
specifically the representations of the computation being optimized and the optimization(s) being performed.
887:
938:
903:
580:
80:
827:
702:
158:
682:
172:
168:
256:
687:
542:
242:
835:
812:
788:
717:
470:
216:
164:
84:
20:
609:
R. Allen and K. Kennedy. Optimizing
Compilers for Modern Architectures. Morgan Kaufmann, 2002.
199:– this divides a loop into multiple parts that may be run concurrently on multiple processors.
964:
959:
913:
840:
664:
659:
430:
309:
103:
51:
994:
918:
762:
480:
325:
47:
239:
or blocking – reorganizes a loop to iterate over blocks of data sized to fit in the cache.
8:
974:
845:
737:
645:
226:
212:
189:
and thus enable other optimizations. Certain architectures utilize looping constructs at
186:
34:
is the process of increasing execution speed and reducing the overheads associated with
969:
933:
752:
742:
692:
43:
850:
485:
438:
approach to loop optimization. For example, a single statement within an outer loop '
297:
190:
147:. Additionally, if the initial condition is known at compile-time and is known to be
39:
312:. For example, the executions of a statement nested inside an outer loop with index
984:
923:
793:
772:
732:
712:
530:
457:
Once again, a transformation is legal if it preserves the temporal sequence of all
280:
262:
245:– attempts to run as many of the loop iterations as possible at the same time on a
116:
412:
A unimodular transformation is legal if it preserves the temporal sequence of all
979:
519:
458:
413:
276:
196:
88:
35:
27:
193:
level that count in a single direction only (e.g., decrement-jump-if-not-zero ).
954:
928:
727:
722:
707:
475:
425:
252:
222:
144:
123:
106:, both of the data being accessed in the loop and the code in the loop's body.
1009:
71:
Loop optimization can be viewed as the application of a sequence of specific
202:
99:
16:
Increasing execution speed and reducing the overheads associated with loops
92:
transformations that, by themselves, would result in reduced performance.
697:
614:
236:
109:
777:
206:
66:
871:
219:
of loop iterations to hide the latencies of processor function units.
57:
527:
283:, loop-sectioning is a loop-transformation technique for enabling
600:
1997 Morgan
Kaufmann. Section 20.4.2 discusses loop optimization.
581:"7.6.3.1 Strip-Mining (Sun Studio 12: Fortran Programming Guide)"
419:
176:
402:{\displaystyle {\begin{bmatrix}0&1\\1&0\end{bmatrix}}}
284:
246:
225:
or peeling – this attempts to simplify a loop or eliminate
291:
524:
Compiler transformations for high-performance computing.
77:
Compiler transformations for high-performance computing
368:
308:-dimensional space, with the points being executed in
362:
328:
296:
The unimodular transformation approach uses a single
54:
techniques have been developed to make them faster.
67:
Optimization via a sequence of loop transformations
401:
346:
175:) or manually (inserting parallel directives like
58:Representation of computation and transformations
1007:
784:Induction variable recognition and elimination
630:
320:can be associated with the pairs of integers
420:The polyhedral or constraint-based framework
596:Advanced Compiler Design and Implementation
644:
637:
623:
38:. It plays an important role in improving
452:0 <= i <= n and 0 <= j <= i+2
42:performance and making effective use of
818:Sparse conditional constant propagation
587:
506:Reasoning About Program Transformations
304:loops as a set of integer points in an
292:The unimodular transformation framework
46:capabilities. Most execution time of a
1008:
497:
618:
95:Common loop transformations include:
126:– this technique changes a standard
13:
512:
14:
1027:
603:
279:or strip-mining – introduced for
50:is spent on loops; as such, many
768:Common subexpression elimination
909:Compile-time function execution
573:
559:
535:
341:
329:
1:
491:
316:and an inner loop with index
888:Interprocedural optimization
446:' is executed once for each
167:– this is a special case of
138: ) loop wrapped in an
7:
939:Profile-guided optimization
904:Bounds-checking elimination
464:
273:clauses of the conditional.
81:intermediate representation
10:
1032:
703:Loop-invariant code motion
159:Loop-invariant code motion
18:
947:
896:
880:
859:
826:
802:
751:
683:Automatic parallelization
673:
652:
173:automatic parallelization
169:automatic parallelization
257:Just-in-time compilation
79:) to the source code or
688:Automatic vectorization
522:, and Oliver J. Sharp.
1016:Compiler optimizations
836:Instruction scheduling
813:Global value numbering
789:Live-variable analysis
718:Loop nest optimization
646:Compiler optimizations
567:"Intel Developer Zone"
543:"8051 Instruction Set"
471:Loop nest optimization
444:for j := 0 to i+2
431:Affine transformations
403:
348:
217:out-of-order execution
155:-guard can be skipped.
21:Loop nest optimization
965:Control-flow analysis
960:Array-access analysis
914:Dead-code elimination
872:Tail-call elimination
841:Instruction selection
665:Local value numbering
660:Peephole optimization
442:' and an inner loop '
404:
349:
347:{\displaystyle (i,j)}
310:lexicographical order
104:locality of reference
52:compiler optimization
995:Value range analysis
919:Expression templates
763:Available expression
593:Steven S. Muchnick,
481:Scalable parallelism
440:for i := 0 to n
360:
326:
75:(listed below or in
73:loop transformations
975:Dependence analysis
846:Register allocation
738:Software pipelining
213:Software pipelining
151:-free, the initial
44:parallel processing
970:Data-flow analysis
934:Partial evaluation
743:Strength reduction
693:Induction variable
399:
393:
344:
48:scientific program
1003:
1002:
851:Rematerialization
486:Scalable locality
298:unimodular matrix
281:vector processors
32:loop optimization
1023:
985:Pointer analysis
924:Inline expansion
794:Use-define chain
773:Constant folding
733:Loop unswitching
713:Loop interchange
639:
632:
625:
616:
615:
610:
607:
601:
591:
585:
584:
577:
571:
570:
563:
557:
556:
554:
553:
539:
533:
518:David F. Bacon,
516:
510:
501:
453:
449:
445:
441:
436:constraint-based
426:polyhedral model
408:
406:
405:
400:
398:
397:
355:
353:
351:
350:
345:
1031:
1030:
1026:
1025:
1024:
1022:
1021:
1020:
1006:
1005:
1004:
999:
980:Escape analysis
948:Static analysis
943:
892:
876:
855:
828:Code generation
822:
798:
754:
747:
669:
648:
643:
613:
608:
604:
592:
588:
579:
578:
574:
565:
564:
560:
551:
549:
541:
540:
536:
520:Susan L. Graham
517:
513:
502:
498:
494:
467:
451:
450:pair such that
447:
443:
439:
422:
392:
391:
386:
380:
379:
374:
364:
363:
361:
358:
357:
327:
324:
323:
321:
294:
165:Parallelization
69:
60:
28:compiler theory
24:
17:
12:
11:
5:
1029:
1019:
1018:
1001:
1000:
998:
997:
992:
990:Shape analysis
987:
982:
977:
972:
967:
962:
957:
955:Alias analysis
951:
949:
945:
944:
942:
941:
936:
931:
929:Jump threading
926:
921:
916:
911:
906:
900:
898:
894:
893:
891:
890:
884:
882:
878:
877:
875:
874:
869:
863:
861:
857:
856:
854:
853:
848:
843:
838:
832:
830:
824:
823:
821:
820:
815:
809:
807:
800:
799:
797:
796:
791:
786:
781:
775:
770:
765:
759:
757:
749:
748:
746:
745:
740:
735:
730:
728:Loop unrolling
725:
723:Loop splitting
720:
715:
710:
708:Loop inversion
705:
700:
695:
690:
685:
679:
677:
671:
670:
668:
667:
662:
656:
654:
650:
649:
642:
641:
634:
627:
619:
612:
611:
602:
586:
572:
558:
547:www.win.tue.nl
534:
511:
495:
493:
490:
489:
488:
483:
478:
476:Polytope model
473:
466:
463:
421:
418:
396:
390:
387:
385:
382:
381:
378:
375:
373:
370:
369:
367:
343:
340:
337:
334:
331:
293:
290:
289:
288:
274:
260:
250:
240:
234:
220:
210:
200:
194:
180:
162:
156:
145:pipeline stall
121:
114:
107:
85:transformation
68:
65:
59:
56:
15:
9:
6:
4:
3:
2:
1028:
1017:
1014:
1013:
1011:
996:
993:
991:
988:
986:
983:
981:
978:
976:
973:
971:
968:
966:
963:
961:
958:
956:
953:
952:
950:
946:
940:
937:
935:
932:
930:
927:
925:
922:
920:
917:
915:
912:
910:
907:
905:
902:
901:
899:
895:
889:
886:
885:
883:
879:
873:
870:
868:
867:Deforestation
865:
864:
862:
858:
852:
849:
847:
844:
842:
839:
837:
834:
833:
831:
829:
825:
819:
816:
814:
811:
810:
808:
805:
801:
795:
792:
790:
787:
785:
782:
779:
776:
774:
771:
769:
766:
764:
761:
760:
758:
756:
750:
744:
741:
739:
736:
734:
731:
729:
726:
724:
721:
719:
716:
714:
711:
709:
706:
704:
701:
699:
696:
694:
691:
689:
686:
684:
681:
680:
678:
676:
672:
666:
663:
661:
658:
657:
655:
651:
647:
640:
635:
633:
628:
626:
621:
620:
617:
606:
599:
597:
590:
582:
576:
568:
562:
548:
544:
538:
531:
529:
525:
521:
515:
508:
507:
500:
496:
487:
484:
482:
479:
477:
474:
472:
469:
468:
462:
460:
455:
437:
432:
427:
417:
415:
410:
394:
388:
383:
376:
371:
365:
338:
335:
332:
319:
315:
311:
307:
303:
299:
286:
282:
278:
275:
272:
268:
264:
261:
258:
254:
251:
248:
244:
243:Vectorization
241:
238:
235:
232:
228:
224:
221:
218:
214:
211:
208:
204:
201:
198:
195:
192:
188:
184:
181:
178:
174:
170:
166:
163:
160:
157:
154:
150:
146:
141:
137:
133:
129:
125:
122:
118:
115:
111:
108:
105:
101:
98:
97:
96:
93:
90:
86:
82:
78:
74:
64:
55:
53:
49:
45:
41:
37:
33:
29:
22:
674:
605:
594:
589:
575:
561:
550:. Retrieved
546:
537:
523:
514:
504:
503:In the book
499:
459:dependencies
456:
435:
423:
414:dependencies
411:
317:
313:
305:
301:
295:
270:
266:
231:loop peeling
230:
227:dependencies
215:– a type of
187:dependencies
152:
139:
136:repeat/until
135:
131:
130:loop into a
127:
94:
89:dependencies
83:, with each
76:
72:
70:
61:
31:
25:
780:elimination
698:Loop fusion
653:Basic block
263:Unswitching
149:side-effect
117:Interchange
860:Functional
778:Dead store
552:2019-12-09
492:References
277:Sectioning
207:inner loop
197:Scheduling
753:Data-flow
253:Unrolling
223:Splitting
124:Inversion
1010:Category
755:analysis
528:CiteSeer
465:See also
191:assembly
183:Reversal
134:(a.k.a.
132:do/while
354:
322:
249:system.
203:Skewing
120:layout.
100:Fission
881:Global
806:-based
448:(i, j)
237:Tiling
177:OpenMP
110:Fusion
897:Other
128:while
113:data.
40:cache
36:loops
675:Loop
424:The
285:SIMD
271:else
269:and
247:SIMD
804:SSA
26:In
1012::
545:.
454:.
409:.
267:if
179:).
153:if
140:if
30:,
638:e
631:t
624:v
598:,
583:.
569:.
555:.
395:]
389:0
384:1
377:1
372:0
366:[
342:)
339:j
336:,
333:i
330:(
318:j
314:i
306:n
302:n
23:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.