Loop optimization - Knowledge

161:– this can vastly improve efficiency by moving a computation from inside the loop to outside of it, computing a value just once before the loop begins, if the resultant quantity of the calculation will be the same for every loop iteration (i.e., a loop-invariant quantity). This is particularly important with address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with inversion, because not all code is safe to be moved outside the loop. 255:– duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which may degrade performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead (except multiple instruction fetches and increased program load time), but requires that the number of iterations be known at compile time (except in the case of 91:

if it is to preserve the result of the program (i.e., be a legal transformation). Evaluating the benefit of a transformation or sequence of transformations can be quite difficult within this approach, as the application of one beneficial transformation may require the prior use of one or more other

62:

Since instructions inside loops can be executed repeatedly, it is frequently not possible to give a bound on the number of instruction executions that will be impacted by a loop optimization. This presents challenges when reasoning about the correctness and benefits of a loop optimization,

428:

handles a wider class of programs and transformations than the unimodular framework. The set of executions of a set of statements within a possibly imperfectly nested set of loops is seen as the union of a set of polytopes representing the executions of the statements.

433:

are applied to these polytopes, producing a description of a new execution order. The boundaries of the polytopes, the data dependencies, and the transformations are often described using systems of constraints, and this approach is often referred to as a

287:(single instruction, multiple data)-encodings of loops and improving memory performance. This involves each vector operation being done for a size less-than or equal-to the maximum vector length on a given vector machine. 142:

conditional, reducing the number of jumps by two for cases where the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a

112:

or combining – this combines the bodies of two adjacent loops that would iterate the same number of times (whether or not that number is known at compile time), as long as they make no reference to each other's

407: 119:

or permutation – these optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve locality of reference, depending on the array's

461:. Estimating the benefits of a transformation, or finding the best transformation for a given code on a given computer, remain the subject of ongoing research as of the time of this writing (2010). 356:. The application of a unimodular transformation corresponds to the multiplication of the points within this space by the matrix. For example, the interchange of two loops corresponds to the matrix 416:; measuring the performance impact of a unimodular transformation is more difficult. Imperfectly nested loops and some transformations (such as tiling) do not fit easily into this framework. 102:

or distribution – loop fission attempts to break a loop into multiple loops over the same index range, but each new loop takes only part of the original loop's body. This can improve

300:

to describe the combined result of a sequence of many of the above transformations. Central to this approach is the view of the set of all executions of a statement within

509:, Jean-Francois Collard discusses in depth the general question of representing executions of programs rather than program text in the context of static optimization. 352: 259:). Care must also be taken to ensure that multiple re-calculation of indexed variables is not a greater overhead than advancing pointers within the original loop. 526:

Report No. UCB/CSD 93/781, Computer Science Division-EECS, University of California, Berkeley, Berkeley, California 94720, November 1993 (available at

532:). Introduces compiler analysis such as data dependence analysis and interprocedural analysis, as well as a very complete list of loop transformations 636: 182: 783: 87:

having an associated test for legality. A transformation (or sequence of transformations) generally must preserve the temporal sequence of all

566: 265:– moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of it inside each of the 209:

depends on previous iterations, and rearranges its array accesses so that the only dependencies are between iterations of the outer loop.

229:

by breaking it into multiple loops which have the same bodies but iterate over different portions of the index range. A special case is

359: 817: 233:, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop. 19:

This article is about compiler optimization for loops. For the more specific technique of overhead reduction of loop nests, see

171:

focusing on loops, restructuring them to run efficiently on multiprocessor systems. It can be done automatically by compilers (

185:– a subtle optimization that reverses the order in which values are assigned to the index variable. This can help eliminate 629: 595: 989: 866: 767: 205:– this technique is applied to a nested loop iterating over a multidimensional array, where each iteration of the 1015: 908: 622: 505: 148: 803: 63:

specifically the representations of the computation being optimized and the optimization(s) being performed.

887: 938: 903: 580: 80: 827: 702: 158: 682: 172: 168: 256: 687: 542: 242: 835: 812: 788: 717: 470: 216: 164: 84: 20: 609:

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.

199:– this divides a loop into multiple parts that may be run concurrently on multiple processors. 964: 959: 913: 840: 664: 659: 430: 309: 103: 51: 994: 918: 762: 480: 325: 47: 239:

or blocking – reorganizes a loop to iterate over blocks of data sized to fit in the cache.

8: 974: 845: 737: 645: 226: 212: 189:

and thus enable other optimizations. Certain architectures utilize looping constructs at

186: 34:

is the process of increasing execution speed and reducing the overheads associated with

969: 933: 752: 742: 692: 43: 850: 485: 438:

approach to loop optimization. For example, a single statement within an outer loop '

297: 190: 147:. Additionally, if the initial condition is known at compile-time and is known to be 39: 312:. For example, the executions of a statement nested inside an outer loop with index 984: 923: 793: 772: 732: 712: 530: 457:

Once again, a transformation is legal if it preserves the temporal sequence of all

280: 262: 245:– attempts to run as many of the loop iterations as possible at the same time on a 116: 412:

A unimodular transformation is legal if it preserves the temporal sequence of all

979: 519: 458: 413: 276: 196: 88: 35: 27: 193:

level that count in a single direction only (e.g., decrement-jump-if-not-zero ).

954: 928: 727: 722: 707: 475: 425: 252: 222: 144: 123: 106:, both of the data being accessed in the loop and the code in the loop's body. 1009: 71:

Loop optimization can be viewed as the application of a sequence of specific

202: 99: 16:

Increasing execution speed and reducing the overheads associated with loops

92:

transformations that, by themselves, would result in reduced performance.

697: 614: 236: 109: 777: 206: 66: 871: 219:

of loop iterations to hide the latencies of processor function units.

57: 527: 283:, loop-sectioning is a loop-transformation technique for enabling 600:

1997 Morgan Kaufmann. Section 20.4.2 discusses loop optimization.

581:"7.6.3.1 Strip-Mining (Sun Studio 12: Fortran Programming Guide)" 419: 176: 402:{\displaystyle {\begin{bmatrix}0&1\\1&0\end{bmatrix}}} 284: 246: 225:

or peeling – this attempts to simplify a loop or eliminate

291: 524:

Compiler transformations for high-performance computing.

77:

Compiler transformations for high-performance computing

368: 308:-dimensional space, with the points being executed in 362: 328: 296:

The unimodular transformation approach uses a single

54:

techniques have been developed to make them faster.

67:

Optimization via a sequence of loop transformations

401: 346: 175:) or manually (inserting parallel directives like 58:Representation of computation and transformations 1007: 784:Induction variable recognition and elimination 630: 320:can be associated with the pairs of integers 420:The polyhedral or constraint-based framework 596:Advanced Compiler Design and Implementation 644: 637: 623: 38:. It plays an important role in improving 452:0 <= i <= n and 0 <= j <= i+2 42:performance and making effective use of 818:Sparse conditional constant propagation 587: 506:Reasoning About Program Transformations 304:loops as a set of integer points in an 292:The unimodular transformation framework 46:capabilities. Most execution time of a 1008: 497: 618: 95:Common loop transformations include: 126:– this technique changes a standard 13: 512: 14: 1027: 603: 279:or strip-mining – introduced for 50:is spent on loops; as such, many 768:Common subexpression elimination 909:Compile-time function execution 573: 559: 535: 341: 329: 1: 491: 316:and an inner loop with index 888:Interprocedural optimization 446:' is executed once for each 167:– this is a special case of 138: ) loop wrapped in an 7: 939:Profile-guided optimization 904:Bounds-checking elimination 464: 273:clauses of the conditional. 81:intermediate representation 10: 1032: 703:Loop-invariant code motion 159:Loop-invariant code motion 18: 947: 896: 880: 859: 826: 802: 751: 683:Automatic parallelization 673: 652: 173:automatic parallelization 169:automatic parallelization 257:Just-in-time compilation 79:) to the source code or 688:Automatic vectorization 522:, and Oliver J. Sharp. 1016:Compiler optimizations 836:Instruction scheduling 813:Global value numbering 789:Live-variable analysis 718:Loop nest optimization 646:Compiler optimizations 567:"Intel Developer Zone" 543:"8051 Instruction Set" 471:Loop nest optimization 444:for j := 0 to i+2 431:Affine transformations 403: 348: 217:out-of-order execution 155:-guard can be skipped. 21:Loop nest optimization 965:Control-flow analysis 960:Array-access analysis 914:Dead-code elimination 872:Tail-call elimination 841:Instruction selection 665:Local value numbering 660:Peephole optimization 442:' and an inner loop ' 404: 349: 347:{\displaystyle (i,j)} 310:lexicographical order 104:locality of reference 52:compiler optimization 995:Value range analysis 919:Expression templates 763:Available expression 593:Steven S. Muchnick, 481:Scalable parallelism 440:for i := 0 to n 360: 326: 75:(listed below or in 73:loop transformations 975:Dependence analysis 846:Register allocation 738:Software pipelining 213:Software pipelining 151:-free, the initial 44:parallel processing 970:Data-flow analysis 934:Partial evaluation 743:Strength reduction 693:Induction variable 399: 393: 344: 48:scientific program 1003: 1002: 851:Rematerialization 486:Scalable locality 298:unimodular matrix 281:vector processors 32:loop optimization 1023: 985:Pointer analysis 924:Inline expansion 794:Use-define chain 773:Constant folding 733:Loop unswitching 713:Loop interchange 639: 632: 625: 616: 615: 610: 607: 601: 591: 585: 584: 577: 571: 570: 563: 557: 556: 554: 553: 539: 533: 518:David F. Bacon, 516: 510: 501: 453: 449: 445: 441: 436:constraint-based 426:polyhedral model 408: 406: 405: 400: 398: 397: 355: 353: 351: 350: 345: 1031: 1030: 1026: 1025: 1024: 1022: 1021: 1020: 1006: 1005: 1004: 999: 980:Escape analysis 948:Static analysis 943: 892: 876: 855: 828:Code generation 822: 798: 754: 747: 669: 648: 643: 613: 608: 604: 592: 588: 579: 578: 574: 565: 564: 560: 551: 549: 541: 540: 536: 520:Susan L. Graham 517: 513: 502: 498: 494: 467: 451: 450:pair such that 447: 443: 439: 422: 392: 391: 386: 380: 379: 374: 364: 363: 361: 358: 357: 327: 324: 323: 321: 294: 165:Parallelization 69: 60: 28:compiler theory 24: 17: 12: 11: 5: 1029: 1019: 1018: 1001: 1000: 998: 997: 992: 990:Shape analysis 987: 982: 977: 972: 967: 962: 957: 955:Alias analysis 951: 949: 945: 944: 942: 941: 936: 931: 929:Jump threading 926: 921: 916: 911: 906: 900: 898: 894: 893: 891: 890: 884: 882: 878: 877: 875: 874: 869: 863: 861: 857: 856: 854: 853: 848: 843: 838: 832: 830: 824: 823: 821: 820: 815: 809: 807: 800: 799: 797: 796: 791: 786: 781: 775: 770: 765: 759: 757: 749: 748: 746: 745: 740: 735: 730: 728:Loop unrolling 725: 723:Loop splitting 720: 715: 710: 708:Loop inversion 705: 700: 695: 690: 685: 679: 677: 671: 670: 668: 667: 662: 656: 654: 650: 649: 642: 641: 634: 627: 619: 612: 611: 602: 586: 572: 558: 547:www.win.tue.nl 534: 511: 495: 493: 490: 489: 488: 483: 478: 476:Polytope model 473: 466: 463: 421: 418: 396: 390: 387: 385: 382: 381: 378: 375: 373: 370: 369: 367: 343: 340: 337: 334: 331: 293: 290: 289: 288: 274: 260: 250: 240: 234: 220: 210: 200: 194: 180: 162: 156: 145:pipeline stall 121: 114: 107: 85:transformation 68: 65: 59: 56: 15: 9: 6: 4: 3: 2: 1028: 1017: 1014: 1013: 1011: 996: 993: 991: 988: 986: 983: 981: 978: 976: 973: 971: 968: 966: 963: 961: 958: 956: 953: 952: 950: 946: 940: 937: 935: 932: 930: 927: 925: 922: 920: 917: 915: 912: 910: 907: 905: 902: 901: 899: 895: 889: 886: 885: 883: 879: 873: 870: 868: 867:Deforestation 865: 864: 862: 858: 852: 849: 847: 844: 842: 839: 837: 834: 833: 831: 829: 825: 819: 816: 814: 811: 810: 808: 805: 801: 795: 792: 790: 787: 785: 782: 779: 776: 774: 771: 769: 766: 764: 761: 760: 758: 756: 750: 744: 741: 739: 736: 734: 731: 729: 726: 724: 721: 719: 716: 714: 711: 709: 706: 704: 701: 699: 696: 694: 691: 689: 686: 684: 681: 680: 678: 676: 672: 666: 663: 661: 658: 657: 655: 651: 647: 640: 635: 633: 628: 626: 621: 620: 617: 606: 599: 597: 590: 582: 576: 568: 562: 548: 544: 538: 531: 529: 525: 521: 515: 508: 507: 500: 496: 487: 484: 482: 479: 477: 474: 472: 469: 468: 462: 460: 455: 437: 432: 427: 417: 415: 410: 394: 388: 383: 376: 371: 365: 338: 335: 332: 319: 315: 311: 307: 303: 299: 286: 282: 278: 275: 272: 268: 264: 261: 258: 254: 251: 248: 244: 243:Vectorization 241: 238: 235: 232: 228: 224: 221: 218: 214: 211: 208: 204: 201: 198: 195: 192: 188: 184: 181: 178: 174: 170: 166: 163: 160: 157: 154: 150: 146: 141: 137: 133: 129: 125: 122: 118: 115: 111: 108: 105: 101: 98: 97: 96: 93: 90: 86: 82: 78: 74: 64: 55: 53: 49: 45: 41: 37: 33: 29: 22: 674: 605: 594: 589: 575: 561: 550:. Retrieved 546: 537: 523: 514: 504: 503:In the book 499: 459:dependencies 456: 435: 423: 414:dependencies 411: 317: 313: 305: 301: 295: 270: 266: 231:loop peeling 230: 227:dependencies 215:– a type of 187:dependencies 152: 139: 136:repeat/until 135: 131: 130:loop into a 127: 94: 89:dependencies 83:, with each 76: 72: 70: 61: 31: 25: 780:elimination 698:Loop fusion 653:Basic block 263:Unswitching 149:side-effect 117:Interchange 860:Functional 778:Dead store 552:2019-12-09 492:References 277:Sectioning 207:inner loop 197:Scheduling 753:Data-flow 253:Unrolling 223:Splitting 124:Inversion 1010:Category 755:analysis 528:CiteSeer 465:See also 191:assembly 183:Reversal 134:(a.k.a. 132:do/while 354:⁠ 322:⁠ 249:system. 203:Skewing 120:layout. 100:Fission 881:Global 806:-based 448:(i, j) 237:Tiling 177:OpenMP 110:Fusion 897:Other 128:while 113:data. 40:cache 36:loops 675:Loop 424:The 285:SIMD 271:else 269:and 247:SIMD 804:SSA 26:In 1012:: 545:. 454:. 409:. 267:if 179:). 153:if 140:if 30:, 638:e 631:t 624:v 598:, 583:. 569:. 555:. 395:] 389:0 384:1 377:1 372:0 366:[ 342:) 339:j 336:, 333:i 330:( 318:j 314:i 306:n 302:n 23:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index