Knowledge

Loop nest optimization

Source đź“ť

36: 1186:
accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time (problem #1). However, the code does not address the third problem. (Nor does it address the cleanup work necessary when N is odd. Such details will be left out of the following discussion.)
2510:
The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the Pentium 4 has 256KB (or more, depending on the model) high-bandwidth level-2 cache as well as the
2004:
the machine's memory system will keep up with the floating point unit and the code will run at maximum performance. The 16KB cache of the Pentium 4 is not quite big enough: if ib=24 and kb=64 were chosen instead, 12KB of the cache would be used—avoiding completely filling it, which is desirable so
1499:
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the
1495:
This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it
1532:
The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of A is loaded, and the entire matrix B is loaded. For the entire calculation, C is stored once (that's good), A is loaded into the cache once (assuming a stripe of A
1540:
The next step to reduce the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8 GHz Pentium 4 system used for this example, the balance number is 16.5. The second code example above
1185:
The original loop calculates the result for one entry in the result matrix at a time. By calculating a small block of entries simultaneously, the following loop reuses each loaded value twice, so that the inner loop has four loads and four multiply–adds, thus solving problem #2. By carrying four
1897:
With this code, ib can be set to any desired parameter, and the number of loads of the B matrix will be reduced by that factor. This freedom has a cost: NĂ—ib slices of the A matrix are being kept in the cache. As long as that fits, this code will not be limited by the memory system.
2493:
The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably
1901:
So what size matrix fits? The example system, a 2.8 GHz Pentium 4, has a 16KB primary data cache. With ib=20, the slice of the A matrix in this code will be larger than the primary cache when N > 100. For problems larger than that, another trick is needed.
2524:. Unfortunately, the extra levels of blocking will incur still more loop overhead, which for some problem sizes on some hardware may be more time-consuming than any shortcomings in the hardware's ability to stream data from the L2 cache. 190:
until it is reused. The partitioning of loop iteration space leads to partitioning of a large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
2519:
Block the loops again, again for the level-2 cache sizes. With a total of three levels of blocking (for the register file, for the L1 cache, and the L2 cache), the code will minimize the required bandwidth at each level of the
2515:
Adjust the block sizes for the level-2 cache. This will stress the processor's ability to keep many instructions in flight simultaneously, and there is a good chance it will be unable to achieve full bandwidth from the level-2
2506:
loop. This is one of the values of such a compiler: while it is straightforward to code the simple cases of this optimization, keeping all the details correct as the code is replicated and transformed is an error-prone process.
936:
It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests
1905:
That trick is reducing the size of the stripe of the B matrix by blocking the k loop so that the stripe is of size ib Ă— kb. Blocking the k loop means that the C array will be loaded and stored N/kb times, for a total of
2532:
is designed to take advantage of any available cache, no matter what its size is. This automatically takes advantage of two or more levels of memory hierarchy, if available. Cache-oblivious algorithms for
1504:. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop. 1507:
Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This
1533:
fits in the cache with a stripe of B), but B is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N/ib doubleword loads from main memory. In the code above,
1500:
accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the
1181:
Typical PC memory systems can only sustain one 8-byte doubleword per 10–30 double-precision multiply–adds, so values loaded into the cache must be reused many times.
156:
occur when one loop is inside of another loop.) One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common
1950: 1496:
can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!
382:
The following is an example of matrix vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.
1992: 941:) also plays an important role in achieving better cache performance. Explicit blocking requires choosing a tile size based on these factors. By contrast, 2681:. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991. 2724: 2871: 2005:
the C and B arrays have to have some room to flow through. These numbers come within 20% of the peak floating-point speed of the processor.
186:
Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the
100: 72: 2644: 2905: 53: 2635: 79: 2614: 2584: 2534: 920:
is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example,
2717: 2670: 86: 3077: 119: 2954: 2855: 68: 2694:, showing the overall balance between floating point operations and memory operations for many different computers 3103: 2996: 2710: 1492:
iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.
57: 1545:
over i. (Technically, this is actually the second time i is blocked, as the first time was the factor of 2.)
2891: 1541:
cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is
2975: 3026: 2991: 2678: 2604: 17: 2915: 2790: 2770: 2529: 942: 93: 2775: 1168: 46: 1911: 2923: 2900: 2876: 1959: 3052: 3047: 3001: 2928: 2752: 2747: 954: 149: 3082: 3006: 2850: 2691: 965:
where A, B, and C are NĂ—N arrays. Subscripts, for the following description, are in form
8: 3062: 2933: 2825: 2733: 1164: 953:
Many large mathematical operations on computers end up spending much of their time doing
145: 2528:
Rather than specifically tune for one particular cache size, as in the first example, a
152:
optimization or parallelization or another loop overhead reduction of the loop nests. (
3057: 3021: 2840: 2830: 2780: 2938: 2762: 2610: 2580: 2573: 2551: 1509: 1163:
Floating point additions take some number of cycles to complete. In order to keep an
3072: 3011: 2881: 2860: 2820: 2800: 2657: 2546: 2521: 1175: 938: 133: 3067: 1501: 3042: 3016: 2815: 2810: 2795: 2495: 157: 3097: 1526: 1517:
CPUs, who intended to build machines more parallel than the general purpose
2785: 2702: 153: 2865: 945:
are designed to make efficient use of cache without explicit blocking.
2570: 563:
After loop tiling is applied using 2 * 2 blocks, the code looks like:
2959: 187: 160: 35: 137: 2697: 1167:
with multiple cycle latency busy, the code must update multiple
2602: 2679:
The cache performance and optimizations of blocked algorithms
1522: 2698:"CHiLL: Composable High-Level Loop Transformation Framework" 2662: 2649: 2571:
Steven Muchnick; Muchnick and Associates (15 August 1997).
1514: 1954:
memory transfers. B is still transferred N/ib times, for
166:
The technique used to produce this optimization is called
1518: 249:
can be blocked with a block size B by replacing it with
144:(LNO) is an optimization technique that applies a set of 1174:
Machines can typically do just one memory operation per
377: 2606:
Compilation Techniques for Reconfigurable Architectures
374:
is a function returning the minimum of its arguments.
2564: 1962: 1914: 60:. Unsourced material may be challenged and removed. 2603:JoĂŁo M.P. Cardoso; Pedro C. Diniz (2 April 2011). 2572: 1986: 1944: 1178:, so values loaded must be reused at least twice. 948: 3095: 2502:iterations, to remove the if statement from the 2596: 2498:the kk == 0 iteration off from the rest of the 928:) may cross cache lines, causing cache misses. 2872:Induction variable recognition and elimination 2718: 2677:M. S. Lam, E. E. Rothberg, and M. E. Wolf. 2732: 2725: 2711: 2639:. Supercomputing'89, pages 655–664, 1989. 2609:. Springer Science & Business Media. 120:Learn how and when to remove this message 908:. The accessed chunk of array a is also 2906:Sparse conditional constant propagation 2575:Advanced Compiler Design Implementation 14: 3096: 1525:CPUs, adopted 32-entry floating-point 2706: 900:The original loop iteration space is 378:Example: matrix-vector multiplication 27:Technique in computer software design 2645:A Data Locality Optimizing Algorithm 58:adding citations to reliable sources 29: 2674:. Kluwer Academic Publishers. 2000. 1159:There are three problems to solve: 24: 2627: 2511:level-1 cache. There is a choice: 25: 3115: 2685: 2856:Common subexpression elimination 34: 2997:Compile-time function execution 45:needs additional citations for 949:Example: matrix multiplication 931: 13: 1: 2557: 2976:Interprocedural optimization 2655:Irigoin, F. and Triolet, R. 2000:2*N/kb + N/ib < N/balance 7: 3027:Profile-guided optimization 2992:Bounds-checking elimination 2671:Loop Tiling for Parallelism 2636:More Iteration Space Tiling 2540: 2008:Here is the code with loop 1484:This code has had both the 181: 10: 3120: 2791:Loop-invariant code motion 1945:{\displaystyle 2*N^{3}/kb} 943:cache-oblivious algorithms 176:strip mine and interchange 3035: 2984: 2968: 2947: 2914: 2890: 2839: 2771:Automatic parallelization 2761: 2740: 2692:Streams benchmark results 2665:'88, pages 319–329, 1988. 2530:cache-oblivious algorithm 2642:Wolf, M. E. and Lam, M. 2014: 1987:{\displaystyle N^{3}/ib} 1547: 1188: 974: 565: 384: 251: 196: 69:"Loop nest optimization" 2776:Automatic vectorization 2652:'91, pages 30–44, 1991. 1996:transfers. So long as 3104:Compiler optimizations 2924:Instruction scheduling 2901:Global value numbering 2877:Live-variable analysis 2806:Loop nest optimization 2734:Compiler optimizations 2658:Supernode Partitioning 1988: 1946: 142:loop nest optimization 3053:Control-flow analysis 3048:Array-access analysis 3002:Dead-code elimination 2960:Tail-call elimination 2929:Instruction selection 2753:Local value numbering 2748:Peephole optimization 2535:matrix multiplication 1989: 1947: 955:matrix multiplication 3083:Value range analysis 3007:Expression templates 2851:Available expression 1960: 1912: 957:. The operation is: 146:loop transformations 136:and particularly in 54:improve this article 3063:Dependence analysis 2934:Register allocation 2826:Software pipelining 2579:. Morgan Kaufmann. 972:The basic loop is: 148:for the purpose of 3058:Data-flow analysis 3022:Partial evaluation 2831:Strength reduction 2781:Induction variable 1984: 1942: 1513:is why vendors of 3091: 3090: 2939:Rematerialization 2616:978-0-387-09671-1 2586:978-1-55860-320-2 2552:Loop optimization 1510:register pressure 194:An ordinary loop 130: 129: 122: 104: 16:(Redirected from 3111: 3073:Pointer analysis 3012:Inline expansion 2882:Use-define chain 2861:Constant folding 2821:Loop unswitching 2801:Loop interchange 2727: 2720: 2713: 2704: 2703: 2621: 2620: 2600: 2594: 2593: 2578: 2568: 2522:memory hierarchy 2505: 2501: 2489: 2486: 2483: 2480: 2477: 2474: 2471: 2468: 2465: 2462: 2459: 2456: 2453: 2450: 2447: 2444: 2441: 2438: 2435: 2432: 2429: 2426: 2423: 2420: 2417: 2414: 2411: 2408: 2405: 2402: 2399: 2396: 2393: 2390: 2387: 2384: 2381: 2378: 2375: 2372: 2369: 2366: 2363: 2360: 2357: 2354: 2351: 2348: 2345: 2342: 2339: 2336: 2333: 2330: 2327: 2324: 2321: 2318: 2315: 2312: 2309: 2306: 2303: 2300: 2297: 2294: 2291: 2288: 2285: 2282: 2279: 2276: 2273: 2270: 2267: 2264: 2261: 2258: 2255: 2252: 2249: 2246: 2243: 2240: 2237: 2234: 2231: 2228: 2225: 2222: 2219: 2216: 2213: 2210: 2207: 2204: 2201: 2198: 2195: 2192: 2189: 2186: 2183: 2180: 2177: 2174: 2171: 2168: 2165: 2162: 2159: 2156: 2153: 2150: 2147: 2144: 2141: 2138: 2135: 2132: 2129: 2126: 2123: 2120: 2117: 2114: 2111: 2108: 2105: 2102: 2099: 2096: 2093: 2090: 2087: 2084: 2081: 2078: 2075: 2072: 2069: 2066: 2063: 2060: 2057: 2054: 2051: 2048: 2045: 2042: 2039: 2036: 2033: 2030: 2027: 2024: 2021: 2018: 2011: 1995: 1993: 1991: 1990: 1985: 1977: 1972: 1971: 1953: 1951: 1949: 1948: 1943: 1935: 1930: 1929: 1893: 1890: 1887: 1884: 1881: 1878: 1875: 1872: 1869: 1866: 1863: 1860: 1857: 1854: 1851: 1848: 1845: 1842: 1839: 1836: 1833: 1830: 1827: 1824: 1821: 1818: 1815: 1812: 1809: 1806: 1803: 1800: 1797: 1794: 1791: 1788: 1785: 1782: 1779: 1776: 1773: 1770: 1767: 1764: 1761: 1758: 1755: 1752: 1749: 1746: 1743: 1740: 1737: 1734: 1731: 1728: 1725: 1722: 1719: 1716: 1713: 1710: 1707: 1704: 1701: 1698: 1695: 1692: 1689: 1686: 1683: 1680: 1677: 1674: 1671: 1668: 1665: 1662: 1659: 1656: 1653: 1650: 1647: 1644: 1641: 1638: 1635: 1632: 1629: 1626: 1623: 1620: 1617: 1614: 1611: 1608: 1605: 1602: 1599: 1596: 1593: 1590: 1587: 1584: 1581: 1578: 1575: 1572: 1569: 1566: 1563: 1560: 1557: 1554: 1551: 1491: 1487: 1480: 1477: 1474: 1471: 1468: 1465: 1462: 1459: 1456: 1453: 1450: 1447: 1444: 1441: 1438: 1435: 1432: 1429: 1426: 1423: 1420: 1417: 1414: 1411: 1408: 1405: 1402: 1399: 1396: 1393: 1390: 1387: 1384: 1381: 1378: 1375: 1372: 1369: 1366: 1363: 1360: 1357: 1354: 1351: 1348: 1345: 1342: 1339: 1336: 1333: 1330: 1327: 1324: 1321: 1318: 1315: 1312: 1309: 1306: 1303: 1300: 1297: 1294: 1291: 1288: 1285: 1282: 1279: 1276: 1273: 1270: 1267: 1264: 1261: 1258: 1255: 1252: 1249: 1246: 1243: 1240: 1237: 1234: 1231: 1228: 1225: 1222: 1219: 1216: 1213: 1210: 1207: 1204: 1201: 1198: 1195: 1192: 1155: 1152: 1149: 1146: 1143: 1140: 1137: 1134: 1131: 1128: 1125: 1122: 1119: 1116: 1113: 1110: 1107: 1104: 1101: 1098: 1095: 1092: 1089: 1086: 1083: 1080: 1077: 1074: 1071: 1068: 1065: 1062: 1059: 1056: 1053: 1050: 1047: 1044: 1041: 1038: 1035: 1032: 1029: 1026: 1023: 1020: 1017: 1014: 1011: 1008: 1005: 1002: 999: 996: 993: 990: 987: 984: 981: 978: 968: 939:loop interchange 927: 923: 896: 893: 890: 887: 884: 881: 878: 875: 872: 869: 866: 863: 860: 857: 854: 851: 848: 845: 842: 839: 836: 833: 830: 827: 824: 821: 818: 815: 812: 809: 806: 803: 800: 797: 794: 791: 788: 785: 782: 779: 776: 773: 770: 767: 764: 761: 758: 755: 752: 749: 746: 743: 740: 737: 734: 731: 728: 725: 722: 719: 716: 713: 710: 707: 704: 701: 698: 695: 692: 689: 686: 683: 680: 677: 674: 671: 668: 665: 662: 659: 656: 653: 650: 647: 644: 641: 638: 635: 632: 629: 626: 623: 620: 617: 614: 611: 608: 605: 602: 599: 596: 593: 590: 587: 584: 581: 578: 575: 572: 569: 559: 556: 553: 550: 547: 544: 541: 538: 535: 532: 529: 526: 523: 520: 517: 514: 511: 508: 505: 502: 499: 496: 493: 490: 487: 484: 481: 478: 475: 472: 469: 466: 463: 460: 457: 454: 451: 448: 445: 442: 439: 436: 433: 430: 427: 424: 421: 418: 415: 412: 409: 406: 403: 400: 397: 394: 391: 388: 373: 366: 363: 360: 357: 354: 351: 348: 345: 342: 339: 336: 333: 330: 327: 324: 321: 318: 315: 312: 309: 306: 303: 300: 297: 294: 291: 288: 285: 282: 279: 276: 273: 270: 267: 264: 261: 258: 255: 245: 242: 239: 236: 233: 230: 227: 224: 221: 218: 215: 212: 209: 206: 203: 200: 170:, also known as 134:computer science 125: 118: 114: 111: 105: 103: 62: 38: 30: 21: 3119: 3118: 3114: 3113: 3112: 3110: 3109: 3108: 3094: 3093: 3092: 3087: 3068:Escape analysis 3036:Static analysis 3031: 2980: 2964: 2943: 2916:Code generation 2910: 2886: 2842: 2835: 2757: 2736: 2731: 2688: 2630: 2628:Further reading 2625: 2624: 2617: 2601: 2597: 2587: 2569: 2565: 2560: 2543: 2503: 2499: 2491: 2490: 2487: 2484: 2481: 2478: 2475: 2472: 2469: 2466: 2463: 2460: 2457: 2454: 2451: 2448: 2445: 2442: 2439: 2436: 2433: 2430: 2427: 2424: 2421: 2418: 2415: 2412: 2409: 2406: 2403: 2400: 2397: 2394: 2391: 2388: 2385: 2382: 2379: 2376: 2373: 2370: 2367: 2364: 2361: 2358: 2355: 2352: 2349: 2346: 2343: 2340: 2337: 2334: 2331: 2328: 2325: 2322: 2319: 2316: 2313: 2310: 2307: 2304: 2301: 2298: 2295: 2292: 2289: 2286: 2283: 2280: 2277: 2274: 2271: 2268: 2265: 2262: 2259: 2256: 2253: 2250: 2247: 2244: 2241: 2238: 2235: 2232: 2229: 2226: 2223: 2220: 2217: 2214: 2211: 2208: 2205: 2202: 2199: 2196: 2193: 2190: 2187: 2184: 2181: 2178: 2175: 2172: 2169: 2166: 2163: 2160: 2157: 2154: 2151: 2148: 2145: 2142: 2139: 2136: 2133: 2130: 2127: 2124: 2121: 2118: 2115: 2112: 2109: 2106: 2103: 2100: 2097: 2094: 2091: 2088: 2085: 2082: 2079: 2076: 2073: 2070: 2067: 2064: 2061: 2058: 2055: 2052: 2049: 2046: 2043: 2040: 2037: 2034: 2031: 2028: 2025: 2022: 2019: 2016: 2009: 1973: 1967: 1963: 1961: 1958: 1957: 1955: 1931: 1925: 1921: 1913: 1910: 1909: 1907: 1895: 1894: 1891: 1888: 1885: 1882: 1879: 1876: 1873: 1870: 1867: 1864: 1861: 1858: 1855: 1852: 1849: 1846: 1843: 1840: 1837: 1834: 1831: 1828: 1825: 1822: 1819: 1816: 1813: 1810: 1807: 1804: 1801: 1798: 1795: 1792: 1789: 1786: 1783: 1780: 1777: 1774: 1771: 1768: 1765: 1762: 1759: 1756: 1753: 1750: 1747: 1744: 1741: 1738: 1735: 1732: 1729: 1726: 1723: 1720: 1717: 1714: 1711: 1708: 1705: 1702: 1699: 1696: 1693: 1690: 1687: 1684: 1681: 1678: 1675: 1672: 1669: 1666: 1663: 1660: 1657: 1654: 1651: 1648: 1645: 1642: 1639: 1636: 1633: 1630: 1627: 1624: 1621: 1618: 1615: 1612: 1609: 1606: 1603: 1600: 1597: 1594: 1591: 1588: 1585: 1582: 1579: 1576: 1573: 1570: 1567: 1564: 1561: 1558: 1555: 1552: 1549: 1489: 1485: 1482: 1481: 1478: 1475: 1472: 1469: 1466: 1463: 1460: 1457: 1454: 1451: 1448: 1445: 1442: 1439: 1436: 1433: 1430: 1427: 1424: 1421: 1418: 1415: 1412: 1409: 1406: 1403: 1400: 1397: 1394: 1391: 1388: 1385: 1382: 1379: 1376: 1373: 1370: 1367: 1364: 1361: 1358: 1355: 1352: 1349: 1346: 1343: 1340: 1337: 1334: 1331: 1328: 1325: 1322: 1319: 1316: 1313: 1310: 1307: 1304: 1301: 1298: 1295: 1292: 1289: 1286: 1283: 1280: 1277: 1274: 1271: 1268: 1265: 1262: 1259: 1256: 1253: 1250: 1247: 1244: 1241: 1238: 1235: 1232: 1229: 1226: 1223: 1220: 1217: 1214: 1211: 1208: 1205: 1202: 1199: 1196: 1193: 1190: 1157: 1156: 1153: 1150: 1147: 1144: 1141: 1138: 1135: 1132: 1129: 1126: 1123: 1120: 1117: 1114: 1111: 1108: 1105: 1102: 1099: 1096: 1093: 1090: 1087: 1084: 1081: 1078: 1075: 1072: 1069: 1066: 1063: 1060: 1057: 1054: 1051: 1048: 1045: 1042: 1039: 1036: 1033: 1030: 1027: 1024: 1021: 1018: 1015: 1012: 1009: 1006: 1003: 1000: 997: 994: 991: 988: 985: 982: 979: 976: 966: 951: 934: 925: 921: 898: 897: 894: 891: 888: 885: 882: 879: 876: 873: 870: 867: 864: 861: 858: 855: 852: 849: 846: 843: 840: 837: 834: 831: 828: 825: 822: 819: 816: 813: 810: 807: 804: 801: 798: 795: 792: 789: 786: 783: 780: 777: 774: 771: 768: 765: 762: 759: 756: 753: 750: 747: 744: 741: 738: 735: 732: 729: 726: 723: 720: 717: 714: 711: 708: 705: 702: 699: 696: 693: 690: 687: 684: 681: 678: 675: 672: 669: 666: 663: 660: 657: 654: 651: 648: 645: 642: 639: 636: 633: 630: 627: 624: 621: 618: 615: 612: 609: 606: 603: 600: 597: 594: 591: 588: 585: 582: 579: 576: 573: 570: 567: 561: 560: 557: 554: 551: 548: 545: 542: 539: 536: 533: 530: 527: 524: 521: 518: 515: 512: 509: 506: 503: 500: 497: 494: 491: 488: 485: 482: 479: 476: 473: 470: 467: 464: 461: 458: 455: 452: 449: 446: 443: 440: 437: 434: 431: 428: 425: 422: 419: 416: 413: 410: 407: 404: 401: 398: 395: 392: 389: 386: 380: 371: 368: 367: 364: 361: 358: 355: 352: 349: 346: 343: 340: 337: 334: 331: 328: 325: 322: 319: 316: 313: 310: 307: 304: 301: 298: 295: 292: 289: 286: 283: 280: 277: 274: 271: 268: 265: 262: 259: 256: 253: 247: 246: 243: 240: 237: 234: 231: 228: 225: 222: 219: 216: 213: 210: 207: 204: 201: 198: 184: 126: 115: 109: 106: 63: 61: 51: 39: 28: 23: 22: 15: 12: 11: 5: 3117: 3107: 3106: 3089: 3088: 3086: 3085: 3080: 3078:Shape analysis 3075: 3070: 3065: 3060: 3055: 3050: 3045: 3043:Alias analysis 3039: 3037: 3033: 3032: 3030: 3029: 3024: 3019: 3017:Jump threading 3014: 3009: 3004: 2999: 2994: 2988: 2986: 2982: 2981: 2979: 2978: 2972: 2970: 2966: 2965: 2963: 2962: 2957: 2951: 2949: 2945: 2944: 2942: 2941: 2936: 2931: 2926: 2920: 2918: 2912: 2911: 2909: 2908: 2903: 2897: 2895: 2888: 2887: 2885: 2884: 2879: 2874: 2869: 2863: 2858: 2853: 2847: 2845: 2837: 2836: 2834: 2833: 2828: 2823: 2818: 2816:Loop unrolling 2813: 2811:Loop splitting 2808: 2803: 2798: 2796:Loop inversion 2793: 2788: 2783: 2778: 2773: 2767: 2765: 2759: 2758: 2756: 2755: 2750: 2744: 2742: 2738: 2737: 2730: 2729: 2722: 2715: 2707: 2701: 2700: 2695: 2687: 2686:External links 2684: 2683: 2682: 2675: 2666: 2653: 2640: 2629: 2626: 2623: 2622: 2615: 2595: 2585: 2562: 2561: 2559: 2556: 2555: 2554: 2549: 2542: 2539: 2526: 2525: 2517: 2015: 2002: 2001: 1983: 1980: 1976: 1970: 1966: 1941: 1938: 1934: 1928: 1924: 1920: 1917: 1548: 1527:register files 1189: 1183: 1182: 1179: 1172: 975: 963: 962: 950: 947: 933: 930: 566: 385: 379: 376: 252: 197: 183: 180: 158:linear algebra 128: 127: 42: 40: 33: 26: 9: 6: 4: 3: 2: 3116: 3105: 3102: 3101: 3099: 3084: 3081: 3079: 3076: 3074: 3071: 3069: 3066: 3064: 3061: 3059: 3056: 3054: 3051: 3049: 3046: 3044: 3041: 3040: 3038: 3034: 3028: 3025: 3023: 3020: 3018: 3015: 3013: 3010: 3008: 3005: 3003: 3000: 2998: 2995: 2993: 2990: 2989: 2987: 2983: 2977: 2974: 2973: 2971: 2967: 2961: 2958: 2956: 2955:Deforestation 2953: 2952: 2950: 2946: 2940: 2937: 2935: 2932: 2930: 2927: 2925: 2922: 2921: 2919: 2917: 2913: 2907: 2904: 2902: 2899: 2898: 2896: 2893: 2889: 2883: 2880: 2878: 2875: 2873: 2870: 2867: 2864: 2862: 2859: 2857: 2854: 2852: 2849: 2848: 2846: 2844: 2838: 2832: 2829: 2827: 2824: 2822: 2819: 2817: 2814: 2812: 2809: 2807: 2804: 2802: 2799: 2797: 2794: 2792: 2789: 2787: 2784: 2782: 2779: 2777: 2774: 2772: 2769: 2768: 2766: 2764: 2760: 2754: 2751: 2749: 2746: 2745: 2743: 2739: 2735: 2728: 2723: 2721: 2716: 2714: 2709: 2708: 2705: 2699: 2696: 2693: 2690: 2689: 2680: 2676: 2673: 2672: 2667: 2664: 2660: 2659: 2654: 2651: 2647: 2646: 2641: 2638: 2637: 2632: 2631: 2618: 2612: 2608: 2607: 2599: 2592: 2588: 2582: 2577: 2576: 2567: 2563: 2553: 2550: 2548: 2547:Duff's device 2545: 2544: 2538: 2536: 2531: 2523: 2518: 2514: 2513: 2512: 2508: 2497: 2013: 2006: 1999: 1998: 1997: 1981: 1978: 1974: 1968: 1964: 1939: 1936: 1932: 1926: 1922: 1918: 1915: 1903: 1899: 1546: 1544: 1538: 1536: 1530: 1528: 1524: 1520: 1516: 1512: 1511: 1505: 1503: 1497: 1493: 1187: 1180: 1177: 1173: 1170: 1166: 1162: 1161: 1160: 973: 970: 960: 959: 958: 956: 946: 944: 940: 929: 919: 915: 911: 907: 903: 564: 383: 375: 250: 195: 192: 189: 179: 177: 173: 172:loop blocking 169: 164: 162: 159: 155: 151: 147: 143: 139: 135: 124: 121: 113: 102: 99: 95: 92: 88: 85: 81: 78: 74: 71: â€“  70: 66: 65:Find sources: 59: 55: 49: 48: 43:This article 41: 37: 32: 31: 19: 2805: 2669: 2656: 2643: 2634: 2605: 2598: 2590: 2574: 2566: 2527: 2509: 2492: 2007: 2003: 1904: 1900: 1896: 1542: 1539: 1534: 1531: 1508: 1506: 1498: 1494: 1483: 1184: 1176:multiply-add 1171:in parallel. 1169:accumulators 1158: 971: 964: 952: 935: 917: 913: 909: 905: 901: 899: 562: 381: 369: 248: 193: 185: 175: 171: 167: 165: 154:Nested loops 141: 131: 116: 110:January 2021 107: 97: 90: 83: 76: 64: 52:Please help 47:verification 44: 2868:elimination 2786:Loop fusion 2741:Basic block 2537:are known. 932:Tiling size 168:loop tiling 18:Loop tiling 2948:Functional 2866:Dead store 2633:Wolfe, M. 2558:References 926:j = 1 to n 161:algorithms 80:newspapers 2841:Data-flow 2012:blocked. 1919:∗ 3098:Category 2843:analysis 2668:Xue, J. 2541:See also 182:Overview 150:locality 140:design, 138:compiler 2591:tiling. 1994:⁠ 1956:⁠ 1952:⁠ 1908:⁠ 1543:blocked 961:C = AĂ—B 916:. When 94:scholar 2969:Global 2894:-based 2613:  2583:  2516:cache. 1537:is 2. 370:where 96:  89:  82:  75:  67:  2985:Other 2496:split 2473:acc11 2461:acc10 2449:acc01 2437:acc00 2410:acc11 2392:acc10 2374:acc01 2356:acc00 2293:acc11 2281:acc10 2269:acc01 2257:acc00 2239:acc11 2233:acc10 2227:acc01 2221:acc00 1880:acc11 1868:acc10 1856:acc01 1844:acc00 1817:acc11 1799:acc10 1781:acc01 1763:acc00 1709:acc11 1703:acc10 1697:acc01 1691:acc00 1523:68000 1470:acc11 1458:acc10 1446:acc01 1434:acc00 1407:acc11 1389:acc10 1371:acc01 1353:acc00 1299:acc11 1293:acc10 1287:acc01 1281:acc00 1165:adder 922:i = 1 372:min() 188:cache 101:JSTOR 87:books 2763:Loop 2663:POPL 2650:PLDI 2611:ISBN 2581:ISBN 2329:< 2251:else 2173:< 2128:< 2083:< 2038:< 1742:< 1661:< 1616:< 1571:< 1521:and 1515:RISC 1488:and 1332:< 1257:< 1212:< 1115:< 1061:< 1019:< 823:< 763:< 718:< 649:< 510:< 456:< 359:.... 320:< 275:< 220:< 73:news 2892:SSA 2648:. 2308:for 2152:for 2107:for 2062:for 2017:for 1721:for 1640:for 1595:for 1550:for 1519:x86 1502:ISA 1311:for 1236:for 1191:for 1094:for 1040:for 998:for 977:int 912:by 904:by 826:min 802:for 766:min 742:for 697:for 628:for 622:100 613:int 568:int 489:for 435:for 429:100 420:int 387:int 323:min 299:for 254:for 241:... 199:for 174:or 132:In 56:by 3100:: 2661:. 2589:. 2500:kk 2413:+= 2395:+= 2377:+= 2359:+= 2347:++ 2338:kb 2332:kk 2320:kk 2212:== 2209:kk 2203:if 2191:+= 2182:ib 2176:ii 2164:ii 2140:+= 2098:kb 2095:+= 2092:kk 2080:kk 2068:kk 2053:ib 2050:+= 2047:ii 2035:ii 2023:ii 1820:+= 1802:+= 1784:+= 1766:+= 1754:++ 1679:+= 1670:ib 1664:ii 1652:ii 1628:+= 1586:ib 1583:+= 1580:ii 1568:ii 1556:ii 1535:ib 1529:. 1410:+= 1392:+= 1374:+= 1356:+= 1344:++ 1269:+= 1224:+= 1136:+= 1124:++ 1070:++ 1028:++ 969:. 924:, 853:++ 847:); 793:++ 787:); 730:+= 661:+= 522:++ 468:++ 347:++ 344:); 287:+= 229:++ 178:. 163:. 2726:e 2719:t 2712:v 2619:. 2504:i 2488:} 2485:} 2482:} 2479:} 2476:; 2470:= 2467:C 2464:; 2458:= 2455:C 2452:; 2446:= 2443:C 2440:; 2434:= 2431:C 2428:} 2425:; 2422:A 2419:* 2416:B 2407:; 2404:A 2401:* 2398:B 2389:; 2386:A 2383:* 2380:B 2371:; 2368:A 2365:* 2362:B 2353:{ 2350:) 2344:k 2341:; 2335:+ 2326:k 2323:; 2317:= 2314:k 2311:( 2305:} 2302:; 2299:C 2296:= 2290:; 2287:C 2284:= 2278:; 2275:C 2272:= 2266:; 2263:C 2260:= 2254:{ 2248:; 2245:0 2242:= 2236:= 2230:= 2224:= 2218:) 2215:0 2206:( 2200:{ 2197:) 2194:2 2188:i 2185:; 2179:+ 2170:i 2167:; 2161:= 2158:i 2155:( 2149:{ 2146:) 2143:2 2137:j 2134:; 2131:N 2125:j 2122:; 2119:0 2116:= 2113:j 2110:( 2104:{ 2101:) 2089:; 2086:N 2077:; 2074:0 2071:= 2065:( 2059:{ 2056:) 2044:; 2041:N 2032:; 2029:0 2026:= 2020:( 2010:k 1982:b 1979:i 1975:/ 1969:3 1965:N 1940:b 1937:k 1933:/ 1927:3 1923:N 1916:2 1892:} 1889:} 1886:} 1883:; 1877:= 1874:C 1871:; 1865:= 1862:C 1859:; 1853:= 1850:C 1847:; 1841:= 1838:C 1835:} 1832:; 1829:A 1826:* 1823:B 1814:; 1811:A 1808:* 1805:B 1796:; 1793:A 1790:* 1787:B 1778:; 1775:A 1772:* 1769:B 1760:{ 1757:) 1751:k 1748:; 1745:N 1739:k 1736:; 1733:0 1730:= 1727:k 1724:( 1718:; 1715:0 1712:= 1706:= 1700:= 1694:= 1688:{ 1685:) 1682:2 1676:i 1673:; 1667:+ 1658:i 1655:; 1649:= 1646:i 1643:( 1637:{ 1634:) 1631:2 1625:j 1622:; 1619:N 1613:j 1610:; 1607:0 1604:= 1601:j 1598:( 1592:{ 1589:) 1577:; 1574:N 1565:; 1562:0 1559:= 1553:( 1490:j 1486:i 1479:} 1476:} 1473:; 1467:= 1464:C 1461:; 1455:= 1452:C 1449:; 1443:= 1440:C 1437:; 1431:= 1428:C 1425:} 1422:; 1419:A 1416:* 1413:B 1404:; 1401:A 1398:* 1395:B 1386:; 1383:A 1380:* 1377:B 1368:; 1365:A 1362:* 1359:B 1350:{ 1347:) 1341:k 1338:; 1335:N 1329:k 1326:; 1323:0 1320:= 1317:k 1314:( 1308:; 1305:0 1302:= 1296:= 1290:= 1284:= 1278:{ 1275:) 1272:2 1266:j 1263:; 1260:N 1254:j 1251:; 1248:0 1245:= 1242:j 1239:( 1233:{ 1230:) 1227:2 1221:i 1218:; 1215:N 1209:i 1206:; 1203:0 1200:= 1197:i 1194:( 1154:} 1151:} 1148:; 1145:B 1142:* 1139:A 1133:C 1130:) 1127:k 1121:; 1118:N 1112:k 1109:; 1106:0 1103:= 1100:k 1097:( 1091:; 1088:0 1085:= 1082:C 1079:{ 1076:) 1073:j 1067:; 1064:N 1058:j 1055:; 1052:0 1049:= 1046:j 1043:( 1037:{ 1034:) 1031:i 1025:; 1022:N 1016:i 1013:; 1010:0 1007:= 1004:i 1001:( 995:; 992:k 989:, 986:j 983:, 980:i 967:C 937:( 918:n 914:n 910:n 906:n 902:n 895:} 892:} 889:} 886:} 883:; 880:b 877:* 874:a 871:+ 868:c 865:= 862:c 859:{ 856:) 850:y 844:n 841:, 838:2 835:+ 832:j 829:( 820:y 817:; 814:j 811:= 808:y 805:( 799:{ 796:) 790:x 784:n 781:, 778:2 775:+ 772:i 769:( 760:x 757:; 754:i 751:= 748:x 745:( 739:{ 736:) 733:2 727:j 724:; 721:n 715:j 712:; 709:0 706:= 703:j 700:( 694:; 691:0 688:= 685:c 682:; 679:0 676:= 673:c 670:{ 667:) 664:2 658:i 655:; 652:n 646:i 643:; 640:0 637:= 634:i 631:( 625:; 619:= 616:n 610:; 607:c 604:, 601:b 598:, 595:a 592:, 589:y 586:, 583:x 580:, 577:j 574:, 571:i 558:} 555:} 552:; 549:b 546:* 543:a 540:+ 537:c 534:= 531:c 528:{ 525:) 519:j 516:; 513:n 507:j 504:; 501:0 498:= 495:j 492:( 486:; 483:0 480:= 477:c 474:{ 471:) 465:i 462:; 459:n 453:i 450:; 447:0 444:= 441:i 438:( 432:; 426:= 423:n 417:; 414:c 411:, 408:b 405:, 402:a 399:, 396:j 393:, 390:i 365:} 362:} 356:{ 353:) 350:i 341:B 338:+ 335:j 332:, 329:N 326:( 317:i 314:; 311:j 308:= 305:i 302:( 296:{ 293:) 290:B 284:j 281:; 278:N 272:j 269:; 266:0 263:= 260:j 257:( 244:} 238:{ 235:) 232:i 226:; 223:N 217:i 214:; 211:0 208:= 205:i 202:( 123:) 117:( 112:) 108:( 98:· 91:· 84:· 77:· 50:. 20:)

Index

Loop tiling

verification
improve this article
adding citations to reliable sources
"Loop nest optimization"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
computer science
compiler
loop transformations
locality
Nested loops
linear algebra
algorithms
cache
loop interchange
cache-oblivious algorithms
matrix multiplication
adder
accumulators
multiply-add
ISA
register pressure
RISC
x86

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑