36:
1186:
accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time (problem #1). However, the code does not address the third problem. (Nor does it address the cleanup work necessary when N is odd. Such details will be left out of the following discussion.)
2510:
The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the
Pentium 4 has 256KB (or more, depending on the model) high-bandwidth level-2 cache as well as the
2004:
the machine's memory system will keep up with the floating point unit and the code will run at maximum performance. The 16KB cache of the
Pentium 4 is not quite big enough: if ib=24 and kb=64 were chosen instead, 12KB of the cache would be used—avoiding completely filling it, which is desirable so
1499:
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the
1495:
This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz
Pentium 4, built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it
1532:
The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of A is loaded, and the entire matrix B is loaded. For the entire calculation, C is stored once (that's good), A is loaded into the cache once (assuming a stripe of A
1540:
The next step to reduce the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8 GHz
Pentium 4 system used for this example, the balance number is 16.5. The second code example above
1185:
The original loop calculates the result for one entry in the result matrix at a time. By calculating a small block of entries simultaneously, the following loop reuses each loaded value twice, so that the inner loop has four loads and four multiply–adds, thus solving problem #2. By carrying four
1897:
With this code, ib can be set to any desired parameter, and the number of loads of the B matrix will be reduced by that factor. This freedom has a cost: NĂ—ib slices of the A matrix are being kept in the cache. As long as that fits, this code will not be limited by the memory system.
2493:
The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably
1901:
So what size matrix fits? The example system, a 2.8 GHz
Pentium 4, has a 16KB primary data cache. With ib=20, the slice of the A matrix in this code will be larger than the primary cache when N > 100. For problems larger than that, another trick is needed.
2524:. Unfortunately, the extra levels of blocking will incur still more loop overhead, which for some problem sizes on some hardware may be more time-consuming than any shortcomings in the hardware's ability to stream data from the L2 cache.
190:
until it is reused. The partitioning of loop iteration space leads to partitioning of a large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
2519:
Block the loops again, again for the level-2 cache sizes. With a total of three levels of blocking (for the register file, for the L1 cache, and the L2 cache), the code will minimize the required bandwidth at each level of the
2515:
Adjust the block sizes for the level-2 cache. This will stress the processor's ability to keep many instructions in flight simultaneously, and there is a good chance it will be unable to achieve full bandwidth from the level-2
2506:
loop. This is one of the values of such a compiler: while it is straightforward to code the simple cases of this optimization, keeping all the details correct as the code is replicated and transformed is an error-prone process.
936:
It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests
1905:
That trick is reducing the size of the stripe of the B matrix by blocking the k loop so that the stripe is of size ib Ă— kb. Blocking the k loop means that the C array will be loaded and stored N/kb times, for a total of
2532:
is designed to take advantage of any available cache, no matter what its size is. This automatically takes advantage of two or more levels of memory hierarchy, if available. Cache-oblivious algorithms for
1504:. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.
1507:
Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This
1533:
fits in the cache with a stripe of B), but B is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N/ib doubleword loads from main memory. In the code above,
1500:
accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the
1181:
Typical PC memory systems can only sustain one 8-byte doubleword per 10–30 double-precision multiply–adds, so values loaded into the cache must be reused many times.
156:
occur when one loop is inside of another loop.) One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common
1950:
1496:
can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz
Pentium 4 than on the 166 MHz Y-MP!
382:
The following is an example of matrix vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.
1992:
941:) also plays an important role in achieving better cache performance. Explicit blocking requires choosing a tile size based on these factors. By contrast,
2681:. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991.
2724:
2871:
2005:
the C and B arrays have to have some room to flow through. These numbers come within 20% of the peak floating-point speed of the processor.
186:
Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the
100:
72:
2644:
2905:
53:
2635:
79:
2614:
2584:
2534:
920:
is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example,
2717:
2670:
86:
3077:
119:
2954:
2855:
68:
2694:, showing the overall balance between floating point operations and memory operations for many different computers
3103:
2996:
2710:
1492:
iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.
57:
1545:
over i. (Technically, this is actually the second time i is blocked, as the first time was the factor of 2.)
2891:
1541:
cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is
2975:
3026:
2991:
2678:
2604:
17:
2915:
2790:
2770:
2529:
942:
93:
2775:
1168:
46:
1911:
2923:
2900:
2876:
1959:
3052:
3047:
3001:
2928:
2752:
2747:
954:
149:
3082:
3006:
2850:
2691:
965:
where A, B, and C are NĂ—N arrays. Subscripts, for the following description, are in form
8:
3062:
2933:
2825:
2733:
1164:
953:
Many large mathematical operations on computers end up spending much of their time doing
145:
2528:
Rather than specifically tune for one particular cache size, as in the first example, a
152:
optimization or parallelization or another loop overhead reduction of the loop nests. (
3057:
3021:
2840:
2830:
2780:
2938:
2762:
2610:
2580:
2573:
2551:
1509:
1163:
Floating point additions take some number of cycles to complete. In order to keep an
3072:
3011:
2881:
2860:
2820:
2800:
2657:
2546:
2521:
1175:
938:
133:
3067:
1501:
3042:
3016:
2815:
2810:
2795:
2495:
157:
3097:
1526:
1517:
CPUs, who intended to build machines more parallel than the general purpose
2785:
2702:
153:
2865:
945:
are designed to make efficient use of cache without explicit blocking.
2570:
563:
After loop tiling is applied using 2 * 2 blocks, the code looks like:
2959:
187:
160:
35:
137:
2697:
1167:
with multiple cycle latency busy, the code must update multiple
2602:
2679:
The cache performance and optimizations of blocked algorithms
1522:
2698:"CHiLL: Composable High-Level Loop Transformation Framework"
2662:
2649:
2571:
Steven
Muchnick; Muchnick and Associates (15 August 1997).
1514:
1954:
memory transfers. B is still transferred N/ib times, for
166:
The technique used to produce this optimization is called
1518:
249:
can be blocked with a block size B by replacing it with
144:(LNO) is an optimization technique that applies a set of
1174:
Machines can typically do just one memory operation per
377:
2606:
Compilation
Techniques for Reconfigurable Architectures
374:
is a function returning the minimum of its arguments.
2564:
1962:
1914:
60:. Unsourced material may be challenged and removed.
2603:JoĂŁo M.P. Cardoso; Pedro C. Diniz (2 April 2011).
2572:
1986:
1944:
1178:, so values loaded must be reused at least twice.
948:
3095:
2502:iterations, to remove the if statement from the
2596:
2498:the kk == 0 iteration off from the rest of the
928:) may cross cache lines, causing cache misses.
2872:Induction variable recognition and elimination
2718:
2677:M. S. Lam, E. E. Rothberg, and M. E. Wolf.
2732:
2725:
2711:
2639:. Supercomputing'89, pages 655–664, 1989.
2609:. Springer Science & Business Media.
120:Learn how and when to remove this message
908:. The accessed chunk of array a is also
2906:Sparse conditional constant propagation
2575:Advanced Compiler Design Implementation
14:
3096:
1525:CPUs, adopted 32-entry floating-point
2706:
900:The original loop iteration space is
378:Example: matrix-vector multiplication
27:Technique in computer software design
2645:A Data Locality Optimizing Algorithm
58:adding citations to reliable sources
29:
2674:. Kluwer Academic Publishers. 2000.
1159:There are three problems to solve:
24:
2627:
2511:level-1 cache. There is a choice:
25:
3115:
2685:
2856:Common subexpression elimination
34:
2997:Compile-time function execution
45:needs additional citations for
949:Example: matrix multiplication
931:
13:
1:
2557:
2976:Interprocedural optimization
2655:Irigoin, F. and Triolet, R.
2000:2*N/kb + N/ib < N/balance
7:
3027:Profile-guided optimization
2992:Bounds-checking elimination
2671:Loop Tiling for Parallelism
2636:More Iteration Space Tiling
2540:
2008:Here is the code with loop
1484:This code has had both the
181:
10:
3120:
2791:Loop-invariant code motion
1945:{\displaystyle 2*N^{3}/kb}
943:cache-oblivious algorithms
176:strip mine and interchange
3035:
2984:
2968:
2947:
2914:
2890:
2839:
2771:Automatic parallelization
2761:
2740:
2692:Streams benchmark results
2665:'88, pages 319–329, 1988.
2530:cache-oblivious algorithm
2642:Wolf, M. E. and Lam, M.
2014:
1987:{\displaystyle N^{3}/ib}
1547:
1188:
974:
565:
384:
251:
196:
69:"Loop nest optimization"
2776:Automatic vectorization
2652:'91, pages 30–44, 1991.
1996:transfers. So long as
3104:Compiler optimizations
2924:Instruction scheduling
2901:Global value numbering
2877:Live-variable analysis
2806:Loop nest optimization
2734:Compiler optimizations
2658:Supernode Partitioning
1988:
1946:
142:loop nest optimization
3053:Control-flow analysis
3048:Array-access analysis
3002:Dead-code elimination
2960:Tail-call elimination
2929:Instruction selection
2753:Local value numbering
2748:Peephole optimization
2535:matrix multiplication
1989:
1947:
955:matrix multiplication
3083:Value range analysis
3007:Expression templates
2851:Available expression
1960:
1912:
957:. The operation is:
146:loop transformations
136:and particularly in
54:improve this article
3063:Dependence analysis
2934:Register allocation
2826:Software pipelining
2579:. Morgan Kaufmann.
972:The basic loop is:
148:for the purpose of
3058:Data-flow analysis
3022:Partial evaluation
2831:Strength reduction
2781:Induction variable
1984:
1942:
1513:is why vendors of
3091:
3090:
2939:Rematerialization
2616:978-0-387-09671-1
2586:978-1-55860-320-2
2552:Loop optimization
1510:register pressure
194:An ordinary loop
130:
129:
122:
104:
16:(Redirected from
3111:
3073:Pointer analysis
3012:Inline expansion
2882:Use-define chain
2861:Constant folding
2821:Loop unswitching
2801:Loop interchange
2727:
2720:
2713:
2704:
2703:
2621:
2620:
2600:
2594:
2593:
2578:
2568:
2522:memory hierarchy
2505:
2501:
2489:
2486:
2483:
2480:
2477:
2474:
2471:
2468:
2465:
2462:
2459:
2456:
2453:
2450:
2447:
2444:
2441:
2438:
2435:
2432:
2429:
2426:
2423:
2420:
2417:
2414:
2411:
2408:
2405:
2402:
2399:
2396:
2393:
2390:
2387:
2384:
2381:
2378:
2375:
2372:
2369:
2366:
2363:
2360:
2357:
2354:
2351:
2348:
2345:
2342:
2339:
2336:
2333:
2330:
2327:
2324:
2321:
2318:
2315:
2312:
2309:
2306:
2303:
2300:
2297:
2294:
2291:
2288:
2285:
2282:
2279:
2276:
2273:
2270:
2267:
2264:
2261:
2258:
2255:
2252:
2249:
2246:
2243:
2240:
2237:
2234:
2231:
2228:
2225:
2222:
2219:
2216:
2213:
2210:
2207:
2204:
2201:
2198:
2195:
2192:
2189:
2186:
2183:
2180:
2177:
2174:
2171:
2168:
2165:
2162:
2159:
2156:
2153:
2150:
2147:
2144:
2141:
2138:
2135:
2132:
2129:
2126:
2123:
2120:
2117:
2114:
2111:
2108:
2105:
2102:
2099:
2096:
2093:
2090:
2087:
2084:
2081:
2078:
2075:
2072:
2069:
2066:
2063:
2060:
2057:
2054:
2051:
2048:
2045:
2042:
2039:
2036:
2033:
2030:
2027:
2024:
2021:
2018:
2011:
1995:
1993:
1991:
1990:
1985:
1977:
1972:
1971:
1953:
1951:
1949:
1948:
1943:
1935:
1930:
1929:
1893:
1890:
1887:
1884:
1881:
1878:
1875:
1872:
1869:
1866:
1863:
1860:
1857:
1854:
1851:
1848:
1845:
1842:
1839:
1836:
1833:
1830:
1827:
1824:
1821:
1818:
1815:
1812:
1809:
1806:
1803:
1800:
1797:
1794:
1791:
1788:
1785:
1782:
1779:
1776:
1773:
1770:
1767:
1764:
1761:
1758:
1755:
1752:
1749:
1746:
1743:
1740:
1737:
1734:
1731:
1728:
1725:
1722:
1719:
1716:
1713:
1710:
1707:
1704:
1701:
1698:
1695:
1692:
1689:
1686:
1683:
1680:
1677:
1674:
1671:
1668:
1665:
1662:
1659:
1656:
1653:
1650:
1647:
1644:
1641:
1638:
1635:
1632:
1629:
1626:
1623:
1620:
1617:
1614:
1611:
1608:
1605:
1602:
1599:
1596:
1593:
1590:
1587:
1584:
1581:
1578:
1575:
1572:
1569:
1566:
1563:
1560:
1557:
1554:
1551:
1491:
1487:
1480:
1477:
1474:
1471:
1468:
1465:
1462:
1459:
1456:
1453:
1450:
1447:
1444:
1441:
1438:
1435:
1432:
1429:
1426:
1423:
1420:
1417:
1414:
1411:
1408:
1405:
1402:
1399:
1396:
1393:
1390:
1387:
1384:
1381:
1378:
1375:
1372:
1369:
1366:
1363:
1360:
1357:
1354:
1351:
1348:
1345:
1342:
1339:
1336:
1333:
1330:
1327:
1324:
1321:
1318:
1315:
1312:
1309:
1306:
1303:
1300:
1297:
1294:
1291:
1288:
1285:
1282:
1279:
1276:
1273:
1270:
1267:
1264:
1261:
1258:
1255:
1252:
1249:
1246:
1243:
1240:
1237:
1234:
1231:
1228:
1225:
1222:
1219:
1216:
1213:
1210:
1207:
1204:
1201:
1198:
1195:
1192:
1155:
1152:
1149:
1146:
1143:
1140:
1137:
1134:
1131:
1128:
1125:
1122:
1119:
1116:
1113:
1110:
1107:
1104:
1101:
1098:
1095:
1092:
1089:
1086:
1083:
1080:
1077:
1074:
1071:
1068:
1065:
1062:
1059:
1056:
1053:
1050:
1047:
1044:
1041:
1038:
1035:
1032:
1029:
1026:
1023:
1020:
1017:
1014:
1011:
1008:
1005:
1002:
999:
996:
993:
990:
987:
984:
981:
978:
968:
939:loop interchange
927:
923:
896:
893:
890:
887:
884:
881:
878:
875:
872:
869:
866:
863:
860:
857:
854:
851:
848:
845:
842:
839:
836:
833:
830:
827:
824:
821:
818:
815:
812:
809:
806:
803:
800:
797:
794:
791:
788:
785:
782:
779:
776:
773:
770:
767:
764:
761:
758:
755:
752:
749:
746:
743:
740:
737:
734:
731:
728:
725:
722:
719:
716:
713:
710:
707:
704:
701:
698:
695:
692:
689:
686:
683:
680:
677:
674:
671:
668:
665:
662:
659:
656:
653:
650:
647:
644:
641:
638:
635:
632:
629:
626:
623:
620:
617:
614:
611:
608:
605:
602:
599:
596:
593:
590:
587:
584:
581:
578:
575:
572:
569:
559:
556:
553:
550:
547:
544:
541:
538:
535:
532:
529:
526:
523:
520:
517:
514:
511:
508:
505:
502:
499:
496:
493:
490:
487:
484:
481:
478:
475:
472:
469:
466:
463:
460:
457:
454:
451:
448:
445:
442:
439:
436:
433:
430:
427:
424:
421:
418:
415:
412:
409:
406:
403:
400:
397:
394:
391:
388:
373:
366:
363:
360:
357:
354:
351:
348:
345:
342:
339:
336:
333:
330:
327:
324:
321:
318:
315:
312:
309:
306:
303:
300:
297:
294:
291:
288:
285:
282:
279:
276:
273:
270:
267:
264:
261:
258:
255:
245:
242:
239:
236:
233:
230:
227:
224:
221:
218:
215:
212:
209:
206:
203:
200:
170:, also known as
134:computer science
125:
118:
114:
111:
105:
103:
62:
38:
30:
21:
3119:
3118:
3114:
3113:
3112:
3110:
3109:
3108:
3094:
3093:
3092:
3087:
3068:Escape analysis
3036:Static analysis
3031:
2980:
2964:
2943:
2916:Code generation
2910:
2886:
2842:
2835:
2757:
2736:
2731:
2688:
2630:
2628:Further reading
2625:
2624:
2617:
2601:
2597:
2587:
2569:
2565:
2560:
2543:
2503:
2499:
2491:
2490:
2487:
2484:
2481:
2478:
2475:
2472:
2469:
2466:
2463:
2460:
2457:
2454:
2451:
2448:
2445:
2442:
2439:
2436:
2433:
2430:
2427:
2424:
2421:
2418:
2415:
2412:
2409:
2406:
2403:
2400:
2397:
2394:
2391:
2388:
2385:
2382:
2379:
2376:
2373:
2370:
2367:
2364:
2361:
2358:
2355:
2352:
2349:
2346:
2343:
2340:
2337:
2334:
2331:
2328:
2325:
2322:
2319:
2316:
2313:
2310:
2307:
2304:
2301:
2298:
2295:
2292:
2289:
2286:
2283:
2280:
2277:
2274:
2271:
2268:
2265:
2262:
2259:
2256:
2253:
2250:
2247:
2244:
2241:
2238:
2235:
2232:
2229:
2226:
2223:
2220:
2217:
2214:
2211:
2208:
2205:
2202:
2199:
2196:
2193:
2190:
2187:
2184:
2181:
2178:
2175:
2172:
2169:
2166:
2163:
2160:
2157:
2154:
2151:
2148:
2145:
2142:
2139:
2136:
2133:
2130:
2127:
2124:
2121:
2118:
2115:
2112:
2109:
2106:
2103:
2100:
2097:
2094:
2091:
2088:
2085:
2082:
2079:
2076:
2073:
2070:
2067:
2064:
2061:
2058:
2055:
2052:
2049:
2046:
2043:
2040:
2037:
2034:
2031:
2028:
2025:
2022:
2019:
2016:
2009:
1973:
1967:
1963:
1961:
1958:
1957:
1955:
1931:
1925:
1921:
1913:
1910:
1909:
1907:
1895:
1894:
1891:
1888:
1885:
1882:
1879:
1876:
1873:
1870:
1867:
1864:
1861:
1858:
1855:
1852:
1849:
1846:
1843:
1840:
1837:
1834:
1831:
1828:
1825:
1822:
1819:
1816:
1813:
1810:
1807:
1804:
1801:
1798:
1795:
1792:
1789:
1786:
1783:
1780:
1777:
1774:
1771:
1768:
1765:
1762:
1759:
1756:
1753:
1750:
1747:
1744:
1741:
1738:
1735:
1732:
1729:
1726:
1723:
1720:
1717:
1714:
1711:
1708:
1705:
1702:
1699:
1696:
1693:
1690:
1687:
1684:
1681:
1678:
1675:
1672:
1669:
1666:
1663:
1660:
1657:
1654:
1651:
1648:
1645:
1642:
1639:
1636:
1633:
1630:
1627:
1624:
1621:
1618:
1615:
1612:
1609:
1606:
1603:
1600:
1597:
1594:
1591:
1588:
1585:
1582:
1579:
1576:
1573:
1570:
1567:
1564:
1561:
1558:
1555:
1552:
1549:
1489:
1485:
1482:
1481:
1478:
1475:
1472:
1469:
1466:
1463:
1460:
1457:
1454:
1451:
1448:
1445:
1442:
1439:
1436:
1433:
1430:
1427:
1424:
1421:
1418:
1415:
1412:
1409:
1406:
1403:
1400:
1397:
1394:
1391:
1388:
1385:
1382:
1379:
1376:
1373:
1370:
1367:
1364:
1361:
1358:
1355:
1352:
1349:
1346:
1343:
1340:
1337:
1334:
1331:
1328:
1325:
1322:
1319:
1316:
1313:
1310:
1307:
1304:
1301:
1298:
1295:
1292:
1289:
1286:
1283:
1280:
1277:
1274:
1271:
1268:
1265:
1262:
1259:
1256:
1253:
1250:
1247:
1244:
1241:
1238:
1235:
1232:
1229:
1226:
1223:
1220:
1217:
1214:
1211:
1208:
1205:
1202:
1199:
1196:
1193:
1190:
1157:
1156:
1153:
1150:
1147:
1144:
1141:
1138:
1135:
1132:
1129:
1126:
1123:
1120:
1117:
1114:
1111:
1108:
1105:
1102:
1099:
1096:
1093:
1090:
1087:
1084:
1081:
1078:
1075:
1072:
1069:
1066:
1063:
1060:
1057:
1054:
1051:
1048:
1045:
1042:
1039:
1036:
1033:
1030:
1027:
1024:
1021:
1018:
1015:
1012:
1009:
1006:
1003:
1000:
997:
994:
991:
988:
985:
982:
979:
976:
966:
951:
934:
925:
921:
898:
897:
894:
891:
888:
885:
882:
879:
876:
873:
870:
867:
864:
861:
858:
855:
852:
849:
846:
843:
840:
837:
834:
831:
828:
825:
822:
819:
816:
813:
810:
807:
804:
801:
798:
795:
792:
789:
786:
783:
780:
777:
774:
771:
768:
765:
762:
759:
756:
753:
750:
747:
744:
741:
738:
735:
732:
729:
726:
723:
720:
717:
714:
711:
708:
705:
702:
699:
696:
693:
690:
687:
684:
681:
678:
675:
672:
669:
666:
663:
660:
657:
654:
651:
648:
645:
642:
639:
636:
633:
630:
627:
624:
621:
618:
615:
612:
609:
606:
603:
600:
597:
594:
591:
588:
585:
582:
579:
576:
573:
570:
567:
561:
560:
557:
554:
551:
548:
545:
542:
539:
536:
533:
530:
527:
524:
521:
518:
515:
512:
509:
506:
503:
500:
497:
494:
491:
488:
485:
482:
479:
476:
473:
470:
467:
464:
461:
458:
455:
452:
449:
446:
443:
440:
437:
434:
431:
428:
425:
422:
419:
416:
413:
410:
407:
404:
401:
398:
395:
392:
389:
386:
380:
371:
368:
367:
364:
361:
358:
355:
352:
349:
346:
343:
340:
337:
334:
331:
328:
325:
322:
319:
316:
313:
310:
307:
304:
301:
298:
295:
292:
289:
286:
283:
280:
277:
274:
271:
268:
265:
262:
259:
256:
253:
247:
246:
243:
240:
237:
234:
231:
228:
225:
222:
219:
216:
213:
210:
207:
204:
201:
198:
184:
126:
115:
109:
106:
63:
61:
51:
39:
28:
23:
22:
15:
12:
11:
5:
3117:
3107:
3106:
3089:
3088:
3086:
3085:
3080:
3078:Shape analysis
3075:
3070:
3065:
3060:
3055:
3050:
3045:
3043:Alias analysis
3039:
3037:
3033:
3032:
3030:
3029:
3024:
3019:
3017:Jump threading
3014:
3009:
3004:
2999:
2994:
2988:
2986:
2982:
2981:
2979:
2978:
2972:
2970:
2966:
2965:
2963:
2962:
2957:
2951:
2949:
2945:
2944:
2942:
2941:
2936:
2931:
2926:
2920:
2918:
2912:
2911:
2909:
2908:
2903:
2897:
2895:
2888:
2887:
2885:
2884:
2879:
2874:
2869:
2863:
2858:
2853:
2847:
2845:
2837:
2836:
2834:
2833:
2828:
2823:
2818:
2816:Loop unrolling
2813:
2811:Loop splitting
2808:
2803:
2798:
2796:Loop inversion
2793:
2788:
2783:
2778:
2773:
2767:
2765:
2759:
2758:
2756:
2755:
2750:
2744:
2742:
2738:
2737:
2730:
2729:
2722:
2715:
2707:
2701:
2700:
2695:
2687:
2686:External links
2684:
2683:
2682:
2675:
2666:
2653:
2640:
2629:
2626:
2623:
2622:
2615:
2595:
2585:
2562:
2561:
2559:
2556:
2555:
2554:
2549:
2542:
2539:
2526:
2525:
2517:
2015:
2002:
2001:
1983:
1980:
1976:
1970:
1966:
1941:
1938:
1934:
1928:
1924:
1920:
1917:
1548:
1527:register files
1189:
1183:
1182:
1179:
1172:
975:
963:
962:
950:
947:
933:
930:
566:
385:
379:
376:
252:
197:
183:
180:
158:linear algebra
128:
127:
42:
40:
33:
26:
9:
6:
4:
3:
2:
3116:
3105:
3102:
3101:
3099:
3084:
3081:
3079:
3076:
3074:
3071:
3069:
3066:
3064:
3061:
3059:
3056:
3054:
3051:
3049:
3046:
3044:
3041:
3040:
3038:
3034:
3028:
3025:
3023:
3020:
3018:
3015:
3013:
3010:
3008:
3005:
3003:
3000:
2998:
2995:
2993:
2990:
2989:
2987:
2983:
2977:
2974:
2973:
2971:
2967:
2961:
2958:
2956:
2955:Deforestation
2953:
2952:
2950:
2946:
2940:
2937:
2935:
2932:
2930:
2927:
2925:
2922:
2921:
2919:
2917:
2913:
2907:
2904:
2902:
2899:
2898:
2896:
2893:
2889:
2883:
2880:
2878:
2875:
2873:
2870:
2867:
2864:
2862:
2859:
2857:
2854:
2852:
2849:
2848:
2846:
2844:
2838:
2832:
2829:
2827:
2824:
2822:
2819:
2817:
2814:
2812:
2809:
2807:
2804:
2802:
2799:
2797:
2794:
2792:
2789:
2787:
2784:
2782:
2779:
2777:
2774:
2772:
2769:
2768:
2766:
2764:
2760:
2754:
2751:
2749:
2746:
2745:
2743:
2739:
2735:
2728:
2723:
2721:
2716:
2714:
2709:
2708:
2705:
2699:
2696:
2693:
2690:
2689:
2680:
2676:
2673:
2672:
2667:
2664:
2660:
2659:
2654:
2651:
2647:
2646:
2641:
2638:
2637:
2632:
2631:
2618:
2612:
2608:
2607:
2599:
2592:
2588:
2582:
2577:
2576:
2567:
2563:
2553:
2550:
2548:
2547:Duff's device
2545:
2544:
2538:
2536:
2531:
2523:
2518:
2514:
2513:
2512:
2508:
2497:
2013:
2006:
1999:
1998:
1997:
1981:
1978:
1974:
1968:
1964:
1939:
1936:
1932:
1926:
1922:
1918:
1915:
1903:
1899:
1546:
1544:
1538:
1536:
1530:
1528:
1524:
1520:
1516:
1512:
1511:
1505:
1503:
1497:
1493:
1187:
1180:
1177:
1173:
1170:
1166:
1162:
1161:
1160:
973:
970:
960:
959:
958:
956:
946:
944:
940:
929:
919:
915:
911:
907:
903:
564:
383:
375:
250:
195:
192:
189:
179:
177:
173:
172:loop blocking
169:
164:
162:
159:
155:
151:
147:
143:
139:
135:
124:
121:
113:
102:
99:
95:
92:
88:
85:
81:
78:
74:
71: –
70:
66:
65:Find sources:
59:
55:
49:
48:
43:This article
41:
37:
32:
31:
19:
2805:
2669:
2656:
2643:
2634:
2605:
2598:
2590:
2574:
2566:
2527:
2509:
2492:
2007:
2003:
1904:
1900:
1896:
1542:
1539:
1534:
1531:
1508:
1506:
1498:
1494:
1483:
1184:
1176:multiply-add
1171:in parallel.
1169:accumulators
1158:
971:
964:
952:
935:
917:
913:
909:
905:
901:
899:
562:
381:
369:
248:
193:
185:
175:
171:
167:
165:
154:Nested loops
141:
131:
116:
110:January 2021
107:
97:
90:
83:
76:
64:
52:Please help
47:verification
44:
2868:elimination
2786:Loop fusion
2741:Basic block
2537:are known.
932:Tiling size
168:loop tiling
18:Loop tiling
2948:Functional
2866:Dead store
2633:Wolfe, M.
2558:References
926:j = 1 to n
161:algorithms
80:newspapers
2841:Data-flow
2012:blocked.
1919:∗
3098:Category
2843:analysis
2668:Xue, J.
2541:See also
182:Overview
150:locality
140:design,
138:compiler
2591:tiling.
1994:
1956:
1952:
1908:
1543:blocked
961:C = AĂ—B
916:. When
94:scholar
2969:Global
2894:-based
2613:
2583:
2516:cache.
1537:is 2.
370:where
96:
89:
82:
75:
67:
2985:Other
2496:split
2473:acc11
2461:acc10
2449:acc01
2437:acc00
2410:acc11
2392:acc10
2374:acc01
2356:acc00
2293:acc11
2281:acc10
2269:acc01
2257:acc00
2239:acc11
2233:acc10
2227:acc01
2221:acc00
1880:acc11
1868:acc10
1856:acc01
1844:acc00
1817:acc11
1799:acc10
1781:acc01
1763:acc00
1709:acc11
1703:acc10
1697:acc01
1691:acc00
1523:68000
1470:acc11
1458:acc10
1446:acc01
1434:acc00
1407:acc11
1389:acc10
1371:acc01
1353:acc00
1299:acc11
1293:acc10
1287:acc01
1281:acc00
1165:adder
922:i = 1
372:min()
188:cache
101:JSTOR
87:books
2763:Loop
2663:POPL
2650:PLDI
2611:ISBN
2581:ISBN
2329:<
2251:else
2173:<
2128:<
2083:<
2038:<
1742:<
1661:<
1616:<
1571:<
1521:and
1515:RISC
1488:and
1332:<
1257:<
1212:<
1115:<
1061:<
1019:<
823:<
763:<
718:<
649:<
510:<
456:<
359:....
320:<
275:<
220:<
73:news
2892:SSA
2648:.
2308:for
2152:for
2107:for
2062:for
2017:for
1721:for
1640:for
1595:for
1550:for
1519:x86
1502:ISA
1311:for
1236:for
1191:for
1094:for
1040:for
998:for
977:int
912:by
904:by
826:min
802:for
766:min
742:for
697:for
628:for
622:100
613:int
568:int
489:for
435:for
429:100
420:int
387:int
323:min
299:for
254:for
241:...
199:for
174:or
132:In
56:by
3100::
2661:.
2589:.
2500:kk
2413:+=
2395:+=
2377:+=
2359:+=
2347:++
2338:kb
2332:kk
2320:kk
2212:==
2209:kk
2203:if
2191:+=
2182:ib
2176:ii
2164:ii
2140:+=
2098:kb
2095:+=
2092:kk
2080:kk
2068:kk
2053:ib
2050:+=
2047:ii
2035:ii
2023:ii
1820:+=
1802:+=
1784:+=
1766:+=
1754:++
1679:+=
1670:ib
1664:ii
1652:ii
1628:+=
1586:ib
1583:+=
1580:ii
1568:ii
1556:ii
1535:ib
1529:.
1410:+=
1392:+=
1374:+=
1356:+=
1344:++
1269:+=
1224:+=
1136:+=
1124:++
1070:++
1028:++
969:.
924:,
853:++
847:);
793:++
787:);
730:+=
661:+=
522:++
468:++
347:++
344:);
287:+=
229:++
178:.
163:.
2726:e
2719:t
2712:v
2619:.
2504:i
2488:}
2485:}
2482:}
2479:}
2476:;
2470:=
2467:C
2464:;
2458:=
2455:C
2452:;
2446:=
2443:C
2440:;
2434:=
2431:C
2428:}
2425:;
2422:A
2419:*
2416:B
2407:;
2404:A
2401:*
2398:B
2389:;
2386:A
2383:*
2380:B
2371:;
2368:A
2365:*
2362:B
2353:{
2350:)
2344:k
2341:;
2335:+
2326:k
2323:;
2317:=
2314:k
2311:(
2305:}
2302:;
2299:C
2296:=
2290:;
2287:C
2284:=
2278:;
2275:C
2272:=
2266:;
2263:C
2260:=
2254:{
2248:;
2245:0
2242:=
2236:=
2230:=
2224:=
2218:)
2215:0
2206:(
2200:{
2197:)
2194:2
2188:i
2185:;
2179:+
2170:i
2167:;
2161:=
2158:i
2155:(
2149:{
2146:)
2143:2
2137:j
2134:;
2131:N
2125:j
2122:;
2119:0
2116:=
2113:j
2110:(
2104:{
2101:)
2089:;
2086:N
2077:;
2074:0
2071:=
2065:(
2059:{
2056:)
2044:;
2041:N
2032:;
2029:0
2026:=
2020:(
2010:k
1982:b
1979:i
1975:/
1969:3
1965:N
1940:b
1937:k
1933:/
1927:3
1923:N
1916:2
1892:}
1889:}
1886:}
1883:;
1877:=
1874:C
1871:;
1865:=
1862:C
1859:;
1853:=
1850:C
1847:;
1841:=
1838:C
1835:}
1832:;
1829:A
1826:*
1823:B
1814:;
1811:A
1808:*
1805:B
1796:;
1793:A
1790:*
1787:B
1778:;
1775:A
1772:*
1769:B
1760:{
1757:)
1751:k
1748:;
1745:N
1739:k
1736:;
1733:0
1730:=
1727:k
1724:(
1718:;
1715:0
1712:=
1706:=
1700:=
1694:=
1688:{
1685:)
1682:2
1676:i
1673:;
1667:+
1658:i
1655:;
1649:=
1646:i
1643:(
1637:{
1634:)
1631:2
1625:j
1622:;
1619:N
1613:j
1610:;
1607:0
1604:=
1601:j
1598:(
1592:{
1589:)
1577:;
1574:N
1565:;
1562:0
1559:=
1553:(
1490:j
1486:i
1479:}
1476:}
1473:;
1467:=
1464:C
1461:;
1455:=
1452:C
1449:;
1443:=
1440:C
1437:;
1431:=
1428:C
1425:}
1422:;
1419:A
1416:*
1413:B
1404:;
1401:A
1398:*
1395:B
1386:;
1383:A
1380:*
1377:B
1368:;
1365:A
1362:*
1359:B
1350:{
1347:)
1341:k
1338:;
1335:N
1329:k
1326:;
1323:0
1320:=
1317:k
1314:(
1308:;
1305:0
1302:=
1296:=
1290:=
1284:=
1278:{
1275:)
1272:2
1266:j
1263:;
1260:N
1254:j
1251:;
1248:0
1245:=
1242:j
1239:(
1233:{
1230:)
1227:2
1221:i
1218:;
1215:N
1209:i
1206:;
1203:0
1200:=
1197:i
1194:(
1154:}
1151:}
1148:;
1145:B
1142:*
1139:A
1133:C
1130:)
1127:k
1121:;
1118:N
1112:k
1109:;
1106:0
1103:=
1100:k
1097:(
1091:;
1088:0
1085:=
1082:C
1079:{
1076:)
1073:j
1067:;
1064:N
1058:j
1055:;
1052:0
1049:=
1046:j
1043:(
1037:{
1034:)
1031:i
1025:;
1022:N
1016:i
1013:;
1010:0
1007:=
1004:i
1001:(
995:;
992:k
989:,
986:j
983:,
980:i
967:C
937:(
918:n
914:n
910:n
906:n
902:n
895:}
892:}
889:}
886:}
883:;
880:b
877:*
874:a
871:+
868:c
865:=
862:c
859:{
856:)
850:y
844:n
841:,
838:2
835:+
832:j
829:(
820:y
817:;
814:j
811:=
808:y
805:(
799:{
796:)
790:x
784:n
781:,
778:2
775:+
772:i
769:(
760:x
757:;
754:i
751:=
748:x
745:(
739:{
736:)
733:2
727:j
724:;
721:n
715:j
712:;
709:0
706:=
703:j
700:(
694:;
691:0
688:=
685:c
682:;
679:0
676:=
673:c
670:{
667:)
664:2
658:i
655:;
652:n
646:i
643:;
640:0
637:=
634:i
631:(
625:;
619:=
616:n
610:;
607:c
604:,
601:b
598:,
595:a
592:,
589:y
586:,
583:x
580:,
577:j
574:,
571:i
558:}
555:}
552:;
549:b
546:*
543:a
540:+
537:c
534:=
531:c
528:{
525:)
519:j
516:;
513:n
507:j
504:;
501:0
498:=
495:j
492:(
486:;
483:0
480:=
477:c
474:{
471:)
465:i
462:;
459:n
453:i
450:;
447:0
444:=
441:i
438:(
432:;
426:=
423:n
417:;
414:c
411:,
408:b
405:,
402:a
399:,
396:j
393:,
390:i
365:}
362:}
356:{
353:)
350:i
341:B
338:+
335:j
332:,
329:N
326:(
317:i
314:;
311:j
308:=
305:i
302:(
296:{
293:)
290:B
284:j
281:;
278:N
272:j
269:;
266:0
263:=
260:j
257:(
244:}
238:{
235:)
232:i
226:;
223:N
217:i
214:;
211:0
208:=
205:i
202:(
123:)
117:(
112:)
108:(
98:·
91:·
84:·
77:·
50:.
20:)
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.