Knowledge

Multiply–accumulate operation

Source 📝

1209: 155:
and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more
467:
using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.
483:
standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas
496:
C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma, this can be globally controlled by the
164:
was the first to conceive a MAC in his Analytical Machine of 1909, and the first to exploit a MAC for division (using multiplication seeded by reciprocal, via the convergent series
112: 216:.) Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add). 397:, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2 390:
due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.
1972: 328: 1280: 1325: 1251: 1441: 428:
Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit
1237: 1027: 1003: 1822: 1377: 730: 1967: 1651: 1355: 17: 1122: 1318: 1188:
Montoye, R. K.; Hokenek, E.; Runyon, S. L. (January 1990). "Design of the IBM RISC System/6000 floating-point execution unit".
1404: 910: 710: 936:"PV-MAC: Multiply-and-accumulate unit structure exploiting precision variability in on-device convolutional neural networks" 1603: 73: 716: 2047: 213: 1615: 1598: 1311: 429: 417: 769: 1610: 1532: 1066: 66:); the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator 604: 288:
A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:
597: 1093: 811: 611: 452: 835: 2042: 1806: 1571: 1259: 935: 748: 631: 530: 404:
Another benefit of including this instruction is that it allows an efficient software implementation of
2021: 1938: 981: 754: 742: 322: 368:(following Kahan's suggested notation in which redundant parentheses direct the compiler to round the 1875: 1360: 760: 736: 197: 125: 117: 35: 1208: 1987: 1681: 1334: 1107: 476: 173: 54:) operation is a common step that computes the product of two numbers and adds that product to an 1223:"Godson-3 Emulates x86: New MIPS-Compatible Chinese Processor Has Extensions for x86 Translation" 618: 489: 55: 858:"A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units" 1796: 1102: 405: 157: 1051: 1982: 1943: 1754: 1686: 791: 663: 590: 551: 302: 297: 244:), with a single rounding. That is, where an unfused multiply–add would compute the product 1950: 1928: 1907: 1702: 1676: 1625: 1593: 1164: 1047: 682: 657: 335: 201: 8: 2009: 1955: 1849: 1646: 1497: 584: 152: 148: 220:
specifies that it must be performed with one rounding, yielding a more accurate result.
147:
Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in
1999: 1656: 1559: 1222: 963: 916: 722: 409: 312: 189: 1992: 1791: 1666: 1554: 1140: 1126: 1004:"Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs" 967: 955: 906: 567: 420:) operations, thus eliminating the need for dedicated hardware for those operations. 128:), or with a single rounding. When performed with a single rounding, it is called a 920: 334:
Fused multiply–add can usually be relied on to give more accurate results. However,
2004: 1977: 1912: 1890: 1854: 1786: 1197: 947: 898: 869: 856:
Lyakhov, Pavel; Valueva, Maria; Valuev, Georgii; Nagornov, Nikolai (January 2020).
504:
The fused multiply–add operation was introduced as "multiply–add fused" in the IBM
464: 306: 1960: 1885: 1620: 1564: 1431: 512: 241: 891:"Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements" 380:
term first) using fused multiply–add, then the result may be negative even when
1801: 1527: 951: 902: 890: 701: 642: 537: 394: 209: 1436: 508:(1990) processor, but has been added to numerous other processors since then: 2036: 1837: 1581: 1477: 1456: 959: 688: 651: 624: 521: 445: 217: 205: 161: 1844: 1419: 1397: 1392: 1303: 318: 266:
significant bits, a fused multiply–add would compute the entire expression
193: 895:
2009 IEEE International Symposium on Parallel & Distributed Processing
2014: 1832: 1707: 1507: 1446: 669: 413: 292: 1201: 888: 874: 857: 1902: 1870: 1827: 1717: 1482: 1414: 780: 560: 176:, but the technique is now also common in general-purpose processors. 1933: 1492: 1370: 774: 577: 31: 524: 240:) is a floating-point multiply–add operation performed in one step ( 1880: 1365: 675: 571: 338:
has pointed out that it can give problems if used unthinkingly. If
121: 172:). The first modern processors to be equipped with MAC units were 1774: 1641: 1586: 1502: 1487: 1451: 1382: 889:
Tung Thanh Hoang; Sjalander, M.; Larsson-Edefors, P. (May 2009).
831: 557: 544: 534: 515: 185: 1095:
Software Division and Square Root Using Goldschmidt's Algorithms
1671: 1661: 1547: 1542: 1472: 1426: 1409: 1387: 798: 505: 281:
to its full precision before rounding the final result down to
58:. The hardware unit that performs the operation is known as a 204:. That is, digital floating-point arithmetic is generally not 1897: 1779: 1764: 1747: 1742: 1737: 1732: 1727: 1722: 1712: 493: 1165:"Optimize Options (Using the GNU Compiler Collection (GCC))" 855: 1769: 1759: 1537: 1252:"Intel adds 22nm octo-core 'Haswell' to CPU design roadmap" 1032: 698: 472: 456: 1052:"IEEE Standard 754 for Binary Floating-Point Arithmetic" 76: 1187: 463:
instruction is used for evaluating polynomials with
315:
for evaluating functions (from the inverse function)
1001: 200:numbers have only a certain amount of mathematical 1141:"Bug 20785 - Pragma STDC * (C99 FP) unimplemented" 214:Floating-point arithmetic § Accuracy problems 106: 832:"The Feasibility of Ludgate's Analytical Machine" 2034: 1101:. 6th Conference on Real Numbers and Computers. 1085: 1238:"New "Bulldozer" and "Piledriver" Instructions" 1068:Floating-Point Fused Multiply–Add Architectures 691:has "Four-operand FMA with Prefix Instruction". 179: 27:Operation common in numerical signal processing 1319: 1235: 188:, the operation is typically exact (computed 1333: 1002:Whitehead, Nathan; Fit-Florea, Alex (2011). 934:Kang, Jongsung; Kim, Taewhan (2020-03-01). 1326: 1312: 1281:"STM32 Cortex-M33 MCUs programming manual" 423: 1106: 1091: 873: 639:ARM processors with VFPv4 and/or NEONv2: 1064: 401:-bit adder to compute the sum properly. 120:numbers, it might be performed with two 1190:IBM Journal of Research and Development 933: 479:supports the FMA operation through the 14: 2035: 1307: 1236:Hollingsworth, Brent (October 2012). 1046: 223: 258:significant bits, add the result to 107:{\displaystyle a\gets a+(b\times c)} 1652:Input–output memory management unit 1123:"VAX instruction of the week: POLY" 24: 1092:Markstein, Peter (November 2004). 25: 2059: 648:STM32 Cortex-M33 (VFMA operation) 444:The FMA operation is included in 418:methods of computing square roots 1207: 585:FMA3 and/or FMA4 instruction set 1273: 1244: 1229: 1215: 1181: 1157: 1133: 1115: 838:from the original on 2019-08-07 485: 1058: 1040: 1020: 995: 974: 927: 882: 849: 824: 436:with single cycle throughput. 160:typical of earlier computers. 101: 89: 80: 13: 1: 817: 453:Digital Equipment Corporation 434:a0×b0 + a1×b1 + a2×b2 + a3×b3 158:method of shifting and adding 180:In floating-point arithmetic 7: 1065:Quinnell, Eric (May 2007). 805: 10: 2064: 952:10.1016/j.vlsi.2019.11.003 903:10.1109/IPDPS.2009.5161212 439: 393:When implemented inside a 323:artificial neural networks 2048:Digital signal processing 1921: 1863: 1815: 1695: 1634: 1520: 1465: 1348: 1341: 174:digital signal processors 138:fused multiply–accumulate 36:digital signal processing 1682:Video display controller 1335:Graphics processing unit 1240:. AMD Developer Central. 486:#pragma STDC FP_CONTRACT 329:double-double arithmetic 717:TeraScale 2 "Evergreen" 707:GPUs and GPGPU boards: 424:Dot product instruction 1797:Shared graphics memory 801:instruction set (2010) 477:C programming language 108: 60:multiplier–accumulator 1983:Hardware acceleration 1687:Video processing unit 792:NEC SX-Aurora TSUBASA 600:(2012, FMA3 and FMA4) 501:command line option. 303:Polynomial evaluation 298:Matrix multiplication 109: 1908:Performance per watt 1677:Texture mapping unit 1626:Unified shader model 781:ARM Mali T600 Series 583:x86 processors with 262:, and round back to 74: 2043:Computer arithmetic 1850:Integrated graphics 1202:10.1147/rd.341.0059 875:10.3390/app10249052 788:Vector Processors: 149:combinational logic 40:multiply–accumulate 18:Multiply–accumulate 2000:Parallel computing 1876:Display resolution 1657:Render output unit 1647:Geometry processor 984:. 20 November 2019 723:Graphics Core Next 410:division algorithm 327:Multiplication in 285:significant bits. 230:fused multiply–add 224:Fused multiply–add 130:fused multiply–add 104: 2030: 2029: 1845:External graphics 1828:Discrete graphics 1792:Memory controller 1555:Graphics pipeline 1516: 1515: 912:978-1-4244-3751-1 812:Compound operator 768:Intel GPUs since 733:(2010) and newer 713:(2009) and newer 634:(2015, FMA3 only) 627:(2013, FMA3 only) 621:(2017, FMA3 only) 593:(2011, FMA4 only) 156:quickly than the 124:(typical in many 16:(Redirected from 2055: 2005:Vector processor 1988:Image processing 1978:Graphics library 1913:Transistor count 1855:System on a chip 1787:Memory bandwidth 1667:Stream processor 1346: 1345: 1328: 1321: 1314: 1305: 1304: 1298: 1297: 1295: 1294: 1285: 1277: 1271: 1270: 1268: 1267: 1258:. Archived from 1248: 1242: 1241: 1233: 1227: 1226: 1219: 1213: 1212: 1211: 1205: 1185: 1179: 1178: 1176: 1175: 1161: 1155: 1154: 1152: 1151: 1137: 1131: 1130: 1125:. Archived from 1119: 1113: 1112: 1110: 1100: 1089: 1083: 1082: 1080: 1079: 1073: 1062: 1056: 1055: 1044: 1038: 1037: 1024: 1018: 1017: 1015: 1014: 1008: 999: 993: 992: 990: 989: 978: 972: 971: 931: 925: 924: 897:. pp. 1–7. 886: 880: 879: 877: 862:Applied Sciences 853: 847: 846: 844: 843: 828: 783:(2012) and above 563:(2007) and above 518:(1996) and above 500: 487: 482: 462: 435: 389: 379: 367: 348:is evaluated as 347: 280: 253: 171: 113: 111: 110: 105: 21: 2063: 2062: 2058: 2057: 2056: 2054: 2053: 2052: 2033: 2032: 2031: 2026: 1917: 1859: 1811: 1691: 1630: 1621:Tiled rendering 1512: 1461: 1432:InfiniteReality 1337: 1332: 1302: 1301: 1292: 1290: 1283: 1279: 1278: 1274: 1265: 1263: 1250: 1249: 1245: 1234: 1230: 1221: 1220: 1216: 1206: 1186: 1182: 1173: 1171: 1163: 1162: 1158: 1149: 1147: 1139: 1138: 1134: 1121: 1120: 1116: 1098: 1090: 1086: 1077: 1075: 1071: 1063: 1059: 1045: 1041: 1026: 1025: 1021: 1012: 1010: 1006: 1000: 996: 987: 985: 980: 979: 975: 932: 928: 913: 887: 883: 854: 850: 841: 839: 830: 829: 825: 820: 808: 498: 480: 460: 442: 433: 426: 381: 369: 349: 339: 313:Newton's method 267: 245: 242:fused operation 226: 184:When done with 182: 165: 151:followed by an 116:When done with 75: 72: 71: 28: 23: 22: 15: 12: 11: 5: 2061: 2051: 2050: 2045: 2028: 2027: 2025: 2024: 2019: 2018: 2017: 2007: 2002: 1997: 1996: 1995: 1985: 1980: 1975: 1970: 1965: 1964: 1963: 1958: 1948: 1947: 1946: 1941: 1936: 1925: 1923: 1919: 1918: 1916: 1915: 1910: 1905: 1900: 1895: 1894: 1893: 1888: 1878: 1873: 1867: 1865: 1861: 1860: 1858: 1857: 1852: 1847: 1842: 1841: 1840: 1835: 1825: 1819: 1817: 1813: 1812: 1810: 1809: 1804: 1802:Texture memory 1799: 1794: 1789: 1784: 1783: 1782: 1777: 1772: 1767: 1762: 1752: 1751: 1750: 1745: 1740: 1735: 1730: 1725: 1720: 1710: 1705: 1699: 1697: 1693: 1692: 1690: 1689: 1684: 1679: 1674: 1669: 1664: 1659: 1654: 1649: 1644: 1638: 1636: 1632: 1631: 1629: 1628: 1623: 1618: 1613: 1608: 1607: 1606: 1596: 1591: 1590: 1589: 1579: 1574: 1569: 1568: 1567: 1562: 1552: 1551: 1550: 1545: 1540: 1530: 1528:Compute kernel 1524: 1522: 1518: 1517: 1514: 1513: 1511: 1510: 1505: 1500: 1495: 1490: 1485: 1480: 1475: 1469: 1467: 1463: 1462: 1460: 1459: 1454: 1449: 1444: 1439: 1434: 1429: 1424: 1423: 1422: 1417: 1412: 1402: 1401: 1400: 1395: 1390: 1385: 1375: 1374: 1373: 1368: 1363: 1352: 1350: 1343: 1339: 1338: 1331: 1330: 1323: 1316: 1308: 1300: 1299: 1272: 1243: 1228: 1214: 1180: 1156: 1132: 1129:on 2020-02-13. 1114: 1108:10.1.1.85.9648 1084: 1057: 1050:(1996-05-31). 1048:Kahan, William 1039: 1028:"fmadd instrs" 1019: 994: 973: 926: 911: 881: 848: 822: 821: 819: 816: 815: 814: 807: 804: 803: 802: 796: 795: 794: 786: 785: 784: 778: 772: 766: 765: 764: 758: 752: 746: 740: 728: 727: 726: 720: 705: 702:z/Architecture 696: 695: 694: 693: 692: 679: 673: 670:Qualcomm Krait 667: 664:ARM Cortex-A15 661: 655: 649: 646: 643:ARM Cortex-M4F 637: 636: 635: 628: 622: 615: 608: 601: 594: 581: 575: 564: 555: 548: 541: 538:Emotion Engine 528: 519: 441: 438: 425: 422: 395:microprocessor 332: 331: 325: 316: 310: 300: 295: 254:, round it to 225: 222: 198:floating-point 181: 178: 118:floating-point 103: 100: 97: 94: 91: 88: 85: 82: 79: 26: 9: 6: 4: 3: 2: 2060: 2049: 2046: 2044: 2041: 2040: 2038: 2023: 2020: 2016: 2013: 2012: 2011: 2008: 2006: 2003: 2001: 1998: 1994: 1991: 1990: 1989: 1986: 1984: 1981: 1979: 1976: 1974: 1971: 1969: 1966: 1962: 1959: 1957: 1954: 1953: 1952: 1949: 1945: 1942: 1940: 1937: 1935: 1932: 1931: 1930: 1927: 1926: 1924: 1920: 1914: 1911: 1909: 1906: 1904: 1901: 1899: 1896: 1892: 1889: 1887: 1884: 1883: 1882: 1879: 1877: 1874: 1872: 1869: 1868: 1866: 1862: 1856: 1853: 1851: 1848: 1846: 1843: 1839: 1836: 1834: 1831: 1830: 1829: 1826: 1824: 1821: 1820: 1818: 1814: 1808: 1805: 1803: 1800: 1798: 1795: 1793: 1790: 1788: 1785: 1781: 1778: 1776: 1773: 1771: 1768: 1766: 1763: 1761: 1758: 1757: 1756: 1753: 1749: 1746: 1744: 1741: 1739: 1736: 1734: 1731: 1729: 1726: 1724: 1721: 1719: 1716: 1715: 1714: 1711: 1709: 1706: 1704: 1701: 1700: 1698: 1694: 1688: 1685: 1683: 1680: 1678: 1675: 1673: 1670: 1668: 1665: 1663: 1660: 1658: 1655: 1653: 1650: 1648: 1645: 1643: 1640: 1639: 1637: 1633: 1627: 1624: 1622: 1619: 1617: 1614: 1612: 1609: 1605: 1602: 1601: 1600: 1597: 1595: 1592: 1588: 1585: 1584: 1583: 1582:Rasterisation 1580: 1578: 1575: 1573: 1572:HDR rendering 1570: 1566: 1563: 1561: 1558: 1557: 1556: 1553: 1549: 1546: 1544: 1541: 1539: 1536: 1535: 1534: 1531: 1529: 1526: 1525: 1523: 1519: 1509: 1506: 1504: 1501: 1499: 1496: 1494: 1491: 1489: 1486: 1484: 1481: 1479: 1478:Apple silicon 1476: 1474: 1471: 1470: 1468: 1464: 1458: 1457:Apple silicon 1455: 1453: 1450: 1448: 1445: 1443: 1440: 1438: 1435: 1433: 1430: 1428: 1425: 1421: 1418: 1416: 1413: 1411: 1408: 1407: 1406: 1403: 1399: 1396: 1394: 1391: 1389: 1386: 1384: 1381: 1380: 1379: 1376: 1372: 1369: 1367: 1364: 1362: 1359: 1358: 1357: 1354: 1353: 1351: 1347: 1344: 1340: 1336: 1329: 1324: 1322: 1317: 1315: 1310: 1309: 1306: 1289: 1282: 1276: 1262:on 2012-02-17 1261: 1257: 1253: 1247: 1239: 1232: 1224: 1218: 1210: 1203: 1199: 1195: 1191: 1184: 1170: 1166: 1160: 1146: 1142: 1136: 1128: 1124: 1118: 1109: 1104: 1097: 1096: 1088: 1070: 1069: 1061: 1053: 1049: 1043: 1035: 1034: 1029: 1023: 1005: 998: 983: 977: 969: 965: 961: 957: 953: 949: 945: 941: 937: 930: 922: 918: 914: 908: 904: 900: 896: 892: 885: 876: 871: 867: 863: 859: 852: 837: 833: 827: 823: 813: 810: 809: 800: 797: 793: 790: 789: 787: 782: 779: 776: 773: 771: 767: 763:-based (2017) 762: 759: 757:-based (2016) 756: 753: 751:-based (2014) 750: 747: 745:-based (2012) 744: 741: 739:-based (2010) 738: 735: 734: 732: 729: 724: 721: 719:-series based 718: 715: 714: 712: 709: 708: 706: 703: 700: 697: 690: 689:Fujitsu A64FX 687: 686: 684: 680: 677: 674: 671: 668: 665: 662: 659: 658:ARM Cortex-A7 656: 653: 652:ARM Cortex-A5 650: 647: 644: 641: 640: 638: 633: 629: 626: 625:Intel Haswell 623: 620: 616: 613: 609: 606: 602: 599: 595: 592: 588: 587: 586: 582: 579: 576: 573: 570:-compatible) 569: 565: 562: 559: 556: 553: 549: 546: 542: 539: 536: 532: 529: 526: 523: 520: 517: 514: 511: 510: 509: 507: 502: 499:-ffp-contract 495: 491: 478: 474: 473:1999 standard 469: 466: 465:Horner's rule 458: 454: 449: 447: 446:IEEE 754-2008 437: 431: 421: 419: 415: 411: 407: 402: 400: 396: 391: 388: 384: 377: 373: 365: 361: 357: 353: 346: 342: 337: 336:William Kahan 330: 326: 324: 320: 317: 314: 311: 308: 307:Horner's rule 304: 301: 299: 296: 294: 291: 290: 289: 286: 284: 278: 274: 270: 265: 261: 257: 252: 248: 243: 239: 235: 231: 221: 219: 218:IEEE 754-2008 215: 211: 207: 203: 199: 195: 191: 187: 177: 175: 169: 163: 162:Percy Ludgate 159: 154: 150: 145: 143: 139: 135: 131: 127: 123: 119: 114: 98: 95: 92: 86: 83: 77: 69: 65: 61: 57: 53: 49: 45: 41: 37: 34:, especially 33: 19: 2010:Video coding 1611:Tessellation 1576: 1521:Architecture 1291:. Retrieved 1287: 1275: 1264:. Retrieved 1260:the original 1256:The Register 1255: 1246: 1231: 1217: 1196:(1): 59–70. 1193: 1189: 1183: 1172:. Retrieved 1168: 1159: 1148:. Retrieved 1144: 1135: 1127:the original 1117: 1094: 1087: 1076:. Retrieved 1074:(PhD thesis) 1067: 1060: 1042: 1031: 1022: 1011:. Retrieved 997: 986:. Retrieved 976: 943: 939: 929: 894: 884: 868:(24): 9052. 865: 861: 851: 840:. Retrieved 826: 770:Sandy Bridge 704:(since 1998) 503: 470: 450: 443: 427: 403: 398: 392: 386: 382: 375: 371: 363: 359: 355: 351: 344: 340: 333: 319:Convolutions 305:(e.g., with 287: 282: 276: 272: 268: 263: 259: 255: 250: 246: 237: 233: 229: 227: 210:distributive 196:). However, 194:power of two 183: 167: 146: 141: 137: 133: 129: 115: 67: 63: 59: 51: 48:multiply-add 47: 43: 39: 29: 1993:Compression 1864:Performance 1816:Form factor 1708:Framebuffer 1672:Tensor unit 1662:Shader unit 1594:Ray-tracing 1533:Fabrication 1508:Intel 2700G 1442:3dfx Voodoo 1437:NEC µPD7220 1169:gcc.gnu.org 1145:gcc.gnu.org 940:Integration 731:Nvidia GPUs 685:processors 605:Steamroller 525:SuperH SH-4 414:square root 293:Dot product 206:associative 56:accumulator 2037:Categories 1903:Frame rate 1871:Clock rate 1833:Clustering 1635:Components 1415:Radeon Pro 1293:2024-05-06 1266:2008-08-19 1174:2022-02-02 1150:2022-02-02 1078:2011-03-28 1013:2013-08-31 988:2021-08-14 982:"mad - ps" 842:2020-08-30 818:References 598:Piledriver 578:Elbrus-8SV 574:-2F (2008) 561:SPARC64 VI 432:registers 1934:Scrolling 1838:Switching 1493:VideoCore 1103:CiteSeerX 968:211264132 960:0167-9260 946:: 76–85. 775:Intel MIC 612:Excavator 591:Bulldozer 202:precision 122:roundings 96:× 81:← 32:computing 1881:Fillrate 1560:Geometry 1420:Instinct 1009:. nvidia 921:14535090 836:Archived 806:See also 711:AMD GPUs 676:Apple A6 572:Loongson 406:division 186:integers 64:MAC unit 1961:Texture 1891:Texel/s 1886:Pixel/s 1823:IP core 1775:HBM-PIM 1642:Blitter 1616:T&L 1587:Shading 1503:Imageon 1498:Vivante 1488:PowerVR 1452:Glaze3D 1383:GeForce 1349:Desktop 749:Maxwell 632:Skylake 558:Fujitsu 545:Itanium 535:Toshiba 522:Hitachi 516:PA-8000 488:). The 475:of the 440:Support 212:. (See 1939:Sprite 1898:FLOP/s 1696:Memory 1565:Vertex 1548:MOSFET 1543:FinFET 1473:Adreno 1466:Mobile 1427:Matrox 1410:Radeon 1388:Quadro 1378:Nvidia 1105:  966:  958:  919:  909:  799:RISC-V 777:(2012) 755:Pascal 743:Kepler 725:-based 678:(2012) 672:(2012) 666:(2012) 660:(2013) 654:(2012) 645:(2010) 630:Intel 614:(2015) 607:(2014) 580:(2018) 554:(2006) 547:(2001) 543:Intel 540:(1999) 527:(1998) 506:POWER1 455:(DEC) 412:) and 190:modulo 38:, the 2015:Codec 1973:GPGPU 1780:HBM3E 1765:HBM2E 1748:GDDR7 1743:GDDR6 1738:GDDR5 1733:GDDR4 1728:GDDR3 1723:GDDR2 1713:SGRAM 1398:Tegra 1393:Tesla 1356:Intel 1284:(PDF) 1099:(PDF) 1072:(PDF) 1007:(PDF) 964:S2CID 917:S2CID 761:Volta 737:Fermi 683:ARMv8 494:Clang 481:fma() 416:(see 408:(see 238:fmadd 192:some 153:adder 136:) or 46:) or 2022:VLIW 1968:ASIC 1944:Tile 1922:Misc 1807:VRAM 1770:HBM3 1760:HBM2 1718:GDDR 1604:SIMT 1599:SIMD 1538:CMOS 1483:Mali 956:ISSN 907:ISBN 681:All 617:AMD 610:AMD 603:AMD 596:AMD 589:AMD 568:MIPS 552:Cell 550:STI 492:and 471:The 461:POLY 451:The 430:SIMD 358:) − 321:and 142:FMAC 126:DSPs 1755:HBM 1703:DMA 1577:MAC 1405:AMD 1371:Arc 1342:GPU 1198:doi 1033:IBM 948:doi 899:doi 870:doi 699:IBM 619:Zen 531:SCE 490:GCC 459:'s 457:VAX 271:+ ( 236:or 234:FMA 208:or 166:(1+ 144:). 134:FMA 52:MAD 44:MAC 30:In 2039:: 1956:GI 1951:3D 1929:2D 1447:S3 1366:Xe 1361:GT 1288:ST 1286:. 1254:. 1194:34 1192:. 1167:. 1143:. 1030:. 962:. 954:. 944:71 942:. 938:. 915:. 905:. 893:. 866:10 864:. 860:. 834:. 513:HP 448:. 385:= 374:× 362:× 354:× 350:(( 343:− 275:× 249:× 228:A 70:: 1327:e 1320:t 1313:v 1296:. 1269:. 1225:. 1204:. 1200:: 1177:. 1153:. 1111:. 1081:. 1054:. 1036:. 1016:. 991:. 970:. 950:: 923:. 901:: 878:. 872:: 845:. 566:( 533:- 484:( 399:N 387:y 383:x 378:) 376:x 372:x 370:( 366:) 364:y 360:y 356:x 352:x 345:y 341:x 309:) 283:N 279:) 277:c 273:b 269:a 264:N 260:a 256:N 251:c 247:b 232:( 170:) 168:x 140:( 132:( 102:) 99:c 93:b 90:( 87:+ 84:a 78:a 68:a 62:( 50:( 42:( 20:)

Index

Multiply–accumulate
computing
digital signal processing
accumulator
floating-point
roundings
DSPs
combinational logic
adder
method of shifting and adding
Percy Ludgate
digital signal processors
integers
modulo
power of two
floating-point
precision
associative
distributive
Floating-point arithmetic § Accuracy problems
IEEE 754-2008
fused operation
Dot product
Matrix multiplication
Polynomial evaluation
Horner's rule
Newton's method
Convolutions
artificial neural networks
double-double arithmetic

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.