Knowledge

Single instruction, multiple data

Source đź“ť

47: 857: 4784: 322:. These computers had many limited-functionality processors that would work in parallel. For example, each of 65,536 single-bit processors in a Thinking Machines CM-2 would execute the same instruction at the same time, allowing, for instance, to logically combine 65,536 pairs of bits at a time, using a hypercube-connected network or processor-dedicated RAM to find its operands. Supercomputing moved away from the SIMD approach when inexpensive scalar 865: 524:'s vector extension uses an alternative approach: instead of exposing the sub-register-level details to the programmer, the instruction set abstracts them out as a few "vector registers" that use the same interfaces across all CPUs with this instruction set. The hardware handles all alignment issues and "strip-mining" of loops. Machines with different vector sizes would be able to run the same code. LLVM calls this vector type "vscale". 181: 404:
to be distinguished from older ones, the newer architectures are then considered "short-vector" architectures, as earlier SIMD and vector supercomputers had vector lengths from 64 to 64,000. A modern supercomputer is almost always a cluster of MIMD computers, each of which implements (short-vector) SIMD instructions.
877:
with embedded SIMD have largely taken over this task from the CPU. Some systems also include permute functions that re-pack elements inside vectors, making them particularly useful for data processing and compression. They are also used in cryptography. The trend of general-purpose computing on GPUs
457:
to target maximal cache optimality, though this technique will require more intermediate state. Note: Batch-pipeline systems (example: GPUs or software rasterization pipelines) are most advantageous for cache control when implemented with SIMD intrinsics, but they are not exclusive to SIMD features.
502:
Different architectures provide different register sizes (e.g. 64, 128, 256 and 512 bits) and instruction sets, meaning that programmers must provide multiple implementations of vectorized code to operate optimally on any given CPU. In addition, the possible set of SIMD instructions grows with each
403:
All of these developments have been oriented toward support for real-time graphics, and are therefore oriented toward processing in two, three, or four dimensions, usually with vector lengths of between two and sixteen words, depending on data type and architecture. When new SIMD architectures need
420:
of an image consists of three values for the brightness of the red (R), green (G) and blue (B) portions of the color. To change the brightness, the R, G and B values are read from memory, a value is added to (or subtracted from) them, and the resulting values are written back out to memory. Audio
1114:
Consumer software is typically expected to work on a range of CPUs covering multiple generations, which could limit the programmer's ability to use new SIMD instructions to improve the computational performance of a program. The solution is to include multiple versions of the same code that uses
428:
With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "retrieve this pixel, now retrieve the next pixel", a SIMD processor will have a single
1230:
Emscripten, Mozilla's C/C++-to-JavaScript compiler, with extensions can enable compilation of C++ programs that make use of SIMD intrinsics or GCC-style vector code to the SIMD API of JavaScript, resulting in equivalent speedups compared to scalar code. It also supports (and now prefers) the
333:
The current era of SIMD processors grew out of the desktop-computer market rather than the supercomputer market. As desktop processors became powerful enough to support real-time gaming and audio/video processing during the 1990s, demand grew for this particular type of computing power, and
1018:, is a range of techniques and tricks used for performing SIMD in general-purpose registers on hardware that does not provide any direct support for SIMD instructions. This can be used to exploit parallelism in certain algorithms even on hardware that does not support SIMD directly. 933:
confused matters somewhat, but today the system seems to have settled down (after AMD adopted SSE) and newer compilers should result in more SIMD-enabled software. Intel and AMD now both provide optimized math libraries that use SIMD instructions, and open source alternatives like
1192:
Instances of these types are immutable and in optimized code are mapped directly to SIMD registers. Operations expressed in Dart typically are compiled into a single instruction without any overhead. This is similar to C and C++ intrinsics. Benchmarks for
1085:
Instead of providing an SIMD datatype, compilers can also be hinted to auto-vectorize some loops, potentially taking some assertions about the lack of data dependency. This is not as flexible as manipulating SIMD variables directly, but is easier to use.
429:
instruction that effectively says "retrieve n pixels" (where n is a number that varies from design to design). For a variety of reasons, this can take much less time than retrieving each pixel individually, as with a traditional CPU design.
889:
software was at first slow, due to a number of problems. One was that many of the early SIMD instruction sets tended to slow overall performance of the system due to the re-use of existing floating point registers. Other systems, like
1341:
and controlled by a general purpose CPU) and is geared towards the huge datasets required by 3D and video processing applications. It differs from traditional ISAs by being SIMD from the ground up with no separate scalar registers.
222:: there are simultaneous (parallel) computations, but each unit performs the exact same instruction at any given moment (just with different data). SIMD is particularly applicable to common tasks such as adjusting the contrast in a 1305:
executing its own instruction stream, or as a coprocessor driven by ordinary CPU instructions. 3D graphics applications tend to lend themselves well to SIMD processing as they rely heavily on operations with 4-dimensional vectors.
1034:) guaranteeing the generation of vector code. Intel, AltiVec, and ARM NEON provide extensions widely adopted by the compilers targeting their CPUs. (More complex operations are the task of vector math libraries.) 868:
The SIMD tripling of four 8-bit numbers. The CPU loads 4 numbers at once, multiplies them all in one SIMD-multiplication, and saves them all at once back to RAM. In theory, the speed can be multiplied by
499:
Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them.
860:
The ordinary tripling of four 8-bit numbers. The CPU loads one 8-bit number into R1, multiplies it with R2, and then saves the answer from R3 back to RAM. This process is repeated for each number.
1165:
As using FMV requires code modification on GCC and Clang, vendors more commonly use library multi-versioning: this is easier to achieve as only compiler switches need to be changed.
604: 432:
Another advantage is that the instruction operates on all loaded data in a single operation. In other words, if the SIMD system works by loading up eight data points at once, the
1064:
that works similarly to the GCC extension. LLVM's libcxx seems to implement it. For GCC and libstdc++, a wrapper library that builds on top of the GCC extension is available.
412:
An application that may take advantage of SIMD is one where the same value is being added to (or subtracted from) a large number of data points, a common operation in many
440:; the eight values are processed in parallel even on a non-superscalar processor, and a superscalar processor may be able to perform multiple SIMD operations in parallel. 2055: 715:
Small-scale (64 or 128 bits) SIMD became popular on general-purpose CPUs in the early 1990s and continued through 1997 and later with Motion Video Instructions (MVI) for
1041:
takes the extensions a step further by abstracting them into a universal interface that can be used on any platform by providing a way of defining SIMD datatypes. The
728: 972:, therefore assembly language programming is rarely needed. Additionally, many of the systems that would benefit from SIMD were supplied by Apple itself, for example 1352:'s CSX600 (2004) has 96 cores each with two double-precision floating point units while the CSX700 (2008) has 192. Stream Processors is headed by computer architect 486:; programmers familiar with one particular architecture may not expect this. Worse: the alignment may change from one revision or "compatible" processor to another. 1162:
also supports FMV. The setup is similar to GCC and Clang in that the code defines what instruction sets to compile for, but cloning is manually done via inlining.
4110: 3137: 1904: 898:, offered support for data types that were not interesting to a wide audience and had expensive context switching instructions to switch between using the 543: 464:
Currently, implementing an algorithm with SIMD instructions usually requires human labor; most compilers do not generate SIMD instructions from a typical
1973: 436:
operation being applied to the data will happen to all eight values at the same time. This parallelism is separate from the parallelism provided by a
1294: 527:
An order of magnitude increase in code size is not uncommon, when compared to equivalent scalar or equivalent vector code, and an order of magnitude
1227:. As of August 2020, the WebAssembly interface remains unfinished, but its portable 128-bit SIMD feature has already seen some use in many engines. 1150:
since clang 7 allow for a simplified approach, with the compiler taking care of function duplication and selection. GCC and clang requires explicit
510:
instruction set shared a register file with the floating-point stack, which caused inefficiencies when mixing floating-point and MMX code. However,
2109: 698: 1127:
in the program or a library is duplicated and compiled for many instruction set extensions, and the program decides which one to use at run-time.
539: 239: 1138:
FMV, manually coded in assembly language, is quite commonly used in a number of performance-critical libraries such as glibc and libjpeg-turbo.
3248: 2431: 1004:. Even though Apple has stopped using PowerPC processors in their products, further development of AltiVec is continued in several PowerPC and 702: 4200: 2950: 458:
Further complexity may be apparent to avoid dependence within series such as code strings; while independence is required for vectorization.
384:
system. Since then, there have been several extensions to the SIMD instruction sets for both architectures. Advanced vector extensions AVX,
4052: 2228: 688: 3107: 2673: 2490: 684: 3461: 2084:: A portable implementation of platform-specific intrinsics for other platforms (e.g. SSE intrinsics for ARM NEON), using C/C++ headers 2008: 1314:
now chooses at runtime processor-specific implementations of its own math operations, including the use of SIMD-capable instructions.
2453: 1251: 939: 935: 3102: 1878: 1748: 3174: 2068: 1243: 1134:
is duplicated for many instruction set extensions, and the operating system or the program decides which one to load at run-time.
943: 614: 1486:
Conte, G.; Tommesani, S.; Zanichelli, F. (2000). "The long and winding road to high-performance image processing with MMX/SSE".
4181: 3456: 2927: 251: 35: 1181:
programming language, bringing the benefits of SIMD to web programs for the first time. The interface consists of two types:
1078:
package, available on NuGet, implements SIMD datatypes. Java also has a new proposed API for SIMD instructions available in
4221: 1223:. However, by 2017, SIMD.js has been taken out of the ECMAScript standard queue in favor of pursuing a similar interface in 4448: 3871: 2995: 2258: 2102: 1794: 243: 4471: 3881: 3022: 489:
Gathering data into SIMD registers and scattering it to the correct destination locations is tricky (sometimes requiring
323: 123: 2045: 1734: 4814: 4360: 2149: 985: 743: 335: 288:, which could operate on a "vector" of data with a single instruction. Vector processing was especially popularized by 4216: 4466: 3189: 3017: 2369: 1542: 981: 102: 254:, both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution. 4045: 4004: 3567: 2460: 2426: 2421: 2340: 2305: 1348:
Larger scale commercial SIMD processors are available from ClearSpeed Technology, Ltd. and Stream Processors, Inc.
1220: 1177:
In 2013 John McCutchan announced that he had created a high-performance interface to SIMD instruction sets for the
4438: 4253: 3979: 3876: 3277: 3184: 2985: 2206: 2095: 1825: 593: 285: 97: 1911: 342:
1.1 desktops in 1994 to accelerate MPEG decoding. Sun Microsystems introduced SIMD integer instructions in its "
4545: 4459: 4408: 3005: 2724: 2159: 1442: 496:
Specific instructions like rotations or three-operand addition are not available in some SIMD instruction sets.
149: 551: 4809: 4769: 4603: 4454: 4141: 3179: 3027: 3000: 2861: 2475: 2436: 2293: 663: 316: 165: 144: 4819: 2056:
Article about Optimizing the Rendering Pipeline of Animated Models Using the Intel Streaming SIMD Extensions
1345:
Ziilabs produced an SIMD type processor for use on mobile devices, such as media players and mobile phones.
3616: 3378: 2854: 2815: 2470: 2465: 2399: 2211: 1390: 1054: 208: 204: 1026:
It is common for publishers of the SIMD instruction sets to make their own C/C++ language extensions with
4824: 4788: 4734: 4194: 4038: 3243: 2940: 2638: 2335: 1178: 826: 649: 308: 292:
in the 1970s and 1980s. Vector processing architectures are now considered separate from SIMD computers:
139: 1239:
It has generally proven difficult to find sustainable commercial applications for SIMD-only processors.
1154:
labels in the code to "clone" functions, while ICC does so automatically (under the command-line option
806: 4713: 4508: 4393: 4355: 4205: 4095: 3893: 3540: 2957: 2448: 2416: 2186: 2174: 2154: 1381: 1267: 56: 1082:
17 in an incubator module. It also has a safe fallback mechanism on unsupported CPUs to simple loops.
4729: 4708: 4653: 4540: 4530: 4503: 4365: 3984: 3947: 3937: 2325: 1469: 1365: 1334: 1159: 926: 755: 535: 453:
may not easily benefit from SIMD; however, it is theoretically possible to vectorize comparisons and
422: 381: 377: 2060: 4683: 4309: 4248: 4161: 3999: 3406: 3342: 3319: 3169: 3131: 2967: 2917: 2912: 2389: 2283: 2191: 1302: 1255: 1201: 503:
new register size. Unfortunately, for legacy support reasons, the older versions cannot be retired.
483: 465: 258: 1045:
Clang compiler also implements the feature, with an analogous interface defined in the IR. Rust's
203:. SIMD can be internal (part of the hardware design) and it can be directly accessible through an 4829: 4744: 4739: 4598: 4189: 3952: 3735: 3629: 3593: 3510: 3494: 3336: 3125: 3084: 3072: 2935: 2849: 2770: 2535: 2196: 2139: 1608: 1143: 1115:
either older or newer SIMD technologies, and pick one that best fits the user's CPU at run-time (
1001: 550:
but is still far better than non-predicated SIMD. Detailed comparative examples are given in the
469: 231: 1591: 4483: 4415: 4319: 4211: 4166: 3758: 3730: 3640: 3605: 3354: 3348: 3330: 3064: 3058: 2962: 2866: 2757: 2696: 2558: 2201: 1216: 922: 790: 774: 343: 1215:
and Intel announced at IDF 2013 that they are implementing McCutchan's specification for both
449:
Not all algorithms can be vectorized easily. For example, a flow-control-heavy task like code
4575: 4535: 4488: 4478: 4273: 4136: 4075: 3932: 3841: 3587: 3299: 3117: 2876: 2844: 2802: 2714: 2515: 2330: 2320: 2310: 2300: 2270: 2253: 2118: 1369: 1318: 1197: 1067: 891: 822: 751: 507: 479:
Programming with particular SIMD instruction sets can involve numerous low-level challenges.
437: 358: 2016: 1860: 1670: 1434: 960:
offered a rich system and can be programmed using increasingly sophisticated compilers from
4515: 4403: 4398: 4388: 4375: 4171: 3962: 3898: 3484: 3206: 3096: 3043: 2575: 2288: 2144: 2126: 1525:
Lee, R.B. (1995). "Realtime MPEG video via software decompression on a PA-RISC processor".
1139: 1031: 293: 219: 31: 17: 425:
would likewise, for volume control, multiply both Left and Right channels simultaneously.
8: 4678: 4633: 4433: 4299: 4009: 3994: 3814: 3665: 3647: 3611: 3599: 3253: 3200: 2977: 2893: 2775: 2630: 2525: 2384: 1395: 1131: 1094:
hint. This OpenMP interface has replaced a wide set of nonstandard extensions, including
956:
had somewhat more success, even though they entered the SIMD market later than the rest.
899: 490: 297: 200: 83: 1169:
supports LMV and this functionality is adopted by the Intel-backed Clear Linux project.
63:
Please help update this article to reflect recent events or newly available information.
4703: 4552: 4525: 4350: 4314: 4304: 4263: 4105: 4085: 4080: 4061: 3866: 3858: 3710: 3685: 3489: 3364: 2888: 2829: 2709: 2441: 2169: 2040: 1776: 1548: 1507: 1290: 1027: 903: 834: 659: 319: 247: 2065: 1695: 4749: 4425: 4383: 4278: 3819: 3786: 3702: 3634: 3535: 3525: 3515: 3446: 3441: 3436: 3359: 3288: 3194: 3154: 2787: 2737: 2687: 2663: 2545: 2485: 2480: 2362: 2278: 1538: 1279: 930: 907: 886: 810: 747: 547: 334:
microprocessor vendors turned to SIMD to meet the demand. Hewlett-Packard introduced
196: 1511: 841:. The Xetal has 320 16-bit processor elements especially designed for vision tasks. 719:. SIMD instructions can be found, to one degree or another, on most CPUs, including 4759: 4558: 4493: 4340: 4156: 4151: 4146: 4115: 3989: 3922: 3908: 3763: 3670: 3624: 3431: 3426: 3421: 3416: 3411: 3401: 3271: 3238: 3149: 3144: 3053: 2905: 2900: 2883: 2871: 2810: 2374: 2352: 2238: 2216: 2134: 2066:
Introduction to Parallel Computing from LLNL Lawrence Livermore National Laboratory
2050: 1552: 1530: 1499: 1491: 1451: 1430: 1116: 1000:
as well as AltiVec. Apple was the dominant purchaser of PowerPC chips from IBM and
794: 564: 473: 277: 215: 1566: 1356:. Their Storm-1 processor (2007) contains 80 SIMD cores controlled by a MIPS CPU. 1301:
was unusual in that one of its vector-float units could function as an autonomous
4623: 4563: 4498: 4345: 4335: 4268: 4258: 4100: 4090: 3903: 3888: 3836: 3740: 3715: 3552: 3545: 3396: 3391: 3386: 3325: 3233: 3223: 2945: 2780: 2732: 2495: 2379: 2347: 2248: 2243: 2164: 2072: 1275: 1247: 782: 736: 622: 618: 1990: 1802: 1488:
Proc. Fifth IEEE Int'l Workshop on Computer Architectures for Machine Perception
207:(ISA), but it should not be confused with an ISA. SIMD describes computers with 4754: 4570: 4227: 4120: 4014: 3848: 3831: 3824: 3720: 3577: 3314: 3228: 3159: 2742: 2704: 2653: 2648: 2643: 2357: 2181: 1972:
Jensen, Peter; Jibaja, Ivan; Hu, Ningxin; Gohman, Dan; McCutchan, John (2015).
1632: 1534: 1205: 1038: 1258:
applications like conversion between various video standards and frame rates (
1208:
visualization show near 400% speedup compared to scalar code written in Dart.
461:
Large register files which increases power consumption and required chip area.
365:
architecture in 1996. This sparked the introduction of the much more powerful
4803: 4643: 4520: 3809: 3725: 2765: 2747: 2540: 2233: 1635:, showing RSA implemented using a non-SIMD SSE2 integer multiply instruction. 1495: 1298: 1271: 630: 416:
applications. One example would be changing the brightness of an image. Each
312: 227: 223: 2668: 1455: 1194: 4243: 4019: 3957: 3773: 3750: 3562: 3283: 2221: 2077: 1946: 1527:
digest of papers Compcon '95. Technologies for the Information Superhighway
1338: 980:. However, in 2006, Apple computers moved to Intel x86 processors. Apple's 802: 641: 281: 1254:. The GAPP's recent incarnations have become a powerful tool in real-time 906:. Compilers also often lacked support, requiring programmers to resort to 873:
SIMD instructions are widely used to process 3D graphics, although modern
531:
effectiveness (work done per instruction) is achievable with Vector ISAs.
4764: 3804: 3768: 3479: 3451: 3309: 3164: 2087: 1503: 1224: 856: 211:
that perform the same operation on multiple data points simultaneously.
3690: 3680: 3675: 3657: 3557: 3530: 2792: 2625: 2595: 2315: 1879:"Transparent use of library packages optimized for Intel® architecture" 1645: 1353: 1349: 1286: 1212: 1124: 953: 874: 413: 347: 327: 235: 1720: 472:
in compilers is an active area of computer science research. (Compare
4638: 4613: 4030: 3781: 3778: 3520: 2590: 2568: 1620: 1307: 1297:
has incorporated a SIMD processor somewhere in its architecture. The
1071: 1005: 977: 716: 583: 270: 2061:"Yeppp!": cross-platform, open-source SIMD library from Georgia Tech 1762: 1567:"AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X Review" 864: 4688: 4668: 4593: 3796: 2615: 1842: 961: 370: 1211:
McCutchan's work on Dart, now called SIMD.js, has been adopted by
4693: 4673: 4648: 4283: 2605: 2563: 1385: 1326: 1311: 1079: 957: 845: 830: 818: 740: 732: 724: 450: 397: 389: 373: 366: 339: 234:
designs include SIMD instructions to improve the performance of
30:"SIMD" redirects here. For the cryptographic hash function, see 4663: 4658: 2620: 2585: 2550: 2081: 1410: 1087: 973: 918: 895: 778: 673: 521: 301: 3078: 2610: 2580: 1166: 1147: 989: 879: 838: 786: 766: 417: 1749:"The JIT finally proposed. JIT and SIMD are getting married" 380:
systems. Intel responded in 1999 by introducing the all-new
307:
The first era of modern SIMD computers was characterized by
4698: 4628: 4618: 3942: 3090: 3010: 2600: 1928: 1611:, showing how SSE2 is used to implement SHA hash algorithms 1405: 1400: 1377: 1373: 1330: 1259: 1095: 1042: 1015: 997: 993: 947: 814: 798: 770: 763: 759: 645: 511: 393: 385: 351: 289: 170: 542:
as "Associative Processing", more commonly known today as
300:
does not, due to Flynn's work (1966, 1972) pre-dating the
4608: 4585: 2530: 2520: 1322: 1263: 969: 965: 914: 720: 626: 362: 1947:"tc39/ecmascript_simd: SIMD numeric type for EcmaScript" 1590:
Patterson, David; Waterman, Andrew (18 September 2017).
1485: 357:
The first widely deployed desktop SIMD was with Intel's
180: 350:
microprocessor. MIPS followed suit with their similar
1435:"Some Computer Organizations and Their Effectiveness" 1317:
A later processor that used vector processing is the
1250:
and taken to the commercial sector by their spin-off
326:
approaches based on commodity processors such as the
2051:
Short Vector Extensions in Commercial Microprocessor
1971: 1185:
Float32x4, 4 single precision floating point values.
848:
SIMD instructions process 512 bits of data at once.
1735:"RyuJIT: The next-generation JIT compiler for .NET" 1285:A more ubiquitous application for SIMD is found in 1321:used in the Playstation 3, which was developed by 330:became more powerful, and interest in SIMD waned. 1589: 59:may be compromised due to out-of-date information 4801: 1623:, showing a stream cipher implemented using SSE2 1242:One that has had some measure of success is the 882:) may lead to wider use of SIMD in the future. 563:Examples of SIMD supercomputers (not including 238:use. SIMD has three different subcategories in 150:Associative processing (predicated/masked SIMD) 269:The first use of SIMD instructions was in the 4046: 2103: 1633:Subject: up to 1.4x RSA throughput using SSE2 346:" instruction set extensions in 1995, in its 261:(GPUs) are often wide SIMD implementations. 3108:Computer performance by orders of magnitude 1974:"SIMD in JavaScript via C++ and Emscripten" 1130:Library multi-versioning (LMV): the entire 4053: 4039: 2117: 2110: 2096: 1902: 1234: 1119:). There are two main camps of solutions: 837:, developed several SIMD processors named 829:'s instruction set is heavily SIMD based. 392:are developed by Intel. AMD supports AVX, 1991:"Porting SIMD code targeting WebAssembly" 1423: 1333:. It uses a number of SIMD processors (a 546:SIMD. This approach is not as compact as 34:. For the Scottish statistical tool, see 1109: 863: 855: 179: 1675:Using the GNU Compiler Collection (GCC) 1021: 662:, models 1 and 2 (CM-1 and CM-2), from 615:Geometric-Arithmetic Parallel Processor 14: 4802: 4060: 2009:"ZiiLABS ZMS-05 ARM 9 Media Processor" 1592:"SIMD Instructions Considered Harmful" 917:had a slow start. The introduction of 821:. The IBM, Sony, Toshiba co-developed 36:Scottish index of multiple deprivation 4034: 2091: 1871: 1429: 3079:Floating-point operations per second 1843:"Function multi-versioning in GCC 6" 1479: 1406:Single Program, Multiple Data (SPMD) 1337:architecture, each with independent 246:. SIMT should not be confused with 40: 1905:"Bringing SIMD to the web via Dart" 1524: 1231:WebAssembly 128-bit SIMD proposal. 1123:Function multi-versioning (FMV): a 1053:) uses this interface, and so does 24: 2046:Cracking Open The Pentium 3 (1999) 1172: 1060:C++ has an experimental interface 744:Multimedia Acceleration eXtensions 145:Pipelined processing (packed SIMD) 25: 4841: 2034: 1823: 1188:Int32x4, 4 32-bit integer values. 946:have started to appear (see also 538:takes another approach, known in 189:Single instruction, multiple data 184:Single instruction, multiple data 4783: 4782: 4005:Semiconductor device fabrication 1008:designs from Freescale and IBM. 443: 45: 4254:Analysis of parallel algorithms 3980:History of general-purpose CPUs 2207:Nondeterministic Turing machine 2001: 1995:Emscripten 1.40.1 documentation 1983: 1965: 1939: 1921: 1896: 1853: 1835: 1817: 1787: 1769: 1755: 1741: 1727: 1713: 1688: 1663: 1638: 1621:Salsa20 speed; Salsa20 software 1155: 1151: 1103: 1099: 1091: 1075: 1061: 1050: 1046: 594:ICL Distributed Array Processor 280:of the early 1970s such as the 273:, which was completed in 1966. 2160:Deterministic finite automaton 1826:"OMP5.1: Loop Transformations" 1801:. 18 July 2012. Archived from 1626: 1614: 1602: 1583: 1559: 1518: 1462: 1443:IEEE Transactions on Computers 605:Burroughs Scientific Processor 482:SIMD may have restrictions on 13: 1: 4201:Simultaneous and heterogenous 2951:Simultaneous and heterogenous 1646:"SIMD library math functions" 1416: 1401:SIMD within a register (SWAR) 664:Thinking Machines Corporation 557: 407: 309:massively parallel processing 4789:Category: Parallel computing 3635:Integrated memory controller 3617:Translation lookaside buffer 2816:Memory dependence prediction 2259:Random-access stored program 2212:Probabilistic Turing machine 1391:Instruction set architecture 1049:crate (and the experimental 885:Adoption of SIMD systems in 642:Massively Parallel Processor 520:To remedy problems 1 and 5, 209:multiple processing elements 205:instruction set architecture 7: 3091:Synaptic updates per second 1953:. Ecma TC39. 22 August 2019 1696:"Clang Language Extensions" 1359: 1030:or special datatypes (with 992:) were modified to support 851: 710: 650:Goddard Space Flight Center 400:in their current products. 226:or adjusting the volume of 27:Type of parallel processing 10: 4846: 4096:High-performance computing 3495:Heterogeneous architecture 2417:Orthogonal instruction set 2187:Alternating Turing machine 2175:Quantum cellular automaton 1535:10.1109/CMPCON.1995.512384 1382:Advanced Vector Extensions 264: 29: 4815:Digital signal processing 4778: 4730:Automatic parallelization 4722: 4584: 4424: 4374: 4366:Application checkpointing 4328: 4292: 4236: 4180: 4129: 4068: 3985:Microprocessor chronology 3972: 3948:Dynamic frequency scaling 3921: 3857: 3795: 3749: 3701: 3656: 3576: 3503: 3472: 3377: 3298: 3262: 3216: 3116: 3103:Cache performance metrics 3042: 2976: 2926: 2837: 2828: 2801: 2756: 2723: 2695: 2686: 2506: 2409: 2398: 2269: 2125: 2041:SIMD architectures (2000) 1723:. VcDevel. 6 August 2020. 1366:Streaming SIMD Extensions 1282:, and image enhancement. 1246:, which was developed by 1160:Rust programming language 536:Scalable Vector Extension 493:) and can be inefficient. 259:graphics processing units 4000:Hardware security module 3343:Digital signal processor 3320:Graphics processing unit 3132:Graphics processing unit 1496:10.1109/CAMP.2000.875989 1202:3D vertex transformation 785:'s ARC Video subsystem, 4745:Embarrassingly parallel 4740:Deterministic algorithm 3953:Dynamic voltage scaling 3736:Memory address register 3630:Branch target predictor 3594:Address generation unit 3337:Physics processing unit 3126:Central processing unit 3085:Transactions per second 3073:Instructions per second 2996:Array processing (SIMT) 2140:Stored-program computer 1456:10.1109/TC.1972.5009071 1235:Commercial applications 1144:GNU Compiler Collection 1062:std::experimental::simd 1002:Freescale Semiconductor 470:Automatic vectorization 468:program, for instance. 276:SIMD was the basis for 140:Array processing (SIMT) 4460:Associative processing 4416:Non-blocking algorithm 4222:Clustered multi-thread 3759:Hardwired control unit 3641:Memory management unit 3606:Memory management unit 3355:Secure cryptoprocessor 3349:Tensor Processing Unit 3331:Vision processing unit 3065:Cycles per instruction 3059:Instructions per cycle 3006:Associative processing 2697:Instruction pipelining 2119:Processor technologies 1795:"Tutorial pragma simd" 1700:Clang 11 documentation 1289:: nearly every modern 1076:System.Numerics.Vector 1012:SIMD within a register 870: 861: 296:includes them whereas 216:data level parallelism 214:Such machines exploit 185: 4576:Hardware acceleration 4489:Superscalar processor 4479:Dataflow architecture 4076:Distributed computing 3842:Sum-addressed decoder 3588:Arithmetic logic unit 2715:Classic RISC pipeline 2669:Epiphany architecture 2516:Motorola 68000 series 1861:"2045-target-feature" 1763:"JEP 338: Vector API" 1276:image noise reduction 1198:matrix multiplication 1110:SIMD multi-versioning 867: 859: 544:"Predicated" (masked) 438:superscalar processor 286:Texas Instruments ASC 278:vector supercomputers 240:Flynn's 1972 Taxonomy 183: 111:Multiple data streams 4810:Classes of computers 4455:Pipelined processing 4404:Explicit parallelism 4399:Implicit parallelism 4389:Dataflow programming 3963:Performance per watt 3541:replacement policies 3207:Package on a package 3097:Performance per watt 3001:Pipelined processing 2771:Tomasulo's algorithm 2576:Clipper architecture 2432:Application-specific 2145:Finite-state machine 1929:"SIMD in JavaScript" 1883:Clear Linux* Project 1737:. 30 September 2013. 1529:. pp. 186–192. 1470:"MIMD1 - XP/S, CM-5" 1325:in cooperation with 1032:operator overloading 1022:Programmer interface 32:SIMD (hash function) 4679:Parallel Extensions 4484:Pipelined processor 3995:Digital electronics 3648:Instruction decoder 3600:Floating-point unit 3254:Soft microprocessor 3201:System in a package 2776:Reservation station 2306:Transport-triggered 1671:"Vector Extensions" 1132:programming library 1028:intrinsic functions 568: 197:parallel processing 4825:Parallel computing 4553:Massively parallel 4531:distributed shared 4351:Cache invalidation 4315:Instruction window 4106:Manycore processor 4086:Massively parallel 4081:Parallel computing 4062:Parallel computing 3867:Integrated circuit 3711:Processor register 3365:Baseband processor 2710:Operand forwarding 2170:Cellular automaton 2071:2013-06-10 at the 1805:on 4 December 2020 1721:"VcDevel/std-simd" 1433:(September 1972). 1291:video game console 1140:Intel C++ Compiler 871: 862: 660:Connection Machine 562: 491:permute operations 361:extensions to the 338:instructions into 242:, one of which is 186: 132:SIMD subcategories 90:Single data stream 4797: 4796: 4750:Parallel slowdown 4384:Stream processing 4274:Karp–Flatt metric 4028: 4027: 3917: 3916: 3536:Instruction cache 3526:Scratchpad memory 3373: 3372: 3360:Network processor 3289:Network on a chip 3244:Ultra-low-voltage 3195:Multi-chip module 3038: 3037: 2824: 2823: 2811:Branch prediction 2788:Register renaming 2682: 2681: 2664:VISC architecture 2486:Quantum computing 2481:VISC architecture 2363:Secondary storage 2279:Microarchitecture 2239:Register machines 1865:The Rust RFC Book 1777:"SIMD Directives" 1431:Flynn, Michael J. 1280:video compression 1146:since GCC 6, and 1106:, and many more. 1104:#pragma GCC ivdep 986:development tools 908:assembly language 887:personal computer 708: 707: 565:vector processors 552:Vector processing 548:Vector processing 474:vector processing 317:Thinking Machines 294:Duncan's Taxonomy 178: 177: 78: 77: 16:(Redirected from 4837: 4820:Flynn's taxonomy 4786: 4785: 4760:Software lockout 4559:Computer cluster 4494:Vector processor 4449:Array processing 4434:Flynn's taxonomy 4341:Memory coherence 4116:Computer network 4055: 4048: 4041: 4032: 4031: 3990:Processor design 3882:Power management 3764:Instruction unit 3625:Branch predictor 3574: 3573: 3272:System on a chip 3214: 3213: 3054:Transistor count 2978:Flynn's taxonomy 2835: 2834: 2693: 2692: 2496:Addressing modes 2407: 2406: 2353:Memory hierarchy 2217:Hypercomputation 2135:Abstract machine 2112: 2105: 2098: 2089: 2088: 2028: 2027: 2025: 2024: 2015:. Archived from 2005: 1999: 1998: 1987: 1981: 1980: 1978: 1969: 1963: 1962: 1960: 1958: 1943: 1937: 1936: 1925: 1919: 1918: 1916: 1910:. Archived from 1909: 1903:John McCutchan. 1900: 1894: 1893: 1891: 1889: 1875: 1869: 1868: 1857: 1851: 1850: 1839: 1833: 1832: 1830: 1824:Kruse, Michael. 1821: 1815: 1814: 1812: 1810: 1791: 1785: 1784: 1773: 1767: 1766: 1759: 1753: 1752: 1745: 1739: 1738: 1731: 1725: 1724: 1717: 1711: 1710: 1708: 1706: 1692: 1686: 1685: 1683: 1681: 1667: 1661: 1660: 1658: 1656: 1642: 1636: 1630: 1624: 1618: 1612: 1606: 1600: 1599: 1587: 1581: 1580: 1578: 1577: 1571:www.phoronix.com 1563: 1557: 1556: 1522: 1516: 1515: 1483: 1477: 1476: 1474: 1466: 1460: 1459: 1439: 1427: 1396:Flynn's taxonomy 1270:formats, etc.), 1256:video processing 1157: 1153: 1117:dynamic dispatch 1105: 1101: 1093: 1092:#pragma omp simd 1077: 1063: 1052: 1048: 569: 561: 540:Flynn's Taxonomy 435: 298:Flynn's Taxonomy 252:hardware threads 248:software threads 201:Flynn's taxonomy 84:Flynn's taxonomy 80: 79: 73: 70: 64: 57:factual accuracy 49: 48: 41: 21: 4845: 4844: 4840: 4839: 4838: 4836: 4835: 4834: 4800: 4799: 4798: 4793: 4774: 4718: 4624:Coarray Fortran 4580: 4564:Beowulf cluster 4420: 4370: 4361:Synchronization 4346:Cache coherence 4336:Multiprocessing 4324: 4288: 4269:Cost efficiency 4264:Gustafson's law 4232: 4176: 4125: 4101:Multiprocessing 4091:Cloud computing 4064: 4059: 4029: 4024: 4010:Tick–tock model 3968: 3924: 3913: 3853: 3837:Address decoder 3791: 3745: 3741:Program counter 3716:Status register 3697: 3652: 3612:Load–store unit 3579: 3572: 3499: 3468: 3369: 3326:Image processor 3301: 3294: 3264: 3258: 3234:Microcontroller 3224:Embedded system 3212: 3112: 3045: 3034: 2972: 2922: 2820: 2797: 2781:Re-order buffer 2752: 2733:Data dependency 2719: 2678: 2508: 2502: 2401: 2400:Instruction set 2394: 2380:Multiprocessing 2348:Cache hierarchy 2341:Register/memory 2265: 2165:Queue automaton 2121: 2116: 2073:Wayback Machine 2037: 2032: 2031: 2022: 2020: 2007: 2006: 2002: 1989: 1988: 1984: 1976: 1970: 1966: 1956: 1954: 1945: 1944: 1940: 1927: 1926: 1922: 1914: 1907: 1901: 1897: 1887: 1885: 1877: 1876: 1872: 1859: 1858: 1854: 1841: 1840: 1836: 1828: 1822: 1818: 1808: 1806: 1793: 1792: 1788: 1775: 1774: 1770: 1761: 1760: 1756: 1751:. 7 April 2014. 1747: 1746: 1742: 1733: 1732: 1728: 1719: 1718: 1714: 1704: 1702: 1694: 1693: 1689: 1679: 1677: 1669: 1668: 1664: 1654: 1652: 1644: 1643: 1639: 1631: 1627: 1619: 1615: 1607: 1603: 1588: 1584: 1575: 1573: 1565: 1564: 1560: 1545: 1523: 1519: 1484: 1480: 1472: 1468: 1467: 1463: 1437: 1428: 1424: 1419: 1362: 1266:, NTSC to/from 1248:Lockheed Martin 1237: 1175: 1173:SIMD on the web 1112: 1074:in RyuJIT. The 1024: 854: 713: 623:Lockheed Martin 619:Martin Marietta 560: 446: 433: 410: 267: 195:) is a type of 74: 68: 65: 62: 54:This article's 50: 46: 39: 28: 23: 22: 15: 12: 11: 5: 4843: 4833: 4832: 4830:SIMD computing 4827: 4822: 4817: 4812: 4795: 4794: 4792: 4791: 4779: 4776: 4775: 4773: 4772: 4767: 4762: 4757: 4755:Race condition 4752: 4747: 4742: 4737: 4732: 4726: 4724: 4720: 4719: 4717: 4716: 4711: 4706: 4701: 4696: 4691: 4686: 4681: 4676: 4671: 4666: 4661: 4656: 4651: 4646: 4641: 4636: 4631: 4626: 4621: 4616: 4611: 4606: 4601: 4596: 4590: 4588: 4582: 4581: 4579: 4578: 4573: 4568: 4567: 4566: 4556: 4550: 4549: 4548: 4543: 4538: 4533: 4528: 4523: 4513: 4512: 4511: 4506: 4499:Multiprocessor 4496: 4491: 4486: 4481: 4476: 4475: 4474: 4469: 4464: 4463: 4462: 4457: 4452: 4441: 4430: 4428: 4422: 4421: 4419: 4418: 4413: 4412: 4411: 4406: 4401: 4391: 4386: 4380: 4378: 4372: 4371: 4369: 4368: 4363: 4358: 4353: 4348: 4343: 4338: 4332: 4330: 4326: 4325: 4323: 4322: 4317: 4312: 4307: 4302: 4296: 4294: 4290: 4289: 4287: 4286: 4281: 4276: 4271: 4266: 4261: 4256: 4251: 4246: 4240: 4238: 4234: 4233: 4231: 4230: 4228:Hardware scout 4225: 4219: 4214: 4209: 4203: 4198: 4192: 4186: 4184: 4182:Multithreading 4178: 4177: 4175: 4174: 4169: 4164: 4159: 4154: 4149: 4144: 4139: 4133: 4131: 4127: 4126: 4124: 4123: 4121:Systolic array 4118: 4113: 4108: 4103: 4098: 4093: 4088: 4083: 4078: 4072: 4070: 4066: 4065: 4058: 4057: 4050: 4043: 4035: 4026: 4025: 4023: 4022: 4017: 4015:Pin grid array 4012: 4007: 4002: 3997: 3992: 3987: 3982: 3976: 3974: 3970: 3969: 3967: 3966: 3960: 3955: 3950: 3945: 3940: 3935: 3929: 3927: 3919: 3918: 3915: 3914: 3912: 3911: 3906: 3901: 3896: 3891: 3886: 3885: 3884: 3879: 3874: 3863: 3861: 3855: 3854: 3852: 3851: 3849:Barrel shifter 3846: 3845: 3844: 3839: 3832:Binary decoder 3829: 3828: 3827: 3817: 3812: 3807: 3801: 3799: 3793: 3792: 3790: 3789: 3784: 3776: 3771: 3766: 3761: 3755: 3753: 3747: 3746: 3744: 3743: 3738: 3733: 3728: 3723: 3721:Stack register 3718: 3713: 3707: 3705: 3699: 3698: 3696: 3695: 3694: 3693: 3688: 3678: 3673: 3668: 3662: 3660: 3654: 3653: 3651: 3650: 3645: 3644: 3643: 3632: 3627: 3622: 3621: 3620: 3614: 3603: 3597: 3591: 3584: 3582: 3571: 3570: 3565: 3560: 3555: 3550: 3549: 3548: 3543: 3538: 3533: 3528: 3523: 3513: 3507: 3505: 3501: 3500: 3498: 3497: 3492: 3487: 3482: 3476: 3474: 3470: 3469: 3467: 3466: 3465: 3464: 3454: 3449: 3444: 3439: 3434: 3429: 3424: 3419: 3414: 3409: 3404: 3399: 3394: 3389: 3383: 3381: 3375: 3374: 3371: 3370: 3368: 3367: 3362: 3357: 3352: 3346: 3340: 3334: 3328: 3323: 3317: 3315:AI accelerator 3312: 3306: 3304: 3296: 3295: 3293: 3292: 3286: 3281: 3278:Multiprocessor 3275: 3268: 3266: 3260: 3259: 3257: 3256: 3251: 3246: 3241: 3236: 3231: 3229:Microprocessor 3226: 3220: 3218: 3217:By application 3211: 3210: 3204: 3198: 3192: 3187: 3182: 3177: 3172: 3167: 3162: 3160:Tile processor 3157: 3152: 3147: 3142: 3141: 3140: 3129: 3122: 3120: 3114: 3113: 3111: 3110: 3105: 3100: 3094: 3088: 3082: 3076: 3070: 3069: 3068: 3056: 3050: 3048: 3040: 3039: 3036: 3035: 3033: 3032: 3031: 3030: 3020: 3015: 3014: 3013: 3008: 3003: 2998: 2988: 2982: 2980: 2974: 2973: 2971: 2970: 2965: 2960: 2955: 2954: 2953: 2948: 2946:Hyperthreading 2938: 2932: 2930: 2928:Multithreading 2924: 2923: 2921: 2920: 2915: 2910: 2909: 2908: 2898: 2897: 2896: 2891: 2881: 2880: 2879: 2874: 2864: 2859: 2858: 2857: 2852: 2841: 2839: 2832: 2826: 2825: 2822: 2821: 2819: 2818: 2813: 2807: 2805: 2799: 2798: 2796: 2795: 2790: 2785: 2784: 2783: 2778: 2768: 2762: 2760: 2754: 2753: 2751: 2750: 2745: 2740: 2735: 2729: 2727: 2721: 2720: 2718: 2717: 2712: 2707: 2705:Pipeline stall 2701: 2699: 2690: 2684: 2683: 2680: 2679: 2677: 2676: 2671: 2666: 2661: 2658: 2657: 2656: 2654:z/Architecture 2651: 2646: 2641: 2633: 2628: 2623: 2618: 2613: 2608: 2603: 2598: 2593: 2588: 2583: 2578: 2573: 2572: 2571: 2566: 2561: 2553: 2548: 2543: 2538: 2533: 2528: 2523: 2518: 2512: 2510: 2504: 2503: 2501: 2500: 2499: 2498: 2488: 2483: 2478: 2473: 2468: 2463: 2458: 2457: 2456: 2446: 2445: 2444: 2434: 2429: 2424: 2419: 2413: 2411: 2404: 2396: 2395: 2393: 2392: 2387: 2382: 2377: 2372: 2367: 2366: 2365: 2360: 2358:Virtual memory 2350: 2345: 2344: 2343: 2338: 2333: 2328: 2318: 2313: 2308: 2303: 2298: 2297: 2296: 2286: 2281: 2275: 2273: 2267: 2266: 2264: 2263: 2262: 2261: 2256: 2251: 2246: 2236: 2231: 2226: 2225: 2224: 2219: 2214: 2209: 2204: 2199: 2194: 2189: 2182:Turing machine 2179: 2178: 2177: 2172: 2167: 2162: 2157: 2152: 2142: 2137: 2131: 2129: 2123: 2122: 2115: 2114: 2107: 2100: 2092: 2086: 2085: 2075: 2063: 2058: 2053: 2048: 2043: 2036: 2035:External links 2033: 2030: 2029: 2000: 1982: 1964: 1938: 1920: 1917:on 2013-12-03. 1895: 1870: 1852: 1834: 1816: 1786: 1781:www.openmp.org 1768: 1754: 1740: 1726: 1712: 1687: 1662: 1650:Stack Overflow 1637: 1625: 1613: 1609:RE: SSE2 speed 1601: 1582: 1558: 1543: 1517: 1478: 1461: 1450:(9): 948–960. 1421: 1420: 1418: 1415: 1414: 1413: 1408: 1403: 1398: 1393: 1388: 1361: 1358: 1319:Cell Processor 1236: 1233: 1206:Mandelbrot set 1190: 1189: 1186: 1174: 1171: 1136: 1135: 1128: 1111: 1108: 1070:added SIMD to 1039:GNU C Compiler 1023: 1020: 954:Apple Computer 875:graphics cards 853: 850: 823:Cell Processor 752:MMX and iwMMXt 712: 709: 706: 705: 696: 692: 691: 682: 678: 677: 676:MP-1 and MP-2 671: 667: 666: 657: 653: 652: 639: 635: 634: 621:(continued at 612: 608: 607: 602: 598: 597: 591: 587: 586: 581: 577: 576: 573: 559: 556: 518: 517: 516: 515: 514:corrects this. 504: 500: 497: 494: 487: 484:data alignment 477: 462: 459: 445: 442: 409: 406: 369:system in the 313:supercomputers 266: 263: 230:. Most modern 176: 175: 174: 173: 168: 160: 159: 155: 154: 153: 152: 147: 142: 134: 133: 129: 128: 127: 126: 121: 113: 112: 108: 107: 106: 105: 100: 92: 91: 87: 86: 76: 75: 53: 51: 44: 26: 9: 6: 4: 3: 2: 4842: 4831: 4828: 4826: 4823: 4821: 4818: 4816: 4813: 4811: 4808: 4807: 4805: 4790: 4781: 4780: 4777: 4771: 4768: 4766: 4763: 4761: 4758: 4756: 4753: 4751: 4748: 4746: 4743: 4741: 4738: 4736: 4733: 4731: 4728: 4727: 4725: 4721: 4715: 4712: 4710: 4707: 4705: 4702: 4700: 4697: 4695: 4692: 4690: 4687: 4685: 4682: 4680: 4677: 4675: 4672: 4670: 4667: 4665: 4662: 4660: 4657: 4655: 4652: 4650: 4647: 4645: 4644:Global Arrays 4642: 4640: 4637: 4635: 4632: 4630: 4627: 4625: 4622: 4620: 4617: 4615: 4612: 4610: 4607: 4605: 4602: 4600: 4597: 4595: 4592: 4591: 4589: 4587: 4583: 4577: 4574: 4572: 4571:Grid computer 4569: 4565: 4562: 4561: 4560: 4557: 4554: 4551: 4547: 4544: 4542: 4539: 4537: 4534: 4532: 4529: 4527: 4524: 4522: 4519: 4518: 4517: 4514: 4510: 4507: 4505: 4502: 4501: 4500: 4497: 4495: 4492: 4490: 4487: 4485: 4482: 4480: 4477: 4473: 4470: 4468: 4465: 4461: 4458: 4456: 4453: 4450: 4447: 4446: 4445: 4442: 4440: 4437: 4436: 4435: 4432: 4431: 4429: 4427: 4423: 4417: 4414: 4410: 4407: 4405: 4402: 4400: 4397: 4396: 4395: 4392: 4390: 4387: 4385: 4382: 4381: 4379: 4377: 4373: 4367: 4364: 4362: 4359: 4357: 4354: 4352: 4349: 4347: 4344: 4342: 4339: 4337: 4334: 4333: 4331: 4327: 4321: 4318: 4316: 4313: 4311: 4308: 4306: 4303: 4301: 4298: 4297: 4295: 4291: 4285: 4282: 4280: 4277: 4275: 4272: 4270: 4267: 4265: 4262: 4260: 4257: 4255: 4252: 4250: 4247: 4245: 4242: 4241: 4239: 4235: 4229: 4226: 4223: 4220: 4218: 4215: 4213: 4210: 4207: 4204: 4202: 4199: 4196: 4193: 4191: 4188: 4187: 4185: 4183: 4179: 4173: 4170: 4168: 4165: 4163: 4160: 4158: 4155: 4153: 4150: 4148: 4145: 4143: 4140: 4138: 4135: 4134: 4132: 4128: 4122: 4119: 4117: 4114: 4112: 4109: 4107: 4104: 4102: 4099: 4097: 4094: 4092: 4089: 4087: 4084: 4082: 4079: 4077: 4074: 4073: 4071: 4067: 4063: 4056: 4051: 4049: 4044: 4042: 4037: 4036: 4033: 4021: 4018: 4016: 4013: 4011: 4008: 4006: 4003: 4001: 3998: 3996: 3993: 3991: 3988: 3986: 3983: 3981: 3978: 3977: 3975: 3971: 3964: 3961: 3959: 3956: 3954: 3951: 3949: 3946: 3944: 3941: 3939: 3936: 3934: 3931: 3930: 3928: 3926: 3920: 3910: 3907: 3905: 3902: 3900: 3897: 3895: 3892: 3890: 3887: 3883: 3880: 3878: 3875: 3873: 3870: 3869: 3868: 3865: 3864: 3862: 3860: 3856: 3850: 3847: 3843: 3840: 3838: 3835: 3834: 3833: 3830: 3826: 3823: 3822: 3821: 3818: 3816: 3813: 3811: 3810:Demultiplexer 3808: 3806: 3803: 3802: 3800: 3798: 3794: 3788: 3785: 3783: 3780: 3777: 3775: 3772: 3770: 3767: 3765: 3762: 3760: 3757: 3756: 3754: 3752: 3748: 3742: 3739: 3737: 3734: 3732: 3731:Memory buffer 3729: 3727: 3726:Register file 3724: 3722: 3719: 3717: 3714: 3712: 3709: 3708: 3706: 3704: 3700: 3692: 3689: 3687: 3684: 3683: 3682: 3679: 3677: 3674: 3672: 3669: 3667: 3666:Combinational 3664: 3663: 3661: 3659: 3655: 3649: 3646: 3642: 3639: 3638: 3636: 3633: 3631: 3628: 3626: 3623: 3618: 3615: 3613: 3610: 3609: 3607: 3604: 3601: 3598: 3595: 3592: 3589: 3586: 3585: 3583: 3581: 3575: 3569: 3566: 3564: 3561: 3559: 3556: 3554: 3551: 3547: 3544: 3542: 3539: 3537: 3534: 3532: 3529: 3527: 3524: 3522: 3519: 3518: 3517: 3514: 3512: 3509: 3508: 3506: 3502: 3496: 3493: 3491: 3488: 3486: 3483: 3481: 3478: 3477: 3475: 3471: 3463: 3460: 3459: 3458: 3455: 3453: 3450: 3448: 3445: 3443: 3440: 3438: 3435: 3433: 3430: 3428: 3425: 3423: 3420: 3418: 3415: 3413: 3410: 3408: 3405: 3403: 3400: 3398: 3395: 3393: 3390: 3388: 3385: 3384: 3382: 3380: 3376: 3366: 3363: 3361: 3358: 3356: 3353: 3350: 3347: 3344: 3341: 3338: 3335: 3332: 3329: 3327: 3324: 3321: 3318: 3316: 3313: 3311: 3308: 3307: 3305: 3303: 3297: 3290: 3287: 3285: 3282: 3279: 3276: 3273: 3270: 3269: 3267: 3261: 3255: 3252: 3250: 3247: 3245: 3242: 3240: 3237: 3235: 3232: 3230: 3227: 3225: 3222: 3221: 3219: 3215: 3208: 3205: 3202: 3199: 3196: 3193: 3191: 3188: 3186: 3183: 3181: 3178: 3176: 3173: 3171: 3168: 3166: 3163: 3161: 3158: 3156: 3153: 3151: 3148: 3146: 3143: 3139: 3136: 3135: 3133: 3130: 3127: 3124: 3123: 3121: 3119: 3115: 3109: 3106: 3104: 3101: 3098: 3095: 3092: 3089: 3086: 3083: 3080: 3077: 3074: 3071: 3066: 3063: 3062: 3060: 3057: 3055: 3052: 3051: 3049: 3047: 3041: 3029: 3026: 3025: 3024: 3021: 3019: 3016: 3012: 3009: 3007: 3004: 3002: 2999: 2997: 2994: 2993: 2992: 2989: 2987: 2984: 2983: 2981: 2979: 2975: 2969: 2966: 2964: 2961: 2959: 2956: 2952: 2949: 2947: 2944: 2943: 2942: 2939: 2937: 2934: 2933: 2931: 2929: 2925: 2919: 2916: 2914: 2911: 2907: 2904: 2903: 2902: 2899: 2895: 2892: 2890: 2887: 2886: 2885: 2882: 2878: 2875: 2873: 2870: 2869: 2868: 2865: 2863: 2860: 2856: 2853: 2851: 2848: 2847: 2846: 2843: 2842: 2840: 2836: 2833: 2831: 2827: 2817: 2814: 2812: 2809: 2808: 2806: 2804: 2800: 2794: 2791: 2789: 2786: 2782: 2779: 2777: 2774: 2773: 2772: 2769: 2767: 2766:Scoreboarding 2764: 2763: 2761: 2759: 2755: 2749: 2748:False sharing 2746: 2744: 2741: 2739: 2736: 2734: 2731: 2730: 2728: 2726: 2722: 2716: 2713: 2711: 2708: 2706: 2703: 2702: 2700: 2698: 2694: 2691: 2689: 2685: 2675: 2672: 2670: 2667: 2665: 2662: 2659: 2655: 2652: 2650: 2647: 2645: 2642: 2640: 2637: 2636: 2634: 2632: 2629: 2627: 2624: 2622: 2619: 2617: 2614: 2612: 2609: 2607: 2604: 2602: 2599: 2597: 2594: 2592: 2589: 2587: 2584: 2582: 2579: 2577: 2574: 2570: 2567: 2565: 2562: 2560: 2557: 2556: 2554: 2552: 2549: 2547: 2544: 2542: 2541:Stanford MIPS 2539: 2537: 2534: 2532: 2529: 2527: 2524: 2522: 2519: 2517: 2514: 2513: 2511: 2505: 2497: 2494: 2493: 2492: 2489: 2487: 2484: 2482: 2479: 2477: 2474: 2472: 2469: 2467: 2464: 2462: 2459: 2455: 2452: 2451: 2450: 2447: 2443: 2440: 2439: 2438: 2435: 2433: 2430: 2428: 2425: 2423: 2420: 2418: 2415: 2414: 2412: 2408: 2405: 2403: 2402:architectures 2397: 2391: 2388: 2386: 2383: 2381: 2378: 2376: 2373: 2371: 2370:Heterogeneous 2368: 2364: 2361: 2359: 2356: 2355: 2354: 2351: 2349: 2346: 2342: 2339: 2337: 2334: 2332: 2329: 2327: 2324: 2323: 2322: 2321:Memory access 2319: 2317: 2314: 2312: 2309: 2307: 2304: 2302: 2299: 2295: 2292: 2291: 2290: 2287: 2285: 2282: 2280: 2277: 2276: 2274: 2272: 2268: 2260: 2257: 2255: 2254:Random-access 2252: 2250: 2247: 2245: 2242: 2241: 2240: 2237: 2235: 2234:Stack machine 2232: 2230: 2227: 2223: 2220: 2218: 2215: 2213: 2210: 2208: 2205: 2203: 2200: 2198: 2195: 2193: 2190: 2188: 2185: 2184: 2183: 2180: 2176: 2173: 2171: 2168: 2166: 2163: 2161: 2158: 2156: 2153: 2151: 2150:with datapath 2148: 2147: 2146: 2143: 2141: 2138: 2136: 2133: 2132: 2130: 2128: 2124: 2120: 2113: 2108: 2106: 2101: 2099: 2094: 2093: 2090: 2083: 2079: 2076: 2074: 2070: 2067: 2064: 2062: 2059: 2057: 2054: 2052: 2049: 2047: 2044: 2042: 2039: 2038: 2019:on 2011-07-18 2018: 2014: 2010: 2004: 1996: 1992: 1986: 1975: 1968: 1952: 1948: 1942: 1935:. 8 May 2014. 1934: 1930: 1924: 1913: 1906: 1899: 1884: 1880: 1874: 1866: 1862: 1856: 1848: 1844: 1838: 1827: 1820: 1804: 1800: 1796: 1790: 1782: 1778: 1772: 1764: 1758: 1750: 1744: 1736: 1730: 1722: 1716: 1701: 1697: 1691: 1676: 1672: 1666: 1651: 1647: 1641: 1634: 1629: 1622: 1617: 1610: 1605: 1597: 1593: 1586: 1572: 1568: 1562: 1554: 1550: 1546: 1544:0-8186-7029-0 1540: 1536: 1532: 1528: 1521: 1513: 1509: 1505: 1504:11381/2297671 1501: 1497: 1493: 1489: 1482: 1471: 1465: 1457: 1453: 1449: 1445: 1444: 1436: 1432: 1426: 1422: 1412: 1409: 1407: 1404: 1402: 1399: 1397: 1394: 1392: 1389: 1387: 1383: 1379: 1375: 1371: 1367: 1364: 1363: 1357: 1355: 1351: 1346: 1343: 1340: 1336: 1332: 1328: 1324: 1320: 1315: 1313: 1309: 1304: 1300: 1299:PlayStation 2 1296: 1292: 1288: 1283: 1281: 1277: 1273: 1272:deinterlacing 1269: 1265: 1261: 1257: 1253: 1249: 1245: 1240: 1232: 1228: 1226: 1222: 1218: 1214: 1209: 1207: 1203: 1199: 1196: 1187: 1184: 1183: 1182: 1180: 1170: 1168: 1163: 1161: 1152:target_clones 1149: 1145: 1141: 1133: 1129: 1126: 1122: 1121: 1120: 1118: 1107: 1097: 1089: 1083: 1081: 1073: 1069: 1065: 1058: 1056: 1044: 1040: 1035: 1033: 1029: 1019: 1017: 1013: 1009: 1007: 1003: 999: 995: 991: 987: 983: 979: 975: 971: 967: 963: 959: 955: 951: 949: 945: 941: 937: 932: 928: 924: 920: 916: 911: 909: 905: 901: 897: 893: 888: 883: 881: 876: 866: 858: 849: 847: 842: 840: 836: 832: 828: 824: 820: 817:(MaDMaX) and 816: 812: 808: 804: 800: 796: 792: 788: 784: 780: 776: 772: 768: 765: 761: 757: 753: 749: 745: 742: 738: 734: 730: 726: 722: 718: 704: 700: 697: 694: 693: 690: 686: 683: 680: 679: 675: 672: 669: 668: 665: 661: 658: 655: 654: 651: 647: 643: 640: 637: 636: 632: 631:Silicon Optix 628: 624: 620: 616: 613: 610: 609: 606: 603: 600: 599: 595: 592: 589: 588: 585: 582: 579: 578: 574: 571: 570: 566: 555: 553: 549: 545: 541: 537: 532: 530: 525: 523: 513: 509: 505: 501: 498: 495: 492: 488: 485: 481: 480: 478: 475: 471: 467: 463: 460: 456: 452: 448: 447: 444:Disadvantages 441: 439: 430: 426: 424: 419: 415: 405: 401: 399: 395: 391: 387: 383: 379: 375: 372: 368: 364: 360: 355: 353: 349: 345: 341: 337: 331: 329: 328:Intel i860 XP 325: 321: 320:CM-1 and CM-2 318: 314: 310: 305: 303: 299: 295: 291: 287: 283: 279: 274: 272: 262: 260: 255: 253: 249: 245: 241: 237: 233: 229: 228:digital audio 225: 224:digital image 221: 217: 212: 210: 206: 202: 198: 194: 190: 182: 172: 169: 167: 164: 163: 162: 161: 157: 156: 151: 148: 146: 143: 141: 138: 137: 136: 135: 131: 130: 125: 122: 120: 117: 116: 115: 114: 110: 109: 104: 101: 99: 96: 95: 94: 93: 89: 88: 85: 82: 81: 72: 60: 58: 52: 43: 42: 37: 33: 19: 4443: 4329:Coordination 4259:Amdahl's law 4195:Simultaneous 4020:Chip carrier 3958:Clock gating 3877:Mixed-signal 3774:Write buffer 3751:Control unit 3563:Clock signal 3302:accelerators 3284:Cypress PSoC 2990: 2941:Simultaneous 2758:Out-of-order 2390:Neuromorphic 2271:Architecture 2229:Belt machine 2222:Zeno machine 2155:Hierarchical 2021:. Retrieved 2017:the original 2012: 2003: 1994: 1985: 1967: 1955:. Retrieved 1950: 1941: 1932: 1923: 1912:the original 1898: 1886:. Retrieved 1882: 1873: 1864: 1855: 1846: 1837: 1819: 1807:. Retrieved 1803:the original 1798: 1789: 1780: 1771: 1757: 1743: 1729: 1715: 1703:. Retrieved 1699: 1690: 1678:. Retrieved 1674: 1665: 1653:. Retrieved 1649: 1640: 1628: 1616: 1604: 1595: 1585: 1574:. Retrieved 1570: 1561: 1526: 1520: 1487: 1481: 1464: 1447: 1441: 1425: 1347: 1344: 1316: 1312:Direct3D 9.0 1284: 1241: 1238: 1229: 1221:SpiderMonkey 1210: 1191: 1176: 1164: 1137: 1113: 1100:#pragma simd 1084: 1066: 1059: 1036: 1025: 1011: 1010: 952: 912: 884: 872: 843: 809:technology, 714: 703:Pyxsys, Inc. 644:(MPP), from 533: 528: 526: 519: 455:"batch flow" 454: 431: 427: 411: 402: 356: 348:UltraSPARC I 332: 315:such as the 306: 282:CDC Star-100 275: 268: 256: 213: 192: 188: 187: 118: 66: 55: 4765:Scalability 4526:distributed 4409:Concurrency 4376:Programming 4217:Cooperative 4206:Speculative 4142:Instruction 3805:Multiplexer 3769:Data buffer 3480:Single-core 3452:bit slicing 3310:Coprocessor 3165:Coprocessor 3046:performance 2968:Cooperative 2958:Speculative 2918:Distributed 2877:Superscalar 2862:Instruction 2830:Parallelism 2803:Speculative 2635:System/3x0 2507:Instruction 2284:Von Neumann 2197:Post–Turing 1957:8 September 1888:8 September 1339:local store 1287:video games 1278:, adaptive 1225:WebAssembly 1090:4.0+ has a 1047:packed_simd 220:concurrency 4804:Categories 4770:Starvation 4509:asymmetric 4244:PRAM model 4212:Preemptive 3925:management 3820:Multiplier 3681:Logic gate 3671:Sequential 3578:Functional 3558:Clock rate 3531:Data cache 3504:Components 3485:Multi-core 3473:Core count 2963:Preemptive 2867:Pipelining 2850:Bit-serial 2793:Wide-issue 2738:Structural 2660:Tilera ISA 2626:MicroBlaze 2596:ETRAX CRIS 2491:Comparison 2336:Load–store 2316:Endianness 2023:2010-05-24 1705:16 January 1680:16 January 1655:16 January 1576:2023-07-13 1417:References 1354:Bill Dally 1350:ClearSpeed 1213:ECMAScript 1125:subroutine 793:and VIS2, 689:Wavetracer 625:, then at 558:Chronology 529:or greater 506:The early 414:multimedia 408:Advantages 376:and IBM's 236:multimedia 218:, but not 69:March 2017 4504:symmetric 4249:PEM model 3859:Circuitry 3779:Microcode 3703:Registers 3546:coherence 3521:CPU cache 3379:Word size 3044:Processor 2688:Execution 2591:DEC Alpha 2569:Power ISA 2385:Cognitive 2192:Universal 1308:Microsoft 1068:Microsoft 1051:std::sims 1006:Power ISA 978:QuickTime 904:registers 685:Zephyr DC 670:1987-1996 638:1983-1991 584:ILLIAC IV 271:ILLIAC IV 4735:Deadlock 4723:Problems 4689:pthreads 4669:OpenHMPP 4594:Ateji PX 4555:computer 4426:Hardware 4293:Elements 4279:Slowdown 4190:Temporal 4172:Pipeline 3797:Datapath 3490:Manycore 3462:variable 3300:Hardware 2936:Temporal 2616:OpenRISC 2311:Cellular 2301:Dataflow 2294:modified 2069:Archived 1809:9 August 1799:CilkPlus 1512:13180531 1360:See also 1262:to/from 1102:, GCC's 962:Motorola 913:SIMD on 910:coding. 902:and MMX 852:Software 844:Intel's 711:Hardware 575:Example 371:Motorola 354:system. 304:(1977). 284:and the 158:See also 4694:RaftLib 4674:OpenACC 4649:GPUOpen 4639:C++ AMP 4614:Charm++ 4356:Barrier 4300:Process 4284:Speedup 4069:General 3973:Related 3904:Quantum 3894:Digital 3889:Boolean 3787:Counter 3686:Quantum 3447:512-bit 3442:256-bit 3437:128-bit 3280:(MPSoC) 3265:on chip 3263:Systems 3081:(FLOPS) 2894:Process 2743:Control 2725:Hazards 2611:Itanium 2606:Unicore 2564:PowerPC 2289:Harvard 2249:Pointer 2244:Counter 2202:Quantum 2013:ZiiLabs 1847:lwn.net 1596:SIGARCH 1553:2262046 1386:AVX-512 1327:Toshiba 1252:Teranex 1158:). The 1080:OpenJDK 958:AltiVec 940:SIMDx86 936:libSIMD 846:AVX-512 831:Philips 819:MIPS-3D 746:(MAX), 741:PA-RISC 733:PowerPC 725:AltiVec 627:Teranex 451:parsing 398:AVX-512 390:AVX-512 374:PowerPC 367:AltiVec 340:PA-RISC 311:-style 265:History 257:Modern 4787:  4664:OpenCL 4659:OpenMP 4604:Chapel 4521:shared 4516:Memory 4451:(SIMT) 4394:Models 4305:Thread 4237:Theory 4208:(SpMT) 4162:Memory 4147:Thread 4130:Levels 3909:Switch 3899:Analog 3637:(IMC) 3608:(MMU) 3457:others 3432:64-bit 3427:48-bit 3422:32-bit 3417:24-bit 3412:16-bit 3407:15-bit 3402:12-bit 3239:Mobile 3155:Stream 3150:Barrel 3145:Vector 3134:(GPU) 3093:(SUPS) 3061:(IPC) 2913:Memory 2906:Vector 2889:Thread 2872:Scalar 2674:Others 2621:RISC-V 2586:SuperH 2555:Power 2551:MIPS-X 2526:PDP-11 2375:Fabric 2127:Models 2082:GitHub 1951:GitHub 1933:01.org 1551:  1541:  1510:  1411:OpenCL 1293:since 1204:, and 1088:OpenMP 1057:2.0+. 974:iTunes 919:3DNow! 896:3DNow! 833:, now 779:3DNow! 771:SSE4.x 674:MasPar 596:(DAP) 554:page. 534:ARM's 522:RISC-V 396:, and 302:Cray-1 4634:Dryad 4599:Boost 4320:Array 4310:Fiber 4224:(CMT) 4197:(SMT) 4111:GPGPU 3965:(PPW) 3923:Power 3815:Adder 3691:Array 3658:Logic 3619:(TLB) 3602:(FPU) 3596:(AGU) 3590:(ALU) 3580:units 3516:Cache 3397:8-bit 3392:4-bit 3387:1-bit 3351:(TPU) 3345:(DSP) 3339:(PPU) 3333:(VPU) 3322:(GPU) 3291:(NoC) 3274:(SoC) 3209:(PoP) 3203:(SiP) 3197:(MCM) 3138:GPGPU 3128:(CPU) 3118:Types 3099:(PPW) 3087:(TPS) 3075:(IPS) 3067:(CPI) 2838:Level 2649:S/390 2644:S/370 2639:S/360 2581:SPARC 2559:POWER 2442:TRIPS 2410:Types 2078:simde 1977:(PDF) 1915:(PDF) 1908:(PDF) 1829:(PDF) 1549:S2CID 1508:S2CID 1473:(PDF) 1438:(PDF) 1167:Glibc 1148:Clang 1055:Swift 1014:, or 990:XCode 944:SLEEF 931:Intel 880:GPGPU 839:Xetal 787:SPARC 767:SSSE3 748:Intel 717:Alpha 701:from 699:Xplor 687:from 617:from 418:pixel 378:POWER 4699:ROCm 4629:CUDA 4619:Cilk 4586:APIs 4546:COMA 4541:NUMA 4472:MIMD 4467:MISD 4444:SIMD 4439:SISD 4167:Loop 4157:Data 4152:Task 3943:ACPI 3676:Glue 3568:FIFO 3511:Core 3249:ASIP 3190:CPLD 3185:FPOA 3180:FPGA 3175:ASIC 3028:SPMD 3023:MIMD 3018:MISD 3011:SWAR 2991:SIMD 2986:SISD 2901:Data 2884:Task 2855:Word 2601:M32R 2546:MIPS 2509:sets 2476:ZISC 2471:NISC 2466:OISC 2461:MISC 2454:EPIC 2449:VLIW 2437:EDGE 2427:RISC 2422:CISC 2331:HUMA 2326:NUMA 1959:2019 1890:2019 1811:2020 1707:2020 1682:2020 1657:2020 1539:ISBN 1448:C-21 1378:SSE3 1374:SSE2 1335:NUMA 1331:Sony 1329:and 1295:1998 1268:HDTV 1260:NTSC 1244:GAPP 1219:and 1179:Dart 1156:/Qax 1096:Cilk 1072:.NET 1043:LLVM 1037:The 1016:SWAR 998:SSE3 996:and 994:SSE2 984:and 982:APIs 976:and 968:and 948:libm 942:and 925:and 894:and 815:MDMX 811:MIPS 807:Neon 799:MAJC 769:and 764:SSE3 760:SSE2 731:for 727:and 695:2001 681:1991 656:1985 646:NASA 629:and 611:1981 601:1976 590:1974 580:1974 572:Year 512:SSE2 423:DSPs 394:AVX2 388:and 386:AVX2 352:MDMX 324:MIMD 290:Cray 244:SIMT 193:SIMD 171:MPMD 166:SPMD 124:MIMD 119:SIMD 103:MISD 98:SISD 18:SIMD 4714:ZPL 4709:TBB 4704:UPC 4684:PVM 4654:MPI 4609:HPX 4536:UMA 4137:Bit 3938:APM 3933:PMU 3825:CPU 3782:ROM 3553:Bus 3170:PAL 2845:Bit 2631:LMC 2536:ARM 2531:x86 2521:VAX 2080:on 1531:doi 1500:hdl 1492:doi 1452:doi 1370:MMX 1323:IBM 1310:'s 1303:DSP 1264:PAL 1195:4Ă—4 1098:'s 970:GNU 966:IBM 950:). 929:by 927:SSE 923:AMD 921:by 915:x86 900:FPU 892:MMX 835:NXP 827:SPU 825:'s 805:'s 803:ARM 797:'s 795:Sun 791:VIS 789:'s 783:ARC 777:'s 775:AMD 756:SSE 750:'s 739:'s 729:SPE 723:'s 721:IBM 508:MMX 434:add 382:SSE 363:x86 359:MMX 344:VIS 336:MAX 250:or 232:CPU 199:in 4806:: 3872:3D 2011:. 1993:. 1949:. 1931:. 1881:. 1863:. 1845:. 1797:. 1779:. 1698:. 1673:. 1648:. 1594:. 1569:. 1547:. 1537:. 1506:. 1498:. 1490:. 1446:. 1440:. 1384:, 1380:, 1376:, 1372:, 1368:, 1274:, 1217:V8 1200:, 1142:, 964:, 938:, 869:4. 813:' 801:, 781:, 773:, 762:, 758:, 754:, 737:HP 735:, 633:) 567:) 476:.) 4054:e 4047:t 4040:v 2111:e 2104:t 2097:v 2026:. 1997:. 1979:. 1961:. 1892:. 1867:. 1849:. 1831:. 1813:. 1783:. 1765:. 1709:. 1684:. 1659:. 1598:. 1579:. 1555:. 1533:: 1514:. 1502:: 1494:: 1475:. 1458:. 1454:: 988:( 878:( 648:/ 466:C 191:( 71:) 67:( 61:. 38:. 20:)

Index

SIMD
SIMD (hash function)
Scottish index of multiple deprivation
factual accuracy
Flynn's taxonomy
SISD
MISD
SIMD
MIMD
Array processing (SIMT)
Pipelined processing (packed SIMD)
Associative processing (predicated/masked SIMD)
SPMD
MPMD

parallel processing
Flynn's taxonomy
instruction set architecture
multiple processing elements
data level parallelism
concurrency
digital image
digital audio
CPU
multimedia
Flynn's 1972 Taxonomy
SIMT
software threads
hardware threads
graphics processing units

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑