181:, or PC is a register that holds the address that is presented to the instruction memory. The address is presented to instruction memory at the start of a cycle. Then during the cycle, the instruction is read out of instruction memory, and at the same time, a calculation is done to determine the next PC. The next PC is calculated by incrementing the PC by 4, and by choosing whether to take that as the next PC or to take the result of a branch/jump calculation as the next PC. Note that in classic RISC, all instructions have the same length. (This is one thing that separates RISC from CISC ). In the original RISC designs, the size of an instruction is 4 bytes, so always add 4 to the instruction address, but don't use PC + 4 for the case of a taken branch, jump, or exception (see
632:
205:. In the case of CISC micro-coded instructions, once fetched from the instruction cache, the instruction bits are shifted down the pipeline, where simple combinational logic in each pipeline stage produces control signals for the datapath directly from the instruction bits. In those CISC designs, very little decoding is done in the stage traditionally called the decode stage. A consequence of this lack of decoding is that more instruction bits have to be used to specifying what the instruction does. That leaves fewer bits for things like register indices.
27:
639:
522:
720:: Depending on the design of the delayed branch and the branch conditions, it is determined whether the instruction immediately following the branch instruction is executed even if the branch is taken. Instead of taking an IPC penalty for some fraction of branches either taken (perhaps 60%) or not taken (perhaps 40%), branch delay slots take an IPC penalty for those branches into which the compiler could not schedule the branch delay slot. The SPARC, MIPS, and MC88K designers designed a branch delay slot into their ISAs.
784:), may not want wrapping arithmetic. Some architectures (e.g. MIPS), define special addition operations that branch to special locations on overflow, rather than wrapping the result. Software at the target location is responsible for fixing the problem. This special branch is called an exception. Exceptions differ from regular branches in that the target address is not specified by the instruction itself, and the branch decision is dependent on the outcome of the instruction.
300:
instruction's destination register. On real silicon, this can be a hazard (see below for more on hazards). That is because one of the source registers being read in decode might be the same as the destination register being written in writeback. When that happens, then the same memory cells in the register file are being both read and written the same time. On silicon, many implementations of memory cells will not operate correctly when read and written at the same time.
407:
142:
229:
target computation generally required a 16 bit add and a 14 bit incrementer. Resolving the branch in the decode stage made it possible to have just a single-cycle branch mis-predict penalty. Since branches were very often taken (and thus mis-predicted), it was very important to keep this penalty low.
796:
Exceptions are different from branches and jumps, because those other control flow changes are resolved in the decode stage. Exceptions are resolved in the writeback stage. When an exception is detected, the following instructions (earlier in the pipeline) are marked as invalid, and as they flow to
299:
During this stage, both single cycle and two cycle instructions write their results into the register file. Note that two different stages are accessing the register file at the same time—the decode stage is reading two source registers, at the same time that the writeback stage is writing a previous
228:
The decode stage ended up with quite a lot of hardware: MIPS has the possibility of branching if two registers are equal, so a 32-bit-wide AND tree runs in series after the register file read, making a very long critical path through this stage (which means fewer cycles per second). Also, the branch
821:
There are two strategies to handle the suspend/resume problem. The first is a global stall signal. This signal, when activated, prevents instructions from advancing down the pipeline, generally by gating off the clock to the flip-flops at the start of each stage. The disadvantage of this strategy
220:
If the instruction decoded is a branch or jump, the target address of the branch or jump is computed in parallel with reading the register file. The branch condition is computed in the following cycle (after the register file is read), and if the branch is taken or if the instruction is a jump, the
755:
The most serious drawback to delayed branches is the additional control complexity they entail. If the delay slot instruction takes an exception, the processor has to be restarted on the branch, rather than that next instruction. Exceptions then have essentially two addresses, the exception address
713:
Branch Likely: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch was taken. The compiler can always fill the branch delay slot on such a branch, and since branches are more often taken than not, such branches have a smaller IPC penalty than
674:
machine relied on the compiler to add the NOP instructions in this case, rather than having the circuitry to detect and (more taxingly) stall the first two pipeline stages. Hence the name MIPS: Microprocessor without
Interlocked Pipeline Stages. It turned out that the extra NOP instructions added
502:
Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexers to select the most recent data. These bypass multiplexers make it possible for the pipeline to
279:
During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both one and two cycle instructions always write their results in the same stage of the pipeline so that just one write port to the register file can be used, and
216:
At the same time the register file is read, instruction issue logic in this stage determines if the pipeline is ready to execute the instruction in this stage. If not, the issue logic causes both the
Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the input flip flops do
817:
Occasionally, either the data or instruction cache does not contain a required datum or instruction. In these cases, the CPU must suspend operation until the cache can be filled with the necessary data, and then must resume execution. The problem of filling the cache with the required data (and
808:
changes to the software visible state in the program order. This in-order commit happens very naturally in the classic RISC pipeline. Most instructions write their results to the register file in the writeback stage, and so those writes automatically happen in program order. Store instructions,
800:
To make it easy (and fast) for the software to fix the problem and restart the program, the CPU must take a precise exception. A precise exception means that all instructions up to the excepting instruction have been executed, and the excepting instruction and everything afterwards have not been
768:
The simplest solution, provided by most architectures, is wrapping arithmetic. Numbers greater than the maximum possible encoded value have their most significant bits chopped off until they fit. In the usual integer number system, 3000000000+3000000000=6000000000. With unsigned 32 bit wrapping
825:
Another strategy to handle suspend/resume is to reuse the exception logic. The machine takes an exception on the offending instruction, and all further instructions are invalidated. When the cache has been filled with the necessary data, the instruction that caused the cache miss restarts. To
325:
Classic RISC pipelines avoided these hazards by replicating hardware. In particular, branch instructions could have used the ALU to compute the target address of the branch. If the ALU were used in the decode stage for that purpose, an ALU instruction followed by a branch would have seen both
709:
Predict Not Taken: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch is not taken. If the branch is not taken, the pipeline stays full. If the branch is taken, the instruction is flushed (marked as if it were a NOP), and one cycle's
266:
operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle
153:, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). The vertical axis is successive instructions; the horizontal axis is time. So in the green column, the earliest instruction is in WB stage, and the latest instruction is undergoing instruction fetch.
208:
All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, the indexes of these two registers are identified within the instruction, and the indexes are presented to the register memory, as the address. Thus the two registers named are read from the
240:
The ALU is responsible for performing boolean operations (and, or, not, nand, nor, xor, xnor) and also for performing integer addition and subtraction. Besides the result, the ALU typically provides status bits such as whether or not the result was 0, or if an overflow occurred.
822:
is that there are a large number of flip flops, so the global stall signal takes a long time to propagate. Since the machine generally has to stall in the same cycle that it identifies the condition requiring the stall, the stall signal becomes a speed-limiting critical path.
726:: In parallel with fetching each instruction, guess if the instruction is a branch or jump, and if so, guess the target. On the cycle after a branch or jump, fetch the instruction at the guessed target. When the guess is wrong, flush the incorrectly fetched target.
675:
by the compiler expanded the program binaries enough that the instruction cache hit rate was reduced. The stall hardware, although expensive, was put back into later designs to improve instruction cache hit rate, at which point the acronym no longer made sense.
503:
execute simple instructions with just the latency of the ALU, the multiplexer, and a flip-flop. Without the multiplexers, the latency of writing and then reading the register file would have to be included in the latency of these instructions.
463:
it is normally written-back. The solution to this problem is a pair of bypass multiplexers. These multiplexers sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU. Each multiplexer selects between:
690:
The branch resolution recurrence goes through quite a bit of circuitry: the instruction cache read, register file read, branch condition compute (which involves a 32-bit compare on the MIPS CPUs), and the next instruction address
251:
Register-Register
Operation (Single-cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute
607:
since it floats in the pipeline, like an air bubble in a water pipe, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a
809:
however, write their results to the Store Data Queue in the access stage. If the store instruction takes an exception, the Store Data Queue entry is invalidated so that it is not written to the cache data SRAM later.
741:
Compilers typically have some difficulty finding logically independent instructions to place after the branch (the instruction after the branch is called the delay slot), so that they must insert NOPs into the delay
590:- they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the
255:
Memory
Reference (Two-cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the
769:
arithmetic, 3000000000+3000000000=1705032704 (6000000000 mod 2^32). This may not seem terribly useful. The largest benefit of wrapping arithmetic is that every operation has a well defined result.
694:
Because branch and jump targets are calculated in parallel to the register read, RISC ISAs typically do not have instructions that branch to a register+offset address. Jump to register is supported.
582:
instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the
797:
the end of the pipe their results are discarded. The program counter is set to the address of a special exception handler, and special registers are written with the exception location and cause.
667:'s Decode stage. This causes quite a performance hit, as the processor spends a lot of time processing nothing, but clock speeds can be increased as there is less forwarding logic to wait for.
663:
can be solved by stalling the first stage by three cycles until write-back is achieved, and the data in the register file is correct, causing the correct register value to be fetched by the
237:
The
Execute stage is where the actual computation occurs. Typically this stage consists of an ALU, and also a bit shifter. It may also include a multiple cycle multiplier and divider.
697:
On any branch taken, the instruction immediately after the branch is always fetched from the instruction cache. If this instruction is ignored, there is a one cycle per taken branch
166:. The term "latency" is used in computer science often and means the time from when an operation starts until it completes. Thus, instruction fetch has a latency of one
444:. Write-back of this normally occurs in cycle 5 (green box). Therefore, the value read from the register file and passed to the ALU (in the Execute stage of the
326:
instructions attempt to use the ALU simultaneously. It is simple to resolve this conflict by designing a specialized branch target adder into the decode stage.
686:
The classic RISC pipeline resolves branches in the decode stage, which means the branch resolution recurrence is two cycles long. There are three implications:
510:
in time - the data cannot be bypassed back to an earlier stage if it has not been processed yet. In the case above, the data is passed forward (by the time the
1956:
730:
Delayed branches were controversial, first, because their semantics are complicated. A delayed branch specifies that the jump to a new location happens
403:
The instruction fetch and decode stages send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:
482:
The current register pipeline of the access stage (which is either a loaded value or a forwarded ALU result, this provides bypassing of two stages):
928:
748:
processors, which fetch multiple instructions per cycle and must have some form of branch prediction, do not benefit from delayed branches. The
2067:
1250:
826:
expedite data cache miss handling, the instruction can be restarted so that its access cycle happens one cycle after the data cache is filled.
221:
PC in the first stage is assigned the branch target, rather than the incremented PC that has been computed. Some architectures made use of the
756:
and the restart address, and generating and distinguishing between the two correctly in all cases has been a source of bugs for later designs.
1769:
1047:
1926:
1492:
1309:
2280:
335:
Data hazards occur when an instruction, scheduled blindly, would attempt to use data before the data is available in the register file.
1272:
818:
potentially writing back to memory the evicted cache line) is not specific to the pipeline organization, and is not discussed here.
1921:
1993:
2275:
1746:
788:
247:
Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:
2864:
2690:
1814:
1077:
921:
2700:
1841:
162:
The instructions reside in memory that takes one cycle to read. This memory can be dedicated to SRAM, or an
Instruction
765:
Suppose a 32-bit RISC processes an ADD instruction that adds two large numbers, and the result does not fit in 32 bits.
125:. During operation, each pipeline stage works on one instruction at a time. Each of these stages consists of a set of
968:
896:
734:
the next instruction. That next instruction is the one unavoidably loaded by the instruction cache after the branch.
2008:
1836:
1809:
1188:
835:
70:
48:
41:
2823:
2386:
1279:
1245:
1240:
1159:
1124:
88:
2859:
2798:
2695:
2096:
2003:
1804:
1025:
914:
586:
instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are
1824:
1543:
978:
670:
This data hazard can be detected quite easily when the program's machine code is written by the compiler. The
309:
1998:
1846:
1819:
1680:
1294:
1255:
1112:
781:
84:
2435:
2197:
1673:
1634:
1289:
1284:
1218:
1030:
647:
A pipeline interlock does not have to be used with any data forwarding, however. The first example of the
2062:
1759:
1457:
1154:
777:
323:
Structural hazards occur when two instructions might attempt to use the same resources at the same time.
2712:
2359:
1776:
1267:
1235:
1005:
993:
973:
2803:
2766:
2756:
1144:
201:
Another thing that separates the first RISC machines from earlier CISC machines, is that RISC has no
2818:
2225:
2161:
2138:
1988:
1950:
1786:
1736:
1731:
1208:
1102:
1010:
35:
2771:
2554:
2448:
2412:
2329:
2313:
2155:
1944:
1903:
1891:
1754:
1668:
1589:
1354:
1015:
958:
190:
126:
91:
2577:
2549:
2459:
2424:
2173:
2167:
2149:
1883:
1877:
1781:
1685:
1576:
1515:
1377:
1020:
698:
259:
118:
52:
2751:
2660:
2406:
2118:
1936:
1695:
1663:
1621:
1334:
1149:
1139:
1129:
1119:
1089:
1072:
937:
222:
787:
The most common kind of software-visible exception on one of the classic RISC machines is a
2781:
2717:
2303:
2025:
1915:
1862:
1394:
1107:
963:
945:
468:
A register file read port (i.e. the output of the decode stage, as in the naive pipeline):
122:
8:
2828:
2813:
2633:
2484:
2466:
2430:
2418:
2072:
2019:
1796:
1712:
1594:
1449:
1344:
1203:
130:
2685:
2677:
2529:
2504:
2308:
2183:
1707:
1648:
1528:
1260:
988:
603:
491:
348:
170:(if using single-cycle SRAM or if the instruction was in the cache). Thus, during the
225:(ALU) in the Execute stage, at the cost of slightly decreased instruction throughput.
2638:
2605:
2521:
2453:
2354:
2344:
2334:
2265:
2260:
2255:
2178:
2107:
2013:
1973:
1606:
1556:
1506:
1482:
1364:
1304:
1299:
1181:
1097:
892:
723:
717:
631:
186:
171:
163:
150:
99:
2808:
2741:
2727:
2582:
2489:
2443:
2250:
2245:
2240:
2235:
2230:
2220:
2090:
2057:
1968:
1963:
1872:
1724:
1719:
1702:
1690:
1629:
1193:
1171:
1057:
1035:
953:
2722:
2707:
2655:
2559:
2534:
2371:
2364:
2215:
2210:
2205:
2144:
2052:
2042:
1764:
1599:
1551:
1314:
1198:
1166:
1067:
1062:
983:
886:
737:
Delayed branches have been criticized as a poor short-term choice in ISA design:
178:
283:
For direct mapped and virtually tagged data caching, the simplest by far of the
2833:
2667:
2650:
2643:
2539:
2396:
2133:
2047:
1978:
1561:
1523:
1472:
1467:
1462:
1176:
1000:
871:
856:
263:
107:
638:
521:
2853:
2628:
2544:
1584:
1566:
1359:
1052:
773:
752:
ISA left out delayed branches, as it was intended for superscalar processors.
671:
314:
for situations where instructions in a pipeline would produce wrong answers.
210:
1487:
574:
is not present in the data cache until after the Memory Access stage of the
217:
not accept new bits, thus no new calculations take place during that cycle.
185:, below). (Note that some modern machines use more complicated algorithms (
2838:
2776:
2592:
2569:
2381:
2102:
1040:
339:
In the classic RISC pipeline, Data hazards are avoided in one of two ways:
117:
Each of these classic scalar RISC designs fetches and tries to execute one
2623:
2587:
2298:
2270:
2128:
1983:
906:
745:
455:
back to the
Execute stage (i.e. to the red circle in the diagram) of the
167:
705:
There are four schemes to solve this performance problem with branches:
2509:
2499:
2494:
2476:
2376:
2349:
1611:
1444:
1414:
1134:
772:
But the programmer, especially if programming in a language supporting
417:, without hazard consideration, the data hazard progresses as follows:
406:
288:
684:
Control hazards are caused by conditional and unconditional branching.
2600:
2597:
2339:
1409:
1387:
749:
284:
202:
141:
94:(RISC CPUs) used a very similar architectural solution, now called a
174:
stage, a 32-bit instruction is fetched from the instruction memory.
121:. The main common concept of each design is a five-stage execution
2615:
1434:
475:
The current register pipeline of the ALU (to bypass by one stage):
1424:
1382:
1439:
1404:
1369:
276:
If data memory needs to be accessed, it is done in this stage.
267:
instruction wrote their results to a separate set of registers.
1897:
1429:
1399:
103:
2761:
1909:
1829:
1419:
262:(Many cycle latency). Integer multiply and divide and all
146:
354:
Suppose the CPU is executing the following piece of code:
1349:
1339:
213:. In the MIPS design, the register file had 32 entries.
111:
244:
The bit shifter is responsible for shift and rotations.
291:
are used, one storing data and the other storing tags.
136:
486:
arrow. Note that this requires the data to be passed
451:
Instead, we must pass the data that was computed by
133:that operates on the outputs of those flip-flops.
884:
872:"RISC I: A Reduced Instruction Set VLSI Computer"
857:"RISC I: A Reduced Instruction Set VLSI Computer"
528:
2851:
436:is fetched from the register file. However, the
885:Hennessy, John L.; Patterson, David A. (2011).
888:Computer Architecture, A Quantitative Approach
533:However, consider the following instructions:
440:instruction has not yet written its result to
922:
710:opportunity to finish an instruction is lost.
1927:Computer performance by orders of magnitude
936:
929:
915:
514:is ready for the register in the ALU, the
193:) to guess the next instruction address.)
869:
854:
804:To take precise exceptions, the CPU must
424:instruction calculates the new value for
71:Learn how and when to remove this message
490:in time by one cycle. If this occurs, a
342:
140:
34:This article includes a list of general
432:operation is decoded, and the value of
308:Hennessy and Patterson coined the term
2852:
812:
506:Note that the data can only be passed
910:
317:
196:
1898:Floating-point operations per second
157:
137:The classic five stage RISC pipeline
20:
701:penalty, which is adequately large.
13:
678:
520:
498:operation until the data is ready.
448:operation, red box) is incorrect.
405:
40:it lacks sufficient corresponding
14:
2876:
891:(5th ed.). Morgan Kaufmann.
836:Iron law of processor performance
285:numerous data cache organizations
2824:Semiconductor device fabrication
870:Patterson, David (12 May 1981).
855:Patterson, David (12 May 1981).
637:
630:
271:
89:reduced instruction set computer
25:
2799:History of general-purpose CPUs
1026:Nondeterministic Turing machine
624:Problem resolved using a bubble
614:
578:instruction. By this time, the
570:The data read from the address
329:
145:Basic five-stage pipeline in a
979:Deterministic finite automaton
863:
848:
601:This NOP is termed a pipeline
529:Solution B. Pipeline interlock
494:must be inserted to stall the
1:
1770:Simultaneous and heterogenous
874:. Isca '81. pp. 443–457.
859:. Isca '81. pp. 443–457.
841:
760:
110:, and later the notional CPU
2454:Integrated memory controller
2436:Translation lookaside buffer
1635:Memory dependence prediction
1078:Random-access stored program
1031:Probabilistic Turing machine
398:; Writes r10 & r3 to r11
294:
85:history of computer hardware
7:
2865:Superscalar microprocessors
1910:Synaptic updates per second
829:
619:Bypassing backwards in time
347:Bypassing is also known as
10:
2881:
2314:Heterogeneous architecture
1236:Orthogonal instruction set
1006:Alternating Turing machine
994:Quantum cellular automaton
655:and the second example of
518:has already computed it).
303:
232:
2804:Microprocessor chronology
2791:
2767:Dynamic frequency scaling
2740:
2676:
2614:
2568:
2520:
2475:
2395:
2322:
2291:
2196:
2117:
2081:
2035:
1935:
1922:Cache performance metrics
1861:
1795:
1745:
1656:
1647:
1620:
1575:
1542:
1514:
1505:
1325:
1228:
1217:
1088:
944:
428:. In the same cycle, the
2819:Hardware security module
2162:Digital signal processor
2139:Graphics processing unit
1951:Graphics processing unit
535:
356:
280:it is always available.
260:Multi-cycle Instructions
191:branch target prediction
114:invented for education.
92:central processing units
2772:Dynamic voltage scaling
2555:Memory address register
2449:Branch target predictor
2413:Address generation unit
2156:Physics processing unit
1945:Central processing unit
1904:Transactions per second
1892:Instructions per second
1815:Array processing (SIMT)
959:Stored-program computer
377:; Writes r3 - r4 to r10
55:more precise citations.
2860:Instruction processing
2578:Hardwired control unit
2460:Memory management unit
2425:Memory management unit
2174:Secure cryptoprocessor
2168:Tensor Processing Unit
2150:Vision processing unit
1884:Cycles per instruction
1878:Instructions per cycle
1825:Associative processing
1516:Instruction pipelining
938:Processor technologies
525:
410:
154:
2661:Sum-addressed decoder
2407:Arithmetic logic unit
1534:Classic RISC pipeline
1488:Epiphany architecture
1335:Motorola 68000 series
524:
409:
343:Solution A. Bypassing
223:Arithmetic logic unit
144:
119:instruction per cycle
96:classic RISC pipeline
2782:Performance per watt
2360:replacement policies
2026:Package on a package
1916:Performance per watt
1820:Pipelined processing
1590:Tomasulo's algorithm
1395:Clipper architecture
1251:Application-specific
964:Finite-state machine
123:instruction pipeline
16:Instruction pipeline
2814:Digital electronics
2467:Instruction decoder
2419:Floating-point unit
2073:Soft microprocessor
2020:System in a package
1595:Reservation station
1125:Transport-triggered
813:Cache miss handling
131:combinational logic
129:to hold state, and
98:. Those CPUs were:
2686:Integrated circuit
2530:Processor register
2184:Baseband processor
1529:Operand forwarding
989:Cellular automaton
714:the previous kind.
610:pipeline interlock
526:
411:
349:operand forwarding
318:Structural hazards
197:Instruction decode
155:
2847:
2846:
2736:
2735:
2355:Instruction cache
2345:Scratchpad memory
2192:
2191:
2179:Network processor
2108:Network on a chip
2063:Ultra-low-voltage
2014:Multi-chip module
1857:
1856:
1643:
1642:
1630:Branch prediction
1607:Register renaming
1501:
1500:
1483:VISC architecture
1305:Quantum computing
1300:VISC architecture
1182:Secondary storage
1098:Microarchitecture
1058:Register machines
724:Branch Prediction
718:Branch Delay Slot
645:
644:
187:branch prediction
172:Instruction Fetch
158:Instruction fetch
151:Instruction Fetch
81:
80:
73:
2872:
2809:Processor design
2701:Power management
2583:Instruction unit
2444:Branch predictor
2393:
2392:
2091:System on a chip
2033:
2032:
1873:Transistor count
1797:Flynn's taxonomy
1654:
1653:
1512:
1511:
1315:Addressing modes
1226:
1225:
1172:Memory hierarchy
1036:Hypercomputation
954:Abstract machine
931:
924:
917:
908:
907:
902:
876:
875:
867:
861:
860:
852:
666:
662:
658:
654:
650:
641:
634:
615:
597:
593:
585:
581:
577:
573:
566:
563:
560:
557:
554:
551:
548:
545:
542:
539:
517:
513:
497:
485:
478:
471:
458:
454:
447:
443:
439:
435:
431:
427:
423:
420:In cycle 3, the
399:
396:
393:
390:
387:
384:
381:
378:
375:
372:
369:
366:
363:
360:
183:delayed branches
76:
69:
65:
62:
56:
51:this article by
42:inline citations
29:
28:
21:
2880:
2879:
2875:
2874:
2873:
2871:
2870:
2869:
2850:
2849:
2848:
2843:
2829:Tick–tock model
2787:
2743:
2732:
2672:
2656:Address decoder
2610:
2564:
2560:Program counter
2535:Status register
2516:
2471:
2431:Load–store unit
2398:
2391:
2318:
2287:
2188:
2145:Image processor
2120:
2113:
2083:
2077:
2053:Microcontroller
2043:Embedded system
2031:
1931:
1864:
1853:
1791:
1741:
1639:
1616:
1600:Re-order buffer
1571:
1552:Data dependency
1538:
1497:
1327:
1321:
1220:
1219:Instruction set
1213:
1199:Multiprocessing
1167:Cache hierarchy
1160:Register/memory
1084:
984:Queue automaton
940:
935:
905:
899:
880:
879:
868:
864:
853:
849:
844:
832:
815:
763:
681:
679:Control hazards
664:
660:
656:
652:
648:
595:
591:
583:
579:
575:
571:
568:
567:
564:
561:
558:
555:
552:
549:
546:
543:
540:
537:
531:
515:
511:
495:
483:
476:
469:
456:
452:
445:
441:
437:
433:
429:
425:
421:
401:
400:
397:
394:
391:
388:
385:
382:
379:
376:
373:
370:
367:
364:
361:
358:
345:
332:
320:
306:
297:
274:
235:
199:
179:Program Counter
160:
139:
77:
66:
60:
57:
47:Please help to
46:
30:
26:
17:
12:
11:
5:
2878:
2868:
2867:
2862:
2845:
2844:
2842:
2841:
2836:
2834:Pin grid array
2831:
2826:
2821:
2816:
2811:
2806:
2801:
2795:
2793:
2789:
2788:
2786:
2785:
2779:
2774:
2769:
2764:
2759:
2754:
2748:
2746:
2738:
2737:
2734:
2733:
2731:
2730:
2725:
2720:
2715:
2710:
2705:
2704:
2703:
2698:
2693:
2682:
2680:
2674:
2673:
2671:
2670:
2668:Barrel shifter
2665:
2664:
2663:
2658:
2651:Binary decoder
2648:
2647:
2646:
2636:
2631:
2626:
2620:
2618:
2612:
2611:
2609:
2608:
2603:
2595:
2590:
2585:
2580:
2574:
2572:
2566:
2565:
2563:
2562:
2557:
2552:
2547:
2542:
2540:Stack register
2537:
2532:
2526:
2524:
2518:
2517:
2515:
2514:
2513:
2512:
2507:
2497:
2492:
2487:
2481:
2479:
2473:
2472:
2470:
2469:
2464:
2463:
2462:
2451:
2446:
2441:
2440:
2439:
2433:
2422:
2416:
2410:
2403:
2401:
2390:
2389:
2384:
2379:
2374:
2369:
2368:
2367:
2362:
2357:
2352:
2347:
2342:
2332:
2326:
2324:
2320:
2319:
2317:
2316:
2311:
2306:
2301:
2295:
2293:
2289:
2288:
2286:
2285:
2284:
2283:
2273:
2268:
2263:
2258:
2253:
2248:
2243:
2238:
2233:
2228:
2223:
2218:
2213:
2208:
2202:
2200:
2194:
2193:
2190:
2189:
2187:
2186:
2181:
2176:
2171:
2165:
2159:
2153:
2147:
2142:
2136:
2134:AI accelerator
2131:
2125:
2123:
2115:
2114:
2112:
2111:
2105:
2100:
2097:Multiprocessor
2094:
2087:
2085:
2079:
2078:
2076:
2075:
2070:
2065:
2060:
2055:
2050:
2048:Microprocessor
2045:
2039:
2037:
2036:By application
2030:
2029:
2023:
2017:
2011:
2006:
2001:
1996:
1991:
1986:
1981:
1979:Tile processor
1976:
1971:
1966:
1961:
1960:
1959:
1948:
1941:
1939:
1933:
1932:
1930:
1929:
1924:
1919:
1913:
1907:
1901:
1895:
1889:
1888:
1887:
1875:
1869:
1867:
1859:
1858:
1855:
1854:
1852:
1851:
1850:
1849:
1839:
1834:
1833:
1832:
1827:
1822:
1817:
1807:
1801:
1799:
1793:
1792:
1790:
1789:
1784:
1779:
1774:
1773:
1772:
1767:
1765:Hyperthreading
1757:
1751:
1749:
1747:Multithreading
1743:
1742:
1740:
1739:
1734:
1729:
1728:
1727:
1717:
1716:
1715:
1710:
1700:
1699:
1698:
1693:
1683:
1678:
1677:
1676:
1671:
1660:
1658:
1651:
1645:
1644:
1641:
1640:
1638:
1637:
1632:
1626:
1624:
1618:
1617:
1615:
1614:
1609:
1604:
1603:
1602:
1597:
1587:
1581:
1579:
1573:
1572:
1570:
1569:
1564:
1559:
1554:
1548:
1546:
1540:
1539:
1537:
1536:
1531:
1526:
1524:Pipeline stall
1520:
1518:
1509:
1503:
1502:
1499:
1498:
1496:
1495:
1490:
1485:
1480:
1477:
1476:
1475:
1473:z/Architecture
1470:
1465:
1460:
1452:
1447:
1442:
1437:
1432:
1427:
1422:
1417:
1412:
1407:
1402:
1397:
1392:
1391:
1390:
1385:
1380:
1372:
1367:
1362:
1357:
1352:
1347:
1342:
1337:
1331:
1329:
1323:
1322:
1320:
1319:
1318:
1317:
1307:
1302:
1297:
1292:
1287:
1282:
1277:
1276:
1275:
1265:
1264:
1263:
1253:
1248:
1243:
1238:
1232:
1230:
1223:
1215:
1214:
1212:
1211:
1206:
1201:
1196:
1191:
1186:
1185:
1184:
1179:
1177:Virtual memory
1169:
1164:
1163:
1162:
1157:
1152:
1147:
1137:
1132:
1127:
1122:
1117:
1116:
1115:
1105:
1100:
1094:
1092:
1086:
1085:
1083:
1082:
1081:
1080:
1075:
1070:
1065:
1055:
1050:
1045:
1044:
1043:
1038:
1033:
1028:
1023:
1018:
1013:
1008:
1001:Turing machine
998:
997:
996:
991:
986:
981:
976:
971:
961:
956:
950:
948:
942:
941:
934:
933:
926:
919:
911:
904:
903:
898:978-0123838728
897:
881:
878:
877:
862:
846:
845:
843:
840:
839:
838:
831:
828:
814:
811:
774:large integers
762:
759:
758:
757:
753:
743:
728:
727:
721:
715:
711:
703:
702:
695:
692:
680:
677:
643:
642:
635:
627:
626:
621:
598:instructions.
536:
530:
527:
500:
499:
480:
473:
415:naive pipeline
357:
344:
341:
331:
328:
319:
316:
305:
302:
296:
293:
273:
270:
269:
268:
264:floating-point
257:
253:
234:
231:
198:
195:
159:
156:
149:machine (IF =
138:
135:
79:
78:
33:
31:
24:
15:
9:
6:
4:
3:
2:
2877:
2866:
2863:
2861:
2858:
2857:
2855:
2840:
2837:
2835:
2832:
2830:
2827:
2825:
2822:
2820:
2817:
2815:
2812:
2810:
2807:
2805:
2802:
2800:
2797:
2796:
2794:
2790:
2783:
2780:
2778:
2775:
2773:
2770:
2768:
2765:
2763:
2760:
2758:
2755:
2753:
2750:
2749:
2747:
2745:
2739:
2729:
2726:
2724:
2721:
2719:
2716:
2714:
2711:
2709:
2706:
2702:
2699:
2697:
2694:
2692:
2689:
2688:
2687:
2684:
2683:
2681:
2679:
2675:
2669:
2666:
2662:
2659:
2657:
2654:
2653:
2652:
2649:
2645:
2642:
2641:
2640:
2637:
2635:
2632:
2630:
2629:Demultiplexer
2627:
2625:
2622:
2621:
2619:
2617:
2613:
2607:
2604:
2602:
2599:
2596:
2594:
2591:
2589:
2586:
2584:
2581:
2579:
2576:
2575:
2573:
2571:
2567:
2561:
2558:
2556:
2553:
2551:
2550:Memory buffer
2548:
2546:
2545:Register file
2543:
2541:
2538:
2536:
2533:
2531:
2528:
2527:
2525:
2523:
2519:
2511:
2508:
2506:
2503:
2502:
2501:
2498:
2496:
2493:
2491:
2488:
2486:
2485:Combinational
2483:
2482:
2480:
2478:
2474:
2468:
2465:
2461:
2458:
2457:
2455:
2452:
2450:
2447:
2445:
2442:
2437:
2434:
2432:
2429:
2428:
2426:
2423:
2420:
2417:
2414:
2411:
2408:
2405:
2404:
2402:
2400:
2394:
2388:
2385:
2383:
2380:
2378:
2375:
2373:
2370:
2366:
2363:
2361:
2358:
2356:
2353:
2351:
2348:
2346:
2343:
2341:
2338:
2337:
2336:
2333:
2331:
2328:
2327:
2325:
2321:
2315:
2312:
2310:
2307:
2305:
2302:
2300:
2297:
2296:
2294:
2290:
2282:
2279:
2278:
2277:
2274:
2272:
2269:
2267:
2264:
2262:
2259:
2257:
2254:
2252:
2249:
2247:
2244:
2242:
2239:
2237:
2234:
2232:
2229:
2227:
2224:
2222:
2219:
2217:
2214:
2212:
2209:
2207:
2204:
2203:
2201:
2199:
2195:
2185:
2182:
2180:
2177:
2175:
2172:
2169:
2166:
2163:
2160:
2157:
2154:
2151:
2148:
2146:
2143:
2140:
2137:
2135:
2132:
2130:
2127:
2126:
2124:
2122:
2116:
2109:
2106:
2104:
2101:
2098:
2095:
2092:
2089:
2088:
2086:
2080:
2074:
2071:
2069:
2066:
2064:
2061:
2059:
2056:
2054:
2051:
2049:
2046:
2044:
2041:
2040:
2038:
2034:
2027:
2024:
2021:
2018:
2015:
2012:
2010:
2007:
2005:
2002:
2000:
1997:
1995:
1992:
1990:
1987:
1985:
1982:
1980:
1977:
1975:
1972:
1970:
1967:
1965:
1962:
1958:
1955:
1954:
1952:
1949:
1946:
1943:
1942:
1940:
1938:
1934:
1928:
1925:
1923:
1920:
1917:
1914:
1911:
1908:
1905:
1902:
1899:
1896:
1893:
1890:
1885:
1882:
1881:
1879:
1876:
1874:
1871:
1870:
1868:
1866:
1860:
1848:
1845:
1844:
1843:
1840:
1838:
1835:
1831:
1828:
1826:
1823:
1821:
1818:
1816:
1813:
1812:
1811:
1808:
1806:
1803:
1802:
1800:
1798:
1794:
1788:
1785:
1783:
1780:
1778:
1775:
1771:
1768:
1766:
1763:
1762:
1761:
1758:
1756:
1753:
1752:
1750:
1748:
1744:
1738:
1735:
1733:
1730:
1726:
1723:
1722:
1721:
1718:
1714:
1711:
1709:
1706:
1705:
1704:
1701:
1697:
1694:
1692:
1689:
1688:
1687:
1684:
1682:
1679:
1675:
1672:
1670:
1667:
1666:
1665:
1662:
1661:
1659:
1655:
1652:
1650:
1646:
1636:
1633:
1631:
1628:
1627:
1625:
1623:
1619:
1613:
1610:
1608:
1605:
1601:
1598:
1596:
1593:
1592:
1591:
1588:
1586:
1585:Scoreboarding
1583:
1582:
1580:
1578:
1574:
1568:
1567:False sharing
1565:
1563:
1560:
1558:
1555:
1553:
1550:
1549:
1547:
1545:
1541:
1535:
1532:
1530:
1527:
1525:
1522:
1521:
1519:
1517:
1513:
1510:
1508:
1504:
1494:
1491:
1489:
1486:
1484:
1481:
1478:
1474:
1471:
1469:
1466:
1464:
1461:
1459:
1456:
1455:
1453:
1451:
1448:
1446:
1443:
1441:
1438:
1436:
1433:
1431:
1428:
1426:
1423:
1421:
1418:
1416:
1413:
1411:
1408:
1406:
1403:
1401:
1398:
1396:
1393:
1389:
1386:
1384:
1381:
1379:
1376:
1375:
1373:
1371:
1368:
1366:
1363:
1361:
1360:Stanford MIPS
1358:
1356:
1353:
1351:
1348:
1346:
1343:
1341:
1338:
1336:
1333:
1332:
1330:
1324:
1316:
1313:
1312:
1311:
1308:
1306:
1303:
1301:
1298:
1296:
1293:
1291:
1288:
1286:
1283:
1281:
1278:
1274:
1271:
1270:
1269:
1266:
1262:
1259:
1258:
1257:
1254:
1252:
1249:
1247:
1244:
1242:
1239:
1237:
1234:
1233:
1231:
1227:
1224:
1222:
1221:architectures
1216:
1210:
1207:
1205:
1202:
1200:
1197:
1195:
1192:
1190:
1189:Heterogeneous
1187:
1183:
1180:
1178:
1175:
1174:
1173:
1170:
1168:
1165:
1161:
1158:
1156:
1153:
1151:
1148:
1146:
1143:
1142:
1141:
1140:Memory access
1138:
1136:
1133:
1131:
1128:
1126:
1123:
1121:
1118:
1114:
1111:
1110:
1109:
1106:
1104:
1101:
1099:
1096:
1095:
1093:
1091:
1087:
1079:
1076:
1074:
1073:Random-access
1071:
1069:
1066:
1064:
1061:
1060:
1059:
1056:
1054:
1053:Stack machine
1051:
1049:
1046:
1042:
1039:
1037:
1034:
1032:
1029:
1027:
1024:
1022:
1019:
1017:
1014:
1012:
1009:
1007:
1004:
1003:
1002:
999:
995:
992:
990:
987:
985:
982:
980:
977:
975:
972:
970:
969:with datapath
967:
966:
965:
962:
960:
957:
955:
952:
951:
949:
947:
943:
939:
932:
927:
925:
920:
918:
913:
912:
909:
900:
894:
890:
889:
883:
882:
873:
866:
858:
851:
847:
837:
834:
833:
827:
823:
819:
810:
807:
802:
798:
794:
792:
791:
785:
783:
779:
775:
770:
766:
754:
751:
747:
744:
740:
739:
738:
735:
733:
725:
722:
719:
716:
712:
708:
707:
706:
700:
696:
693:
689:
688:
687:
685:
676:
673:
672:Stanford MIPS
668:
640:
636:
633:
629:
628:
625:
622:
620:
617:
616:
613:
611:
606:
605:
599:
589:
534:
523:
519:
509:
504:
493:
489:
481:
474:
467:
466:
465:
462:
449:
418:
416:
408:
404:
355:
352:
350:
340:
337:
336:
327:
324:
315:
313:
312:
301:
292:
290:
286:
281:
277:
272:Memory access
265:
261:
258:
254:
250:
249:
248:
245:
242:
238:
230:
226:
224:
218:
214:
212:
211:register file
206:
204:
194:
192:
188:
184:
180:
175:
173:
169:
165:
152:
148:
143:
134:
132:
128:
124:
120:
115:
113:
109:
105:
101:
97:
93:
90:
87:, some early
86:
75:
72:
64:
61:December 2012
54:
50:
44:
43:
37:
32:
23:
22:
19:
2839:Chip carrier
2777:Clock gating
2696:Mixed-signal
2593:Write buffer
2570:Control unit
2382:Clock signal
2121:accelerators
2103:Cypress PSoC
1760:Simultaneous
1577:Out-of-order
1533:
1209:Neuromorphic
1090:Architecture
1048:Belt machine
1041:Zeno machine
974:Hierarchical
887:
865:
850:
824:
820:
816:
805:
803:
799:
795:
789:
786:
771:
767:
764:
736:
731:
729:
704:
691:multiplexer.
683:
682:
669:
659:followed by
651:followed by
646:
623:
618:
609:
602:
600:
587:
569:
532:
507:
505:
501:
487:
460:
450:
419:
414:
412:
402:
353:
346:
338:
334:
333:
330:Data hazards
322:
321:
310:
307:
298:
282:
278:
275:
246:
243:
239:
236:
227:
219:
215:
207:
200:
182:
176:
161:
116:
95:
82:
67:
58:
39:
18:
2624:Multiplexer
2588:Data buffer
2299:Single-core
2271:bit slicing
2129:Coprocessor
1984:Coprocessor
1865:performance
1787:Cooperative
1777:Speculative
1737:Distributed
1696:Superscalar
1681:Instruction
1649:Parallelism
1622:Speculative
1454:System/3x0
1326:Instruction
1103:Von Neumann
1016:Post–Turing
746:Superscalar
168:clock cycle
106:, Motorola
53:introducing
2854:Categories
2744:management
2639:Multiplier
2500:Logic gate
2490:Sequential
2397:Functional
2377:Clock rate
2350:Data cache
2323:Components
2304:Multi-core
2292:Core count
1782:Preemptive
1686:Pipelining
1669:Bit-serial
1612:Wide-issue
1557:Structural
1479:Tilera ISA
1445:MicroBlaze
1415:ETRAX CRIS
1310:Comparison
1155:Load–store
1135:Endianness
842:References
801:executed.
761:Exceptions
459:operation
127:flip-flops
36:references
2678:Circuitry
2598:Microcode
2522:Registers
2365:coherence
2340:CPU cache
2198:Word size
1863:Processor
1507:Execution
1410:DEC Alpha
1388:Power ISA
1204:Cognitive
1011:Universal
488:backwards
295:Writeback
203:microcode
2616:Datapath
2309:Manycore
2281:variable
2119:Hardware
1755:Temporal
1435:OpenRISC
1130:Cellular
1120:Dataflow
1113:modified
830:See also
790:TLB miss
2792:Related
2723:Quantum
2713:Digital
2708:Boolean
2606:Counter
2505:Quantum
2266:512-bit
2261:256-bit
2256:128-bit
2099:(MPSoC)
2084:on chip
2082:Systems
1900:(FLOPS)
1713:Process
1562:Control
1544:Hazards
1430:Itanium
1425:Unicore
1383:PowerPC
1108:Harvard
1068:Pointer
1063:Counter
1021:Quantum
588:stalled
508:forward
304:Hazards
233:Execute
83:In the
49:improve
2728:Switch
2718:Analog
2456:(IMC)
2427:(MMU)
2276:others
2251:64-bit
2246:48-bit
2241:32-bit
2236:24-bit
2231:16-bit
2226:15-bit
2221:12-bit
2058:Mobile
1974:Stream
1969:Barrel
1964:Vector
1953:(GPU)
1912:(SUPS)
1880:(IPC)
1732:Memory
1725:Vector
1708:Thread
1691:Scalar
1493:Others
1440:RISC-V
1405:SuperH
1374:Power
1370:MIPS-X
1345:PDP-11
1194:Fabric
946:Models
895:
806:commit
782:Scheme
776:(e.g.
742:slots.
604:bubble
492:bubble
484:purple
461:before
311:hazard
287:, two
256:cycle.
252:stage.
38:, but
2784:(PPW)
2742:Power
2634:Adder
2510:Array
2477:Logic
2438:(TLB)
2421:(FPU)
2415:(AGU)
2409:(ALU)
2399:units
2335:Cache
2216:8-bit
2211:4-bit
2206:1-bit
2170:(TPU)
2164:(DSP)
2158:(PPU)
2152:(VPU)
2141:(GPU)
2110:(NoC)
2093:(SoC)
2028:(PoP)
2022:(SiP)
2016:(MCM)
1957:GPGPU
1947:(CPU)
1937:Types
1918:(PPW)
1906:(TPS)
1894:(IPS)
1886:(CPI)
1657:Level
1468:S/390
1463:S/370
1458:S/360
1400:SPARC
1378:POWER
1261:TRIPS
1229:Types
750:Alpha
732:after
562:->
544:->
479:arrow
472:arrow
413:In a
392:->
371:->
289:SRAMs
164:Cache
108:88000
104:SPARC
2762:ACPI
2495:Glue
2387:FIFO
2330:Core
2068:ASIP
2009:CPLD
2004:FPOA
1999:FPGA
1994:ASIC
1847:SPMD
1842:MIMD
1837:MISD
1830:SWAR
1810:SIMD
1805:SISD
1720:Data
1703:Task
1674:Word
1420:M32R
1365:MIPS
1328:sets
1295:ZISC
1290:NISC
1285:OISC
1280:MISC
1273:EPIC
1268:VLIW
1256:EDGE
1246:RISC
1241:CISC
1150:HUMA
1145:NUMA
893:ISBN
778:Lisp
594:and
477:blue
189:and
177:The
147:RISC
100:MIPS
2757:APM
2752:PMU
2644:CPU
2601:ROM
2372:Bus
1989:PAL
1664:Bit
1450:LMC
1355:ARM
1350:x86
1340:VAX
780:or
699:IPC
665:AND
661:AND
653:AND
649:SUB
596:AND
584:AND
580:AND
572:adr
565:r11
553:r10
550:AND
547:r10
541:adr
516:SUB
512:AND
496:AND
470:red
457:AND
453:SUB
446:AND
442:r10
438:SUB
434:r10
430:AND
426:r10
422:SUB
395:r11
383:r10
380:AND
374:r10
359:SUB
112:DLX
2856::
2691:3D
793:.
657:LD
612:.
592:LD
576:LD
559:r3
538:LD
389:r3
368:r4
362:r3
351:.
102:,
930:e
923:t
916:v
901:.
556:,
386:,
365:,
74:)
68:(
63:)
59:(
45:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.