Knowledge

Paxos (computer science)

Source đź“ť

1392:
Phase2Start(N,null) | | | | | | | | | | | | | | | |  !! Concurrent commuting proposals | X------- ?|-----?|-?|-?| | | Propose(ReadA) X-----------?|-----?|-?|-?| | | Propose(ReadB) | | X------X-------------->|->| Accepted(N,<ReadA,ReadB>) | | |<--------X--X-------->|->| Accepted(N,<ReadB,ReadA>) | | | | | | | | | | | | | | | |  !! No Conflict, both accepted | | | | | | | | Stable = <ReadA, ReadB> | | | | | | | | | | | | | | | |  !! Concurrent conflicting proposals X-----------?|-----?|-?|-?| | | Propose(<WriteB,ReadA>) | X--------?|-----?|-?|-?| | | Propose(ReadB) | | | | | | | | | | X------X-------------->|->| Accepted(N,<WriteB,ReadA> . <ReadB>) | | |<--------X--X-------->|->| Accepted(N,<ReadB> . <WriteB,ReadA>) | | | | | | | | | | | | | | | |  !! Conflict detected, leader chooses | | | | | | | | commutative order: | | | | | | | | V = <ReadA, WriteB, ReadB> | | | | | | | | | | X----->|->|->| | | Phase2Start(N+1,V) | | |<-----X- X- X-------->|->| Accepted(N+1,V) | | | | | | | | Stable = <ReadA, ReadB> . | | | | | | | | <ReadA, WriteB, ReadB> | | | | | | | | | | | | | | | | !! More conflicting proposals X-----------?|-----?|-?|-?| | | Propose(WriteA) | X--------?|-----?|-?|-?| | | Propose(ReadA) | | | | | | | | | | X------X-------------->|->| Accepted(N+1,<WriteA> . <ReadA>) | | |<--------X- X-------->|->| Accepted(N+1,<ReadA> . <WriteA>) | | | | | | | | | | | | | | | |  !! Leader chooses order: | | | | | | | | W = <WriteA, ReadA> | | | | | | | | | | X----->|->|->| | | Phase2Start(N+2,W) | | |<-----X- X- X-------->|->| Accepted(N+2,W) | | | | | | | | Stable = <ReadA, ReadB> . | | | | | | | | <ReadA, WriteB, ReadB> . | | | | | | | | <WriteA, ReadA> | | | | | | | |
864:| | | | |  !! LEADER FAILS | | | | | | |  !! NEW LEADER (knows last number was 1) | X--------->|->|->| | | Prepare(2) | |<---------X--X--X | | Promise(2,{null,null,null}) | | | | | | | |  !! OLD LEADER recovers | | | | | | | |  !! OLD LEADER tries 2, denied | X------------>|->|->| | | Prepare(2) | |<------------X--X--X | | Nack(2) | | | | | | | |  !! OLD LEADER tries 3 | X------------>|->|->| | | Prepare(3) | |<------------X--X--X | | Promise(3,{null,null,null}) | | | | | | | |  !! NEW LEADER proposes, denied | | X--------->|->|->| | | Accept!(2,Va) | | |<---------X--X--X | | Nack(3) | | | | | | | |  !! NEW LEADER tries 4 | | X--------->|->|->| | | Prepare(4) | | |<---------X--X--X | | Promise(4,{null,null,null}) | | | | | | | |  !! OLD LEADER proposes, denied | X------------>|->|->| | | Accept!(3,Vb) | |<------------X--X--X | | Nack(4) | | | | | | | | ... and so on ... 890:| | | | X------------------>|->| Accepted(1,V1)  ! | | | | | | | | | |  !! FAIL !! | | | | | | | | | | X--------------->|->|->|->| | | Prepare(2) |<---------------X--X--X--X | | Promise(2,{null,null,null,null}) X--------------->| | | | | | Accept!(2,V2) | | | | X--------------->|->| Accepted(2,V2)  ! | | | | | | | | |  !! FAIL !! | | | | | | | | | X--------->|---->|->|->| | | Prepare(3) |<---------X-----X--X--X | | Promise(3,{V1,null,null,null}) X--------------->|->| | | | Accept!(3,V1) | | | | X--X--------->|->| Accepted(3,V1)  ! | | | | | | | |  !! FAIL !! | | | | | | | | X------>|->|------->| | | Prepare(4) |<------X--X--|--|--X | | Promise(4,{V1(1),V2(2),null}) X------>|->|->|->|->| | | Accept!(4,V2) | X--X--X--X--X------>|->| Accepted(4,V2) 1197:| | | | |  !! received in different order | | | | | | | | |  !! by the Acceptors | X--------------?|-?|-?|-?| | | Accept!(N,I,V) X-----------------?|-?|-?|-?| | | Accept!(N,I,W) | | | | | | | | | | | | | | | | | |  !! Acceptors disagree on value | | |<-------X--X->|->|----->|->| Accepted(N,I,V) | | |<-------|<-|<-X--X----->|->| Accepted(N,I,W) | | | | | | | | | | | | | | | | | |  !! Detect collision & recover | | |<-------X--X--X--X----->|->| Accepted(N+1,I,W) |<---------------------------------X--X Response(W) | | | | | | | | | 1188:| | | | | |  !! by the Acceptors | X--------------?|-?|-?|-?| | | Accept!(N,I,V) X-----------------?|-?|-?|-?| | | Accept!(N,I,W) | | | | | | | | | | | | | | | | | |  !! Acceptors disagree on value | | |<-------X--X->|->|----->|->| Accepted(N,I,V) | | |<-------|<-|<-X--X----->|->| Accepted(N,I,W) | | | | | | | | | | | | | | | | | |  !! Detect collision & recover | | X------->|->|->|->| | | Accept!(N+1,I,W) | | |<-------X--X--X--X----->|->| Accepted(N+1,I,W) |<---------------------------------X--X Response(W) | | | | | | | | | 1118:{ Acceptors } Proposer Main Aux Learner | | | | | | -- Phase 2 -- X----------->|->|->| | | Accept!(N,I,V) | | |  ! | | --- FAIL! --- |<-----------X--X--------------->| Accepted(N,I,V) | | | | | -- Failure detected (only 2 accepted) -- X----------->|->|------->| | Accept!(N,I,V) (re-transmit, include Aux) |<-----------X--X--------X------>| Accepted(N,I,V) | | | | | -- Reconfigure : Quorum = 2 -- X----------->|->| | | Accept!(N,I+1,W) (Aux not participating) |<-----------X--X--------------->| Accepted(N,I+1,W) | | | | | 851:| | | | | | | | | | | |  !! Leader fails during broadcast !! | X------------>| | | | | Accept!(1,V) |  ! | | | | | | | | | | | |  !! NEW LEADER !! | X--------->|->|->| | | Prepare(2) | |<---------X--X--X | | Promise(2,{V, null, null}) | X--------->|->|->| | | Accept!(2,V) | |<---------X--X--X------>|->| Accepted(2,V) |<---------------------------------X--X Response | | | | | | | 1417:
accepted at round N. For instance, if the coordinator and the acceptor accepted respectively at round N <WriteB, ReadB> and <ReadB, ReadA> , the acceptor will spontaneously accept <WriteB, ReadB, ReadA> at round N+1. With this variation, the cost of recovery is a single message delay which is obviously optimal. Notice here that the use of a unique quorum at a round does not harm liveness. This comes from the fact that any process in this quorum is a read quorum for the prepare phase of the next rounds.
799: 1106:"With only two processors p and q, one processor cannot distinguish failure of the other processor from failure of the communication medium. A third processor is needed. However, that third processor does not have to participate in choosing the sequence of commands. It must take action only in case p or q fails, after which it does nothing while either p or q continues to operate the system by itself. The third processor can therefore be a small/slow/cheap one, or a processor primarily devoted to other tasks." 1211:
Servers | X--------?|-?|-?|-?| Accept!(N,I,V) X-----------?|-?|-?|-?| Accept!(N,I,W) | | | | | | | | | | | |  !! Servers disagree on value | | X<>X->|->| Accepted(N,I,V) | | |<-|<-X<>X Accepted(N,I,W) | | | | | | | | | | | |  !! Detect collision & recover | | X<>X<>X<>X Accepted(N+1,I,W) |<-----------X--X--X--X Response(W) | | | | | |
886:
includes the Acceptor that has accepted V1, and must propose it. The Proposer manages to get two Acceptors to accept it before failing. At this point, three Acceptors have accepted V1, but not for the same identifier. Finally, a new Proposer prepares the majority that has not seen the largest accepted identifier. The value associated with the largest identifier in that majority is V2, so it must propose it. This Proposer then gets all Acceptors to accept V2, achieving consensus.
1503:
and remove other sources of delay on the leader's critical path. So doing enables Derecho to sustain the full bidirectional RDMA data rate. In contrast, although traditional Paxos protocols can be migrated to an RDMA network by simply mapping the message send operations to native RDMA operations, doing so leaves round-trip delays on the critical path. In high-speed RDMA networks, even small delays can be large enough to prevent utilization of the full potential bandwidth.
1517: 1337: 1330: 1314: 1292: 1285: 1269: 524: 1348: 1063:"A proposer can send its proposal only to the leader rather than to all coordinators. However, this requires that the result of the leader-selection algorithm be broadcast to the proposers, which might be expensive. So, it might be better to let the proposer send its proposal to all coordinators. (In that case, only the coordinators themselves need to know who the leader is.) 1499:), there has been substantial interest in optimizing Paxos to leverage hardware offloading, in which the network interface card and network routers provide reliability and network-layer congestion control, freeing the host CPU for other tasks. The Derecho C++ Paxos library is an open-source Paxos implementation that explores this option. 903:| | | | |  !! FAIL !! | | | | | | X--------->|->| | | Prepare(2) |<---------X--X | | Promise(2,{V1,null}) X------>|->|->| | | Accept!(2,V1) |<------X--X--X------>|->| Accepted(2,V1) | | | | | | 877:| | | |  !! FAIL !! | | | | | | X--------->|->| | | Prepare(2) |<---------X--X | | Promise(2,{null,null}) X------>|->|->| | | Accept!(2,V2) |<------X--X--X------>|->| Accepted(2,V2) | | | | | | 1446:
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Accept!(N,I,V) | | X<>X<>X | | Verify(N,I,V) - BROADCAST |
1196:
Client Leader Acceptor Learner | | | | | | | | | | | X------->|->|->|->| | | Any(N,I,Recovery) | | | | | | | | | | | | | | | | | |  !! Concurrent conflicting proposals | | | |
1187:
Client Leader Acceptor Learner | | | | | | | | | | | | | | | | | | | | | | | | | | |  !! Concurrent conflicting proposals | | | | | | | | |  !! received in different order | | |
1173:
Client Leader Acceptor Learner | | | | | | | | | X--------->|->|->|->| | | Any(N,I,Recovery) | | | | | | | | X------------------->|->|->|->| | | Accept!(N,I,W) |
942:
Client Proposer Acceptor Learner | | | | | | | --- First Request --- X-------->| | | | | | Request | X--------->|->|->| | | Prepare(N) | |<---------X--X--X | | Promise(N,I,{Va,Vb,Vc})
889:
Proposer Acceptor Learner | | | | | | | | | | | X--------------->|->|->|->|->| | | Prepare(1) |<---------------X--X--X--X--X | | Promise(1,{null,null,null,null,null}) x--------------->| | | | | | | Accept!(1,V1)
863:
Client Proposer Acceptor Learner | | | | | | | X----->| | | | | | Request | X------------>|->|->| | | Prepare(1) | |<------------X--X--X | | Promise(1,{null,null,null}) |  !
850:
Client Proposer Acceptor Learner | | | | | | | X----->| | | | | | Request | X------------>|->|->| | | Prepare(1) | |<------------X--X--X | | Promise(1,{Va, Vb, Vc}) | |
264:
processors: in other words, the number of non-faulty processes must be strictly greater than the number of faulty processes. However, using reconfiguration, a protocol may be employed which survives any number of total failures as long as no more than F fail simultaneously. For Paxos protocols, these
1400:
The above message flow shows us that Generalized Paxos can leverage operation semantics to avoid collisions when the spontaneous ordering of the network fails. This allows the protocol to be in practice quicker than Fast Paxos. However, when a collision occurs, Generalized Paxos needs two additional
976:
Client Servers | | | | --- First Request --- X-------->| | | Request | X->|->| Prepare(N) | |<-X--X Promise(N, I, {Va, Vb}) | X->|->| Accept!(N, I, Vn) | X<>X<>X Accepted(N, I) |<--------X |
902:
Proposer Acceptor Learner | | | | | | | X--------->|->|->| | | Prepare(1) |<---------X--X--X | | Promise(1,{null,null,null}) x--------->|->| | | | Accept!(1,V1) | | X--X--------->|->| Accepted(1,V1)  ! |
876:
Proposer Acceptor Learner | | | | | | | X--------->|->|->| | | Prepare(1) |<---------X--X--X | | Promise(1,{null,null,null}) x--------->| | | | | Accept!(1,V1) | | X------------>|->| Accepted(1,V1)  ! | |
846:
In this case, a Proposer fails after proposing a value, but before the agreement is reached. Specifically, it fails in the middle of the Accept message, so only one Acceptor of the Quorum receives the value. Meanwhile, a new Leader (a Proposer) is elected (but this is not shown in detail). Note that
1502:
Derecho offers both a classic Paxos, with data durability across full shutdown/restart sequences, and vertical Paxos (atomic multicast), for in-memory replication and state-machine synchronization. The Paxos protocols employed by Derecho needed to be adapted to maximize asynchronous data streaming
1404:
In the general case, such round trips are unavoidable and come from the fact that multiple commands can be accepted during a round. This makes the protocol more expensive than Paxos when conflicts are frequent. Hopefully two possible refinements of Generalized Paxos are possible to improve recovery
1210:
Client Servers | | | | | | | | X->|->|->| Any(N,I,Recovery) | | | | | | | | | | | |  !! Concurrent conflicting proposals | | | | | |  !! received in different order | | | | | |  !! by the
1477:
The failure scenario is the same for both protocols; Each Learner waits to receive F+1 identical messages from different Acceptors. If this does not occur, the Acceptors themselves will also be aware of it (since they exchanged each other's messages in the broadcast round), and correct Acceptors
1219:
Generalized consensus explores the relationship between the operations of the replicated state machine and the consensus protocol that implements it. The main discovery involves optimizations of Paxos when conflicting proposals could be applied in any order. i.e., when the proposed operations are
885:
In the following case, one Proposer achieves acceptance of value V1 of one Acceptor before failing. A new Proposer prepares the Acceptors that never accepted V1, allowing it to propose V2. This Proposer is able to get one Acceptor to accept V2 before failing. A new Proposer finds a majority that
859:
The most complex case is when multiple Proposers believe themselves to be Leaders. For instance, the current leader may fail and later recover, but the other Proposers have already re-selected a new leader. The recovered leader has not learned this yet and attempts to begin one round in conflict
769:
Notice that a Proposer in Paxos could propose "I am the leader," (or, for example, "Proposer X is the leader"). Because of the agreement and validity guarantees of Paxos, if accepted by a Quorum, then the Proposer is now known to be the leader to all other nodes. This satisfies the needs of leader
1164:
The final optimization occurs when the leader specifies a recovery technique in advance, allowing the Acceptors to perform the collision recovery themselves. Thus, uncoordinated collision recovery can occur in three message delays (and only two message delays if all Learners are also Acceptors).
898:
In the following case, one Proposer achieves acceptance of value V1 of two Acceptors before failing. A new Proposer may start another round, but it is now impossible for that proposer to prepare a majority that doesn't include at least one Acceptor that has accepted V1. As such, even though the
1416:
Second, if both rounds N and N+1 use a unique and identical centered quorum, when an acceptor detects a collision at round N, it spontaneously proposes at round N+1 a sequence suffixing both (i) the sequence accepted at round N by the coordinator and (ii) the greatest non-conflicting prefix it
1391:
Client Leader Acceptor Learner | | | | | | | |  !! New Leader Begins Round | | X----->|->|->| | | Prepare(N) | | |<-----X- X- X | | Promise(N,null) | | X----->|->|->| | |
961:
Client Proposer Acceptor Learner | | | | | | | --- Following Requests --- X-------->| | | | | | Request | X--------->|->|->| | | Accept!(N,I+1,W) | |<---------X--X--X------>|->|
57:
The Paxos family of protocols includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Although no deterministic fault-tolerant
1486:
Client Acceptor Learner | | |  ! | |  !! One Acceptor is faulty X----->|->|->! | | Accept!(N,I,V) | X<>X<>X------>|->| Accepted(N,I,{V,W}) - BROADCAST | | |  ! | |  !! Learners receive 2
1227:
This concept is further generalized into ever-growing sequences of commutative operations, some of which are known to be stable (and thus may be executed). The protocol tracks these sequences ensuring that all proposed operations of one sequence are stabilized before allowing any operation
813:
The simplest error cases are the failure of an Acceptor (when a Quorum of Acceptors remains alive) and failure of a redundant Learner. In these cases, the protocol requires no "recovery" (i.e. it still succeeds): no additional rounds or messages are required, as shown below (in the next two
837:
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | |<---------X--X--X | | Promise(1,{Va,Vb,Vc}) |
826:|<---------X--X | | Promise(1,{Va, Vb, null}) | X--------->|->| | | Accept!(1,V) | |<---------X--X--------->|->| Accepted(1,V) |<---------------------------------X--X Response | | | | | | 825:
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | | | |  ! | |  !! FAIL !! |
838:
X--------->|->|->| | | Accept!(1,V) | |<---------X--X--X------>|->| Accepted(1,V) | | | | | |  !  !! FAIL !! |<---------------------------------X Response | | | | | |
1101:
This reduction in processor requirements comes at the expense of liveness; if too many main processors fail in a short time, the system must halt until the auxiliary processors can reconfigure the system. During stable periods, the auxiliary processors take no part in the protocol.
137:
and Shraer. Derecho, a C++ software library for cloud-scale state machine replication, offers a Paxos protocol that has been integrated with self-managed virtually synchronous membership. This protocol matches the Keidar and Shraer optimality bounds, and maps efficiently to modern
46:. State machine replication is a technique for converting an algorithm into a fault-tolerant, distributed implementation. Ad-hoc techniques may leave important cases of failures unresolved. The principled approach proposed by Lamport et al. ensures all cases are handled safely. 872:
In the following case, one Proposer achieves acceptance of value V1 by one Acceptor before failing. A new Proposer prepares the Acceptors that never accepted V1, allowing it to propose V2. Then V2 is accepted by all Acceptors, including the one that initially accepted V1.
121:
protocol. However, gbcast is unusual in supporting durability and addressing partitioning failures. Most reliable multicast protocols lack these properties, which are required for implementations of the state machine replication model. This point is elaborated in a paper by
727:). Because each identifier is unique to a Proposer and only one value may be proposed per identifier, all Acceptors that accept the same identifier thereby accept the same value. These facts result in a few counter-intuitive scenarios that do not impact correctness: 1031:
messages to ensure that, despite failures, only a single value can be chosen. However, if an acceptor does learn what value has been chosen, it can store the value in stable storage and erase any other information it has saved there. If the acceptor later receives a
1473:
Client Acceptor Learner | | | | | | X----->|->|->| | | Accept!(N,I,V) | X<>X<>X------>|->| Accepted(N,I,V) - BROADCAST |<-------------------X--X Response(V) | | | | | |
411:
This protocol is the most basic of the Paxos family. Each "instance" (or "execution") of the basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds. A successful round has 2 phases: phase 1 (which is divided into parts
1487:
different commands | | |  ! | |  !! Correct Acceptors notice error and choose | X<>X<>X------>|->| Accepted(N,I,V) - BROADCAST |<-------------------X--X Response(V) | | |  ! | |
794:
In the diagram below, there is 1 Client, 1 Proposer, 3 Acceptors (i.e. the Quorum size is 3) and 2 Learners (represented by the 2 vertical lines). This diagram represents the case of a first round, which is successful (i.e. no process in the network fails).
81:), in which the amount of durable state could be large. The protocol attempts to make progress even during periods when some bounded number of replicas are unresponsive. There is also a mechanism to drop a permanently failed replica or to add a new replica. 703:
message to the Proposer and every Learner (which can typically be the Proposers themselves. Learners will learn the decided value ONLY AFTER receiving Accepted messages from a majority of acceptors, which means, NOT after receiving just the FIRST Accept
1631:
The OpenReplica replication service uses Paxos to maintain replicas for an open access system that enables users to create fault-tolerant objects. It provides high performance through concurrent rounds and flexibility through dynamic membership
391:
By merging roles, the protocol "collapses" into an efficient client-master-replica style deployment, typical of the database community. The benefit of the Paxos protocols (including implementations with merged roles) is the guarantee of its
958:) use the same leader, so the phase 1 (of these subsequent instances of the basic Paxos protocol), which consist of the Prepare and Promise sub-phases, is skipped. Note that the Leader should be stable, i.e. it should not crash or change. 1057:
messages that tell it the hash of a value v that it must use in its Phase2a action without telling it the actual value of v. If that happens, the leader cannot execute its Phase2a action until it communicates with some process that knows
106:
in 1988, in the context of distributed transactions. Notwithstanding this prior work, Paxos offered a particularly elegant formalism, and included one of the earliest proofs of safety for a fault-tolerant distributed consensus protocol.
899:
Proposer doesn't see the existing consensus, the Proposer's only option is to propose the value already agreed upon. New Proposers can continually increase the identifier to restart the process, but the consensus can never be changed.
26:
in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.
778:
The following diagrams represent several cases/situations of the application of the Basic Paxos protocol. Some cases show how the Basic Paxos protocol copes with the failure of certain (redundant) components of the distributed system.
943:| X--------->|->|->| | | Accept!(N,I,V) | |<---------X--X--X------>|->| Accepted(N,I,V) |<---------------------------------X--X Response | | | | | | | 1236:
In order to illustrate Generalized Paxos, the example below shows a message flow between two concurrently executing clients and a replicated state machine implementing read/write operations over two distinct registers A and B.
1426:
Paxos may also be extended to support arbitrary failures of the participants, including lying, fabrication of messages, collusion with other participants, selective non-participation, etc. These types of failures are called
101:
had demonstrated the solvability of consensus in a broad family of "partially synchronous" systems. Paxos has strong similarities to a protocol used for agreement in "viewstamped replication", first published by Oki and
2204:
Jha, Sagar; Behrens, Jonathan; Gkountouvas, Theo; Milano, Matthew; Song, Weijia; Tremel, Edward; van Renesse, Robbert; Zink, Sydney; Birman, Ken (April 2019). "Derecho: Fast State Machine Replication for Cloud Services".
938:
In the following diagram, only one instance (or "execution") of the basic Paxos protocol, with an initial Leader (a Proposer), is shown. Note that a Multi-Paxos consists of several instances of the basic Paxos protocol.
1413:), then to recover at round N+1 from a collision at round N, the coordinator skips phase 1 and proposes at phase 2 the sequence it accepted last during round N. This reduces the cost of recovery to a single round trip. 150:
In order to simplify the presentation of Paxos, the following assumptions and definitions are made explicit. Techniques to broaden the applicability are known in the literature, and are not covered in this article.
53:
island in Greece, where Lamport wrote that the parliament had to function "even though legislators continually wandered in and out of the parliamentary Chamber". It was later published as a journal article in 1998.
1130:
to reduce end-to-end message delays. In Basic Paxos, the message delay from client request to learning is 3 message delays. Fast Paxos allows 2 message delays, but requires that (1) the system be composed of
1018:
messages just to a quorum of acceptors. As long as all acceptors in that quorum are working and can communicate with the leader and the learners, there is no need for acceptors not in the quorum to do anything.
320:
guaranteed to terminate, and thus does not have the liveness property. This is supported by the Fischer Lynch Paterson impossibility result (FLP) which states that a consistency protocol can only have two of
988:
Client Servers X-------->| | | Request | X->|->| Accept!(N,I+1,W) | X<>X<>X Accepted(N,I+1) |<--------X | | Response | | | |
1001:"We can save messages at the cost of an extra message delay by having a single distinguished learner that informs the other learners when it finds out that a value has been chosen. Acceptors then send 563:
is less than or equal to any previous proposal number received by the Acceptor, the Acceptor needn't respond and can ignore the proposal. However, for the sake of optimization, sending a denial, or
341:
In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor and Learner. This reduces the message complexity significantly, without sacrificing correctness:
925:
is included along with each value which is incremented in each round by the same Leader. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.
461:
is not the value to be proposed; it is simply a unique identifier of this initial message by the Proposer. In fact, the Prepare message needn't contain the proposed value (often denoted by
973:
The following diagram represents the first "instance" of a basic Paxos protocol, when the roles of the Proposer, Acceptor and Learner are collapsed to a single role, called the "Server".
911:
A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. If each command is the result of a single instance of the
997:
A number of optimisations can be performed to reduce the number of exchanged messages, to improve the performance of the protocol, etc. A few of these optimisations are reported below.
1224:
for the state machine. In such cases, the conflicting operations can both be accepted, avoiding the delays required for resolving conflicts and re-proposing the rejected operations.
970:
A common deployment of the Multi-Paxos consists in collapsing the role of the Proposers, Acceptors and Learners to "Servers". So, in the end, there are only "Clients" and "Servers".
133:
Paxos protocols are members of a theoretical class of solutions to a problem formalized as uniform agreement with crash failures. Lower bounds for this problem have been proved by
59: 428:). See below the description of the phases. Remember that we assume an asynchronous model, so e.g. a processor may be in one phase while another processor may be in another. 2780: 2494: 599:
to its proposal. If any Acceptors had previously accepted any proposal, then they'll have sent their values to the Proposer, who now must set the value of its proposal,
1174:|<---------X--X--X--X------>|->| Accepted(N,I,W) |<------------------------------------X--X Response(W) | | | | | | | | 2753: 242: 985:
In the subsequent instances of the basic Paxos protocol, with the same leader as in the previous instances of the basic Paxos protocol, the phase 1 can be skipped.
1115:
An example involving three main acceptors, one auxiliary acceptor and quorum size of three, showing failure of one main processor and subsequent reconfiguration:
1053:
messages for either v or its hash from a quorum of acceptors, and at least one of those messages contains v rather than its hash. However, a leader could receive
495:
The Acceptors wait for a Prepare message from any of the Proposers. When an Acceptor receives a Prepare message, the Acceptor must examine the identifier number,
918:
If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.
384: 364: 262: 1447:|<---------X--X--X------>|->| Accepted(N,V) |<---------------------------------X--X Response(V) | | | | | | | 2062: 822:
In the following diagram, one of the Acceptors in the Quorum fails, so the Quorum size becomes 2. In this case, the Basic Paxos protocol still succeeds.
1757: 110:
Reconfigurable state machines have strong ties to prior work on reliable group multicast protocols that support dynamic group membership, for example
1005:
messages only to the distinguished learner. In most applications, the roles of leader and distinguished learner are performed by the same processor.
860:
with the current leader. In the diagram below, 4 unsuccessful rounds are shown, but there could be more (as suggested at the bottom of the diagram).
516:. The Promise must include the highest number among the Proposals that the Acceptor previously accepted, along with the corresponding accepted value. 1675: 1387:
Responses not shown. Note: message abbreviations differ from previous message flows due to specifics of the protocol, see for a full discussion.
607:. If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose the value it originally wanted to propose, say 508:
is higher than every previous proposal number received by the Acceptor (from any Proposer), then the Acceptor must return a message (called a
346:
In Paxos, clients send commands to a leader. During normal operation, the leader receives a client's command, assigns it a new command number
281:(also called "consistency"), Paxos defines three properties and ensures the first two are always held, regardless of the pattern of failures: 2886: 1076:
messages to the leader and the leader can inform the learners when a value has been chosen. However, this adds an extra message delay.
1581: 1534: 2955: 1553: 1639:
product to implement a general purpose fault-tolerant virtual machine used to run the configuration and control components of the
74:), Paxos guarantees safety (consistency), and the conditions that could prevent it from making progress are difficult to provoke. 2725: 1183:
Conflicting proposals with coordinated recovery. Note: the protocol does not specify how to handle the dropped client request.
2777: 1560: 1721: 1098:
to tolerate F failures with F+1 main processors and F auxiliary processors by dynamically reconfiguring after each failure.
2207: 312:
If value C has been proposed, then eventually learner L will learn some value (if sufficient processors remain non-faulty).
770:
election because there is a single node believing it is the leader and a single node known to be the leader at all times.
200:
for a solution which tolerates corrupted messages that arise from arbitrary/malicious behavior of the messaging channels.)
2464:
I. Gupta, R. van Renesse, and K. P. Birman, 2000, A Probabilistically Correct Leader Election Protocol for Large Groups,
1567: 1428: 1646:
Microsoft uses Paxos in the Autopilot cluster management service from Bing, and in Windows Server Failover Clustering.
2574: 2382: 1900: 1600: 1081:"Finally, observe that phase 1 is unnecessary for round 1 .. The leader of round 1 can begin the round by sending an 962:
Accepted(N,I+1,W) |<---------------------------------X--X Response | | | | | | |
2423: 786:
message are "null" the first time a proposal is made (since no Acceptor has accepted a value before in this round).
687:
If the Acceptor has not already promised (in Phase 1b) to only consider proposals having an identifier greater than
482:
to them. A Proposer should not initiate Paxos if it cannot communicate with enough Acceptors to constitute a Quorum.
49:
The Paxos protocol was first submitted in 1989 and named after a fictional legislative consensus system used on the
1549: 1438:
adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors:
1040:
message, instead of performing its Phase1b or Phase2b action, it can simply inform the leader of the chosen value.
676:
already promised (in Phase 1b of the Paxos protocol) to only consider proposals having an identifier greater than
166:
Processors with stable storage may re-join the protocol after failures (following a crash-recovery failure model).
2950: 2488: 512:) to the Proposer, indicating that the Acceptor will ignore all future proposals numbered less than or equal to 1538: 538: 333:. As Paxos's point is to ensure fault tolerance and it guarantees safety, it cannot also guarantee liveness. 2810: 1401:
round trips to recover. This situation is illustrated with operations WriteB and ReadB in the above schema.
834:
In the following case, one of the (redundant) Learners fails, but the Basic Paxos protocol still succeeds.
2526: 1913: 23: 2328: 2243: 1801: 1496: 739:. However, the Paxos protocol guarantees that consensus is permanent and the chosen value is immutable. 457:, which must be greater than any number previously used in a Prepare message by this Proposer. Note that 139: 2306: 2547:
Lamport, Leslie; Malkhi, Dahlia; Zhou, Lidong (2009). "Vertical paxos and primary-backup replication".
2085:
Birman, Kenneth; Joseph, Thomas (February 1987). "Reliable Communication in the Presence of Failures".
1139:
faults (instead of the classic 2f+1), and (2) the Client to send its request to multiple destinations.
2834: 1752: 177:
for a solution that tolerates failures that arise from arbitrary/malicious behavior of the processes.)
1806: 1636: 733:
a value may achieve a majority across Acceptors (with different identifiers) only to later be changed
170: 31: 2919: 2858: 2557: 2349: 2144: 2057:"Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems" 1574: 1161:
as usual. This coordinated recovery technique requires four message delays from Client to Learner.
603:, to the value associated with the highest proposal number reported by the Acceptors, let's call it 2616: 2024: 1870: 1659: 1615: 573:), response would tell the Proposer that it can stop its attempt to create consensus with proposal 306:
No two distinct learners can learn different values (or there can't be more than one decided value)
1461:
message in Fast Byzantine Paxos is sent to all Acceptors and all Learners, while Fast Paxos sends
2599: 1731: 1527: 2611: 2552: 2439: 2139: 2019: 1865: 1640: 569: 2797: 1662:
negotiation algorithm for fault-tolerant and consistent replication of file data and metadata.
1150:
messages to the leader and every Learner achieving two message delays from Client to Learner.
624:, to a Quorum of Acceptors with the chosen value for its proposal, v, and the proposal number 2336: 2180:
PODC '06: Proceedings of the 25th Annual ACM Symposium on Principles of Distributed Computing
2130:
Lamport, Leslie; Malkhi, Dahlia; Zhou, Lidong (March 2010). "Reconfiguring a State Machine".
35: 1703:
Amazon Elastic Container Services uses Paxos to maintain a consistent view of cluster state.
1495:
With the emergence of very high speed reliable datacenter networks that support remote DMA (
212: 2811:"Consistency, Fault Tolerance, and Availability with MariaDB Xpand — MariaDB Documentation" 2741: 2511: 2366:
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
1716: 1146:
message to the Acceptors directly. The Acceptors would respond as in Basic Paxos, sending
2004: 1850: 1671:
uses Paxos as part of the monitor processes to agree which OSDs are up and in the cluster.
1045:"Instead of sending the value v, the leader can send a hash of v to some acceptors in its 77:
Paxos is usually used where durability is required (for example, to replicate a file or a
8: 2932: 2178:; Shraer, Alexander (2006). "Timeliness, failure-detectors, and consensus performance.". 847:
there are 2 rounds in this case (rounds proceed vertically, from the top to the bottom).
196:
Messages are delivered without corruption. (That is, Byzantine failures don't occur. See
1200: 736: 537:
Please expand the article to include this information. Further details may exist on the
169:
Processors do not collude, lie, or otherwise attempt to subvert the protocol. (That is,
2800:. 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011). 2629: 2580: 2388: 2224: 2157: 2112: 2037: 1985: 1935: 1883: 1825: 1776: 386:
instance of the consensus algorithm by sending messages to a set of acceptor processes.
369: 349: 247: 63: 893: 737:
Acceptors may continue to accept proposals after an identifier has achieved a majority
2904: 2570: 2392: 2378: 2278: 2228: 1989: 1829: 1665:
Heroku uses Doozerd which implements Paxos for its consistent distributed data store.
1454:
removes this extra delay, since the client sends commands directly to the Acceptors.
652:
message should be interpreted as a "request", as in "Accept this proposal, please!".
115: 43: 2684: 2633: 2161: 2116: 2041: 928: 773: 595:
If a Proposer receives Promises from a Quorum of Acceptors, it needs to set a value
2699: 2621: 2584: 2562: 2370: 2270: 2216: 2183: 2149: 2102: 2094: 2067: 2029: 1975: 1925: 1875: 1851:"Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial" 1815: 1780: 1766: 1691: 1685: 98: 1939: 1887: 732: 2784: 2662:
Proceedings of the Third Symposium on Operating Systems Design and Implementation
2364: 1668: 2363:
Chandra, Tushar; Griesemer, Robert; Redstone, Joshua (2007). "Paxos made live".
1409:
First, if the coordinator is part of every quorum of acceptors (round N is said
1142:
Intuitively, if the leader has no value to propose, then a client could send an
980: 880: 760:). In these cases, another round must be started with a higher proposal number. 1652:
have implemented Paxos within their DConE active-active replication technology.
1625: 1451: 1435: 954:
In this case, subsequent instances of the basic Paxos protocol (represented by
798: 728: 669: 123: 103: 39: 2407: 2056: 1378:
In practice, a commute occurs only when operations are proposed concurrently.
1372:, one possible permutation equivalent to the previous order is the following: 2944: 2654: 2282: 1726: 867: 719:
Note that consensus is achieved when a majority of Acceptors accept the same
127: 94: 71: 2566: 2549:
Proceedings of the 28th ACM symposium on Principles of distributed computing
2374: 2187: 2153: 1618:
in order to keep replicas consistent in case of failure. Chubby is used by
2754:"Microsoft Research – Emerging Technology, Computer, and Software Research" 1706:
Amazon DynamoDB uses the Paxos algorithm for leader election and consensus.
1678:
distributed SQL database uses Paxos for distributed transaction resolution.
1375:<1:Read(A), 2:Read(B), 5:Read(A), 3:Write(B), 4:Read(B), 6:Write(A)> 1357:<1:Read(A), 2:Read(B), 3:Write(B), 4:Read(B), 5:Read(A), 6:Write(A)> 2625: 1930: 1820: 1771: 190:
Messages are sent asynchronously and may take arbitrarily long to deliver.
2703: 2175: 1468: 1221: 134: 90: 67: 2881: 2071: 2033: 1980: 1963: 1879: 1153:
If the leader detects a collision, it resolves the collision by sending
58:
consensus protocol can guarantee progress in an asynchronous network (a
2882: 2796:
Kolbeck, Björn; Högqvist, Mikael; Stender, Jan; Hupfeld, Felix (2011).
1541: in this section. Unsourced material may be challenged and removed. 752:
messages, or when the Proposer does not receive a Quorum of responses (
111: 2098: 2107: 1201:
Message flow: Fast Paxos with uncoordinated recovery, collapsed roles
2274: 2220: 1694:
NoSQL database uses Paxos for Light Weight Transaction feature only.
1516: 1481: 1441: 1023:"Acceptors do not care what value is chosen. They simply respond to 1697: 1655: 1649: 1619: 894:
Basic Paxos where new Proposers cannot change an existing consensus
78: 2258: 1622:
which is now in production in Google Analytics and other products.
1177: 399:
A typical implementation's message flow is covered in the section
2726:"The Chubby lock service for loosely-coupled distributed systems" 1802:"Time, Clocks and the Ordering of Events in a Distributed System" 929:
Graphic representation of the flow of messages in the Multi-Paxos
774:
Graphic representation of the flow of messages in the basic Paxos
1964:"Impossibility of distributed consensus with one faulty process" 1751:
Pease, Marshall; Shostak, Robert; Lamport, Leslie (April 1980).
523: 2598:
Lamport, Leslie; Shostak, Robert; Pease, Marshall (July 1982).
1049:
messages. A learner will learn that v is chosen if it receives
854: 640:
or, in case none of the Acceptors previously accepted a value,
118: 2003:
Dwork, Cynthia; Lynch, Nancy; Stockmeyer, Larry (April 1988).
1351:
in this table indicates operations which are non-commutative.
699:
message) as the accepted value (of the Protocol), and send an
1681: 1347: 981:
Multi-Paxos when roles are collapsed and the leader is steady
881:
Basic Paxos where a multi-identifier majority is insufficient
829: 142:
datacenter hardware (but uses TCP if RDMA is not available).
2534:
SRDS'11: 30th IEEE Symposium on Reliable Distributed Systems
1381: 1168: 2495:
International Conference on Dependable Systems and Networks
2203: 50: 949: 868:
Basic Paxos where an Acceptor accepts Two Different Values
209:
In general, a consensus algorithm can make progress using
1700:
NoSQL database uses Paxos for Light Weight Transactions.
915:
protocol, a significant amount of overhead would result.
2742:
https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf
2362: 965: 764: 2933:
https://www.usenix.org/system/files/atc22-elhemali.pdf
2551:. PODC '09. New York, NY, USA: ACM. pp. 312–313. 2257:
Van Renesse, Robbert; Altinbuken, Deniz (2015-02-17).
1469:
Message flow: Fast Byzantine Multi-Paxos, steady state
2604:
ACM Transactions on Programming Languages and Systems
2256: 817: 748:
Rounds fail when multiple Proposers send conflicting
372: 352: 250: 215: 2692:
IEEE Transactions on Dependable and Secure Computing
2683:
Martin, Jean-Philippe; Alvisi, Lorenzo (July 2006).
2063:
ACM Symposium on Principles of Distributed Computing
2002: 841: 244:
processors, despite the simultaneous failure of any
187:
Processors can send messages to any other processor.
2798:“Flease - Lease Coordination without a Lock Server” 2597: 1750: 1490: 1192:Conflicting proposals with uncoordinated recovery. 1072:messages to each learner, acceptors can send their 632:message previously sent to the Acceptors). So, the 393: 1758:Journal of the Association for Computing Machinery 1110: 713:Else, it can ignore the Accept message or request. 628:(which is the same as the number contained in the 453:. The message is identified with unique a number, 378: 358: 256: 236: 2653:Castro, Miguel; Liskov, Barbara (February 1999). 2546: 2427:ACM SIGACT News (Distributed Computing Column) 32 2129: 1635:IBM supposedly uses the Paxos algorithm in their 1628:and Megastore use the Paxos algorithm internally. 1482:Message flow: Fast Byzantine Multi-Paxos, failure 1442:Message flow: Byzantine Multi-Paxos, steady state 272: 2942: 2887:"Under the Hood of Amazon EC2 Container Service" 2005:"Consensus in the Presence of Partial Synchrony" 1614:Google uses the Paxos algorithm in their Chubby 1178:Message flow: Fast Paxos, conflicting proposals 933: 789: 292:Only proposed values can be chosen and learned. 193:Messages may be lost, reordered, or duplicated. 1753:"Reaching Agreement in the Presence of Faults" 1684:HA graph database implements Paxos, replacing 1450:Fast Byzantine Paxos introduced by Martin and 2778:“The Distributed Coordination Engine (DConE)” 1431:, after the solution popularized by Lamport. 808: 2682: 2652: 2503: 2486: 2429:, 4 (Whole Number 121, December 2001) 51-58. 2416: 2174: 2084: 2061:PODC '88: Proceedings of the seventh annual 855:Basic Paxos when multiple Proposers conflict 265:reconfigurations can be handled as separate 2835:"Lightweight transactions in Cassandra 2.0" 664:If an Acceptor receives an Accept message, 420:) and phase 2 (which is divided into parts 2859:"Lightweight Transactions | ScyllaDB Docs" 2524: 2322: 2320: 2318: 2316: 2199: 2197: 2168: 1506: 1228:non-commuting with them to become stable. 830:Basic Paxos when a redundant learner fails 89:The topic predates the protocol. In 1988, 30:Consensus protocols are the basis for the 2615: 2556: 2512:"The Paxos Family of Consensus Protocols" 2432: 2244:"Lower Bounds for Asynchronous Consensus" 2235: 2143: 2106: 2054: 2023: 1979: 1929: 1869: 1848: 1819: 1770: 1601:Learn how and when to remove this message 1434:Byzantine Paxos introduced by Castro and 1382:Message flow: Generalized Paxos (example) 1354:A possible sequence of operations : 1169:Message flow: Fast Paxos, non-conflicting 535:about handling the first Prepare message. 478:and sends the Prepare message containing 16:Family of protocols for solving consensus 2770: 2482: 2480: 2478: 2476: 2474: 2399: 1842: 797: 2326: 2313: 2304: 2300: 2298: 2296: 2294: 2292: 2241: 2194: 1961: 1911: 1799: 950:Multi-Paxos when phase 1 can be skipped 499:, of that message. There are two cases: 204: 2943: 2790: 2591: 2509: 1894: 160:Processors operate at arbitrary speed. 2655:"Practical Byzantine Fault Tolerance" 2525:Pierre, Sutra; Marc, Shapiro (2011). 2487:Lamport, Leslie; Massa, Mike (2004). 2471: 2440:"Leader Election, Why Should I Care?" 2123: 1957: 1955: 1901:Leslie Lamport's history of the paper 1793: 805:Here, V is the last of (Va, Vb, Vc). 782:Note that the values returned in the 668:, from a Proposer, it must accept it 655: 336: 22:is a family of protocols for solving 2883:https://www.allthingsdistributed.com 2527:"Fast Genuine Generalized Consensus" 2518: 2356: 2289: 2208:ACM Transactions on Computer Systems 2087:ACM Transactions on Computer Systems 2055:Oki, Brian; Liskov, Barbara (1988). 1996: 1918:ACM Transactions on Computer Systems 1905: 1744: 1539:adding citations to reliable sources 1510: 1478:will re-broadcast the agreed value: 1214: 966:Multi-Paxos when roles are collapsed 765:Paxos can be used to select a leader 729:Acceptors can accept multiple values 517: 486: 436: 197: 174: 2723: 2078: 1157:messages for a new round which are 742: 586: 449:creates a message, which we call a 163:Processors may experience failures. 13: 2885:, Dr Werner Vogels- (2015-07-20). 1952: 1421: 1334: 1327: 1311: 1289: 1282: 1266: 1068:"Instead of each acceptor sending 921:To achieve this, the round number 818:Basic Paxos when an Acceptor fails 14: 2967: 2329:"Generalized Consensus and Paxos" 2048: 1722:Chandra–Toueg consensus algorithm 1085:message with any proposed value." 977:| Response | | | | 842:Basic Paxos when a Proposer fails 2600:"The Byzantine Generals Problem" 1643:services offered by the cluster. 1515: 1491:Adapting Paxos for RDMA networks 1346: 1335: 1328: 1312: 1290: 1283: 1267: 1127: 1095: 992: 946:where V = last of (Va, Vb, Vc). 912: 522: 470:The Proposer chooses at least a 400: 114:'s work in 1985 and 1987 on the 2956:Fault-tolerant computer systems 2926: 2875: 2851: 2827: 2803: 2746: 2735: 2717: 2676: 2646: 2540: 2458: 2259:"Paxos Made Moderately Complex" 2250: 1526:needs additional citations for 1206:(merged Acceptor/Learner roles) 1111:Message flow: Cheap Multi-Paxos 691:, it should register the value 1395: 1089: 906: 406: 273:Safety and liveness properties 145: 1: 2406:Quesada Torres, Luis (2018). 1800:Lamport, Leslie (July 1978). 1737: 1550:"Paxos" computer science 1121: 475: 446: 154: 2891:www.allthingsdistributed.com 2863:opensource.docs.scylladb.com 1912:Lamport, Leslie (May 1998). 1465:messages only to Learners): 1135:acceptors to tolerate up to 934:Multi-Paxos without failures 790:Basic Paxos without failures 471: 450: 7: 1710: 10: 2972: 2685:"Fast Byzantine Consensus" 1962:Fischer, M. (April 1985). 1914:"The Part-Time Parliament" 1336: 1329: 1313: 1291: 1284: 1268: 1231: 809:Error cases in basic Paxos 581: 431: 181: 84: 1807:Communications of the ACM 1637:IBM SAN Volume Controller 309:Termination (or liveness) 32:state machine replication 2422:Lamport, Leslie (2001). 2327:Lamport, Leslie (2005). 2305:Lamport, Leslie (2005). 2242:Lamport, Leslie (2004). 1849:Schneider, Fred (1990). 1616:distributed lock service 565:negative acknowledgement 2787:. WANdisco white paper. 2567:10.1145/1582716.1582783 2375:10.1145/1281100.1281103 2188:10.1145/1146381.1146408 2154:10.1145/1753171.1753191 1507:Production use of Paxos 1126:Fast Paxos generalizes 1010:"A leader can send its 2951:Distributed algorithms 2510:Turner, Bryan (2007). 2344:Cite journal requires 1641:storage virtualization 1222:commutative operations 1108: 802: 723:(rather than the same 695:(of the just received 616:The Proposer sends an 533:is missing information 389: 380: 366:, and then begins the 360: 277:In order to guarantee 258: 238: 237:{\displaystyle n=2F+1} 2776:Aahlad et al.(2011). 2626:10.1145/357172.357176 2263:ACM Computing Surveys 1931:10.1145/279227.279229 1858:ACM Computing Surveys 1821:10.1145/359545.359563 1772:10.1145/322186.322188 1104: 801: 381: 361: 343: 259: 239: 116:virtually synchronous 62:proved in a paper by 36:distributed computing 2704:10.1109/TDSC.2006.35 2468:, Cornell University 2369:. pp. 398–407. 1717:Two generals problem 1535:improve this article 1241:Commutativity Table 1094:Cheap Paxos extends 370: 350: 248: 213: 205:Number of processors 2493:Proceedings of the 2446:. 13 September 2013 2412:. Google TechTalks. 2409:The Paxos Algorithm 2072:10.1145/62546.62549 2034:10.1145/42282.42283 1981:10.1145/3149.214121 1880:10.1145/98163.98167 1658:uses a Paxos-based 1364:commutes with both 1242: 316:Note that Paxos is 2783:2016-04-15 at the 2758:Microsoft Research 2012:Journal of the ACM 1968:Journal of the ACM 1429:Byzantine failures 1240: 803: 636:message is either 376: 356: 337:Typical deployment 254: 234: 171:Byzantine failures 38:, as suggested by 2912:External link in 2424:Paxos Made Simple 2269:(3): 42:1–42:36. 2099:10.1145/7351.7478 2066:. pp. 8–17. 1611: 1610: 1603: 1585: 1388: 1343: 1342: 1215:Generalized Paxos 1207: 1193: 1184: 814:diagrams/cases). 721:identifier number 556: 555: 394:safety properties 379:{\displaystyle i} 359:{\displaystyle i} 257:{\displaystyle F} 173:don't occur. See 140:remote DMA (RDMA) 2963: 2935: 2930: 2924: 2923: 2917: 2916: 2910: 2908: 2900: 2898: 2897: 2879: 2873: 2872: 2870: 2869: 2855: 2849: 2848: 2846: 2845: 2831: 2825: 2824: 2822: 2821: 2807: 2801: 2794: 2788: 2774: 2768: 2767: 2765: 2764: 2750: 2744: 2739: 2733: 2732: 2730: 2721: 2715: 2714: 2712: 2710: 2689: 2680: 2674: 2673: 2671: 2669: 2659: 2650: 2644: 2643: 2641: 2640: 2619: 2595: 2589: 2588: 2560: 2544: 2538: 2537: 2531: 2522: 2516: 2515: 2507: 2501: 2500: 2484: 2469: 2466:Technical Report 2462: 2456: 2455: 2453: 2451: 2436: 2430: 2420: 2414: 2413: 2403: 2397: 2396: 2360: 2354: 2353: 2347: 2342: 2340: 2332: 2324: 2311: 2310: 2302: 2287: 2286: 2254: 2248: 2247: 2239: 2233: 2232: 2201: 2192: 2191: 2172: 2166: 2165: 2147: 2127: 2121: 2120: 2110: 2082: 2076: 2075: 2052: 2046: 2045: 2027: 2009: 2000: 1994: 1993: 1983: 1959: 1950: 1949: 1947: 1946: 1933: 1909: 1903: 1898: 1892: 1891: 1873: 1855: 1846: 1840: 1839: 1837: 1836: 1823: 1797: 1791: 1790: 1788: 1787: 1774: 1748: 1692:Apache Cassandra 1686:Apache ZooKeeper 1606: 1599: 1595: 1592: 1586: 1584: 1543: 1519: 1511: 1386: 1371: 1367: 1363: 1350: 1339: 1338: 1332: 1331: 1316: 1315: 1294: 1293: 1287: 1286: 1271: 1270: 1243: 1239: 1205: 1191: 1182: 924: 743:When rounds fail 551: 548: 542: 526: 518: 385: 383: 382: 377: 365: 363: 362: 357: 263: 261: 260: 255: 243: 241: 240: 235: 42:and surveyed by 2971: 2970: 2966: 2965: 2964: 2962: 2961: 2960: 2941: 2940: 2939: 2938: 2931: 2927: 2914: 2913: 2911: 2902: 2901: 2895: 2893: 2880: 2876: 2867: 2865: 2857: 2856: 2852: 2843: 2841: 2833: 2832: 2828: 2819: 2817: 2809: 2808: 2804: 2795: 2791: 2785:Wayback Machine 2775: 2771: 2762: 2760: 2752: 2751: 2747: 2740: 2736: 2728: 2724:Burrows, Mike. 2722: 2718: 2708: 2706: 2687: 2681: 2677: 2667: 2665: 2657: 2651: 2647: 2638: 2636: 2596: 2592: 2577: 2558:10.1.1.150.1791 2545: 2541: 2529: 2523: 2519: 2508: 2504: 2485: 2472: 2463: 2459: 2449: 2447: 2438: 2437: 2433: 2421: 2417: 2405: 2404: 2400: 2385: 2361: 2357: 2345: 2343: 2334: 2333: 2325: 2314: 2303: 2290: 2275:10.1145/2673577 2255: 2251: 2240: 2236: 2221:10.1145/3302258 2202: 2195: 2173: 2169: 2145:10.1.1.212.2168 2128: 2124: 2083: 2079: 2053: 2049: 2007: 2001: 1997: 1960: 1953: 1944: 1942: 1910: 1906: 1899: 1895: 1853: 1847: 1843: 1834: 1832: 1798: 1794: 1785: 1783: 1749: 1745: 1740: 1713: 1607: 1596: 1590: 1587: 1544: 1542: 1532: 1520: 1509: 1493: 1488: 1484: 1475: 1471: 1448: 1444: 1424: 1422:Byzantine Paxos 1398: 1393: 1384: 1376: 1369: 1365: 1361: 1358: 1234: 1217: 1212: 1203: 1198: 1189: 1180: 1175: 1171: 1124: 1119: 1113: 1092: 995: 990: 983: 978: 968: 963: 952: 944: 936: 931: 922: 909: 904: 896: 891: 883: 878: 870: 865: 857: 852: 844: 839: 832: 827: 820: 811: 792: 776: 767: 745: 661: 592: 584: 552: 546: 543: 536: 527: 492: 442: 434: 409: 371: 368: 367: 351: 348: 347: 339: 331:fault tolerance 275: 249: 246: 245: 214: 211: 210: 207: 198:Byzantine Paxos 184: 175:Byzantine Paxos 157: 148: 87: 17: 12: 11: 5: 2969: 2959: 2958: 2953: 2937: 2936: 2925: 2874: 2850: 2826: 2802: 2789: 2769: 2745: 2734: 2716: 2698:(3): 202–215. 2675: 2645: 2617:10.1.1.64.2312 2610:(3): 382–401. 2590: 2575: 2539: 2517: 2502: 2470: 2457: 2431: 2415: 2398: 2383: 2355: 2346:|journal= 2312: 2288: 2249: 2234: 2193: 2167: 2122: 2077: 2047: 2025:10.1.1.13.3423 2018:(2): 288–323. 1995: 1974:(2): 374–382. 1951: 1924:(2): 133–169. 1904: 1893: 1871:10.1.1.69.1536 1864:(4): 299–319. 1841: 1814:(7): 558–565. 1792: 1765:(2): 228–234. 1742: 1741: 1739: 1736: 1735: 1734: 1729: 1724: 1719: 1712: 1709: 1708: 1707: 1704: 1701: 1695: 1689: 1679: 1672: 1666: 1663: 1653: 1647: 1644: 1633: 1629: 1626:Google Spanner 1623: 1609: 1608: 1523: 1521: 1514: 1508: 1505: 1492: 1489: 1485: 1483: 1480: 1472: 1470: 1467: 1445: 1443: 1440: 1423: 1420: 1419: 1418: 1414: 1397: 1394: 1390: 1383: 1380: 1374: 1356: 1341: 1340: 1333: 1326: 1324: 1322: 1318: 1317: 1310: 1308: 1306: 1304: 1300: 1299: 1297: 1295: 1288: 1281: 1277: 1276: 1274: 1272: 1265: 1263: 1259: 1258: 1255: 1252: 1249: 1246: 1233: 1230: 1216: 1213: 1209: 1202: 1199: 1195: 1186: 1179: 1176: 1172: 1170: 1167: 1123: 1120: 1117: 1112: 1109: 1091: 1088: 1087: 1086: 1078: 1077: 1065: 1064: 1060: 1059: 1042: 1041: 1020: 1019: 1007: 1006: 994: 991: 987: 982: 979: 975: 967: 964: 960: 951: 948: 941: 935: 932: 930: 927: 908: 905: 901: 895: 892: 888: 882: 879: 875: 869: 866: 862: 856: 853: 849: 843: 840: 836: 831: 828: 824: 819: 816: 810: 807: 791: 788: 775: 772: 766: 763: 762: 761: 744: 741: 717: 716: 715: 714: 708: 707: 706: 705: 682: 681: 670:if and only if 660: 654: 646: 645: 613: 612: 591: 585: 583: 580: 579: 578: 557: 554: 553: 530: 528: 521: 501: 500: 491: 485: 484: 483: 467: 466: 441: 435: 433: 430: 408: 405: 375: 355: 338: 335: 314: 313: 310: 307: 304: 295:Agreement (or 293: 290: 287:non-triviality 274: 271: 267:configurations 253: 233: 230: 227: 224: 221: 218: 206: 203: 202: 201: 194: 191: 188: 183: 180: 179: 178: 167: 164: 161: 156: 153: 147: 144: 86: 83: 44:Fred Schneider 40:Leslie Lamport 15: 9: 6: 4: 3: 2: 2968: 2957: 2954: 2952: 2949: 2948: 2946: 2934: 2929: 2921: 2906: 2892: 2888: 2884: 2878: 2864: 2860: 2854: 2840: 2836: 2830: 2816: 2812: 2806: 2799: 2793: 2786: 2782: 2779: 2773: 2759: 2755: 2749: 2743: 2738: 2727: 2720: 2705: 2701: 2697: 2693: 2686: 2679: 2663: 2656: 2649: 2635: 2631: 2627: 2623: 2618: 2613: 2609: 2605: 2601: 2594: 2586: 2582: 2578: 2576:9781605583969 2572: 2568: 2564: 2559: 2554: 2550: 2543: 2535: 2528: 2521: 2513: 2506: 2498: 2496: 2490: 2489:"Cheap Paxos" 2483: 2481: 2479: 2477: 2475: 2467: 2461: 2445: 2441: 2435: 2428: 2425: 2419: 2411: 2410: 2402: 2394: 2390: 2386: 2384:9781595936165 2380: 2376: 2372: 2368: 2367: 2359: 2351: 2338: 2330: 2323: 2321: 2319: 2317: 2308: 2301: 2299: 2297: 2295: 2293: 2284: 2280: 2276: 2272: 2268: 2264: 2260: 2253: 2245: 2238: 2230: 2226: 2222: 2218: 2214: 2210: 2209: 2200: 2198: 2189: 2185: 2181: 2177: 2171: 2163: 2159: 2155: 2151: 2146: 2141: 2137: 2133: 2126: 2118: 2114: 2109: 2104: 2100: 2096: 2092: 2088: 2081: 2073: 2069: 2065: 2064: 2058: 2051: 2043: 2039: 2035: 2031: 2026: 2021: 2017: 2013: 2006: 1999: 1991: 1987: 1982: 1977: 1973: 1969: 1965: 1958: 1956: 1941: 1937: 1932: 1927: 1923: 1919: 1915: 1908: 1902: 1897: 1889: 1885: 1881: 1877: 1872: 1867: 1863: 1859: 1852: 1845: 1831: 1827: 1822: 1817: 1813: 1809: 1808: 1803: 1796: 1782: 1778: 1773: 1768: 1764: 1760: 1759: 1754: 1747: 1743: 1733: 1730: 1728: 1727:State machine 1725: 1723: 1720: 1718: 1715: 1714: 1705: 1702: 1699: 1696: 1693: 1690: 1687: 1683: 1680: 1677: 1676:MariaDB Xpand 1673: 1670: 1667: 1664: 1661: 1657: 1654: 1651: 1648: 1645: 1642: 1638: 1634: 1630: 1627: 1624: 1621: 1617: 1613: 1612: 1605: 1602: 1594: 1583: 1580: 1576: 1573: 1569: 1566: 1562: 1559: 1555: 1552: â€“  1551: 1547: 1546:Find sources: 1540: 1536: 1530: 1529: 1524:This section 1522: 1518: 1513: 1512: 1504: 1500: 1498: 1479: 1466: 1464: 1460: 1455: 1453: 1439: 1437: 1432: 1430: 1415: 1412: 1408: 1407: 1406: 1402: 1389: 1379: 1373: 1355: 1352: 1349: 1325: 1323: 1320: 1319: 1309: 1307: 1305: 1302: 1301: 1298: 1296: 1279: 1278: 1275: 1273: 1264: 1261: 1260: 1256: 1253: 1250: 1247: 1245: 1244: 1238: 1229: 1225: 1223: 1208: 1194: 1185: 1166: 1162: 1160: 1156: 1151: 1149: 1145: 1140: 1138: 1134: 1129: 1116: 1107: 1103: 1099: 1097: 1084: 1080: 1079: 1075: 1071: 1067: 1066: 1062: 1061: 1056: 1052: 1048: 1044: 1043: 1039: 1035: 1030: 1026: 1022: 1021: 1017: 1013: 1009: 1008: 1004: 1000: 999: 998: 993:Optimisations 986: 974: 971: 959: 957: 947: 940: 926: 919: 916: 914: 900: 887: 874: 861: 848: 835: 823: 815: 806: 800: 796: 787: 785: 780: 771: 759: 755: 751: 747: 746: 740: 738: 734: 730: 726: 722: 712: 711: 710: 709: 702: 698: 694: 690: 686: 685: 684: 683: 679: 675: 671: 667: 663: 662: 659: 653: 651: 643: 639: 635: 631: 627: 623: 619: 615: 614: 610: 606: 602: 598: 594: 593: 590: 576: 572: 571: 566: 562: 558: 550: 540: 534: 531:This article 529: 525: 520: 519: 515: 511: 507: 503: 502: 498: 494: 493: 490: 481: 477: 473: 469: 468: 464: 460: 456: 452: 448: 444: 443: 440: 429: 427: 423: 419: 415: 404: 402: 397: 395: 388: 387: 373: 353: 342: 334: 332: 328: 324: 319: 311: 308: 305: 302: 298: 294: 291: 288: 285:Validity (or 284: 283: 282: 280: 270: 268: 251: 231: 228: 225: 222: 219: 216: 199: 195: 192: 189: 186: 185: 176: 172: 168: 165: 162: 159: 158: 152: 143: 141: 136: 131: 129: 125: 120: 117: 113: 108: 105: 100: 96: 92: 82: 80: 75: 73: 69: 65: 61: 55: 52: 47: 45: 41: 37: 33: 28: 25: 21: 2928: 2894:. Retrieved 2890: 2877: 2866:. Retrieved 2862: 2853: 2842:. Retrieved 2838: 2829: 2818:. Retrieved 2814: 2805: 2792: 2772: 2761:. Retrieved 2757: 2748: 2737: 2719: 2707:. Retrieved 2695: 2691: 2678: 2666:. Retrieved 2661: 2648: 2637:. Retrieved 2607: 2603: 2593: 2548: 2542: 2533: 2520: 2505: 2492: 2465: 2460: 2448:. Retrieved 2444:Elastic Blog 2443: 2434: 2426: 2418: 2408: 2401: 2365: 2358: 2337:cite journal 2307:"Fast Paxos" 2266: 2262: 2252: 2237: 2212: 2206: 2179: 2176:Keidar, Idit 2170: 2138:(1): 63–73. 2135: 2131: 2125: 2090: 2086: 2080: 2060: 2050: 2015: 2011: 1998: 1971: 1967: 1943:. Retrieved 1921: 1917: 1907: 1896: 1861: 1857: 1844: 1833:. Retrieved 1811: 1805: 1795: 1784:. Retrieved 1762: 1756: 1746: 1597: 1591:October 2018 1588: 1578: 1571: 1564: 1557: 1545: 1533:Please help 1528:verification 1525: 1501: 1494: 1476: 1462: 1458: 1456: 1449: 1433: 1425: 1410: 1403: 1399: 1385: 1377: 1359: 1353: 1344: 1235: 1226: 1218: 1204: 1190: 1181: 1163: 1158: 1154: 1152: 1147: 1143: 1141: 1136: 1132: 1125: 1114: 1105: 1100: 1093: 1082: 1073: 1069: 1054: 1050: 1046: 1037: 1033: 1028: 1024: 1015: 1011: 1002: 996: 984: 972: 969: 955: 953: 945: 937: 920: 917: 910: 897: 884: 871: 858: 845: 833: 821: 812: 804: 793: 783: 781: 777: 768: 757: 753: 749: 724: 720: 718: 700: 696: 692: 688: 677: 673: 665: 657: 649: 647: 641: 637: 633: 629: 625: 621: 617: 608: 604: 600: 596: 588: 574: 568: 564: 560: 544: 532: 513: 509: 505: 496: 488: 479: 462: 458: 454: 438: 425: 421: 417: 413: 410: 398: 390: 345: 344: 340: 330: 326: 322: 317: 315: 300: 296: 286: 278: 276: 266: 208: 149: 132: 109: 88: 76: 56: 48: 34:approach to 29: 19: 18: 2915:|last= 2450:27 February 2132:SIGACT News 1396:Performance 1128:Basic Paxos 1096:Basic Paxos 1090:Cheap Paxos 913:Basic Paxos 907:Multi-Paxos 407:Basic Paxos 401:Multi-Paxos 297:consistency 146:Assumptions 2945:Categories 2896:2024-09-19 2868:2024-09-19 2844:2024-09-19 2820:2024-09-19 2763:2024-09-19 2639:2007-02-02 2497:(DSN 2004) 1945:2007-02-02 1835:2007-02-02 1786:2007-02-02 1738:References 1561:newspapers 1366:3:Write(B) 1345:Note that 1122:Fast Paxos 656:Phase 2b: 587:Phase 2a: 487:Phase 1b: 437:Phase 1a: 155:Processors 130:and Zhou. 99:Stockmeyer 2664:: 173–186 2612:CiteSeerX 2553:CiteSeerX 2393:207164635 2283:0360-0300 2229:218482757 2140:CiteSeerX 2108:1813/6534 2093:: 47–76. 2020:CiteSeerX 1990:207660233 1866:CiteSeerX 1830:215822405 1698:ScyllaDB 1688:from v1.9 1457:Note the 1370:4:Read(B) 1362:5:Read(A) 1321:Write(B) 1280:Write(A) 1257:Write(B) 704:message). 620:message, 547:July 2024 539:talk page 476:Acceptors 24:consensus 2905:cite web 2839:DataStax 2781:Archived 2634:55899582 2162:15189602 2117:11224827 2042:17007235 1711:See also 1656:XtreemFS 1650:WANdisco 1632:changes. 1620:Bigtable 1463:Accepted 1459:Accepted 1411:centered 1303:Read(B) 1262:Read(A) 1251:Write(A) 1159:Accepted 1148:Accepted 1074:Accepted 1070:Accepted 1051:Accepted 1003:Accepted 758:Accepted 701:Accepted 658:Accepted 642:(n, v=x) 638:(n, v=z) 447:Proposer 327:liveness 79:database 72:Paterson 2815:MariaDB 2731:. OSDI. 2709:5 March 2668:5 March 2585:2763624 1781:6429068 1575:scholar 1254:Read(B) 1248:Read(A) 1232:Example 1155:Accept! 1144:Accept! 1083:Accept! 1055:Promise 1047:Accept! 1038:Accept! 1034:Prepare 1029:Accept! 1025:Prepare 1016:Accept! 1012:Prepare 784:Promise 754:Promise 750:Prepare 672:it has 630:Prepare 582:Phase 2 510:Promise 489:Promise 451:Prepare 439:Prepare 432:Phase 1 182:Network 124:Lamport 85:History 64:Fischer 2632:  2614:  2583:  2573:  2555:  2391:  2381:  2281:  2227:  2160:  2142:  2115:  2040:  2022:  1988:  1940:421028 1938:  1888:678818 1886:  1868:  1828:  1779:  1577:  1570:  1563:  1556:  1548:  1452:Alvisi 1436:Liskov 1405:time. 1360:Since 735:, and 697:Accept 666:(n, v) 650:Accept 634:Accept 622:(n, v) 618:Accept 589:Accept 472:Quorum 329:, and 323:safety 301:safety 279:safety 135:Keidar 128:Malkhi 119:gbcast 112:Birman 104:Liskov 60:result 2729:(PDF) 2688:(PDF) 2658:(PDF) 2630:S2CID 2581:S2CID 2530:(PDF) 2389:S2CID 2225:S2CID 2215:(2). 2158:S2CID 2113:S2CID 2038:S2CID 2008:(PDF) 1986:S2CID 1936:S2CID 1884:S2CID 1854:(PDF) 1826:S2CID 1777:S2CID 1682:Neo4j 1660:lease 1582:JSTOR 1568:books 1133:3f+ 1 725:value 648:This 299:, or 95:Dwork 91:Lynch 68:Lynch 51:Paxos 20:Paxos 2920:help 2711:2018 2670:2018 2571:ISBN 2452:2021 2379:ISBN 2350:help 2279:ISSN 1732:Raft 1674:The 1669:Ceph 1554:news 1497:RDMA 1368:and 1027:and 1014:and 424:and 416:and 97:and 70:and 2700:doi 2622:doi 2563:doi 2371:doi 2271:doi 2217:doi 2184:doi 2150:doi 2103:hdl 2095:doi 2068:doi 2030:doi 1976:doi 1926:doi 1876:doi 1816:doi 1767:doi 1537:by 1058:v." 1036:or 956:I+1 756:or 674:not 570:NAK 559:If 504:If 474:of 318:not 2947:: 2909:: 2907:}} 2903:{{ 2889:. 2861:. 2837:. 2813:. 2756:. 2694:. 2690:. 2660:. 2628:. 2620:. 2606:. 2602:. 2579:. 2569:. 2561:. 2532:. 2491:. 2473:^ 2442:. 2387:. 2377:. 2341:: 2339:}} 2335:{{ 2315:^ 2291:^ 2277:. 2267:47 2265:. 2261:. 2223:. 2213:36 2211:. 2196:^ 2182:. 2156:. 2148:. 2136:41 2134:. 2111:. 2101:. 2089:. 2059:. 2036:. 2028:. 2016:35 2014:. 2010:. 1984:. 1972:32 1970:. 1966:. 1954:^ 1934:. 1922:16 1920:. 1916:. 1882:. 1874:. 1862:22 1860:. 1856:. 1824:. 1812:21 1810:. 1804:. 1775:. 1763:27 1761:. 1755:. 731:, 465:). 445:A 403:. 396:. 325:, 269:. 126:, 93:, 66:, 2922:) 2918:( 2899:. 2871:. 2847:. 2823:. 2766:. 2713:. 2702:: 2696:3 2672:. 2642:. 2624:: 2608:4 2587:. 2565:: 2536:. 2514:. 2499:. 2454:. 2395:. 2373:: 2352:) 2348:( 2331:. 2309:. 2285:. 2273:: 2246:. 2231:. 2219:: 2190:. 2186:: 2164:. 2152:: 2119:. 2105:: 2097:: 2091:5 2074:. 2070:: 2044:. 2032:: 1992:. 1978:: 1948:. 1928:: 1890:. 1878:: 1838:. 1818:: 1789:. 1769:: 1604:) 1598:( 1593:) 1589:( 1579:· 1572:· 1565:· 1558:· 1531:. 1137:f 923:I 693:v 689:n 680:. 678:n 644:. 626:n 611:. 609:x 605:z 601:v 597:v 577:. 575:n 567:( 561:n 549:) 545:( 541:. 514:n 506:n 497:n 480:n 463:v 459:n 455:n 426:b 422:a 418:b 414:a 374:i 354:i 303:) 289:) 252:F 232:1 229:+ 226:F 223:2 220:= 217:n

Index

consensus
state machine replication
distributed computing
Leslie Lamport
Fred Schneider
Paxos
result
Fischer
Lynch
Paterson
database
Lynch
Dwork
Stockmeyer
Liskov
Birman
virtually synchronous
gbcast
Lamport
Malkhi
Keidar
remote DMA (RDMA)
Byzantine failures
Byzantine Paxos
Byzantine Paxos
safety properties
Multi-Paxos
Proposer
Prepare
Quorum

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑