Understanding consensus and Paxos in distributed systems

udkl · on May 17, 2016

Readers would also be interested in this excellent Raft consensus algorithm visualization : http://thesecretlivesofdata.com/raft/

As well as : "Ice cream & distributed systems" : https://brooker.co.za/blog/2014/10/25/ice-cream.html

ansible · on May 18, 2016

Since I think that Raft is relatively easy to grok and implement, what situations make Paxos a better choice these days?

federicoponzi · on May 18, 2016

Nice links

sargun · on May 18, 2016

If you're interested in learning more about consensus algorithms, there's a great website that describes many of the variants: http://paxos.systems/.

The author also published an excellent paper on the subject called "Paxos Made Moderately Complex" (http://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf)

altendo · on May 17, 2016

I like how the article steps through and examines the key points of Paxos with a relatable scenario. Still a complicated subject for sure. Makes me want to go home and implement a working version for myself.

sillysaurus3 · on May 17, 2016

It's surprisingly easy to fool oneself into thinking they've successfully implemented Paxos.

You may want to try 6.824: Distributed Systems https://pdos.csail.mit.edu/6.824/ I learned a lot from working through the online coursework.

Alternatively, download the labs and see the wide variety of unit tests. There are a few devilishly difficult ones.

defenestration · on May 17, 2016

Here you can find an example in JavaScript that visualizes the Paxos algorithm: http://harry.me/blog/2014/12/27/neat-algorithms-paxos/

0xbadcafebee · on May 17, 2016

If you want a system that's going to "just work" and not be complicated or potentially error-prone (what happens when too many nodes don't respond?), use synchronous distributed design. Paxos has its uses, but I would argue they're few and far between.

justinsaccount · on May 17, 2016

What is "synchronous distributed design"? Where all nodes have to respond and the entire system blocks until they do?

pablo_honey · on May 18, 2016

To answer your question, Fundamentally a Synchronous distributed design has nodes in the system set an upper limit on how long it takes for things to happen, like max RTT and computation time.

Algorithms based on synchronous systems are much simpler but theyre also unrealistic both in theory and usually in practice as well (without clever hacks handling edge cases). For example its difficult to set time limit on how long a node on the internet will take to respond. One may reasonably say 10 seconds max, but what happens on the day it takes 10.1 seconds? The system collapses. Asynchronous systems do not consider time at all so they don't suffer from those classes of problems but algorithms based on them are more complex.

duaneb · on May 18, 2016

Depending on your needs, that may be sufficient. It's certainly much simpler if you can tolerate a higher rate of downtime.

m00dy · on May 17, 2016

2PC is a nice algorithm for it.

justinsaccount · on May 17, 2016

Right... 2PC is where all nodes have to respond and the entire system blocks until they do.

acjohnson55 · on May 18, 2016

Interesting. I'm working on an auction platform, and I created something called Atomic Store [1] to enforce consistency in an Akka-based append-only log deployed on a cluster. But it relies on the logs being singletons. I suppose another way to do it would be to have the logs not be singletons, but have them use two-phase commit, or something similar, to keep the whole thing synchronized.

[1] https://github.com/artsy/atomic-store

DblPlusUngood · on May 18, 2016

If you don't care about tolerating a single failure, then yes, pretending the network is synchronous can simplify your system.

0xbadcafebee · on May 18, 2016

Uh. Synchronous distributed systems tolerate failure just fine. It's a distributed system. You expect failure. Hell, failure happens in all kinds of systems and components. If your applications (and system) can't cope with that you won't last long.

DblPlusUngood · on May 18, 2016

What exactly do you mean by "synchronous"? Do you have a particular system in mind?

If you mean "block waiting for all replicas to respond", then the failure mode is quite different from Paxos. I don't know of one such system that can tolerate two simultaneous failures or network partitions, unlike Paxos.

0xbadcafebee · on May 18, 2016

Well first of all, for Paxos to survive 2 failures it would need 5 nodes; Paxos won't work very well for smaller networks, and a bad network design [or lack of good maintenance] could take out a majority or all of your nodes, making the whole system worthless.

Second, most systems don't need atomic write operations to succeed across network partitions, when network partition is even a concern (for general operations), which it isn't usually. If your network is partitioned, you can't really "decide" anything until it un-partitions, at which point you don't need to come to consensus if you have simpler methods of solving discrepancy - like using a higher level decision process than block-level or database-level. Even after the discrepancy is solved, network-wide changes may have invalidated the initial partitioned operations. Paxos tries to over-simplify a complex operation.

When it is a concern, application design often trumps system design in terms of reliable operation. It completely depends on your network, system, platform and application design. In general, you won't need Paxos to reap the benefits of distributed systems, and synchronous systems will give you more benefits and less headaches.

DblPlusUngood · on May 18, 2016

Synchronous systems have the same problems you describe in your first paragraph, but worse (they tolerate less failures), so I'm not sure what you're trying to say.

I don't buy the rest of your post for two reasons. First, partitions are not only problematic for writes. Externally consistent systems also cannot generally serve reads during a partition either. Furthermore, it is not true that partitions are not a realistic worry - i know of at least one large system that encounters partitions on a weekly basis.

Second, you seem to be arguing that for some reason it is easier to solve consensus at the application level. This is simply not true, otherwise we wouldn't have consistent databases.

0xbadcafebee · on May 19, 2016

I'm saying it depends on what you're doing... They may fail immediately, or they will simply survive implicitly because the operation was atomic, or it was a read operation and it didn't matter anyway, etc.

Consistency is [sometimes] a lie. What are you gonna do, read from the database directly every single time a web server wants to use a user's session cookie? You cache it, you have cache controls, you try to invalidate caches or update them if a new operation changes the session, but there's totally a possible race condition that will be resolved if someone tries a write on an invalid session. There's plenty of valid harmless operations on stale data due to network partition, it's not the end of the world.

Of course some designs are more vulnerable to network partition than others (some designs only work on one network, some require multiple networks) so there are of course cases where you need something like Paxos. I'm saying it's less common than people want to believe.

I find higher level consensus easier because you have more context of what's going on. Session-based operations, for example; if the same operation at the same time is done by two different sessions, you can compare the timing of the operation, when each session was last created/updated, or simply kick back the operation to the sessions and inform them of the conflict and ask them to resolve. In any of these cases the application caught the conflict before it had to do a network communication hokey-pokey, and it can be programmed to make these decisions automatically, too. Letting Paxos decide might result in an immediate fix, but the users might not ever be informed of the consensus decision and one might be confused as to what the actual resulting operation was.

pablo_honey · on May 18, 2016

Then again, a bad network design and lack of good maintainance will not work well for any network. Also the probability of 2 of 5 nodes failing during a given failover time should be quite low for a production quality network.

There seems to be a flaw in your second point.

When the network partitions in a synchronous network, a node in any partition is not able to tell if a partition has actually occured or if the unreachable nodes are dead. The article points this out as one of the main problems Paxos tries to solve. Your given scenario actually puts the nodes in all in a vulnerable state where they are likely to make wrong decisions and become inconsistent

jaekwon · on May 18, 2016

If you'd like to venture into Byzantine fault tolerant consensus, check out Tendermint.

draft thesis from Ethan Buchman: https://github.com/ebuchman/thesis/blob/master/Buchman_Ethan...

https://github.com/tendermint/tendermint

jamesblonde · on May 17, 2016

Checkout lecture 10 here. Great stuff, will be an edx course this fall: https://www.youtube.com/playlist?list=PL700757A5D4B3F368

doug1001 · on May 19, 2016

this is a lucid step-by-step explanation of Paxos, which additionally doesn't seem to require any particular knowledge of consensus algorithms--it provides the background as needed. Engaging prose as well, eg:

"Note that it is possible for several bids to win an auction round. In fact, every active bidder must get a bid chosen in order to decide which candidate won the round. The goal is not agreement on which bidder/bid wins the round (they all do at some point), rather which candidate wins."

from a guy, who, based on his profile is maybe 20 or 21, which is a little depressing.

federicoponzi · on May 18, 2016

I love this! Thanks