Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
CAP twelve years later: How the "rules" have changed (2012) (infoq.com)
86 points by thunderbong on Jan 21, 2024 | hide | past | favorite | 57 comments


I don’t think it should have been called “CAP” theorem. I think of it as “CAL” (consistency–availability-latency). It has always struck me that the P in CAP is different than the other two characteristics. C and A are continuous values, but P is binary (with a duration presumably).

The “P” in CAP means “tolerance of partition events”. But that’s really not what most engineering teams are thinking about when they talk about CAP theorem. Teams that really want tolerance to network partitions are really talking about building parallel instances, for example, cloud providers implementing multiple regions. If they need some level of consistency across those instances, then really, it becomes a discussion about latency.

If you think about it, what you’re really talking about with the network partition events is how stale or latent the data on each side will become, and whether or not, it will become consistent again, particularly without human intervention.

When you talk about it in terms of CAL, now all the properties are continuous, and you can discuss the trade-offs to each property with respect to network partitions.


Professor Daniel Abadi makes a similar point in his paper [1], where he improves on CAP with PACELC -- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?. (From his blog here: [2]).

Klepmann makes an interesting critique [3], that neither is particularly useful because both imply that you must either pick linearizability or total availability. When in practice, there have been plenty of successful databases that present neither characteristic, and thus are neither CP nor AP.

[1]: https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf

[2]: https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-a...

[3]: https://martin.kleppmann.com/2015/05/11/please-stop-calling-...


https://ar5iv.labs.arxiv.org/html/2301.08906v1 - "...We have extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design real-time systems..."


Thank you for the link. I’m glad other people are thinking this way, and with more formalism.


CAP really should have been formulated as:

When too many nodes fail or are unreachable, do you sacrifice consistency, or availability?

And even then it's misleading, because you can still be consistent and partially sacrifice availability (allow local or mergable updates) or be available but partially consistent (allow updates, but have potential conflicts to resolve, maybe manually, when the partition is healed.)

You can even make different A <-> C tradeoffs per request/transaction.

Distributed system are complex, who knew?


> When too many nodes fail or are unreachable

Don’t forget the nasty bit about partitions: if I think nodes 1-5 are reachable and the rest are unreachable, I can’t assume that the rest of the nodes are down or idle — some other client may think that nodes 1-5 are unreachable but nodes 6-15 are reachable.


The nasty bit you speak of is split brain, and it's addressed by C. C says that if you read you always get the last thing written. If the nodes can't communicate and someone does a write then the other half can't return what was written, so CAP is broken.


Well a partition means that some nodes are unreachable, at least from the perspective of some other nodes. The tricky part is that reachability is not necessarily consistent across the entire network.


That's correct


What sacrifies Bitcoin? It seems to be always available and always consistent.


It’s not consistent.

It takes time for all nodes to come to agreement and in that time I am able to request _technically_ out of date info from a node.

Consistency is often the one CAP that isn’t prioritized because if the network isn’t consistent for a few seconds, the application probably still works and that’s the case for bitcoin.

The network is “eventually consistent”


Partition


At the hardware level, one thing that has changed over the years is the exponential growth in network speeds and feeds – within a datacenter and across datacenters.

Layer-3 IP Clos networks are pretty much a standard within a datacenter, making multiple redundant paths available. Across datacenters, optical circuit switched networks with lots of redundant paths are very common.

Modern network fabrics also have end-to-end mechanisms for bandwidth engineering based on per message QoS/priority set by applications as well as very accurate time-keeping.

Observability mechanisms and operational practices have also tremendously improved. A single node becoming unreachable from all other nodes is quite possible. But a single node being reachable to some subset of other nodes in the cluster and not others is nearly impossible. Only way a network partition can happen is due to misconfiguration. Purely layer-3 IP clos networks with BGP advertised multiple ECMP routes are a lot simpler than layer-2 networks of the past. With modern switch operating systems, modern devops practices, Infrastructure-as-Code etc. misconfiguration is nearly impossible.

At a higher layer, disaggregated storage and compute architectures are standard for any distributed database system. In this model, storage fabric is quite simple and much more distributed and redundant. Today, it is a lot easier to guarantee a consistent and highly available transactional write to a quorum of simple storage nodes spread across multiple datacenters in a region. On top of that, building higher-level isolation guaranteeing database transactions in a tier of ephemeral compute nodes is relatively much simpler.

This kind of system is extremely resilient to failures at all levels – node failures are detected and lost processing capacity recovered in seconds; failed transactions are retried in milliseconds; loss of redundancy due to failed storage nodes are rebuilt in seconds; loss of network capacity due to failed circuits or intermediate switches are routed around in microseconds.

So, in today's context, we build databases that provide flexible transaction isolation guarantees that is selected appropriately on a per transaction basis by the application. Application also specifies the latency budget it has available per transaction. With all these improvements, thankfully, application developers don't have worry about CAP theorem like they had to in the heydays of NoSQL databases in 2012.


This may not hold across geographic boundaries, which is the real problem. Like between Asia, Europe, and North America. Its not so much an a complete outage problem, as it is about bandwidth dropping very low on occassion.

When you're talking about say, high frequency trading systems, this is a big deal.

And anyone who deploys a service with all their data in a single data center, is a pretty small time player, or they don't need the uptime.


Could you provide links to these things for those who are stuck in the past and want to learn more?


Well, there's no one comprehensive link I can suggest. But you can find plenty of links on arxiv.org for research publications on these topics. You can look at the ones from big companies like Google, Microsoft and Meta/Facebook. You can also search in Youtube for various conference talks on the same topics.


This is a variation on Google's argument that "our network is so resilient we don't need to worry about Partitions so we can offer C and A and P" but is a weak one, more and more on the face of real present experiences.

Due to the quorum requirements, probability of failure increases with larger quorum sizes, as the number of intersections (and therefore potential points of failure) increases with more complex quorum arrangements.


> Due to the quorum requirements, probability of failure increases with larger quorum sizes

Well, that's not true due to two key design points:

- Size of the read or write quorum can be chosen on a per transaction basis depending on what level of availability and consistency you want. The design goal is to lower the likelyhood of the minimum number of quorum members you need for your transaction will not be available.

- Given the transaction processing nodes are ephemeral and independent from storage nodes, if a transaction processing node which is designated as a member of a quorum dies, it can be quickly and instantly replaced with a new one and all in-flight operations will be retried by the smart clients.


While all those design strategies are indeed valid, and adaptable dynamic quorum sizes, have the potential to manage certain trade-offs, the practical implementation of such systems presents such a challenges, due to the inherent complexities of distributed systems that...Tackling these complexities with those strategies should be approached with a degree of humility.

Overheads and latency issues in node replacement, particularly under conditions of unexpected high load or network difficulties, compound these challenges. These issues often manifest in correlated failures, which are more common than isolated node failures.

In this landscape we are still in compromise and trade-offs territory. I would refer to these two papers as insightful demonstrations of these challenges:

"Consistency vs. Availability in Distributed Real-Time Systems" - https://ar5iv.labs.arxiv.org/html/2301.08906v1

"Consistency models in distributed systems: A survey on definitions, disciplines, challenges and applications" - https://ar5iv.labs.arxiv.org/html/1902.03305


What database systems are you aware of that take advantage of these modern network advances? Sounds like spanner , but any others?


Thanks for the write-up. Was interesting to read!


Related:

CAP Twelve Years Later: How the “Rules” Have Changed (2012) - https://news.ycombinator.com/item?id=10179628 - Sept 2015 (4 comments)

CAP Twelve Years Later: How the "Rules" Have Changed - https://news.ycombinator.com/item?id=4043260 - May 2012 (3 comments)


I’m still trying to understand if the lack of the PutIfAbsent API in S3 (in other words, the ability to put an item only if no other item is present) is a consequence of the strong consistency they offer in terms of Read after Write. Is this trade-off a corollary of the Cap theorem?

That PutIfAbsent Api would solve so many problems for metadata formats used in the BigData field like Delta or Iceberg!


Possibly a side effect from S3's eventual consistency model. Until couple of years ago there was no guarantee that an object you had written to S3 would be visible to another process looking for it.

About two years ago AWS made a big announcement that they could finally guarantee "read your own writes" consistency model. But ONLY within a single account.

If you know that you are not racing against eventual consistency, you can use HeadObject API call to check whether a given key exists in the bucket or not.


"Until couple of years ago there was no guarantee that an object you had written to S3 would be visible to another process looking for it."

Yeah, over the last 5 years but mostly before that change, I saw some strange intermittent problems with Hive where certain files would be registered in the Hive metadata by one process but not visible when the data actually went to get queried by other processes which led to jobs we were responsible for bombing out with missing file errors.

What was truly strange was when this situation would last for hours.


The rules haven't changed, its just that its always about the details.

Most services, short of financial ones, choose to sacrifice consistency for availabilty. Example being social networks, it doesn't matter if I get "inconsistent" social media feeds depending on where I am because my European friends posts haven't had time to propogate across the database shard.

OTOH financial systems, you best believe they are going to choose consistency 9 times out of 10. Exceptions being being very small amounts. I don't want someone to exploit network/node downtime in order to steal millions of dollars.

But the reality is that network outages, rarely last very long. Then its just a matter of making sure the nodes themselves are reliable.


There are situations that social networks need to worry about, consistency wise. Imagine a newly battered spouse unfriends their partner before making a post explaining why they're getting a divorce. If these two events get mixed up the perpetrator can end up reading the post.

Also, ironically enough, most of the global financial system actually operates very inconsistently. It can take days for transactions to clear, and there's a bazillion quirks based on local law and bank policy. So in practice banks use double entry accounting with reconciliation. If something got messed up they issue a new compensating transaction.


Financial systems have been exchanging coerence for availability for milenia.

It's just recently that anybody started to care about instantaneus coerence, instead of eventual.


> coerence

Did you mean coherence? As in consistency?

I'd have autocorrected in reading except that you wrote it twice which implies intention. However a quick search didn't turn up a definition.


Yep, English is not my first language.


No problem, thanks for learning it.


Like I said, depends on the amount. But I remember the days when you had to wait sometimes several days for a check to clear before the funds were available in your account.

Of course you could always write bad checks - but payee was on the hook for the fraud, not the bank ...


I remember about 2017 or 2018 giving a talk about Ethereum and asking how many people in the audience of maybe 400 had heard of the CAP theorem.

One hand went up in the back. That's when I knew our culture was doomed.

CAP is (for the record) why the blockchain is and is not "just a database" -- it's a database which takes a different position on CAP and as a result can be used for different things to other databases.


Would love to learn more. Guess: Is the “different position” that consistency can be achieved when a fault occurs by having the longest chain, instead of a leader/voting mechanism?


"Eventual consistency"

1) there is a network

2) on each block, a leader is selected to make the next block on the basis of a lottery -- each lottery ticket is a unit of computing done ("proof of work")

3) if the network splits, each subnet has a different leader and a different sequence of blocks from that point onwards

4) when the network comes back together, a new "agreed history" is formed from the last shared block, selecting the block with the most "work" done

and resulting in the transactions which were done and confirmed in the smaller subnet being thrown out and history rewritten as if they had never existed

This is the genius of Satoshi -- he figured out that something this trashy was good enough to get something done. Nobody in an academic setting would ever have considered that as a worthwhile solution I don't think.

It was madness. But it worked. It's the same kind of genius as Ethernet just rebroadcasting packets until something works rather than doing Token Ring type negotiation about who can talk next.

Worse really is better at times.


Have they changed again in the last 12 years apart from the higher importance of cloud solutions and privacy legislation?


Didn't change, CAP was flawed from the very beginning, you need more phrases [0] to classify those systems.

[0] https://jepsen.io/consistency


I always thought CAP was weird because you can have both consistency and availability if you sync faster than requests come in to the system.


Like a spherical cow, there are some assumptions we need to make in a 101 class. For example, I remember my database professor saying that I need to assume my entire database does not fit in memory (RAM) because if it does, all bets are off.


The biggest fallacy of distributed systems: the network is always up.

You'd think that would be blindingly obvious, but based on every system in existence ... It's not.

Primarily I think it is marketing/sales since thinking properly about distributed systems is hard and executives don't want to hear about it.


That's what health checks are for yo! ;)


If you assume you can "sync faster than requests come in to the system", you've necessarily chosen C not-P, i.e. you don't have a distributed system (and in practice you've probably chosen not-A too).


CAP is a false choice. In practice, people choose (mostly C)+(mostly A)+(mostly P).


"Mostly consistent" is not good enough for many applications.

The thing most people pick in practice is CA, single master with hopefully a replicated backup.


Absolutely. And that's the right choice IF you can make it. Switching to CP or AP is something you should do because you're forced to.

That's something lost on a lot of modern engineers, I think. Distributed Systems are something you do because you have to, not because they're the best approach inherently.


You've misunderstood the theorem. It doesn't say you can't achieve any combination of mostly C/A/P; it says you can't have perfect C, A, and P at the same time.


That’s because for many use cases, you don’t need to be perfect.

The theorem is about being perfectly C, A, and P


Those things are binary. You have them or mot. Theres no: 'mostly'.

If you have more than two hosts, you have P.

If you have delay to sync, you have C.

If you send inaccurate data instead of waiting you have A.

Theres no 'mostly'


if you choose consistency than the only negative result of a distributed system is consistency based latency. i.e. Time to become consistent which reduces your response time to the client.

so there's a linear relationship between reducing sync time and reducing response time until at some point in reducing sync time you reduce response time below a level you give a shit about...

and you have all 3 of consistency, availability, and partitioning

and then you have beaten the CAP theory


In this case you are assuming not to have P. The traditional idea of "pick 2 of 3" with respect to CAP is weirdly formulated because you either have a distributed system, in which case P is a given and you must trade C vs A, or you don't have a distributed system and can therefore have both C and A.


I think there needs to be another variable which is response time.

Because you can have a well partitioned system with good data redundancy, and you can take time to let the system become consistent which reduces availability.

But the actual practical outcome of changing availability is changing response time.

Therefore, consistency time is directly and linearly related to response time.

And if the response time is below some threshold that you no longer care about then you have beaten CAP theorem.

So in a practical sense, you can beat the CAP theorem. In an abstract sense you cant.

But response time grounds the theorem to reality because it in every system, there's a point of optimization that no longer matters.


Related: A Critique of CAP Theorem by Martin Kleppman

https://arxiv.org/abs/1509.05393

TL;DR: terms in CAP are not adequately defined and he proposes a delay-sensitive framework.


This paper takes the idea further and formalizes it better: https://ar5iv.labs.arxiv.org/html/2301.08906v1


The most misunderstood theorem of all time.


Please be more specific.


(Not the OP)

The "C" and "A" in CAP theorem have very specific definitions in the context of the theorem, that isn't the same as, say, the "C" in ACID. The CAP theorem in its original formulation makes no claim about latency, or eventual consistency, or any of these things. It is a very trivial logical conclusion.

"Consistency" just means when a node is queried, you either get the latest read, or you get an error.

"Availability" just means when a node is queried, you always get something. You never get an error.

---

Given this, we can show with a simple example that in the face of a network partition, it is impossible to get C and A simultaneously.

Suppose there's a Leader node and a Follower node with the two being kept in sync, but say there's a network partition between the two, so the Follower cannot reach the Leader.

Say I changed my name in the Leader from "old-name" to "new-name".

If I ask the Follower node what my name is, what should it return?

The follower can't know that the update has occurred, so you can see that it can only do one of two things

- The follower could return an error (it has C, but not A)

- The follower could return "old-name" (it has A, but not C)


Thanks for the explanation I appreciate it, even if I am familiar with CAP.The answer I was looking for was instead, what is normally misunderstood about it... :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: