Raft Native: The Foundation for Streaming Data’s Best Future
Consensus is fundamental to consistent, distributed systems. To guarantee system availability in the event of inevitable crashes, systems need a way to ensure each node in the cluster is in alignment, such that work can seamlessly transition between nodes in the case of failures.
Consensus protocols such as Paxos, Raft, View Stamped Replication (VSR), etc. help to drive resiliency for distributed systems by providing the logic for processes like leader election, atomic configuration changes, synchronization and more.
As with all design elements, the different approaches to distributed consensus offer different tradeoffs. Paxos is the oldest consensus protocol around and is used in many systems like Google Spanner, Apache Cassandra, Amazon DynamoDB and Neo4j.
Paxos achieves consensus in a three-phased, leaderless, majority-wins protocol. While Paxos is effective in driving correctness, it is notoriously difficult to understand, implement and reason about. This is partly because it obscures many of the challenges in reaching consensus (such as leader election, and reconfiguration), making it difficult to decompose into subproblems.
Raft (for reliable, replicated, redundant and fault-tolerant) can be thought of as an evolution of Paxos — focused on understandability. This is because Raft can achieve the same correctness as Paxos but is more understandable and simpler to implement in the real world, so often it can provide greater reliability guarantees.
For example, Raft uses a stable form of leadership, which simplifies replication log management. And its leader election process, driven through an elegant “heartbeat” system, is more compatible with the Kafka-producer model of pushing data to the partition leader, making it a natural fit for streaming data systems like Redpanda. More on this later.
Because Raft decomposes the different logical components of the consensus problem, for example by making leader election a distinct step before replication, it is a flexible protocol to adapt for complex, modern distributed systems that need to maintain correctness and performance while scaling to petabytes of throughput, all while being simpler to understand to new engineers hacking on the codebase.
For these reasons, Raft has been rapidly adopted for today’s distributed and cloud native systems like MongoDB, CockroachDB, TiDB and Redpanda to achieve greater performance and transactional efficiency.
How Redpanda Implements Raft Natively to Accelerate Streaming Data
When Redpanda founder Alex Gallego determined that the world needed a new streaming data platform — to support the kind of gigabytes-per-second workloads that bring Apache Kafka to a crawl without major hardware investments — he decided to rewrite Kafka from the ground up.
The requirements for what would become Redpanda were: 1) it needed to be simple and lightweight to reduce the complexity and inefficiency of running Kafka clusters reliably at scale; 2) it needed to maximize the performance of modern hardware to provide low latency for large workloads; and 3) it needed to guarantee data safety even for very large throughputs.
The initial design for Redpanda used chain replication: Data is produced to node A, then replicated from A to B, B to C and so on. This was helpful in supporting throughput, but fell short for latency and performance, due to the inefficiencies of chain reconfiguration in the event of node downtime (say B crashes: Do you fail the write? Does A try to write to C?). It was also unnecessarily complex, as it would require an additional process to supervise the nodes and push reconfigurations to a quorum system.
Ultimately, Alex decided on Raft as the foundation for Redpanda consensus and replication, due to its understandability and strong leadership. Raft satisfied all of Redpanda’s high-level design requirements:
- Simplicity. Every Redpanda partition is a Raft group, so everything in the platform is reasoning around Raft, including both metadata management and partition replication. This contrasts with the complexity of Kafka, where data replication is handled by ISR (in-sync replicas) and metadata management is handled by ZooKeeper (or KRaft), and you have two systems that must reason with one another.
- Performance. The Redpanda Raft implementation can tolerate disturbances to a minority of replicas, so long as the leader and a majority of replicas are stable. In cases when a minority of replicas have a delayed response, the leader does not have to wait for their responses to progress, mitigating impact on latency. Redpanda is therefore more fault-tolerant and can deliver predictable performance at scale.
- Reliability. When Redpanda ingests events, they are written to a topic partition and appended to a log file on disk. Every topic partition then forms a Raft consensus group, consisting of a leader plus a number of followers, as specified by the topic’s replication factor. A Redpanda Raft group can tolerate ƒ failures given 2ƒ+1 nodes; for example, in a cluster with five nodes and a topic with a replication factor of five, two nodes can fail and the topic will remain operational. Redpanda leverages the Raft joint consensus protocol to provide consistency even during reconfiguration.
Redpanda also extends core Raft functionality in some critical ways to achieve the scalability, reliability and speed required of a modern, cloud native solution. Redpanda enhancements to Raft tend to focus on Day 2 operations, for instance how to ensure the system runs reliably at scale. These innovations include changes to the election process, heartbeat generation and, critically, support for Apache Kafka
Redpanda’s optimistic implementation of Raft is what enables it to be significantly faster than Kafka while still guaranteeing data safety. In fact, Jepsen testing has verified that Redpanda is a safe system without known consistency problems and a solid Raft-based consensus layer.
But What about KRaft?
While Redpanda takes a Raft-native approach, the legacy streaming data platforms have been laggards in adopting modern approaches to consensus. Kafka itself is a replicated distributed log, but it has historically relied on yet another replicated distributed log — Apache ZooKeeper — for metadata management and controller election.
This has been problematic for a few reasons: 1) Managing multiple systems introduces administrative burden; 2) Scalability is limited due to inefficient metadata handling and double caching; 3) Clusters can become very bloated and resource intensive — in fact, it is not too uncommon to see clusters with equal numbers of ZooKeeper and Kafka nodes.
These limitations have not gone unacknowledged by Apache Kafka’s committers and maintainers, who are in the process of replacing ZooKeeper with a self-managed metadata quorum: Kafka Raft (KRaft).
This event-based flavor of Raft achieves metadata consensus via an event log, called a metadata topic, that improves recovery time and stability. KRaft is a positive development for the upstream Apache Kafka project because it helps alleviate pains around partition scalability and generally reduces the administrative challenges of Kafka metadata management.
Unfortunately, KRaft does not solve the problem of having two different systems for consensus in a Kafka cluster. In the new KRaft paradigm, KRaft partitions handle metadata and cluster management, but replication is handled by the brokers using ISR, so you still have these two distinct platforms and the inefficiencies that arise from that inherent complexity.
The engineers behind KRaft are upfront about these limitations, although some exaggerated vendor pronouncements have created ambiguity around the issue, suggesting that KRaft is far more transformative.
Combining Raft with Performance Engineering: A New Standard for Streaming Data
As data industry leaders like CockroachDB, MongoDB, Neo4j and TiDB have demonstrated, Raft-based systems deliver simpler, faster and more reliable distributed data environments. Raft is becoming the standard consensus protocol for today’s distributed data systems because it marries particularly well with performance engineering to further boost the throughput of data processing.
For example, Redpanda combines Raft with speedy architectural ingredients to perform at least 10 times faster than Kafka at tail latencies (p99.99) when processing a 1GBps workload, on one-third the hardware, without compromising data safety.
Traditionally, GBps+ workloads have been a burden for Apache Kafka, but Redpanda can support them with double-digit millisecond latencies, while retaining Jepsen-verified reliability. How is this achieved? Redpanda is written in C++, and uses a thread-per-core architecture to squeeze the most performance out of modern chips and network cards. These elements work together to elevate the value of Raft for a distributed streaming data platform.
An example of this in terms of Redpanda internals: Because Redpanda bypasses the page cache and the Java virtual machine (JVM) dependency of Kafka, it can embed hardware-level knowledge into its Raft implementation.
Typically, every time you write in Raft you have to flush to guarantee the durability of writes on disk. In Redpanda’s approach to Raft, smaller intermittent flushes are dropped in favor of a larger flush at the end of a call. While this introduces some additional latency per call, it reduces overall system latency and increases overall throughput, because it is reducing the total number of flush operations.
While there are many effective ways to ensure consistency and safety in distributed systems (Blockchains do it very well with Proof of Work and Statement of Work protocols), Raft is a proven approach and flexible enough that it can be enhanced to adapt to new challenges.
As we enter a new world of data-driven possibilities, driven in part by AI and machine learning use cases, the future is in the hands of developers who can harness real-time data streams. Raft-based systems, combined with performance-engineered elements like C++ and thread-per-core architecture, are driving the future of data streaming for mission-critical applications.