eBPF or Not, Sidecars are the Future of the Service Mesh

eBPF and sidecars are not an either-or choice, and the assertion that eBPF needs to replace sidecars is a marketing construct, not an actual requirement.

Aug 12th, 2022 10:00am by William Morgan

Featued image for: eBPF or Not, Sidecars are the Future of the Service Mesh

Feature image via Shutterstock.

William Morgan

William is the co-founder and CEO of Buoyant, the creator of the open source service mesh projects Linkerd. Prior to Buoyant, he was an infrastructure engineer at Twitter, where he helped move Twitter from a failing monolithic Ruby on Rails app to a highly distributed, fault-tolerant microservice architecture. He was a software engineer at Powerset, Microsoft, and Adap.tv, a research scientist at MITRE, and holds an MS in computer science from Stanford University.

eBPF is a hot topic in the Kubernetes world, and the idea of using it to build a “sidecar-free service mesh” has generated recent buzz. Proponents of this idea claim that eBPF lets them reduce service mesh complexity by removing sidecars. What’s left unsaid is that this model simply replaces sidecar proxies with multitenant per-host proxies — a significant step backward for both security and operability that increases, not decreases, complexity.

The sidecar model represents a tremendous advancement for the industry. Sidecars allow the dynamic injection of functionality into the application at runtime, while — critically — retaining all the isolation guarantees achieved by containers. Moving from sidecars back to multitenant, shared proxies loses this critical isolation and results in significant regressions in security and operability.

In fact, the service mesh market has seen this first-hand: the first service mesh, Linkerd 1.0 offered a “sidecar-free” service mesh circa 2017 using the same per-host proxy model, and the resulting challenges in operations, management, and security led directly to Linkerd 2.0 based on sidecars.

eBPF and sidecars are not an either-or choice, and the assertion that eBPF needs to replace sidecars is a marketing construct, not an actual requirement. eBPF has a future in the service meshwork, but it will be as eBPF and sidecars, not eBPF or sidecars.

eBPF in a Nutshell

To understand why we first need to understand eBPF. eBPF is a powerful Linux kernel feature that allows applications to dynamically load and execute code directly within the kernel. This can provide a substantial performance boost: rather than continually moving data between kernel and application space for processing, we can do the processing within the kernel itself. This boost in performance means that eBPF opens up an entire class of applications that were previously infeasible, especially in areas like network observability.

But eBPF is not a magic bullet. eBPF programs are very limited, and for good reason: running code in the kernel is dangerous. To prevent bad actors, the kernel must impose significant constraints on eBPF code, not the least of which is the “verifier.” Before an eBPF program is allowed to execute, the verifier performs a series of rigorous static analysis checks on the program.

Automatic verification of arbitrary code is hard, and the consequences of errors are asymmetric: rejecting a perfectly safe program may be an annoyance to developers, but allowing an unsafe program to run would be a major kernel security vulnerability. Because of this, eBPF programs are highly constrained. They can’t block, or have unbounded loops, or even exceed a predefined size. The verifier must evaluate all possible execution paths, which means the overall complexity of an eBPF program is limited.

Thus, eBPF is suitable for only certain types of work. For example, functions that require limited state, e.g., “count the number of network packets that match an IP address and port,” are relatively straightforward to implement in eBPF. Programs that require accumulating state in non-trivial ways, e.g., “parse this HTTP/2 stream and do a regular expression match against a user-supplied configuration”, or even “negotiate this TLS handshake,” are either outright impossible to implement or require Rube Goldberg levels of contortions to make use of eBPF.

eBPF and the Service Mesh

Let’s turn now to service meshes. Can we replace our sidecars with eBPF?

As we might expect, given the limitations of eBPF, the answer is no — what the service mesh does is well beyond what pure eBPF is capable of. Service meshes handle all the complexities of modern cloud native networking. Linkerd, for example, initiates and terminates mutual TLS; retries requests in the event of transient failures; transparently upgrades connections from HTTP/1.x to HTTP/2; enforces authorization policies based on cryptographic workload identity; and much more.

eBPF and sidecars are not an either-or choice, and the assertion that eBPF needs to replace sidecars is a marketing construct, not an actual requirement.

Like most service meshes, Linkerd does this by inserting a proxy into each application pod — the proverbial sidecar. In Linkerd’s case, this is the ultralight Linkerd2-proxy “micro proxy,” written in Rust and designed to consume the least amount of system resources possible. This proxy intercepts and augments all TCP communication to and from the pod and is ultimately responsible for implementing the service mesh’s full feature set.

Some of the functionality in this proxy can be accomplished with eBPF. For example, occasionally, the sidecar’s job is simply to proxy a TCP connection to a destination, without L7 analysis or logic. This could be offloaded to the kernel using eBPF. But the majority of what the sidecar does requires significant state and is impossible or at best infeasible to implement in eBPF.

Thus, even with eBPF, the service mesh still needs user-space proxies.

The Case for the Sidecar

If we’re designing a service mesh, where we place the proxies is up to us. From an architectural level, we could place them at the sidecar level, at the host level, or even at the cluster level, or even elsewhere. But from the operational and security perspective, there’s really only one answer: compared to any of the alternatives, sidecars provide substantial and concrete benefits to security, maintainability, and operability.

A sidecar proxy handles all the traffic to a single application instance. In effect, it acts as part of the application. This results in some significant advantages:

Sidecar proxy resource consumption scales with the traffic load to the application, so Kubernetes resource limits and requests are directly applicable.
The “blast radius” of sidecar failure is limited to the pod so that Kubernetes’s pod lifecycle controls are directly applicable.
Upgrading sidecar proxies are handled the same way as upgrading application code is, e.g. via rolling deployments.
The security boundary of a sidecar proxy is clearly delineated and tightly scoped: the sidecar proxy contains only the secret material pertaining to that pod, and acts as the enforcement point for the pod. This is granular enforcement is central to zero trust approaches to network security.

By contrast, per-host proxies (and other forms of multitenancy, e.g. cluster-wide proxies) handle traffic to whichever arbitrary set of pods Kubernetes are scheduled on the host. This means all of the operational and security advantages of sidecars are lost:

Per-host proxy resource consumption is unpredictable. It is a function of Kubernetes’s dynamic scheduling decisions, meaning that resource limits and requests are no longer useful—you cannot tell ahead of time how much of the system the proxy requires.
Per-host proxies must ensure fairness and QoS, or the application risks starvation. This is a non-trivial requirement and no popular proxy is designed to handle this form of “contended multitenancy”.
The blast radius for per-host proxies is large and continuously changing. A failure in a per-host proxy will affect whichever arbitrary sets of pods from arbitrary applications were scheduled on the host. Similarly, upgrading a per-host proxy will impact arbitrary applications to arbitrary degrees depending on which pods were scheduled on the machine.
The security story is… messy. A per-host proxy must contain the key material for all pods scheduled on that host and must perform enforcement on behalf of all applications scheduled on this host. This turns the proxy into a new attack vector vulnerable to the confused deputy problem, and any CVE or flaw in the proxy now has a dramatically larger security impact.

In short, sidecar proxies built on top of the isolation guarantees gained through containers, allowing Kubernetes and the kernel to enforce security and fairness. Per-host proxies step outside of those guarantees entirely, introducing significant complexities to operations, security, and maintenance.

So Where Do We Go from Here?

eBPF is a big advancement for networking, and can optimize some work from the service mesh by moving it to the kernel. But eBPF will always require userspace proxies. Given that, the right approach is to combine eBPF and sidecars, not to avoid sidecars.

Proposing a sidecar-free service mesh with eBPF is putting the marketing cart before the engineering horse. Of course, “incrementally improving sidecars with eBPF” doesn’t have quite the same buzz factor of “goodbye sidecars,” but from the user perspective, it’s the right decision.

The sidecar model is a tremendous advancement for the industry. It is not without challenges, but it is the best approach we have, by far, to handle the full scope of cloud native networking while keeping the isolation guarantees achieved by adopting containers in the first place. eBPF can augment this model, but it cannot replace it.