Can eBPF Agent in Kubernetes Be the Key to Better Observability?

Groundcover is boasting of better results using a new eBPF observability tool that it says outperformed a competitor by more than three times.

May 10th, 2023 5:00am by Loraine Lawson

Featued image for: Can eBPF Agent in Kubernetes Be the Key to Better Observability?

Israeli-startup Groundcover is using a new eBPF observability tool — called the Flora agent — that it says bests other application monitoring tools such as DataDog and OpenTelemetry when running on a Kubernetes node alongside New Relic’s Pixie agent and Groundcover’s Flora agent.

The Flora agent outperformed application performance monitoring (APM) competitor Datadog by more than three times, Groundcover stated, demonstrating minimal to zero overhead to the application’s CPU (+9%) and memory (+0%) while Datadog, OpenTelemetry and the Pixie agent had an overhead of 249%, 59% and 32% adobe the CPU baseline, respectively, and 227%, 27% and 9% above the memory baseline.

“All other solutions but Flora raised the resource consumption of the application dramatically and in an unexpected manner, potentially causing the application to reach CPU throttling that might degrade its performance or even create an out-of-memory (OOM) crash in a limited environment,” CTO Yechezkel Rabinovich stated in a blog post. “Flora also proved to be highly efficient in the total resources it consumed, making it the most cost-effective solution at high scale.”

When combining the resources consumed by the different agents tested and the overhead measured on the monitored application, Flora consumed a total CPU that was similar to the one used by OpenTelemetry and the Pixie agent, but that was 73% less than the CPU consumed by Datadog, the blog post stated. “Additionally Flora consumed 74%, 77% and 96% less memory than Datadog, OpenTelemetry and the Pixie agent, respectively,” it added.

The Flora agent was released in April at the KubeCon+CloudNativeConEurope 2023. Its leverages eBPF inside the kernel to access data about the application within Kubernetes.

Run Code Safely in the Kernel

“You can think about eBPF as a way to modify the kernel without the need of compiling a kernel module,” Rabinovich told The New Stack. “What that means in basic language is it something similar to what JavaScript did to browsers — you automatically can adapt your code to the different browsers, and the browsers now support running JavaScript, so you can run your code without the need to have to think about what kind of browser your customer has. So eBPF is kind of the same in a way that you can write code and it will run safely in the kernel without the need to test every kernel versions.”

In the past, it was hard if not impossible to get at some of the data that eBPF can achieve. Developers had to instrument the application in order to get the data, Rabinovich explained. Often companies are still not getting 100% of the data for observability; some are struggling to achieve 10% to 15% of observability data, he added.

Observability is generally split into three types of data:

Logs
Metrics
Traces (which monitor the pathways for interactions, such as end-to-end transactions and what happens between services)

Shahar Azulay, CEO of Groundcover, said it really makes a difference in large development shops, where time to value is zero.

“Traditional observability platforms require you to change your code,” he said. “Imagine what it does do time to value. We usually come across organizations with, say, 100 developers, so they’re already using different languages and a huge technology stack to integrate OpenTelemetry, which is the recommendation of the community, or Datadog, you will have to go through each of these themes, each running through their own instructions, and the fit of the specific stack, forwarding all that as a leader of the organization and pushing that to production. That takes weeks.”

With eBPF, one person, usually a DevOps site reliable engineer (SRE), can “just throw it into immediate installation on the cluster, and you’re recovering everything,” Azulay said.

“Suddenly, you can align everyone to the same depth, because you’re observing stuff from the kernel level, not from the application level. And that’s a mind-blowing difference than what than how observability — what’s the door observability vendor is going to to the organization, instead of the R&D team and the developers, they can go to the infrastructure,” Azulay said.

The test application was a basic HTTP server built in Golang (v1.19) that serves a configurable number to random JSON objects, performs a pre-configured amount of CPU-intensive tasks per each request it receives and returns its response in a Plaintext or Gzip format, according to the blog post. The test application was then tested in the different scenarios as is (for baseline) when instrumented according to relevant documentation of Datadog and OpenTelemetry, and when running on a Kubernetes node alongside New Relic’s Pixie agent and the Flora agent. Prometheus-based CPU and memory utilization metrics were generated for all test cases and were scraped and stored in a VictoriaMetrics database instance.

The infrastructure was a Kubernetes cluster with Node Taints that allowed Groundcover to isolate each deployment test case from the others. Every tested application flavor ran alongside the bare minimum components required for monitoring according to the relevant test case.

Groundcover used a K6 operator to generate the test load, with K6 test objects that executed from each of the separate Node groups. Groundcover used a custom-built K6 image that also exposes Prometheus metrics so it could get metrics from the client side as well for sanity purposes. The results were analyzed in Grafana, through a Prometheus data source integration that queried the deployed VictoriaMetrics instance, according to Rabinovich’s blog post.

CNCF paid for travel and accommodations for The New Stack to attend the KubeCon+CloudNativeConEurope 2023 conference.

Loraine Lawson is a veteran technology reporter who has covered technology issues from data integration to security for 25 years. Before joining The New Stack, she served as the editor of the banking technology site, Bank Automation News. She has...