How Adobe Uses OpenTelemetry Collector
They went on to explain how they use it to track massive amounts of observability data their company collects, including metrics, 330 million unique series a day; span data of 3.6 terabytes a day; and log data of over 1 petabyte a day.
Featherstone, senior manager for software development, explained that not all of this data flows through his team or the OTel collector, but “it’s a pretty good chunk.”
Distributed tracing led his team to OpenTelemetry. Adobe is largely made up of acquisitions, he explained, and with every new company brought in, people have their own opinions of the best cloud, this tool, that text editor, etc.
“With distributed tracing specifically, that becomes a huge challenge,” he said. “Imagine trying to stitch a trace across clouds vendors, open source. So eventually, that’s what led us to the collector. But we were trying to build a distributed tracing platform based on Jaeger agents.” That was in 2019.
It started rolling out the OTel Collector in April 2020 to replace the Jaeger agents. Originally, the collector was just to ingest traces, but in September 2021, brought in metrics and they’re looking to bring in logs as well.
The team instruments applications using Open Telemetry libraries, primarily auto instrumentation, and primarily Java. It does some application enrichment, brings in Adobe-specific data and enriches its pipelines as data flows to the collector. It has some custom extensions and processors, the team does configuration by GitOps where possible.
“The collector is very dynamic extending to multiple destinations with one set of data and this was huge for us. …Sometimes we send collector data to other collectors to further process. So it’s the Swiss Army knife of observability,” Featherstone said.
His team at Adobe is called developer productivity with the charter to help developers write better code faster.
For the Java services, in particular, it has a base container and “if you’re using a Java image, you should go use this … It has a number of quality-of-life features already rolled into it, including the OpenTelemetry Java instrumentation in the jar. [The configuration is ] pulled from our docs, and this is exactly how we configure it for Java.
“So we set the Jaeger endpoint to the local DaemonSet collector. We set the metrics exporter to Prometheus, we set the propagators, we set some extra resource attributes, we set the tracer, the exporter to Jaeger. And we set the trace sampler to parent-based always off,” he said, pointing out that this is all rolled into the Java image.
So with these configurations, any Java service that spins up in Kubernetes at Adobe is already participating in tracing. Everything set up this way passes through the collector.
“So everyone’s participating in tracing just by spinning this up,” he said. “The metrics, we’ve tried to reduce the friction, people would still need to somehow go get those metrics out of that exporter. We’ve made that pretty easy, but it’s not automatic.” He said about 75% of what they run is Java, but they’re trying the same concept with Node.js and Python and other images.
Managing the Data
They do a lot of enrichment, and as well as ensuring no secrets are being sent as part of our tracing or metrics data, said Surana, Adobe’s cloud operations site reliability engineer for observability.
It uses multiple processes, including reduction processor as well as a custom processor in OpenTelemetry Collector that allows them to eliminate certain fields they don’t want sent to the backend, which could be personally identifiable information or other sensitive data. They’re also used to enrich the data because adding more fields such as service identifiers, Kubernetes clusters, and region help improve search.
“Adobe is built out of active acquisitions, and we run multiple different products in different ecosystems. There is a high possibility of service names being colliding under different products or under as similar microservice names, so we wanted to ensure that doesn’t happen,” he said.
It also uses Adobe-specific service registry, where every service has a unique ID attached to the service name. It allows any of the engineers at Adobe to uniquely identify a service in the single tracing backend.
“It [also] allows the engineers to quickly search on things, even though they don’t know the service, or they don’t know who owns that service, they can go look into our service registry, find out the engineering contact for that particular product or team and get on a call to resolve their issue,” Surana said.
They also send data to multiple export destinations.
“This is probably the most common use case,” he said. “Before the introduction of the OpenTelemetry Collector, engineering teams at Adobe have been using different processes, different libraries in a different format. And they were sending it to vendor products, open source projects, and it was very hard for us to get the engineering teams to change their backend, or to just do any small change in the backend code or their application code because engineers have their own product features and product requests, which they are working on.
“With the introduction of OpenTelemetry Collector, as well as the OTLP [OpenTelemetry protocol] format, This made it super easy for us; we are able to send their data to multiple vendors, multiple toolings with just few changes on our side.”
Last year, they were able to send the tracing data to three different backends at the same time to test out one engineering-specific use case.
They’re now sending data to another set of OTel collectors at the edge where they can do transformations including inverse sampling, rule-based sampling and throughput-based sampling.
He said they’re always looking into other ways to get richer insights while sending less data to the backend.
“This entire configuration is managed by git. We make use of the OpenTelemetry Operator Helm charts primarily for our infrastructure use case. … It takes away the responsibility from the engineers to be subject matter experts … and makes the configuration super easy,” he said.
Auto instrumentation with OpenTelemetry Operator allows engineers to just pass in a couple of annotations to instrument their service automatically for all the different signals without writing a single line of code.
“This is huge for us,” he said. This takes developer productivity to the next level.”
They also built out a custom extension on top of the OpenTelemetry Collector using the custom authenticator interface. They had two key requirements for this authentication system: to be able to use a single system to securely send data to the different backends and to be able to secure it for both open source and vendor tools.
OpenTelemetry Collector comes with a rich set of processes for building data processes, including an attribute processor which allows you to add attributes on top of log data and matrix data. It allows you to transform, enrich or modify the data in transit without the application engineers doing anything. Adobe also uses it to improve search capabilities in its backends.
The memory limiter processor helps ensure OTel never runs out of memory and checks the amount of storage needed for keeping things in state. They also use the span to matrix processor and service graph processor to generate data out of traces and build metrics dashboards on the fly.
So What’s Next?
Two things, according to Featherstone: improving data quality, namely getting rid of data no one is going to look at, and rate limiting spans at the edge.
The collector provides the ability at the edge to create rules and drop some data.
“For metrics, imagine that we had the ability to aggregate right in the collector itself. You know, maybe we don’t need quite 15-second granularity, let’s dumb that down to five minutes, and then send that off,” Featherstone said.
“Another one might be sending some metrics to be stored for long term and sending some on to be further processed in some operational data lake or something like that. We have the ability to just pivot right in the collector and do all kinds of things.”
The second thing is rate-limiting spans at the edge.
“We have one of our edges is taking like 60 billion hits per day, and we’re trying to do tracing on that. That becomes a lot of data when you’re talking about piping that all the way down to somewhere to be stored. So we’re trying to figure out where’re the right places to implement rate limiting in which collectors and at what levels … just to prevent unknown bursts of traffic, that kind of thing,” he said.
They’re also trying to pivot more to trace-first troubleshooting.
“We have so many east/west services that trying to do it through logs and trying to pull up the right log index for whatever team and do I even have access to it or whatever. It’s so slow and so hard to do, that we’re trying to really shift the way that people are troubleshooting within Adobe, to something like this, where we’ve made a lot of effort to make these traces, pretty complete,” he said.
They also looking into how people go about troubleshooting and whether the tools they have provide the best way to do that.
They’re looking forward to integrating the OpenTelemetry logging libraries with core application libraries and running OTel collectors as sidecars to send metrics, traces and logs. They exploring the new connector component and building a trace sampling extension at the edge to improve data quality.
Wrapping up, he lauded the collector’s plug-in-based architecture and the ability to send data to different destinations with a single binary. There are a rich set of extensions and processors which give a lot of flexibility with your data, he said.
“OpenTelemetry in general feels a lot to me, like the early days of Kubernetes where everybody was just kind of buzzing about it, and it started like we’re on the hockey stick path right now,” he said. “The community is awesome. The project is awesome. If you haven’t messed with the collector yet, you should definitely go check it out.”