Contributed"> No More FOMO: Efficiency in SLO-Driven Monitoring - The New Stack
Modal Title
Observability / Operations

No More FOMO: Efficiency in SLO-Driven Monitoring

When you have too much data, it can be difficult to know what to look at and how to prioritize your time. A set of service-level objectives can help prioritize what to monitor.
Jun 14th, 2023 10:00am by
Featued image for: No More FOMO: Efficiency in SLO-Driven Monitoring
Image by Gerd Altmann from Pixabay.

Observability is a concept that has been defined in various ways by different experts and practitioners. However, the core idea that underlies all these definitions is efficiency.

Efficiency means using the available resources in the best possible way to achieve the desired outcomes. In the current scenario, where every business is facing fierce competition and changing customer demands, efficiency is crucial for survival and growth. Resources include not only money, but also time, productivity, quality and strategy.

IT spending is often a reflection of the market conditions. When the market is booming, companies tend to spend more on IT projects and tools, without being too concerned about the value they are getting from them. This can create some problems, such as having too many tools that are not integrated or aligned with the business goals, wasting resources on unnecessary or redundant tasks, and losing visibility and control over the IT environment.

IT spend always correlates to market temperature.

Even the companies that spend heavily on cloud services are reconsidering their big decisions that involve significant, long-term investments. Companies are reassessing their existing substantial spend to ensure their investments can be aligned with revenues or future revenue potential.

Observability tools are also subject to the same review. It is essential that the total operating cost of observability tools can also be directly linked to revenue, customer satisfaction, growth in business innovation and operational efficiency.

Why Do We Need Monitoring?

  • If we had a system that would absolutely never fail, we wouldn’t need to monitor that system.
  • If we had a system for which we never have to worry about being performant, reliable or functional, we wouldn’t need to monitor that system.
  • If we had a system that self-corrects itself and auto-recovers from failures, we wouldn’t need to monitor that system.

None of the aforementioned points are true today, and it is obvious that we need to set up monitoring for our infrastructure and applications no matter what scale you operate.

What Is FOMO-Driven Monitoring?

When you are responsible for operating a critical production system, it is natural to want to collect as much monitoring data as possible. After all, the more data you have, the better equipped you will be to identify and troubleshoot problems. However, there are a number of challenges associated with collecting too much monitoring data.

Data Overload

One of the biggest challenges of collecting too much monitoring data is data overload. When you have too much data, it can be difficult to know what to look at and how to prioritize your time. This can lead to missed problems and delayed troubleshooting.

Storage Costs

Another challenge of collecting too much monitoring data is storage costs. Monitoring data can be very large, and storing it can be expensive. If you are not careful, you can quickly rack up a large bill for storage.

Reduced Visibility

When there is too much data, it can be difficult to see the big picture. This can make it difficult to identify trends and patterns that could indicate potential problems.

Increased Noise

More data also means more noise. This can make it difficult to identify important events and trends.

Security Concerns

Collecting too much monitoring data can also raise security concerns. If your monitoring data is not properly secured, it could be vulnerable to attack. This could lead to theft of sensitive data or disruption of your production systems.

FOMO-driven monitoring

Ultimately, an approach driven by the fear of missing out does not result in an optimal observability situation/setup and, in fact, can contribute to plenty of chaos, increased expenses, ambiguity between teams and overall increase in poor efficiency.

You can address this situation by being intentional in making decisions on all aspects of the observability pipeline including signal collection, dashboarding and alerting. Using service-level objectives SLOs is one of the strategies that offers plenty of benefits.

What Are SLOs?

An SLO is a target or goal for a specific service or system. A good SLO will define the level of performance your application needs, but not any higher than necessary.

SLOs help us set a target performance level for a system and measure the performance over a period of time.

Example SLO: An API’s p95 will not exceed 300ms response time

How Do You Set SLOs?

SLOs are generally set by customers. Yes, they are the ultimate authority. However, customers do not actually set SLOs as you can imagine. It is up to the business teams to tell the IT operations and development teams the expected performance and availability of a system.

For example, the business teams operating a marketing lead sign-up page can tell the IT teams that they want the page to load within 200ms at least 90% of the time. They would derive this conclusion by looking at the customer behavior already captured.

Now the IT teams can set the SLO for tracking by identifying SLIs(service-level indicators) in order to measure the SLOs over a period of time. SLIs are the specific metrics and query details of the metrics used to keep track of the SLO progression.

Here is what your observability life cycle looks like implementing an SLO-driven strategy.

SLO-driven strategy

There is an intentional loopback mechanism that is set in taking the SLO-driven strategy. Observability is never a settled problem. Organizations that do not continue reinventing their observability strategy fall behind very quickly, resulting in ambiguous tools, outdated processes and practices, which in the end increases overall operational cost while decreasing efficiency.

With this approach, you get the ability to scientifically measure your infrastructure and application performance over a period of time. Data collected as a result can be used to influence important decisions made on infrastructure spend which in turn helps improve further efficiency.

What Does This Tell Us?

Taking an SLO-first approach allows us to be intentional about the metrics to collect to meet commitments to business.

These are some of the benefits that organizations can achieve by following SLO-based observability strategy:

  • Results in improved signal vs. noise ratio
  • Reduces tool proliferation
  • Enriched monitoring data resulting in reduced MTTR/MTTI
  • Feedback loop provides continuous improvement opportunities
  • Connect monitoring costs in relation to business outcomes, hence able to justify spend to management

Use SLOs to drive your monitoring decisions:

  • Measure, revisit and review SLOs periodically based on outcomes
  • Improve observability posture through
    • Lower cost
    • Reduced issue resolution time
    • Increased team efficiency and innovation

Conclusion

We live in an era where efficiency is critical for organizational success. Observability costs can become uncontrollable if you do not have a proper strategy in place. SLO-driven observability strategy can help you set guardrails, track performance goals, business metrics and measure impact in a consistent manner while increasing operational efficiency and innovation.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.