Saga Without the Headaches

The Saga Pattern is a great tool for durable microservices execution, but it can make maintenance difficult. Here's a recipe for making it work for your systems.

Mar 24th, 2023 10:00am by Dominik Tornow

Featued image for: Saga Without the Headaches

Part 1: The Problem with Sagas

We’ve all been at that point in a project when we realize that our software processes are more complex than we thought. Handling this process complexity has traditionally been painful, but it doesn’t have to be.

A landmark software development playbook called the Saga design pattern has helped us to cope with process complexity for over 30 years. It has served thousands of companies as they build more complex software to serve more demanding business processes.

This pattern’s downside is its higher cost and complexity.

In this post, we’ll first pick apart the traditional way of coding the Saga pattern to handle transaction complexity and look at why it isn’t working. Then, I’ll explain in more depth what happens to development teams that don’t keep an eye on this plumbing code issue. Finally, I’ll show you how to avoid the project rot that ensues.

Meeting the need for durable execution

The Saga pattern emerged to cope with a pressing need in complex software processes: durable execution. When the transactions you’re writing make a single, simple single database call and get a quick response, you don’t need to accommodate anything outside that transaction in your code. However, things get more difficult when transactions rely on more than one database — or indeed on other transaction executions — to get things done.

For example, an application that books a car ride might need to check that the customer’s account is in good standing, then check their location, then examine which cars are in that area. Then it would need to book the ride, notify both the driver and the customer, then take the customer’s payment when the ride is done, writing everything to a central store that updates the driver and customer’s account histories.

Processes like these that process dependent transactions need to keep track of data and state throughout the entire sequence of events. They must be able to survive problems that arise in the transaction flow. If a transaction takes more time than expected to return a result (perhaps a mobile connection falters for a moment or a database hits peak load and takes longer to respond), the software must adapt.

It must wait for the necessary transaction to complete, retrying until it succeeds and coordinating other transactions in the execution queue. If a transaction crashes before completion, the process must be able to roll back to a consistent state to preserve the integrity of the overall application.

This is difficult enough in a use case that requires a response in seconds. Some applications might execute over hours or days, depending on the nature of the transactions and the process they support. The challenge for developers is maintaining the state of the process across the period of execution.

This kind of reliability — a transaction that cannot fail or time out — is known as a strong execution guarantee. It is the opposite of a volatile execution, which can cease to exist at any time without completing everything that it was supposed to do. Volatile executions can leave the system in an inconsistent state.

What seemed simple at the outset turns into a saga with our software as the central character. Developers had to usher it through multiple steps on its journey to completion, ensuring that we preserve its state if something happens.

Understanding the Saga pattern

The Saga pattern provides a road map for that journey. First discussed in a 1987 paper, this pattern brings durable execution to complex processes by enabling them to communicate with each other. A central controller manages that service communication and transaction state.

The pattern offers developers the three things they need for durable execution. It can string together transactions to support long-running processes and guarantee their execution by retrying in the event of failure. It also offers consistency by ensuring that either a process completes entirely or doesn’t complete at all.

However, there’s a heavy price to pay for using the Saga pattern. While there’s nothing wrong with the concept in principle, everything depends on the implementation. Developers have traditionally had to code the pattern themselves as part of their application. That makes its design, deployment and maintenance so difficult that the application can become a slave to the pattern, which ends up taking most of the developers’ time.

Eventually, developers are spending more time maintaining the plumbing code as they add more transactions. What was a linear development workload now becomes exponential. The time spent on development increases disproportionately with every new change.

Coding the Saga pattern manually involves breaking up a coherent process into chunks and then wrapping them with code that manages their operation, including retrying them if they fail. The developer must also manage the scheduling and coordination of these tasks across different processes that depend on each other. They must juggle databases, queues and timers to manage this inter-process communication.

Increasing the volume of software processes and dependencies requires more developer hours to create and maintain the plumbing infrastructure, which in turn drives up application cost. This increasing complexity also makes it more difficult for developers to prove the reliability and security of their code, which carries implications for operations and compliance.

Abstraction is the key

Abstraction is the key to retaining the Saga pattern’s durable execution benefits while discarding its negative baggage. Instead of leaving developers to code the pattern into their applications, we must hide the transaction sequencing from them by abstracting it to another level.

Abstraction is a well-understood process in computing. It gives each application the illusion that it owns everything, eliminating the need for the developer to accommodate it. Virtualization systems do this with the help of a hypervisor. The TCP stack does it by retrying network connections automatically so that developers don’t have to write their own handshaking code. Relational databases do it when they roll-back failed transactions invisibly to keep them consistent.

Running a separate platform to manage durable execution brings these benefits to transaction sequencing by creating what Temporal calls a workflow. Developers still have control over workflows, but they need not concern themselves with the underlying mechanics.

Abstracting durable execution to workflows brings several benefits aside from ease of implementation. A tried-and-tested workflow management layer makes complex transaction sequences less likely to fail than home-baked ad-hoc plumbing code. Eliminating thousands of lines of custom code for each project also makes the code that remains easier to maintain and reduces technical debt.

Developers see these benefits most clearly when debugging. Root cause analysis and remediation get exponentially harder when you’re having to mock and manage plumbing code, too. Workflows hide an entire layer of potential problems.

Productive developers are happy developers

Workflow-based durable execution boosts the developer experience. Instead of disappearing down the transaction management rabbit hole, they get to work on what’s really important to them. This improves morale and is likely to help retain them. With the number of open positions for software engineers in the US expected to grow by 25% between 2021 and 2031, competition for talent is intense. Companies can’t afford much attrition.

Companies have been moving in the right direction in their use of the Saga pattern to handle context switching in software processes. However, they can go further by abstracting these Saga patterns away from the application layer to a separate service. Doing this well could move software maturity forward years in an organization.

Part 2: Avoiding the Tipping Point

In the first half of this post, I talked about how burdensome it is to coordinate transactions and preserve the state at the application layer. Now, we’ll talk about how that sends software projects off-course and what you can do about it.

Any software engineering project of reasonable size runs into the need for durable execution.

Ideally, the cost and time involved in creating new software features would be consistent and calculable. Coding for durability shatters that consistency. It makes the effort involved with development look more like a hockey-stick curve than a linear slope.

The tipping point is where the time and effort spent on coding new features begins its upward spike. It’s when the true extent of managing long-term transactions becomes clear. I’ll describe what it is, why it happens and why hurriedly writing plumbing code isn’t the right way to handle it.

What triggers the tipping point

Life before the tipping point is generally good because the developer experience is linear. The application framework that the developers are using support each new feature that the developer adds with no nasty surprises. That enables the development team to scale up the application with predictable implementation times for new features.

This linear scale works as long as developers make quantitative changes, adding more of the same thing. Things often break when someone has to make a change that isn’t like the rest and discovers a shortcoming in the application framework. This is usually a qualitative change that demands a change in the way the application works.

This change might involve calls to multiple databases, or reliance on multiple dependent transactions for the first time. It might call on a software process that takes an unpredictable amount of time to deliver a result.

The change might not be enough to force the tipping point at first, but life for developers will begin to change. They might write the plumbing code to manage the inter-process communication in a bid to guarantee execution and keep transactions consistent. But this is just the beginning. That code took time to write, and now, developers must expand it to cope with every new qualitative change that they introduce.

They’ll keep doing that for a while, but the rot gets worse. Eventually, developers are spending more time maintaining the plumbing code as they add more transactions. What was a linear development workload now becomes exponential. The time spent on development increases disproportionately with every new change.

The “Meeting of Doom”

Some people are unaware of the tipping point until it happens. Junior developers without the benefit of experience often wander into them unaware. Senior developers are often in the worst position of all; they know the tipping point is coming but politics often renders them powerless to do anything other than wait and pick up the pieces.

Eventually, someone introduces a change that surfaces the problem. It is the straw that breaks the camel’s back. Perhaps a change breaks the software delivery schedule and someone with influence complains. Then, someone calls the “Meeting of Doom.”

This meeting is where the team admits that their current approach is unsustainable. The application has become so complex that these ad hoc plumbing changes are no longer supporting project schedules or budgets.

This realization takes developers through the five stages of grief:

Denial. This will have been happening for a while. People try to ignore the problem, arguing that it’ll be fine to continue as they are. This gives way to…
Anger. Someone in the meeting explains that this will not be fine. Their budgets are broken; their schedules are shot; and the problem needs fixing. They won’t take no for an answer. So people try…
Bargaining. People think of creative ways to prop things up for longer with more ad hoc changes. But eventually, they realize that this isn’t scalable, leading to…
Depression. Finally, developers realize that they’ll have to make more fundamental architectural changes. Their ad ho- plumbing code has taken a life of its own and the tail is now wagging the dog. This goes hand in hand with…
Acceptance. Everyone leaves the meeting with a sense of doom and knows that nothing is going to be good after this. It’s time to cancel a few weekends and get to work.

That sense of doom is justified. As I explained, plumbing code is difficult to write and maintain. From the tipping point onward, things get more difficult as developers find code harder to write and maintain. Suddenly, the linear programming experience they’re used to evaporates. They’re spending more time writing transaction management code than they are working through software features on the Kanban board. That leads to developer burnout, and ultimately, attrition.

Preventing the tipping point

How can we avoid this tipping point, smoothing out the hockey-stick curve and preserving a linear ratio between software features and development times? The first suggestion is usually to accept defeat this time around and pledge to write the plumbing code from the beginning next time or reuse what you’ve already cobbled together.

That won’t work. It leaves us with the same problem, which is that the plumbing code will ultimately become unmanageable. Rather than a tipping point, the development would simply lose linearity earlier. You’d create a more gradual decline into development dysphoria beginning from the project’s inception.

Instead, the team needs to do what it should have done at the beginning: make a major architectural change that supports durable execution systematically.

We’ve already discussed abstraction as the way forward. Begin by abstracting the plumbing functions from the application layer into their own service layer before you write a line more of project code. That will unburden developers by removing the non-linear work, enabling them to scale and keeping the time needed to implement new features constant.

This abstraction maintains the linear experience for programmers. They’ll always feel in control of their time, and certain that they’re getting things done. They will no longer need to consider strategic decisions around tasks such as caching and queuing. Neither will they have to worry about bolting together sprawling sets of software tools and libraries to manage those tasks.

The project managers will be just as happy as the developers with an abstracted set of transaction workflows. Certainty and predictability are key requirements for them, which makes the tipping point with its break from linear development especially problematic. Abstracting the task of transaction sequencing removes the unexpected developer workload and preserves that linearity, giving them the certainty they need to meet scheduling and budgetary commitments.

Tools that support this abstraction and the transformation of plumbing code into manageable workflows will help you preserve predictable software development practices, eliminating the dreaded tipping point and saving you the stress of project remediation. The best time to deploy these abstraction services is before your project begins, but even if your team is in crisis right now, it offers a way out of your predicament.

Dominik Tornow is a principal engineer at Temporal. A dedicated and self motivated software engineer and architect, he uses conceptual and formal models to make complex concepts tangible and complex systems correct by construction.