Everything You Should Know About Microservices | The New Stack https://thenewstack.io/microservices/ Tue, 13 Jun 2023 18:19:54 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 In the Great Microservices Debate, Value Eats Size for Lunch https://thenewstack.io/in-the-great-microservices-debate-value-eats-size-for-lunch/ Tue, 13 Jun 2023 13:10:42 +0000 https://thenewstack.io/?p=22710290

In May, an old hot topic in software design long thought to be settled was stirred up again, sparked by

The post In the Great Microservices Debate, Value Eats Size for Lunch appeared first on The New Stack.

]]>

In May, an old hot topic in software design long thought to be settled was stirred up again, sparked by an article from the Amazon Prime Video architecture team about moving from serverless microservices to a monolith. This sparked some spirited takes and also a clamor for using the right architectural pattern for the right job.

Two interesting aspects can be observed in the melee of ensuing conversations. First, the original article was more about the scaling challenges with serverless “lambdas” rather than purely about microservices. Additionally, it covered state changes within Step Functions leading to higher costs, data transfer between lambdas and S3 storage, and so on.

It bears reminding that there are other and possibly better ways of implementing microservices other than just the use of serverless. The choice of serverless lambdas is not synonymous with the choice of microservices. Choosing serverless as a deployment vehicle should be contingent upon factors such as expected user load and call frequency patterns, among other things.

The second and more interesting aspect was about the size of the services (micro!) and this was the topic of most debates that emerged. How micro is micro? Is it a binary choice of micro versus monolith? Or is there a spectrum of choices of granularity? How should the size or granularity factor into the architecture?

Value-Based Services: Decoupling to Provide Value Independently

A key criterion for a service to be standing alone as a separate code base and a separately deployable entity is that it should provide some value to the users — ideally the end users of the application. A useful heuristic to determine whether or not a service satisfies this criterion is to think about whether most enhancements to the service would result in benefits perceivable by the user. If in a vast majority of updates the service can only provide such user benefit by having to also get other services to release enhancements, then the service has failed the criterion.

Services Providing Shared Internal Value: Coupling Non-Divergent Dependent Paths

What about services that offer capabilities internally to other services and not directly to the end user? For instance, there might be a service that offers a certain specialty queuing that is required for the application. In such cases, the question becomes whether the capabilities provided by the service have just one internal client or several internal clients.

If most of the time a service ends up calling exactly just one other service apart from very few exceptional cases where-in the call path may diverge, then there is little benefit in separating that service and its most predominant dependency. Another useful heuristic:  if a circuit breaks and a service is unable to reach one of its dependency services, can the calling service provide anything at all to its users or nothing?

Avoiding Talkative Services with Heavy Payloads

Providing value is also about the cost efficiency of designing as multiple services versus combining as a single service. One such aspect that was highlighted in the Prime Video case was chatty network calls. This could be a double whammy because it not only results in additional latency before a response goes back to the user, but it might also increase your bandwidth costs.

This would be more problematic if you have large or several payloads moving around between services across network boundaries. To mitigate, one could consider the use of a storage service, so one doesn’t need to move the payload around, rather only an identifier of the payload and only services that need it to consume it.

However even if an ID is passed around, if several services along the call path need to inspect or operate on the payload, those would need to pull the payload down from the storage service which would completely nullify and possibly worsen the situation.

How and where payloads are handled should be an important part of designing service boundaries and thereby influencing the number of services we have in the system.

Testability and Deployability

Finally, one more consideration would be the cost of rapidly testing and deploying services. Consider a scenario wherein a majority of the time multiple services need to be simultaneously enhanced in order to provide a feature enhancement to the user.

Feature testing would involve testing all of those services together. This could potentially result in bottlenecks for releases or necessitate the requirement for complex release control and testing mechanisms such as feature flags or blue-greening sets of services, among other things. This tendency is a sure-shot sign of the disadvantageous proliferation of too many discrete parts.

Too many teams fall into the trap of building “service enhancements” in every release but those enhancements not doing much for the end user because a number of other pieces from other services need to come together. Such highly coupled architecture complicates both dependency management and versioning, with delays in delivering “end user value”.

Value-Based Services, Not ‘Micro’  Services!

Architecture should be able to deliver value to the end users a majority of the time by the release of individual services independently. Considerations of coupling, dependencies, ease of testing and frequency of deployment matter more, while the size of the service itself has limited usefulness other than for applying reasonable limits on becoming too gigantic or too nano-sized.

There may be other esoteric reasons for splitting or creating multiple services such as the way our teams are organized (Conway’s law, anyone?) or providing flexibility with languages and frameworks but these are rarely real needs for providing value in enterprise software development.

One could very well have a performant cost-efficient architecture that delivers “value” with a diverse mix of services of various sizes — some big, some micro, and others somewhere in between. Think of it as a “value-based services architecture” rather than a “microservices-based architecture” that enables services to deliver value quickly and independently. Because value always eats size for lunch!

The post In the Great Microservices Debate, Value Eats Size for Lunch appeared first on The New Stack.

]]>
Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All https://thenewstack.io/amazon-prime-videos-microservices-move-doesnt-lead-to-a-monolith-after-all/ Tue, 13 Jun 2023 13:00:28 +0000 https://thenewstack.io/?p=22710277

In any organizational structure, once you break down regular jobs into overly granularized tasks and delegate them to too many

The post Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All appeared first on The New Stack.

]]>

In any organizational structure, once you break down regular jobs into overly granularized tasks and delegate them to too many individuals, their messaging soon becomes unmanageable, and the organization stops growing.

Last March 22, in a blog post that went unnoticed for several weeks, Amazon Prime Video’s engineers reported the service quality monitoring application they had originally built to determine quality-of-service (QoS) levels for streaming videos — an application they built on a microservices platform — was failing, even at levels below 10 percent of service capacity.

What’s more, they had already applied a remedy: a solution their post described as “a monolith application.”

The change came at least five years after Prime Video — home of on-demand favorites such as “Game of Thrones” and “The Marvelous Mrs. Maisel” — successfully outbid traditional broadcast outlets for the live-streaming rights to carry NFL Thursday Night Football.

One of the leaders in on-demand streaming now found itself in the broadcasting business, serving an average 16.6 million real-time viewers simultaneously. To keep up with live sports viewers’ expectations of their “networks” — in this case, CBS, NBC, or Fox — Prime Video’s evolution needed to accelerate.

It wasn’t happening. When the 2022 football season kicked off last September, too many of Prime Video’s tweets were prefaced with the phrase, “We’re sorry for the inconvenience.

Prime Video engineers overcame these glitches, the engineers’ blog reported, by consolidating QoS monitoring operations that had been separated into isolated AWS Step Functions and Lambda functions, into a unified code module.

As initially reported, their results appeared to finally confirm many organizations’ suspicions, well-articulated over the last decade, that the costs incurred in maintaining system complexity and messaging overhead inevitably outweighed any benefits to be realized from having adopted microservices architecture.

Once that blog post awakened from its dormancy, several experts declared all of microservices architecture dead. “It’s clear that in practice, microservices pose perhaps the biggest siren song for needlessly complicating your system,” wrote Ruby on Rails creator David Heinemeier Hansson.  “Are we seeing a resurgence of the majestic monolith?” asked .NET MVP Milan Jovanović on Twitter. “I hope so.”

“That’s great news for Amazon because it will save a ton of money,” declared Jeff Delaney on his enormously popular YouTube channel Fireship, “but bad news for Amazon because it just lost a great revenue source.”

Yet there were other experts, including CodeOpinion.com’s Derek Comartin, who compared Prime’s “before” and “after” architectural diagrams with one another, and noticed some glaring disconnects between those diagrams and their accompanying narrative.

As world-class experts speaking with the New Stack also noticed, and as a high-ranking Amazon Web Services engineer finally confirmed for us, the solution Prime Video adopted not only fails to fit the profile of a monolithic application. In every respect that truly matters, including scalability and functionality, it is a more evolved microservice than what Prime Video had before.

That Dear Perfection

“This definitely isn’t a microservices-to-monolith story,” remarked Adrian Cockcroft, the former vice president of cloud architecture strategy at AWS, now an advisor for Nubank, in an interview with The New Stack. “It’s a Step Functions-to-microservices story. And I think one of the problems is the wrong labeling.”

Cockcroft, as many regular New Stack readers will be familiar, is one of microservices architecture’s originators, and certainly its most outspoken champion. He has not been directly involved with Prime Video or AWS since becoming an advisor, but he’s familiar with what actually happened there, and he was an AWS executive when Prime’s stream quality monitoring project began. He described for us a kind of prototyping strategy where an organization utilizes AWS Step Functions, coupled with serverless orchestration, for visually modeling business processes.

With this adoption strategy, an architect can reorganize digital processes essentially at will, eventually discovering their best alignment with business processes. He’s intimately familiar with this methodology because it’s part of AWS’ best practices — advice which he himself co-authored. Speaking with us, Cockcroft praised the Prime Video team for having followed that advice.

As Cockcroft understands it, Step Functions was never intended to run processes at the scale of live NFL sports events. It’s not a staging system for processes whose eventual, production-ready state would need to become more algorithmic, more efficient, more consolidated. So the trick to making the Step Functions model workable for more than just prototyping is not just to make the model somewhat scalable, but also transitional.

“If you know you’re going to eventually do it at some scale,” said Cockcroft, “you may build it differently in the first place. So the question is, do you know how to do the thing, and do you know the scale you’re going to run it at? Those are two separate cases. If you don’t know either of those, or if you know it’s small-scale, complex, and you’re not exactly sure how it’s going to be built, then you want to build a prototype that’s going to be very fast to build.”

However, he suggested, if an organization knows from the outset its application will be very widely deployed and highly scalable, it should optimize for that situation by investing in more development time up-front. The Prime Video team did not have that luxury. In that case, Cockcroft said, the team was following best practices: building the best system they could, to accomplish the business objectives as they interpreted them at the time.

“A lot of workloads cost more to build than to run,” Cockcroft explained. “[For] a lot of internal corporate IT workloads, lots of things that are relatively small-scale, if you’re spending more on the developers than you are on the execution, then you want to optimize for saving developer time by building it super-quickly. And I think the first version… was optimized that way; it wasn’t intended to run at scale.”

As any Step Functions-based system becomes refined, according to those same best practices, the next stage of its evolution will be transitional. Part of that metamorphosis may involve, contrary to popular notions, service consolidation. Despite how Prime Video’s blog post described it, the result of consolidation is not a monolith. It’s now a fully-fledged microservice, capable of delivering those 90% cost reductions engineers touted.

“This is an independently scalable chunk of the overall Prime Video workload,” described Cockcroft. “If they’re not running a live stream at the moment, it would scale down or turn off — which is one reason to build it with Step Functions and Lambda functions to start with. And if there’s a live stream running, it scales up. That’s a microservice. The rest of Prime Video scales independently.”

The New Stack spoke with Ajay Nair, AWS’ general manager for Lambda and for its managed container service App Runner. Nair confirmed Cockcroft’s account in its entirety for how the project was initially framed in Step Functions, as well as how it ended up a scalable microservice.

Nair outlined for us a typical microservices development pattern. Here, the original application’s business processes may be too rigidly coupled together to allow for evolution and adaptation. So they’re decoupled and isolated. This decomposition enables developers to define the contracts that spell out each service’s expected inputs and outputs, requirements and outcomes. For the first time, business teams can directly observe the transactional activities that, in the application’s prior incarnations, had been entirely obscured by its complexity and unintended design constraints.

From there, Nair went on, software engineers may codify the isolated serverless functions as services. In so doing, they may further decompose some services — as AWS did for Amazon S3, which is now served by over 300 microservice classes. They may also consolidate other services. One possible reason: Observing their behavior may reveal they actually did not need to be scaled independently after all.

“It is a natural evolution of any architecture where services that are built get consolidated and redistributed,” said Nair. “The resulting capability still has a well-established contract, [and] has a single team managing and deploying it. So it technically meets the definition of a microservice.”

Breakdown

“I think the definition of a microservice is not necessarily crisp,” stated Brendan Burns, the co-creator of Kubernetes, now corporate vice president at Microsoft, in a note to The New Stack.

“I tend to think of it more in terms of capabilities around functionality, scaling, and team size,” Burns continued. “A microservice should be a consistent function or functions — this is like good object-oriented design. If your microservice is the CatAndDog() service, you might want to consider breaking that into Cat() and Dog() services. But if your microservice is ThatOneCatOnMyBlock(), it might be a sign that you have broken things down too far.”

“The level of granularity that you decompose to,” explained F5 Networks Distinguished Engineer Lori MacVittie, speaking with The New Stack, “is still limited by the laws of physics, by network speed, by how much [code] you’re actually wrapping around. Could you do it? Could you do everything as functions inside a containerized environment, and make it work? Yes. It’d be slow as heck. People would not use it.”

Adrian Cockcroft advises that the interpretability of each service’s core purpose, even by a non-developer, should be a tenet of microservice architecture itself. That fact alone should mitigate against poor design choices.

“It should be simple enough for one person to understand how it works,” Cockcroft advocated. “There are lots of definitions of microservices, but basically, you’ve partitioned your problem into multiple, independent chunks that are scaled independently.”

“Everything we’re describing,” remarked F5’s MacVittie, “is just SOA without the standards… We’re doing the same thing; it’s the same pattern. You can take a look at the frameworks, objects, and hierarchies, and you’d be like, ‘This is not that much different than what we’ve been doing since we started this.’ We can argue about that. Who wins? Does it matter? Is Amazon going to say, ‘You’re right, that’s a big microservice, thank you?’ Does it change anything? No. They have solved a problem that they had, by changing how they design things. If they happen to stumble on what they should have been doing in the first place, according to the experts on the Internet, great. It worked for them. They’re saving money, and they did expose one of those problems with decomposing something too far, on a set of networks on the Internet that is not designed to handle it yet.

“We are kinda stuck by physics, right?” she continued.  “We’re unlikely to get any faster than we are right now, so we have to work around that.”

Perhaps you’ve noticed: Enterprise technology stories thrive on dichotomy. For any software architecture to be introduced to the reader as something of value, vendors and journalists frame it in opposition to some other architecture. When an equivalent system or methodology doesn’t yet exist, the new architecture may end up being portrayed as the harbinger of a revolution that overturns tradition.

One reason may be because the discussion online is being led either by vendors, or by journalists who tend to speak with vendors first.

“There is this ongoing disconnect between how software companies operate, and how the rest of the world operates,” remarked Platify Insights analyst Donnie Berkholz. “In a software company, you’ve got ten times the staffing and software engineering on a per capita basis across the company, as you do in many other companies. That gives you a lot of capacity and talent to do things that other people can’t keep up with.”

Maybe the big blazing “Amazon” brand obscured the fact — despite the business units’ proximity to one another — that Prime Video was a customer of AWS. With its engineers’ blog post, Prime joined an ongoing narrative that may have already spun out of control. Certain writers may have focused so intently upon selected facets of microservices architecture, that they let readers draw their own conclusions about what the alternatives to that architecture must look like. If microservices were, by definition, small (an aspect that one journalist in particular was guilty as hell of over-emphasizing), its evil counterpart must be big, or bigness itself.

Subsequently, in a similar confusion of scale, if Amazon Prime Video embraces a monolith, so must all of Amazon. Score one come-from-behind touchdown for monoliths in the fourth quarter, and cue the Thursday Night Football theme.

“We’ve seen the same thing happening over and over across the years,” mentioned Berkholz. “The leading-edge software companies, web companies, and startups encounter a problem because they’re operating at a different scale than most other companies. And a few years later, that problem starts to hit the masses.”

Buildup

The original “axis of evil” in the service-orientation dichotomy was 1999’s Big Ball of Mud. First put forth by Professors Brian Foote and Joseph Yoder of the University of Illinois at Urbana-Champaign, the Big Ball helped catalyze a resurgence in support for distributed systems architecture. It was seated at the discussion table where the monolith sits now, but not for the same reasons.

The Big Ball wasn’t a daunting tower of rigid, inflexible, tightly-coupled processes, but rather programs haphazardly heaped onto other programs, with data exchanged between them by means of file dumps onto floppy disks carried down office staircases in cardboard boxes. Amid the digital chaos of the 1990s and early 2000s, anything definable as not a Big Ball of Mud, was already halfway beautiful.

“Service Oriented Architecture was actually the same idea as microservices,” recalls Forrester senior analyst David Mooter. “The idea was, you create services that align with your business capabilities and your business operating model. Most organizations, what they heard was, ‘Just put stuff [places] and do a Web service,’ [the result being] you just make things SOAP. And when you create haphazard SOAP, you create Distributed Little Balls of Mud. SOA got a bad name because everyone was employing SOA worst practices.”

Mooter shared some of his latest opinions in a Forrester blog post entitled, “The Death of Microservices?” In an interview with us, he noted, “I think you’re seeing, with some of the reaction to this Amazon blog, when you do microservices worst practices, and you blame microservices rather than your poor architectural decisions, then everyone says microservices stink… Put aside microservices: Any buzzword tech trend cannot compensate for poor architectural decisions.”

The sheer fact that “Big Ball” is a nebulous, plastic metaphor has enabled almost any methodology or architecture that fell out of favor over the past quarter-century, to become associated with it. When microservices makes inroads with organizations, it’s the monolith that gets to wear the crown of thorns. More recently, with some clever phraseology, microservices has carried the moniker of shame.

“Our industry swings like a pendulum between innovation, experimentation, and growth (sometimes just called ‘peacetime’) and belt-tightening and pushing for efficiency (‘wartime’),” stated Laura Tacho, long-time friend of The New Stack, and a professional engineering coach.  “Of course, most companies have both scenarios going on in different pockets, but it’s obvious that we’re in a period of belt-tightening now. This is when some of those choices — for example, breaking things into microservices — can no longer be justified against the efficiency losses.”

Berkholz has been observing the same trend: “There’s been this push back-and-forth within the industry — some sort of a pendulum happening, from monolith to microservices and back again. Years ago, it was SOA and back again.”

Defenders of microservices against the mud-throwing that happens when the pendulum swings back, say their architecture won’t be right for every case, or even every organization. That’s a problem. Whenever a market is perceived as being served by two or more equivalent, competing solutions, that market may correctly be portrayed as fragmented. Which is exactly the kind of market enterprises typically avoid participating in.

“Fragmentation implies that the problem hasn’t been well-solved for everybody yet,” Berkholz told us, “when there’s a lot of different solutions, and nobody’s consolidated on a single one that makes sense most of the time. That is something that companies watch. Is this a fragmented ecosystem, where it’s hard to make choices? Or is this an ecosystem where there’s a clear and obvious master?”

From time to time, Lori MacVittie told us, F5 Networks surveys its clients, asking them for the relative percentages of their applications portfolios they would describe as monoliths, microservices, mobile apps and middleware-infused client/server apps.  “Most organizations were operating at some percentage of each of those,” she told us. When the question was adjusted, asking only whether their apps were “traditional” or “modern,” the split usually has been 60/40, respectively.

“They’re doing both,” she said. “And within those, they’re doing different styles. Is that a mess? I don’t think so. They had specific uses for them.”

“I kind of feel like microservice-vs.-monolith isn’t a great argument,” stated Microsoft’s Brendan Burns. “It’s like arguing about vectors vs. linked lists or garbage collection vs. memory management. These designs are all tools — what’s important is to understand the value that you get from each, and when you can take advantage of that value. If you insist on microservicing everything, you’re definitely going to microservice some monoliths that probably you should have just left alone. But if you say, ‘We don’t do microservices,’ you’re probably leaving some agility, reliability and efficiency on the table.”

The Big Ball of Mud metaphor’s creators cited, as the reason software architectures become bloated and unwieldy, Conway’s Law: “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” Advocates of microservices over the years have taken this notion a few steps further, suggesting business structures and even org charts should be deliberately remodeled to align with software, systems, and services.

When the proverbial pendulum swings back, notes Tacho, companies start reconsidering this notion. “Perhaps it’s not only Conway’s Law coming home to roost,” she told us, “but also, ‘Do market conditions allow us to take a gamble on ignoring Conway’s Law for the time being, so we could trade efficiency for innovation?’”

Continuing her war-and-peace metaphor, Tacho went on: “Everything’s a tradeoff. Past decisions to potentially slow development down and make processes less efficient due to microservices might have been totally fine during peacetime, but having to continuously justify those inefficiencies, especially during a period of belt-tightening, is tiresome. What surprises me sometimes is that rearchitecting a large codebase is not something that most companies would invest in during wartime. They simply have to have other priorities with a better ROI for the business, but big fish like Amazon have more flexibility.”

“The first thing you should look at is your business,” advised Forrester’s Mooter, “and what is the right architecture for that? Don’t start with microservices. Start with, what are the business outcomes you’re trying to achieve? What Forrester calls, ‘Outcome-Driven Architecture.’ How do we align our IT systems and infrastructure and applications, to optimize your ability to deliver that? It will change over time.”

“It’s definitely the case,” remarked Microsoft’s Burns, “that one of the benefits of microservices design is that it enables small teams to behave autonomously because they own very specific APIs with crisp contracts between teams. If the rest of your development culture prevents your small teams from operating autonomously, then you’re never going to gain the agility benefits of microservices. Of course, there are other benefits too, like increased resiliency and potentially improved efficiency from more optimal scaling. It’s not an all-or-nothing, but it’s also the case that an engineering culture that is structured for independence and autonomy is going to do better when implementing microservices. I don’t think that this is that much different than the cultural changes that were associated with the DevOps movement a decade ago.”

Prime Video made a huge business gamble on NFL football rights, and the jury is still out as to whether, over time, that gamble will pay off. That move lit a fire under certain sensitive regions of Prime Video’s engineering team. The capabilities they may have planned to deliver three to five years hence, were suddenly needed now. So they made an architectural shift — perhaps the one they’d planned on anyway, or maybe an adaptation. Did they enable business flexibility down the road, as their best practices advised? Or have they just tied Prime Video down to a service contract, to which their business will be forced to adapt forever? Viewed from that perspective, one could easily forget which option was the monolith, and which was the microservice.

It’s a dilemma we put to AWS’ Ajay Nair, and his response bears close scrutiny, not just by software engineers: “Building an evolvable architectural software system is a strategy, not a religion.”

The post Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All appeared first on The New Stack.

]]>
Case Study: A WebAssembly Failure, and Lessons Learned https://thenewstack.io/webassembly/case-study-a-webassembly-failure-and-lessons-learned/ Thu, 25 May 2023 14:00:55 +0000 https://thenewstack.io/?p=22708922

VANCOUVER — In their talk “Microservices and WASM, Are We There Yet?” at the Open Source Summit North America, Kingdon

The post Case Study: A WebAssembly Failure, and Lessons Learned appeared first on The New Stack.

]]>

VANCOUVER — In their talk “Microservices and WASM, Are We There Yet?” at the Linux Foundation’s Open Source Summit North America, Kingdon Barrett, of Weaveworks, and Will Christensen, of Defense Unicorns, said they were surprised as anyone that their talk was accepted since they were newbies who had spent about three weeks delving into this nascent technology.

And their project failed. (Barrett argued, “It only sort of failed  … We accomplished the goal of the talk!”)

But they learned a lot about what WebAssembly, or Wasm, can and cannot do.

“Wasm has largely delivered on its promise in a browser and in apps, but what about for microservices?” the pair’s talk synopsis summarized. “We didn’t know either, so we tried to build a simple project that seemed fun, and learned Wasm for microservices is not as mature and a bit more complicated than running in the browser.”

“Are we there yet? Not really. There’s some caveats,” said Christensen. “But there are a lot of things that do work, but it’s not enough that I wouldn’t bet the farm on it kind of thing.”

Finding Wasm’s Limitations

Barrett, an open source support engineer at Weaveworks, called WebAssembly “this special compiled bytecode language that works on some kind of like a virtual machine that’s very native toward JavaScript. It’s definitely shown that is significantly faster than, let’s say, JavaScript running with the JIT (just-in-time compiler).

“And when you write software to compile for it, you just need to treat it like a different target — like on x86 or Arm architectures; we can compile to a lot of different targets.”

The speakers found there are limitations or design constraints, if you will:

  • You cannot access the network in an unpermissioned way.
  • You cannot pass a string as an argument to a function.
  • You cannot access the file system unless you have specified the things that are permitted.

“There is no string type,” Barrett said. “As far as I can tell, you have to manage memory and count the bytes you’re going to pass. Make sure you don’t lose that number. That’s a little awkward, but there is a way around that as well.”

One of the big potential benefits for government contractors with Wasm is the ability to use existing code and to retain people with deep knowledge in a particular language.

The talk was part of the OpenGovCon track at the conference.

“We came up with this concept, being the government space, that I thought was going to be really interesting for an ATO perspective” — authorized to operate — “which is, how do you enable continuous delivery while still maintaining a consistent environment?” Christensen said.

The government uses ATO certification to manage risk in contractors’ networks by evaluating the security controls for new and existing systems.

One of the big potential benefits for government contractors with Wasm, Christensen said, is the ability to use existing code and to retain people with deep knowledge in a particular language.

“You can use that, tweak it a little bit and get life out of it,” he said. “You may have some performance losses where there may be some nuances, but largely you can retain a lot of that domain language or that sort of domain knowledge and carry it over for the future.”

Barrett and Christensen set out to write a Kubernetes operator.

“I wanted to write something in Go … so all your functions for this or wherever you need come in the event hooks,” Christensen said.

Then, instead of calling the state a function, or a class that you have inside of that monolithic operating design, the idea is that you can reference somehow the last value store. It could be a Redis cache, database, or object storage. Wasm is small enough that at load time, a small binary can be loaded at initialization.

If cold start times are not a problem, you could write something that will go request, pull a Wasm module, load, run and return the result.

And, Christensen continued, “if you really want to get creative, you can shove it in as a config map inside of Kubernetes and … whatever you want to do, but the biggest thing is Wasm gets pulled in. And the idea is you call it almost like a function, and you just execute it.

“And each one of those executions would be a sandbox so you can control the exposure and security and what’s exposed throughout the entire operator. … You could statically compile the entire operator and control it that way. Anyone who wants to work in the sandbox with modules, they would have the freedom within the sandbox to execute. This is the dream. … Well, it didn’t work.”

The idea was that there would be stringent controls in a sandbox about how the runtime would be exposed to the Wasm module, which would include logging and traceability for compliance.

Runtimes and Languages

WebAssembly is being hailed for its ability to compile from any language, though Andrew Cornwall, a Forrester analyst, told The New Stack that it’s easier to compile languages that do not have garbage collectors, so languages such as Java, Python and interpreted languages tend to be more difficult to run in WebAssembly than languages such as C or Rust.

Barrett and Christensen took a few runtimes and languages for (ahem) a spin. Here’s what they found:

Fermyon Spin

Runtime class has been available since Kubernetes v1.12. It’s easy to get started, light on controls. The design requires privileged access to your nodes. Containerd shims control which nodes get provisioned with the runtime.

Kwasm

“There’s a field on the deployment class called runtimeClassName, and you can set that to whatever you want, as long as containerd knows what that means. So Kwasm operator breaks into the host node and sets up some containerd configuration imports of binary from wherever — this is not production ready,” Barrett said, unless you already had separate controls around all of those knobs and know how to authorize that type of grant safely.

He added, “Anyway, this was very easy to get your Wasm modules to run directly on Kubernetes this way, despite it does require privileged access to the nodes and it’s definitely not ATO.”

WASI/WAGI

WASI (WebAssembly System Interface) provides system interfaces; WAGI (WebAssembly Gateway Interface) permits standard IO to be treated as a connection.

“Basically, you don’t have to handle connections, the runtime handles that for you,” Barrett said. “That’s how I would summarize WAGI, and WASI is the system interface that makes that possible. You have standard input, standard output, you have the ability to share memory, and functions — you can import them or export them, call them from inside or outside of the Wasm, but only in ways that you permit.”

WasmEdge

WasmEdge Runtime, based on C++, became a Cloud Native Computing Foundation project in 2021.

The speakers extolled an earlier talk at the conference by Michael Yuan, a maintainer of the project, and urged attendees to look for it.

Wasmer/Wastime

Barrett and Christensen touted the documentation on these runtime projects.

“There are a lot of language examples that are pretty much parallel to what I went through … and it started to click for me,” Barrett said. “I didn’t really understand WASI at first, but going through those examples made it pretty clear.”

They’re designed to get you thinking about low-level constructs of Wasm:

  • What is possible with a function, memory, compiler.
  • How to exercise these directly from within the host language.
  • How to separate your business logic.
  • Constraints in these environments will help you scope your project’s deliverable functions down smaller and smaller.

Wasmtime or Wasmer run examples in WAT (WebAssembly Text Format), a textual representation of the Wasm binary format, something to keep in mind when working in a language like Go. If you’re trying to figure out how to call modules in Go and it’s not working, check out Wazero, the zero-dependency WebAssembly runtime written in Go, Barrett said.

Rust

It has first-class support and the most documentation, the speakers noted.

“If you have domain knowledge of Rust already, you can start exploring right now how to use Wasm in your production workflow,” Christensen said.

Node.js/Deno

Wasm was first designed for use in web browsers. There’s a lot of information out there already about the V8 engine running code that wasn’t just JavaScript in the browser. V8 is implemented in C++ with support for JavaScript. That same V8 engine is found at the heart of NodeJS and Deno. The browser-native JavaScript runtimes in something like Node.js or Deno are what made their use with Wasm so simple.

“A lot of the websites that had the integration already with the V8 engine, so we found that from the command line from a microservices perspective was kind of really easy to implement,” Christensen said.

“So the whole concept about the strings part, about passing it with a pointer, if you’re running Node.js and Deno, you can pass strings natively and you don’t even know it’s any different. …Using Deno, it was really simple to implement. …There are a lot of examples that we’ve discovered, one of which is ‘Hello World,’ actually works. I can compile it so it actually runs and can pass a string and get a string out simply from a web assembly module with Deno.”

Christensen said that Deno or Node.js currently provides the best combination of WASM support that is production ready with a sufficient developer experience.

A Few Caveats

“But a little bit of warning when you go to compile,” Christensen said. “What we have discovered is: all WASM is not compiled the same.”

There are three compilers for Wasm:

  • Singlepass doesn’t have the fastest runtime, but has the fastest compilation.
  • Cranelift is a main engine used in Wasmer and Wasmtime. It doesn’t have the fastest runtime; it’s much better, but it’s still a bizarre compilation.
  • LLVM has the slowest compile time. No one who’s ever used LVM is surprised there, but it is the fastest runtime.

A Few Problems

Pointer functions for handling strings are problematic. String passing, specifically with Rust, even when done correctly, could decrease performance by up to 20 times, they said.

There is a significant difference between compiled and interpreted languages when compiled to a WASM target. Wasm binaries for Ruby and Python may see 20 to 50MG penalties compared to Go or Rust because of the inclusion of the interpreter.

“And specifically, just because we’re compiling Ruby or Python to Wasm, you do need to compile the entire interpreter into it,” Christensen said. “So that means if you are expecting Wasm to be better for boot times and that kind of stuff, if you’re using an interpreted language, you are basically shoving the entire interpreter into the Wasm binary and then running your code to be on the interpreter. So please take note that it’s not a uniform experience.”

“If you’re using an interpreted language, it’s still interpreted in Wasm,” Barrett said. “If you’re passing the script itself into Wasm, the interpreter is compiled in Wasm but the script is still interpreted.”

And Christensen added, “You’re restricted to the runtime restrictions of the browser itself, which means sometimes they may be single-threaded. Good, bad, just be aware.”

A web browser, Deno and Node.js all use the V8 engine, meaning they all exhibit the same limitations when running Wasm.

And language threading needs to be known at runtime for both host and module.

“One thing I’ve noticed: in Go, if I use the HTTP module to do a request from a Wasm-compiled Go module from Deno, there is no way that I can turn around and make sure that’s not gonna break the threaded nature of Deno and that V8 engine,” Christensen said.

He added, “Maybe there’s an answer there, but I didn’t find it. So if you are just getting started and you’re just trying to mess around and try to find all that happening, just know that you may spend some time there.”

And what happens when you have a C dependency with your RubyGem?

Barrett said he didn’t try that at all.

“Most Ruby dependencies are probably native Ruby, not native extensions,” he said. “They’re pure Ruby, but a ‘native extension’ is Ruby compiling C code. And then you have to deal with C code now,” in addition to Ruby.

“Of course, C compiles to Wasm, so I’m sure there is a solution for this. But I haven’t found anyone who has solved it yet.”

It applies to some Python packages as well, Christensen said.

“They [Python eggs] are using the binary modules as well, so there is definitely no way to do a [native system] binary translation into Wasm — binary to binary,” he said. “So if you need to do it, you need to get your hands dirty, compile the library itself to Wasm, then compile whatever gem or package that function calls are there.”

The speakers said that in working with Wasm, they found that ChatGPT wasn’t very helpful and that debugging can be harsh.

So, Should You Be Excited about Wasm?

“Yes. There’s plenty of reasons to be excited,” Christensen said. “It may not be ready yet, but I definitely think it’s enough to move forward and start playing around yourself.”

When Wasm is fully mature, he said, it will have benefits in terms of tech workforce retention, especially in governmental organizations: “You can take existing workforce, you don’t have to re-hire and you can get longevity out of them. Especially to have all that wonderful domain knowledge and you don’t have to re-solve the same problem using a new tool.

“If you have a lot of JavaScript stuff, [you’ll have] better control over it and it runs faster, which is the whole reason why Wasm is interesting,” Christensen said. The reason is that JavaScript compiled to Wasm is much faster, as the V8 engine no longer has to do “just-in-time” operations.

“And then finally, I’m sure a lot of you have an ARM MacBook, and then you try to deploy something to the cloud,” he said. “And next thing you realize, ‘Oh look, my entire stack is in x86.’ Well, Wasm magically does take care of this. I did test this out on a Mac Mini and ran it on a brand new AMD 64 system and Deno couldn’t tell the difference.”

WebAssembly is ready to be tested, Christensen said, and the open source community is the way to make that happen.

“Let the maintainers know; start talking about it. Bring up issues. We need more working examples. That’s missing. We can’t even get ChatGPT to give us anything decent,” he said, so the community is relying on its members to experiment with it and share their experiences.

The post Case Study: A WebAssembly Failure, and Lessons Learned appeared first on The New Stack.

]]>
RabbitMQ Is Boring, and I Love It https://thenewstack.io/rabbitmq-is-boring-and-i-love-it/ Mon, 15 May 2023 13:30:32 +0000 https://thenewstack.io/?p=22707624

RabbitMQ is boring. Very boring. And we tend not to think about boring things. RabbitMQ, like the electrical grid, is

The post RabbitMQ Is Boring, and I Love It appeared first on The New Stack.

]]>

RabbitMQ is boring. Very boring. And we tend not to think about boring things. RabbitMQ, like the electrical grid, is entirely uninteresting — until it stops working. The human brain is conditioned to recognize and respond to pain and peril more than peace, so we tend only to remember the traumas in life. In this post, I want to try to change that. Let’s talk about RabbitMQ, an open source message broker I’ve been using for the better part of 15 years — happily and bored.

My background is in, among other things, messaging and integration technologies. Unfortunately, legacy systems are often hostile, mostly because those who came before us did not foresee the highly distributed nature of today’s modern, API-dominant architectures, such as cloud native computing and microservices.

There are many ways to approach integration. In their book, “Enterprise Integration Patterns,” Gregor Hohpe and Bobby Woolf talk about four approaches: shared databases, messaging, remote procedure call (RPC) and file transfer. Integration is all about optionality and coupling: How do we take services that don’t know about each other and make them work together without overly coupling them by proximity or time? Messaging, that’s how. Messaging is the integration approach that comes with batteries included. It has all the benefits and few, if any, of the drawbacks of the three other integration styles. Messaging means I can sleep at night. Messaging is boring. I love boring.

With the support of multiple open protocols, such as AMQP 0.9, 1.0, MQTT, STOMP and others, RabbitMQ gives people options and flexibility and interoperability, so a service written in Python can communicate with another in C# or Java, and both can be none the wiser.

A Beautiful Indirection 

RabbitMQ has a straightforward programming model: Clients send messages to exchanges. The exchange acts as the broker’s front door, accepting incoming messages and routing them onward. An exchange looks at the incoming message and the message’s headers — and sometimes at one special header in particular, called a routing key — and decides to which queue (or queues) it should send the message. These exchanges can even send messages to other brokers. Queues are the thing consumers consume from. This beautiful indirection is why inserting an extra hop between a producer and a consumer is possible without affecting the producer or the consumer.

It all seems so straightforward — and boring! — when you think about it. But you wouldn’t believe how many people got this stuff wrong from the get-go. Let’s look at Java Message Service (JMS), the Java-standardized API for messaging. It has no concept of an exchange, so it is impossible to reroute a message (without sidestepping the JMS interfaces) once a producer and a consumer connect. Meanwhile, some JMS brokers couple the consumer and the producers by the Java driver client to talk to the broker. If the client supports version X of the broker, and someone has upgraded the broker to X+1, then the producer and the consumer may need to upgrade their Java client drivers to X+1.

Born in 2007, RabbitMQ was conceived due to the need for large banks to standardize their digital systems so they and their customers (that’s us) can transact more easily. RabbitMQ implemented the AMQP protocol from the jump; it’s still the most popular way to connect to the broker today. But it’s not the only way.

Let Me Count the Ways

As mentioned, RabbitMQ supports multiple protocols, which certainly offers choice, but there are other benefits as well. Like MQTT, popular in the Internet-of-Things space, where millions of clients — think microwaves, refrigerators and cars — might need to communicate with a single broker in a lightweight, efficient way. This work’s ongoing and keeps getting better by the day. For example, Native MQTT was recently announced, dramatically reducing memory footprint and increasing scalability.

RabbitMQ supports federation and active/passive deployments. It has various approaches to storing the messages in RAM or on disk. It supports transactions. It’s speedy, and it guarantees the consistency of your data. However, RabbitMQ has traditionally served messaging and integration use cases, not stream processing pipelines.

The community around RabbitMQ is vibrant, burgeoning and fast-moving, and the last few years have been incredibly prolific. Have you tried RabbitMQ Streams? Streams are a new persistent and replicated data structure that models an append-only log with nondestructive consumer semantics. You can use Streams from a RabbitMQ client library as a plain ol’ queue or through a dedicated binary protocol plugin and associated clients for even better throughput and performance.

To say it’s been successful would be an understatement. StackShare states that, among others, Reddit, Robinhood, Zillow, Backbase, Hello Fresh and Alibaba Travels all use RabbitMQ.

There are drivers for virtually every language, platform and paradigm. For example, I work on the Spring team, and we have several increasingly abstract ways by which you can use RabbitMQ, starting with the Spring for RabbitMQ foundational layer and going all the way up to the support in Spring Cloud Data Flow, our stream and batch processing stack.

RabbitMQ is open and extensible, supporting unique features with plugins to the server and extensions to the AMQP protocol.

The RabbitMQ site has a nonexhaustive list of some of the best ones. They include things like publisher confirms, dead letter exchanges, priority queues, per-message and per-queue TTL (time to live values tell RabbitMQ how long an unacknowledged message is allowed to remain in a queue before being deleted), exchange-to-exchange bindings and so much more. These are extensions to the protocol itself, implemented in the broker.

Plugins are slightly different. Numerous plugins extend the broker proper and introduce new management, infrastructure and engine capabilities. For example, there are plugins to support Kubernetes service discovery, OAuth 2, LDAP, WAN federation, STOMP and so much more.

What does all this mean for me? It means that, like PostgreSQL, RabbitMQ is a Swiss army knife. Does it do everything that the costly alternatives from Tibco or IBM do? No. But I’ll bet it can do 95% of whatever I’d need, and it’ll do so in a way that leaves my options open in the future. (And you can’t beat the price!)

Maybe it’s just all those years spent wearing a pager or running into data centers with winter jackets on at 3 a.m., but I actively avoid running anything I can’t charge for directly. I prefer to have someone run RabbitMQ for me. It’s cheap and easy enough to do so on several different cloud providers or the Kubernetes distribution of your choice.

As a developer, RabbitMQ couldn’t be more boring. As an operator in production, RabbitMQ couldn’t be more boring. I love boring, and I love RabbitMQ.

If boring appeals to you too, I encourage you to learn more about RabbitMQ, how it works as an event-streaming broker, how it compares to Kafka and about the beta of RabbitMQ as a service.

The post RabbitMQ Is Boring, and I Love It appeared first on The New Stack.

]]>
How OpenSearch Visualizes Jaeger’s Distributed Tracing https://thenewstack.io/how-opensearch-visualizes-jaegars-distributed-tracing/ Thu, 11 May 2023 17:00:12 +0000 https://thenewstack.io/?p=22706795

We all know how important observability is. Open source tooling is always a popular option. The complexity of selecting tooling

The post How OpenSearch Visualizes Jaeger’s Distributed Tracing appeared first on The New Stack.

]]>

We all know how important observability is. Open source tooling is always a popular option. The complexity of selecting tooling is always a challenge. Typically, we end up with several best-of-breed tools in use in most organizations, which include many different projects and databases.

As organizations continue to implement microservices-based architectures and cloud native technologies, operational data is becoming increasingly large and complex. Because of the distributed nature of the data, the old approach of sorting through logs is not scalable.

As a result, organizations are continuing to adopt distributed tracing as a way of gaining insight into their systems. Distributed tracing helps determine where to start investigating issues and ultimately reduces the time spent on root cause analysis. It serves as an observability signal that captures the entire lifecycle of a particular request as it traverses distributed services. Traces can have multiple service hops, called spans, that comprise the entire operation.

Jaeger

One of the most popular open source solutions for distributed tracing is Jaeger. Jaeger is an open source, end-to-end solution hosted by the Cloud Native Computing Foundation (CNCF). Jaeger leverages data from instrumentation SDKs that are OpenTelemetry (OTel) based and support multiple open source data stores, such as Cassandra and OpenSearch and Elasticsearch, for trace storage.

While Jaeger does provide a UI solution for visualizing and analyzing traces along with monitoring data from Prometheus, OpenSearch now provides the option to visualize traces in OpenSearch Dashboards, the native OpenSearch visualization tool.

Trace Analytics

OpenSearch provides extensive support for log analytics and observability use cases. Starting with version 1.3, OpenSearch added support for distributed trace data analysis with the Observability feature. Using Observability, you can analyze the crucial rate, errors, and duration (RED) metrics in trace data. Additionally, you can evaluate various components of your system for latency and errors and pinpoint services that need attention.

The OpenSearch Project launched the trace analytics feature with support for OTel-compliant trace data provided by Data Prepper — the OpenSearch server-side data collector. To incorporate the popular Jaeger trace data format, in version 2.5 OpenSearch introduced the trace analytics feature in Observability.

With Observability, you can now filter traces to isolate the spans with errors in order to quickly identify the relevant logs. You can use the same feature-rich analysis capabilities for RED metrics, contextually linking traces and spans to their related logs, which are available for the Data Prepper trace data. The following image shows how you can view traces with Observability.

Keep in mind that the OTel and Jaeger formats have several differences, as outlined in OpenTelemetry to Jaeger Transformation in the OpenTelemetry documentation.

Try It out

To try out this new feature, see the Analyzing Jaeger trace data documentation. The documentation includes a Docker Compose file that shows you how to add sample data using a demo and then visualize it using trace analytics. To enable this feature, you need to set the --es.tags-as-fields.all flag to true, as described in the related GitHub issue. This is necessary because of an OpenSearch Dashboards limitation.

In Dashboards, you can see the top service and operation combinations with the highest latency and the greatest number of errors. Selecting any service or operation will automatically direct you to the Traces page with the appropriate filters applied, as shown in the following image. You can also investigate any trace or service on your own by applying various filters.

Next Steps

To try the OpenSearch trace analytics feature, check out the OpenSearch Playground or download the latest version of OpenSearch. We welcome your feedback on the community forum!

The post How OpenSearch Visualizes Jaeger’s Distributed Tracing appeared first on The New Stack.

]]>
Spring Cloud Gateway: The Swiss Army Knife of Cloud Development https://thenewstack.io/spring-cloud-gateway-the-swiss-army-knife-of-cloud-development/ Mon, 08 May 2023 13:47:40 +0000 https://thenewstack.io/?p=22707347

A microservice has to fulfill many functional and nonfunctional requirements. When implementing one, I mostly start with the happy path

The post Spring Cloud Gateway: The Swiss Army Knife of Cloud Development appeared first on The New Stack.

]]>

A microservice has to fulfill many functional and nonfunctional requirements. When implementing one, I mostly start with the happy path to see if I meet these requirements. And with all the other nonfunctional requirements, like protecting my service or scaling various parts independently, I love to work with Spring Cloud Gateway as this tiny and, IMHO, underrated tool is powerful, even with just a few lines of configuration.

What Is Spring Cloud Gateway?

Spring Cloud Gateway is an open source, lightweight and highly customizable API gateway that provides routing, filtering and load-balancing functionality for microservices. It is built on top of Spring Framework and integrates easily with other Spring Cloud components.

If you’re new to Spring Cloud Gateway, this article outlines some common use cases where it can come in handy and requires minimal configuration.

How to Get Started with Spring Cloud Gateway

The easiest way to start experimenting with Spring Cloud Gateway is by using Spring Initializr. So let’s go to start.spring.io and generate a project stub. Pick the project type, language and the versions you want, and be sure to add the Spring Cloud Gateway dependency. Once you are done, hit the Download button.

What you get is a basic project structure like this:

And this is a fully working, almost production-ready Spring Cloud Gateway. The important part is the application.yml, which will hold all further configuration. Now let’s add some magic.

Protect Services through Rate Limiting

Sometimes it’s necessary to protect your service from misbehaving clients to ensure availability for correctly behaving ones. In that case, Spring Cloud Gateway can help with its rate-limiting capabilities. By combining it with a KeyResolver, you can correctly identify all your clients and assign them a quota of requests they are allowed per second.

It also offers a burst mode, where you can get above the assigned quota for a short period of time to cope with sudden bursts of requests.

As the gateway, in that case, requires some sort of memory, you should combine it with an attached Redis cache. This would allow for horizontally scaling your gateways.

spring:
  cloud:
    gateway:
      default-filters:
      - name: RequestRateLimiter
        args:
          # maximum number of requests per period
          redis-rate-limiter.replenishRate: 10
          # maximum number of requests that can be queued before being rejected
          redis-rate-limiter.burstCapacity: 20
          # the key resolver for rate limiting
          key-resolver: "#{@apiKeyResolver}"
      routes:
      - id: example
        uri: http://example.com
        predicates:
        - Path=/**
     
# key resolver for rate limiting
apiKeyResolver:
  type: com.example.ApiKeyResolver


The KeyResolver should look like this. It can hold any custom logic required to select the criteria of distinguishing different users. In this example, it’s just the X-API-key header, but another common thing might be an Authorization header or some sort of set cookie.

package com.example;

import org.springframework.cloud.gateway.filter.ratelimit.KeyResolver;
import org.springframework.web.server.ServerWebExchange;
import reactor.core.publisher.Mono;

public class ApiKeyResolver implements KeyResolver {

    @Override
    public Mono<String> resolve(ServerWebExchange exchange) {
        return Mono.justOrEmpty(exchange.getRequest().getHeaders().getFirst("X-API-Key"));
    }
}


With this, you can easily add a rate-limiting capability to your microservice without having to implement this within the service itself.

Adding a Global Namespace to Various Microservices

As services mature, it might be necessary to do the step from version 1 to version 2. But implementing the new version in the same deployable unit implies the risk — while implementing on version 2, you might accidentally change something related to version 1. So why not leave version 1 as it is and implement the new version as a separate deployment unit?

A simple configuration can help bring these two applications under a common name to seem as though they are one unit of deployment.

spring:
  cloud:
    gateway:
      routes:
      - id: api-v1
        uri: http://v1api.example.com/
        predicates:
        - Path=/v1/**
      - id: api-v2
        uri: http://v2api.example.com/
        predicates:
        - Path=/v2/**


This allows us to access our microservice with the URLs https://api.example.com/v1/* and https://api.example.com/v2/* but have them be deployed and maintained separately.

Scale Subcontext of Your Services Independently

In the previous tip, we showed how to bring together components that need to be together. But this can extend to other scenarios as well. So let’s assume our microservice is simply too big to be managed by a single team. We can use the same configuration to unite different logical parts of the service into one common umbrella to look like one service.

spring:
  cloud:
    gateway:
      routes:
      - id: locations
        uri: http://locationapi.example.com/
        predicates:
        - Path=/v1/locations/**
      - id: weather
        uri: http://weatherapi.example.com/
        predicates:
        - Path=/v1/weather/**


This would also allow us to independently scale the different parts of the API as needed.

AB Test Your Service with a Small Customer Group

AB testing is common when developing new applications since it is a good way of testing whether your service meets requirements without having to roll it out completely. AB testing gives only a small group of users access to the new version of a service and asks them whether they like it. One group could be your colleagues within the company network, for example.

spring:
  cloud:
    gateway:
      routes:
      - id: example-a
        uri: http://example.com/service-a
        predicates:
        - Header=X-AB-Test, A
        - RemoteAddr=192.168.1.1/24
      - id: example-b
        uri: http://example.com/service-b
        predicates:
        - Header=X-AB-Test, B
        - RemoteAddr=192.168.10.1/24
      - id: example-default
        uri: http://example.com/service-default
        predicates:
        - True


In this example, the gateway evaluates all the routes and their predicates defined in this order, and if all the predicates match, the corresponding route will be selected. This means:

  • Customers coming from 192.168.1.1/24 network having the X-AB-Test header set to A will be presented service variant a.
  • Customers coming from 192.168.10.1/24 network having the X-AB-Test header set to B will be presented service variant b.
  • All other customers will see that the service-default variant as the corresponding predicate always evaluates to true.

Protect Services by Adding Authentication

This is not really a specific Spring Cloud Gateway feature, but it’s also handy in combination with the gateway. Imagine you have blackbox services and you want to enhance its implementation without having the source code or the permission to change this.

In this case, a gateway as a reverse proxy in front can help add features like authentication.

spring:
  security:
    oauth2:
      resourceserver:
        jwt:
          issuer-uri: https://client.idp.com/oauth2/default
          audience: api://default
  cloud:
    gateway:
      routes:
      - id: api
        uri: http://api.example.com/
        predicates:
        - Path=/**


As seen here, the gateway parts take care of the traffic forwarding to the target and the Spring Security configuration adds an oauth2 flow for logging in.

Add Audit Logging to Your Services

Another scenario for enhancing existing services might be adding an audit log to the service.

Want to know who’s calling which of your services operations? Add some audit logging. This needs a bit more implementation as I haven’t found a default implementation.

spring:
  cloud:
    gateway:
      routes:
      - id: backend
        uri: http://backend.example.com
        predicates:
        - Path=/**
        filters:
        - name: RequestLogger


The RequestLogger can be implemented as a Spring bean, just like this:

@Component
public class RequestLogger implements GatewayFilter, Ordered {

    private static final Logger LOGGER = LoggerFactory.getLogger(RequestLogger.class);

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        ServerHttpRequest request = exchange.getRequest();
        HttpMethod method = request.getMethod();
        String uri = request.getURI().toString();
        HttpHeaders headers = request.getHeaders();
        String authHeader = headers.getFirst(HttpHeaders.AUTHORIZATION);
        LOGGER.info("Request - method: {}, uri: {}, authHeader: {}", method, uri, authHeader);
        return chain.filter(exchange);
    }

    @Override
    public int getOrder() {
        return Ordered.HIGHEST_PRECEDENCE;
    }
}


In this implementation, the RequestLogger filter logs the HTTP method, Uniform Resource Identifier (URI) and authentication header of each request using the SLF4J logging framework. The filter is implemented as a Spring @Component and is added to the Spring Cloud Gateway filter chain using the filters property in the application.yml file. The Ordered interface is implemented to ensure that this filter has the highest precedence and runs first in the filter chain.

Protect Services by a Circuit Breaker

In some cases, rate limiting, as discussed previously, is not enough to operate a service safely, and if response times for your service get too high, it’s sometimes necessary to cut off traffic for a short period of time to let the service recover itself. This is where all the classical resilience patterns can help. A pattern like the circuit breaker has to be implemented outside of the affected service. The Spring Cloud Gateway can also help here.

spring:
  cloud:
    gateway:
      routes:
      - id: slow-service
        uri: http://example.com/slow-service
        predicates:
        - Path=/slow/**
        filters:
        - name: CircuitBreaker
          args:
            name: slow-service
            fallbackUri: forward:/fallback/slow-service
            statusCodes:
              - SERVICE_UNAVAILABLE
        - name: ResponseTime
          args:
            baseName: slow-service
            timeout: 1000
            tripwires:
              - id: slow-response
                type: MAX_RESPONSE_TIME
                threshold: 500
                circuitBreaker:
                  enabled: true
                  timeout: 10000
                  ringBufferSizeInClosedState: 5
                  ringBufferSizeInHalfOpenState: 2
                  failureRateThreshold: 50
      - id: fast-service
        uri: http://example.com/fast-service
        predicates:
        - Path=/fast/**
      - id: fallback-slow-service
        uri: forward:/fallback/slow-service
        predicates:
        - Path=/fallback/slow/**


In this example, calls to the slow service will be monitored and if its response time exceeds 1,000 milliseconds, it will be cut off and given 10 seconds to rest before trying to bring it back in. In the meantime, the fallback is used. This fallback can be a simple error message, sending a “Temporarily Unavailable” status code or maybe some more helpful implementation, depending on the use case.

More Creative Ways of Using Spring Cloud Gateway

Like Legos, there are endless possibilities, combining all these building blocks to build whatever is needed. One of the most creative ways of using this that I have seen is the following:

A team encountered a challenge when using autoscaling in combination with Java applications. They realized that newly started applications were much slower than those that were already running. While this is a normal behavior for Java applications due to the just-in-time (JIT) compiling process, it affects end-user experience. The team added a Spring Cloud Gateway for load balancing to all these services and configured it as load balancing based on the average response time. So backends with fast average response times would get more traffic than backend instances with slower average response times. Overall, this allowed freshly started instances to warm up their JIT without negatively affecting overall performance, and it also helped reduce traffic to overloaded instances, leading to better performance overall for the end user.

As you can see, there are a lot of possibilities where this highly flexible API gateway can be used in a wide range of products. Due to its extensibility, you can easily add new features and capabilities by simply implementing custom filters or predicates and letting the traffic flow just the way you need. That’s why this is one of my favorite tools for shaping microservice landscapes.

The post Spring Cloud Gateway: The Swiss Army Knife of Cloud Development appeared first on The New Stack.

]]>
Return of the Monolith: Amazon Dumps Microservices for Video Monitoring https://thenewstack.io/return-of-the-monolith-amazon-dumps-microservices-for-video-monitoring/ Thu, 04 May 2023 14:23:21 +0000 https://thenewstack.io/?p=22707172

A blog post from the engineering team at Amazon Prime Video has been roiling the cloud native computing community with

The post Return of the Monolith: Amazon Dumps Microservices for Video Monitoring appeared first on The New Stack.

]]>

A blog post from the engineering team at Amazon Prime Video has been roiling the cloud native computing community with its explanation that, at least in the case of video monitoring, a monolithic architecture has produced superior performance than a microservices and serverless-led approach.

For a generation of engineers and architects raised on the superiority of microservices, the assertion is shocking indeed. In a microservices architecture, an application is broken into individual components, which then can be worked on and scaled independently.

“This post is an absolute embarrassment for Amazon as a company. Complete inability to build internal alignment or coordinated communications,” wrote analyst Donnie Berkholz, who recently started his own industry-analyst firm Platify.

“What makes this story unique is that Amazon was the original poster child for service-oriented architectures,” weighed in Ruby-on-Rails creator and Basecamp co-founder David Heinemeier Hansson, in a blog item post Thursday. “Now the real-world results of all this theory are finally in, and it’s clear that in practice, microservices pose perhaps the biggest siren song for needlessly complicating your system. And serverless only makes it worse.”

In the original post, dated March 22, Amazon Prime Senior Software Development Engineer Marcin Kolny explained how moving the video streaming to a monolithic architecture reduced costs by 90%. It turns out that components from Amazon Web Services hampered scalability and skyrocketed costs.

The Video Quality Analysis (VQA) team at Prime Video initiated the work.

The task as to monitor the thousands of video streams that the Prime delivered to customers. Originally this work was done by a set of  distributed components orchestrated by AWS Step Functions, a serverless orchestration service, AWS Lambda serverless service.

In theory, the use of serverless would allow the team to scale each service independently. It turned out, however, that at least for how the team implemented the components, they hit a hard scaling limit at only 5% of the expected load. The costs of scaling up to monitor thousands of video streams would also be unduly expensive, due to the need to send data across multiple components.

Initially, the team tried to optimize individual components, but this did not bring about significant improvements. So, the team moved all the components into a single process, hosting them on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS).

Takeaway

Kolny was careful to mention that the architectural decisions made by the video quality team may not work in all instances.

“Microservices and serverless components are tools that do work at high scale, but whether to use them over monolith has to be made on a case-by-case basis,” he wrote.

To be fair, the industry has been looking to temper the enthusiasm of microservices over the past decade, stressing it is only good in some cases.

“As with many good ideas, this pattern turned toxic as soon as it was adopted outside its original context, and wreaked havoc once it got pushed into the internals of single-application architectures,” Hansson wrote. “In many ways, microservices is a zombie architecture. Another strain of an intellectual contagion that just refuses to die.”

The IT world is nothing but cyclical, where an architectural trend is derided as hopelessly archaic one year can be the new hot thing the following year. Certainly, over the past decade when microservices ruled (and the decade before when web services did), we’ve heard more than one joke in the newsroom about “monoliths being the next big thing.” Now it may actually come to pass.

The post Return of the Monolith: Amazon Dumps Microservices for Video Monitoring appeared first on The New Stack.

]]>
Cloud Native Basics: 4 Concepts to Know  https://thenewstack.io/cloud-native-basics-4-concepts-to-know/ Thu, 27 Apr 2023 16:42:17 +0000 https://thenewstack.io/?p=22706518

To stay competitive, companies must adjust and adapt their technology stack to accelerate their digital transformation. This means engineering teams

The post Cloud Native Basics: 4 Concepts to Know  appeared first on The New Stack.

]]>

To stay competitive, companies must adjust and adapt their technology stack to accelerate their digital transformation. This means engineering teams now experience exponential data growth that is starting to outgrow underlying infrastructure. That requires durable infrastructure that can support rapid data growth and high availability. With cloud native architecture, companies can meet all their availability requirements and effectively store data in real time.

So what is cloud native? Well, cloud native is an approach to build and run applications that takes full advantage of cloud computing technology. If something is “cloud native,” then it is designed and coded to run on a cloud architecture at the start of the application development process like Kubernetes.

At its core, cloud native is about designing applications as a collection of microservices, each of which can be deployed independently and scaled horizontally to meet demand. This allows for greater flexibility because developers can update specific services as needed, instead of updating the entire application.

Such agility lets engineering teams rapidly deploy and update applications through agile development, containers and orchestration. It also provides improved scalability because teams can easily spin up containers in response to traffic demand, which maximizes resource usage and reduces cost. Additionally, applications that are distributed across multiple servers or nodes mean that one component’s failure does not bring down the entire system.

The 4 Basic Cloud Native Components

Before your organization implements any sort of cloud native architecture, it’s important to understand its basic components. The four pillars of cloud native are microservices, DevOps, open source standards and containers.

No. 1: Microservices are the foundation of cloud native architecture because they offer several benefits, including scalability, fault tolerance and agility. Microservices are smaller and more focused than monolithic applications, which makes them easier to develop, test and deploy. This allows teams to move faster and respond more quickly to changing business requirements and application needs. Plus, a failure in one microservice does not cause an outage of the entire application. This means that developers can replace or update individual microservices and not disrupt the entire system.

No. 2: DevOps is a set of practices that emphasize collaboration and communication between development and operations teams. Its goal is to deliver software faster and more reliably. DevOps plays a critical role in enabling continuous delivery and deployment of cloud native architecture. DevOps teams collaborate to rapidly test and integrate code changes, and focus on automating as much of the deployment process as possible. Another key aspect of DevOps in a cloud native architecture is the use of Infrastructure as Code (IaC) tools, which allow for declarative configuration of infrastructure resources. DevOps’ focus on CI/CD enables products and features to be released to market faster; improves software; ensures that secure coding practices are met and reduces cost for the organization; and improves collaboration between the development and operations teams.

No. 3: There are a variety of industrywide open source standards such as Kubernetes, Prometheus and the Open Container Initiative. These cloud native open source standards are important for several reasons:

  • They help organizations avoid vendor lock-in by ensuring that applications and infrastructure are not tied to any particular cloud provider or proprietary technology.
  • Open source standards promote interoperability between different cloud platforms, technologies and organizations to integrate their environments with a wide range of tools and services to meet business needs.
  • Open source standards foster innovation as they allow developers and organizations to collaborate on new projects and coding advancements for cloud native architectures across the industry.
  • Open source standards are developed through a community-driven process, which ensures that the needs and perspectives of a wide range of stakeholders are considered.

No. 4: Containers enable organizations to package applications into a standard format to easily deploy and run on any cloud platform. Orchestration on the other hand, is the process of managing and automating the deployment, scaling and management of containerized applications. Containers and orchestration help build and manage scalable, portable and resilient applications. This allows businesses to quickly respond to market changes, which gives them a competitive advantage so they can constantly implement value-add features and keep customer-facing services online.

Chronosphere + Cloud Native 

Cloud native practices offer significant business benefits, including faster time-to-market, greater scalability, improved resilience, reduced costs, and better application agility and flexibility. With cloud native adoption, organizations can improve their software development processes and deliver better products and services to their customers.

When migrating to a cloud native architecture, teams must have observability software to oversee system health. Observability tools provide real-time visibility into system performance that help developers to quickly identify and resolve issues, optimize system performance and design better applications for the cloud.

Built specifically for cloud native environments, Chronosphere provides a full suite of observability tools for your organization to control data cardinality and understand costs with the Chronosphere control plane, and assist engineering teams with cloud native adoption.

The post Cloud Native Basics: 4 Concepts to Know  appeared first on The New Stack.

]]>
Kubernetes Evolution: From Microservices to Batch Processing Powerhouse https://thenewstack.io/kubernetes-evolution-from-microservices-to-batch-processing-powerhouse/ Sun, 16 Apr 2023 17:00:54 +0000 https://thenewstack.io/?p=22704735

Kubernetes has come a long way since its inception in 2014. Initially focused on supporting microservice-based workloads, Kubernetes has evolved

The post Kubernetes Evolution: From Microservices to Batch Processing Powerhouse appeared first on The New Stack.

]]>

Kubernetes has come a long way since its inception in 2014.

Initially focused on supporting microservice-based workloads, Kubernetes has evolved into a powerful and flexible tool for building batch-processing platforms. This transformation is driven by the growing demand for machine learning (ML) training capabilities, the shift of high-performance computing (HPC) systems to the cloud, and the evolution towards more loosely coupled mathematical models in the industry.

Recent work by PGS to use Kubernetes to build a compute platform that is equivalent to the world’s top seventh supercomputer with 1.2MvCPUs but running in the cloud and on Spot VMs is a great highlight of this trend.

In its early days, Kubernetes was primarily focused on building features for microservice-based workloads. Its strong container orchestration capabilities made it ideal for managing the complexity of such applications.

However, batch workloads users frequently preferred to rely on other frameworks like Slurm, Mesos, HTCondor, or Nomad. These frameworks provided the necessary features and scalability for batch processing tasks, but they lacked the vibrant ecosystem, community support, and integration capabilities offered by Kubernetes.

In recent years, the Kubernetes community has recognized the growing demand for batch processing support and has made significant investments in this direction. One such investment is the formation of the Batch Working Group, which has undertaken several initiatives to enhance Kubernetes’ batch processing capabilities.

The Batch Working Group has built numerous improvements to the Job API, making it more robust and flexible to support a wider range of batch processing workloads. The revamped API allows users to easily manage batch jobs, offers scalability, performance and reliability enhancements.

Kueue (https://kueue.sigs.k8s.io/) is a new job scheduler developed by the Batch Working Group, designed specifically for Kubernetes batch processing workloads. It offers advanced features such as job prioritization, backfilling, resource flavors orchestration and preemption, ensuring efficient and timely execution of batch jobs while keeping your resources usage at maximum efficiency.

The team is now working on building its integrations with various frameworks like Kubeflow, Ray, Spark and Airflow. These integrations allow users to leverage the power and flexibility of Kubernetes while utilizing the specialized capabilities of these frameworks, creating a seamless and efficient batch-processing experience.

There are also a number of other capabilities that the group is looking to deliver. This includes job-level provisioning APIs in autoscaling, scheduler plugins, node-level runtime improvements and many others.

As Kubernetes continues to invest in batch processing support, it becomes an increasingly competitive option for users who previously relied on other frameworks. There is a number of advantages Kubernetes brings to the table that includes:

  1. Extensive Multitenancy Features: Kubernetes provides robust security, auditing, and cost allocation features, making it an ideal choice for organizations managing multiple tenants and heterogeneous workloads.
  2. Rich Ecosystem and Community: Kubernetes boasts a thriving open-source community, with a wealth of tools and resources available to help users optimize their batch-processing tasks.
  3. Managed Hosting Services: Kubernetes is available as a managed service on all major cloud providers. This offers tight integrations with their compute stacks, enabling users to take advantage of unique capabilities, and simplified orchestration of harder-to-use scarce resources like Spot VMs or accelerators. Using these services will result in faster development cycles, more elasticity and lower total cost of ownership.
  4. Compute orchestration standardization and portability: Enterprises can choose a single API layer to wrap their computational resources to mix their batch and serving workloads. They can use Kubernetes to reduce lock-in to a single provider and get the flexibility of leveraging the best of all that the current cloud market has to offer.

Usually, a user’s transition to use Kubernetes also involves containerization of their batch workloads. Containers themselves have revolutionized the software development process and for computational workloads, they offer a great acceleration of release cycles leading to much faster innovation.

Containers encapsulate an application and its dependencies in a single, self-contained unit, which can run consistently across different platforms and environments. They eliminate the “it works on my machine” problem. They enable rapid prototyping and faster iteration cycles. If combined with cloud hosting it allows agility that helps HPC and ML-oriented companies innovate faster.

The Kubernetes community still needs to solve a number of challenges, including the need for more advanced controls of the runtime on each host node, and the need for more advanced Job API support. HPC users are accustomed to having more control over the runtime.

Setting up large-scale platforms using Kubernetes on premises still requires a significant amount of skill and expertise. There is currently some fragmentation in the batch processing ecosystem, with different frameworks re-implementing common concepts (like Job, Job Group, Job Queueing)  in different ways. Going forward we’ll see these addressed with each Kubernetes release.

The evolution of Kubernetes from a microservices-focused platform to a powerful tool for batch processing demonstrates the adaptability and resilience of the Kubernetes community. By addressing the growing demand for ML training capabilities, HPC migration to the cloud, Kubernetes has become an increasingly attractive option for batch-processing workloads.

Kubernetes’ extensive multitenancy features, rich ecosystem, and managed hosting services on major cloud providers make it a great choice for organizations seeking to optimize their batch-processing tasks and tap into the power of the cloud. If you want to join the Batch Working Group and help contribute to Kubernetes then you can find all the details here. We have regular meetings, a Slack channel and an email group that you can join.

The post Kubernetes Evolution: From Microservices to Batch Processing Powerhouse appeared first on The New Stack.

]]>
What Is Container Monitoring? https://thenewstack.io/what-is-container-monitoring/ Wed, 05 Apr 2023 14:31:21 +0000 https://thenewstack.io/?p=22704515

Container monitoring is the process of collecting metrics on microservices-based applications running on a container platform. Containers are designed to

The post What Is Container Monitoring? appeared first on The New Stack.

]]>

Container monitoring is the process of collecting metrics on microservices-based applications running on a container platform. Containers are designed to spin up code and shut down quickly, which makes it essential to know when something goes wrong as downtime is costly and outages damage customer trust.

Containers are an essential part of any cloud native architecture, which makes it paramount to have software that can effectively monitor and oversee container health and optimize resources to ensure high infrastructure availability.

Let’s take a look at the components of container monitoring, how to select the right software and current offerings.

Benefits and Constraints of Containers

Containers provide IT teams with a more agile, scalable, portable and resilient infrastructure. Container monitoring tools are necessary, as they let engineers resolve issues more proactively, get detailed visualizations, access performance metrics and track changes. As engineers get all of this data in near-real time, there is a good potential of reducing mean time to repair (MTTR).

Engineers must be aware of the limitations of containers: complexity and changing performance baselines. While containers can spin up quickly, they can increase infrastructure sprawl, which means greater environmental complexity. It also can be hard to define baseline performance as containerized infrastructure consistently changes.

Container monitoring must be specifically suited for the technology; legacy monitoring platforms, designed for virtualized environments, are inadequate and do not scale well with container environments. Cloud native architectures don’t rely on dedicated hardware like virtualized infrastructure, which changes monitoring requirements and processes.

How Container Monitoring Works

A container monitoring platform uses logs, tracing, notifications and analytics to gather data.

What Does Container Monitoring Data Help Users Do?

It allows users to:

  • Know when something is amiss
  • Triage the issue quickly
  • Understand the incident to prevent future occurrences

The software uses these methods to capture data on memory utilization, CPU use, CPU limits and memory limit — to name a few.

Distributed tracing is an essential part of container monitoring. Tracing helps engineers understand containerized application performance and behavior. It also provides a way to identify bottlenecks and latency problems, how changes affect the overall system and what fixes work best in specific situations. It’s very effective at providing insights into the path taken by an application through a collection of microservices when it’s making a call to another system.

More comprehensive container monitoring offerings account for all stack layers. They can also produce text-based error data such as “container restart” or “could not connect to database” for quicker incident resolution. Detailed container monitoring means users can learn which types of incidents affect container performance and how shared computing resources connect with each other.

How Do You Monitor Container Health?

Container monitoring requires multiple layers throughout the entire technology stack to collect metrics about the container and any supporting infrastructure, much like application monitoring. Engineers should make sure they can use container monitoring software to track the cluster manager, cluster nodes, the daemon, container and original microservice to get a full picture of container health.

For effective monitoring, engineers must create a connection across the microservices running in containers. Instead of using service-to-service communication for multiple independent services, engineers can implement a service mesh to manage communication across microservices. Doing so allows users to standardize communication among microservices, control traffic, streamline the distributed architecture and get visibility of end-to-end communication.

How to Select a Container Monitoring Tool

In the container monitoring software selection process, it’s important to identify which functions are essential, nice to have or unnecessary. Tools often include these features:

  • Alerts: Notifications that provide information to users about incidents when they occur.
  • Anomaly detection: A function that lets users have the system continuously oversee activity and compare against programmed baseline patterns.
  • Architecture visualization: A graphical depiction of services, integrations and infrastructure that support the container ecosystem.
  • Automation: A service that performs changes to mitigate container issues without human intervention.
  • API monitoring: A function that tracks containerized environment connections to identify anomalies, traffic and user access.
  • Configuration monitoring: A capability that lets users oversee rule sets, enforce policies and log changes within the environment.
  • Dashboards and visualization: The ability to present container data visually so users can quickly see how the system is performing.

Beyond specific features and functions, there are also user experience questions to ask about the software:

  • How quickly and easily can users add instrumentation to code?
  • What is the process for alarm, alert and automation?
  • Can users see each component and layer to isolate the source of failure?
  • Can users view entire application performance for both business and technical organizations?
  • Is it possible to proactively and reactively correlate events and logs to spot abnormalities?
  • Can the software analyze, display and alarm on any set of acquired metrics?

The right container monitoring software should make it easy for engineers to create alarms and automate actions when the system reaches certain resource usage thresholds.

When it comes to container management and monitoring, the industry offers a host of open source and open-source-managed offerings: Prometheus, Kubernetes, Jaeger, Linkerd, Fluentd and cAdvisor are a few examples.

Ways Chronosphere Can Monitor Containers 

Chronosphere’s offering is built for cloud native architectures and Kubernetes to help engineering teams that are collecting container data at scale. Chronosphere’s platform can monitor all standard data ingestion for Kubernetes clusters, such as pods and nodes, standard ingestion protocols as with Prometheus.

Container monitoring software generates a lot of data. When combined with cloud native environment metrics, this creates a data overload that outpaces infrastructure growth. This makes it important to have tools that can help refine what data is useful so that it gets to the folks who need it the most and ends up on the correct dashboards.

The Control Plane can help users fine-tune which container metrics and traces the system ingests. Plus, with the Metrics Usage Analyzer, users are put back in control of which container observability data is being used, and more importantly, pointing out when data is not used. Users decide which data is important after ingestions with the Control Plane so their organization avoids excessive costs across their container and services infrastructure.

To see how Chronosphere can help you monitor your container environments, contact us for a demo today. 

The post What Is Container Monitoring? appeared first on The New Stack.

]]>
How to Fix Kubernetes Monitoring https://thenewstack.io/how-to-fix-kubernetes-monitoring/ Fri, 31 Mar 2023 17:00:32 +0000 https://thenewstack.io/?p=22703174

It’s astonishing how much data is emitted by Kubernetes out of the box. A simple three-node Kubernetes cluster with Prometheus

The post How to Fix Kubernetes Monitoring appeared first on The New Stack.

]]>

It’s astonishing how much data is emitted by Kubernetes out of the box. A simple three-node Kubernetes cluster with Prometheus will ship around 40,000 active series by default! Do we really need all that data?

It’s time to talk about the unspoken challenges of monitoring Kubernetes. The difficulties include not just the bloat and usability of metric data, but also the high churn rate of pod metrics, configuration complexity when running multiple deployments, and more.

This post is inspired by my recent episode of OpenObservability Talks, in which I spoke with Aliaksandr Valialkin, CTO of VictoriaMetrics, a company that offers the open source time series database and monitoring solution by the same name.

Let’s unpack Kubernetes monitoring.

A Bloat of Out-of-the-Box Default Metrics

One of the reasons that Prometheus has become so popular is the ease of getting started collecting metrics. Most of the tools and projects expose metrics in OpenMetrics format, so you just need to turn that on, and then install the Prometheus server to start scraping those metrics.

Prometheus Operator, the standard installation path, installs additional components for monitoring Kubernetes, such as kube-state-metrics, node-exporter and cAdvisor. Using the default Prometheus Operator to monitor even a small 3-node Kubernetes cluster results in around 40,000 different metrics! That’s the starting point, before even adding any applicative or custom metrics.

And this number keeps growing at a fast pace. Valialkin shared that since 2018, the amount of metrics exposed by Kubernetes has increased by 3-1/2 times. This means users are flooded with monitoring data from Kubernetes. Are all these metrics really needed?

Not at all! In fact, the vast majority of these metrics aren’t used anywhere. Valialkin said that 75% of these metrics are never put to use in any dashboards or alert rules. I see quite a similar trend among Logz.io users.

The Metrics We Really Need

Metrics need to be actionable. If you don’t act on them, then don’t collect them. This is even more evident with managed Kubernetes solutions, in which end users don’t manage the underlying system anyway, so many of the exposed metrics are simply not actionable for them.

This drove us to compose a curated set of recommended metrics, essential Kubernetes metrics to be collected whether from self-hosted Kubernetes or from managed Kubernetes services such as EKS, AKS and GKE. We share our curated sets publically as part of our Helm charts on GitHub (based on OpenTelemetry, kube-state-metrics and prometheus-node-exporter charts). VictoriaMetrics and other vendors have similarly created their curated lists.

However, we cannot rely on individual vendors to create such sets. And most end-users aren’t acquainted enough with the various metrics to determine themselves what they need, so they look for the defaults, preferring the safest bet of collecting everything so as not to lack important data later.

Rather, we should come together as the Kubernetes and cloud native community, vendors and end-users alike, and join forces to define a standard set of golden metrics for each component. Valialkin also believes that “third-party monitoring solutions should not install additional components for monitoring Kubernetes itself,” referring to additional components such as kube-state-metrics, node-exporter and cadvisor. He suggests that “all these metrics from such companions should be included in Kubernetes itself.”

I’d also add that we should look into removing unused labels. Do we really need from prometheus-node-exporter the details on each network card or CPU core? Each label adds a dimension to the metric, and multiplies the time series data exponentially.

Microservices Proliferation

Kubernetes has made it easy to package, deploy and manage complex microservices architectures at scale with containers. The growth in the number of microservices results in an increased load on the monitoring system: Every microservice exposes system metrics, such as CPU, memory, and network utilization. On top of that, every microservice exposes its own set of application metrics, depending on the business logic it implements. In addition, the networking between the microservices needs to be monitored as well for latency, RPS and similar metrics. The proliferation of microservices generates a significant amount of telemetry data, which can get quite costly.

High Churn Rate of Pods

People move to Kubernetes to be more agile and release more frequently. This results in frequent deployments of new versions of microservices. With every deployment in Kubernetes, new instances of pods are created and deleted, in what is known as “pod churn.” The new pod gets a unique identifier, different from previous instances, even if it is essentially a new version of the same service instance.

I’d like to pause here and clarify an essential point about metrics. Metrics data is time series data. Time series is uniquely defined by the metric name and a set of labeled values. If one of the label values changes, then a new time series is created.

Back to our ephemeral pods, many practitioners use the pod name as a label within their metrics time series data. This means that with every new deployment and the associated pod churn, the old time series stops receiving new samples and is effectively terminated, while a new time series is initiated, which causes discontinuity in the logical metric data sequence.

Kubernetes workloads typically have high pod churn rates due to frequent deployments of new versions of a microservice, as well as autoscaling of pods based on incoming traffic, or resource constraints on the underlying nodes that require eviction and rescheduling of pods. The discontinuity of metric time series makes it difficult to apply continuous monitoring on the logical services and analyze trends over time on their respective metrics.

A potential solution can be to use the ReplicaSet or StatefulSet ID for the metric label, as these remain fixed as the set adds and removes pods. Valialkin, however, refers to this as somewhat of a hack, saying we should push as a community to have first-level citizen nomenclature in Kubernetes monitoring to provide consistent naming.

Configuration Complexity with Multiple Deployments

Organizations typically run hundreds and even thousands of different applications. When these applications are deployed on Kubernetes, this results in hundreds and thousands of deployment configurations, and multiple Prometheus scrape_config configurations defining how to scrape (pull) these metrics, rules, filters and relabeling to apply, ports to scrape and other configurations. Managing hundreds and thousands of different configurations can quickly become unmanageable at scale. Furthermore, it can burden the Kubernetes API server, which needs to serve requests on all these different configurations.

As a community, we can benefit from a standard for service discovery of deployments and pods in Kubernetes on top of the Prometheus service discovery mechanism. In Valialkin’s vision, “in most cases Prometheus or some other monitoring system should automatically discover all the deployments, all the pods which need to be scraped to collect metrics without the need to write custom configuration per each deployment. And only in some exceptional cases when you need to customize something, then you can write these custom definitions for scraping.”

Want to learn more? Check out the OpenObservability Talks episode: “Is Kubernetes Monitoring Flawed?” On Spotify, Apple Podcasts, or other podcast apps.

The post How to Fix Kubernetes Monitoring appeared first on The New Stack.

]]>
Saga Without the Headaches https://thenewstack.io/making-the-saga-pattern-work-without-all-the-headaches/ Fri, 24 Mar 2023 17:00:11 +0000 https://thenewstack.io/?p=22703126

Part 1: The Problem with Sagas We’ve all been at that point in a project when we realize that our

The post Saga Without the Headaches appeared first on The New Stack.

]]>

Part 1: The Problem with Sagas

We’ve all been at that point in a project when we realize that our software processes are more complex than we thought. Handling this process complexity has traditionally been painful, but it doesn’t have to be.

A landmark software development playbook called the Saga design pattern has helped us to cope with process complexity for over 30 years. It has served thousands of companies as they build more complex software to serve more demanding business processes.

This pattern’s downside is its higher cost and complexity.

In this post, we’ll first pick apart the traditional way of coding the Saga pattern to handle transaction complexity and look at why it isn’t working. Then, I’ll explain in more depth what happens to development teams that don’t keep an eye on this plumbing code issue. Finally, I’ll show you how to avoid the project rot that ensues.

Meeting the need for durable execution

The Saga pattern emerged to cope with a pressing need in complex software processes: durable execution. When the transactions you’re writing make a single, simple single database call and get a quick response, you don’t need to accommodate anything outside that transaction in your code. However, things get more difficult when transactions rely on more than one database — or indeed on other transaction executions — to get things done.

For example, an application that books a car ride might need to check that the customer’s account is in good standing, then check their location, then examine which cars are in that area. Then it would need to book the ride, notify both the driver and the customer, then take the customer’s payment when the ride is done, writing everything to a central store that updates the driver and customer’s account histories.

Processes like these that process dependent transactions need to keep track of data and state throughout the entire sequence of events. They must be able to survive problems that arise in the transaction flow. If a transaction takes more time than expected to return a result (perhaps a mobile connection falters for a moment or a database hits peak load and takes longer to respond), the software must adapt.

It must wait for the necessary transaction to complete, retrying until it succeeds and coordinating other transactions in the execution queue. If a transaction crashes before completion, the process must be able to roll back to a consistent state to preserve the integrity of the overall application.

This is difficult enough in a use case that requires a response in seconds. Some applications might execute over hours or days, depending on the nature of the transactions and the process they support. The challenge for developers is maintaining the state of the process across the period of execution.

This kind of reliability — a transaction that cannot fail or time out — is known as a strong execution guarantee. It is the opposite of a volatile execution, which can cease to exist at any time without completing everything that it was supposed to do. Volatile executions can leave the system in an inconsistent state.

What seemed simple at the outset turns into a saga with our software as the central character. Developers had to usher it through multiple steps on its journey to completion, ensuring that we preserve its state if something happens.

Understanding the Saga pattern

The Saga pattern provides a road map for that journey. First discussed in a 1987 paper, this pattern brings durable execution to complex processes by enabling them to communicate with each other. A central controller manages that service communication and transaction state.

The pattern offers developers the three things they need for durable execution. It can string together transactions to support long-running processes and guarantee their execution by retrying in the event of failure. It also offers consistency by ensuring that either a process completes entirely or doesn’t complete at all.

However, there’s a heavy price to pay for using the Saga pattern. While there’s nothing wrong with the concept in principle, everything depends on the implementation. Developers have traditionally had to code the pattern themselves as part of their application. That makes its design, deployment and maintenance so difficult that the application can become a slave to the pattern, which ends up taking most of the developers’ time.

Eventually, developers are spending more time maintaining the plumbing code as they add more transactions. What was a linear development workload now becomes exponential. The time spent on development increases disproportionately with every new change.

Coding the Saga pattern manually involves breaking up a coherent process into chunks and then wrapping them with code that manages their operation, including retrying them if they fail. The developer must also manage the scheduling and coordination of these tasks across different processes that depend on each other. They must juggle databases, queues and timers to manage this inter-process communication.

Increasing the volume of software processes and dependencies requires more developer hours to create and maintain the plumbing infrastructure, which in turn drives up application cost. This increasing complexity also makes it more difficult for developers to prove the reliability and security of their code, which carries implications for operations and compliance.

Abstraction is the key

Abstraction is the key to retaining the Saga pattern’s durable execution benefits while discarding its negative baggage. Instead of leaving developers to code the pattern into their applications, we must hide the transaction sequencing from them by abstracting it to another level.

Abstraction is a well-understood process in computing. It gives each application the illusion that it owns everything, eliminating the need for the developer to accommodate it. Virtualization systems do this with the help of a hypervisor. The TCP stack does it by retrying network connections automatically so that developers don’t have to write their own handshaking code. Relational databases do it when they roll-back failed transactions invisibly to keep them consistent.

Running a separate platform to manage durable execution brings these benefits to transaction sequencing by creating what Temporal calls a workflow. Developers still have control over workflows, but they need not concern themselves with the underlying mechanics.

Abstracting durable execution to workflows brings several benefits aside from ease of implementation. A tried-and-tested workflow management layer makes complex transaction sequences less likely to fail than home-baked ad-hoc plumbing code. Eliminating thousands of lines of custom code for each project also makes the code that remains easier to maintain and reduces technical debt.

Developers see these benefits most clearly when debugging. Root cause analysis and remediation get exponentially harder when you’re having to mock and manage plumbing code, too. Workflows hide an entire layer of potential problems.

Productive developers are happy developers

Workflow-based durable execution boosts the developer experience. Instead of disappearing down the transaction management rabbit hole, they get to work on what’s really important to them. This improves morale and is likely to help retain them. With the number of open positions for software engineers in the US expected to grow by 25% between 2021 and 2031, competition for talent is intense. Companies can’t afford much attrition.

Companies have been moving in the right direction in their use of the Saga pattern to handle context switching in software processes. However, they can go further by abstracting these Saga patterns away from the application layer to a separate service. Doing this well could move software maturity forward years in an organization.

Part 2: Avoiding the Tipping Point

In the first half of this post, I talked about how burdensome it is to coordinate transactions and preserve the state at the application layer. Now, we’ll talk about how that sends software projects off-course and what you can do about it.

Any software engineering project of reasonable size runs into the need for durable execution.

Ideally, the cost and time involved in creating new software features would be consistent and calculable. Coding for durability shatters that consistency. It makes the effort involved with development look more like a hockey-stick curve than a linear slope.

The tipping point is where the time and effort spent on coding new features begins its upward spike. It’s when the true extent of managing long-term transactions becomes clear. I’ll describe what it is, why it happens and why hurriedly writing plumbing code isn’t the right way to handle it.

What triggers the tipping point

Life before the tipping point is generally good because the developer experience is linear. The application framework that the developers are using support each new feature that the developer adds with no nasty surprises. That enables the development team to scale up the application with predictable implementation times for new features.

This linear scale works as long as developers make quantitative changes, adding more of the same thing. Things often break when someone has to make a change that isn’t like the rest and discovers a shortcoming in the application framework. This is usually a qualitative change that demands a change in the way the application works.

This change might involve calls to multiple databases, or reliance on multiple dependent transactions for the first time. It might call on a software process that takes an unpredictable amount of time to deliver a result.

The change might not be enough to force the tipping point at first, but life for developers will begin to change. They might write the plumbing code to manage the inter-process communication in a bid to guarantee execution and keep transactions consistent. But this is just the beginning. That code took time to write, and now, developers must expand it to cope with every new qualitative change that they introduce.

They’ll keep doing that for a while, but the rot gets worse. Eventually, developers are spending more time maintaining the plumbing code as they add more transactions. What was a linear development workload now becomes exponential. The time spent on development increases disproportionately with every new change.

The “Meeting of Doom”

Some people are unaware of the tipping point until it happens. Junior developers without the benefit of experience often wander into them unaware. Senior developers are often in the worst position of all; they know the tipping point is coming but politics often renders them powerless to do anything other than wait and pick up the pieces.

Eventually, someone introduces a change that surfaces the problem. It is the straw that breaks the camel’s back. Perhaps a change breaks the software delivery schedule and someone with influence complains. Then, someone calls the “Meeting of Doom.”

This meeting is where the team admits that their current approach is unsustainable. The application has become so complex that these ad hoc plumbing changes are no longer supporting project schedules or budgets.

This realization takes developers through the five stages of grief:

  • Denial. This will have been happening for a while. People try to ignore the problem, arguing that it’ll be fine to continue as they are. This gives way to…
  • Anger. Someone in the meeting explains that this will not be fine. Their budgets are broken; their schedules are shot; and the problem needs fixing. They won’t take no for an answer. So people try…
  • Bargaining. People think of creative ways to prop things up for longer with more ad hoc changes. But eventually, they realize that this isn’t scalable, leading to…
  • Depression. Finally, developers realize that they’ll have to make more fundamental architectural changes. Their ad ho- plumbing code has taken a life of its own and the tail is now wagging the dog. This goes hand in hand with…
  • Acceptance. Everyone leaves the meeting with a sense of doom and knows that nothing is going to be good after this. It’s time to cancel a few weekends and get to work.

That sense of doom is justified. As I explained, plumbing code is difficult to write and maintain. From the tipping point onward, things get more difficult as developers find code harder to write and maintain. Suddenly, the linear programming experience they’re used to evaporates. They’re spending more time writing transaction management code than they are working through software features on the Kanban board. That leads to developer burnout, and ultimately, attrition.

Preventing the tipping point

How can we avoid this tipping point, smoothing out the hockey-stick curve and preserving a linear ratio between software features and development times? The first suggestion is usually to accept defeat this time around and pledge to write the plumbing code from the beginning next time or reuse what you’ve already cobbled together.

That won’t work. It leaves us with the same problem, which is that the plumbing code will ultimately become unmanageable. Rather than a tipping point, the development would simply lose linearity earlier. You’d create a more gradual decline into development dysphoria beginning from the project’s inception.

Instead, the team needs to do what it should have done at the beginning: make a major architectural change that supports durable execution systematically.

We’ve already discussed abstraction as the way forward. Begin by abstracting the plumbing functions from the application layer into their own service layer before you write a line more of project code. That will unburden developers by removing the non-linear work, enabling them to scale and keeping the time needed to implement new features constant.

This abstraction maintains the linear experience for programmers. They’ll always feel in control of their time, and certain that they’re getting things done. They will no longer need to consider strategic decisions around tasks such as caching and queuing. Neither will they have to worry about bolting together sprawling sets of software tools and libraries to manage those tasks.

The project managers will be just as happy as the developers with an abstracted set of transaction workflows. Certainty and predictability are key requirements for them, which makes the tipping point with its break from linear development especially problematic. Abstracting the task of transaction sequencing removes the unexpected developer workload and preserves that linearity, giving them the certainty they need to meet scheduling and budgetary commitments.

Tools that support this abstraction and the transformation of plumbing code into manageable workflows will help you preserve predictable software development practices, eliminating the dreaded tipping point and saving you the stress of project remediation. The best time to deploy these abstraction services is before your project begins, but even if your team is in crisis right now, it offers a way out of your predicament.

 

The post Saga Without the Headaches appeared first on The New Stack.

]]>
What Is Microservices Architecture? https://thenewstack.io/microservices/what-is-microservices-architecture/ Thu, 23 Feb 2023 12:22:35 +0000 https://thenewstack.io/?p=22701069

Microservices are a hot topic when people talk about cloud native application development, and for good reason. Microservices architecture is

The post What Is Microservices Architecture? appeared first on The New Stack.

]]>

Microservices are a hot topic when people talk about cloud native application development, and for good reason.

Microservices architecture is a structured manner for deploying a collection of self-contained and independent services in an organization. They are game changing compared to some past application development methodologies, allowing development teams to work independently and at cloud native scale.

Let’s dive into the history of application development, characteristics of microservices and what that means for cloud native observability.

Microservices Architecture vs. Monolithic Architecture

The easiest way to understand what microservices architecture does is to compare it to monolithic architecture.

Monolithic Architecture

Monolithic architecture, as the prefix “mono” implies, is software that is written with all components combined into a single executable. There are typically three advantages to this architecture:

  • Simple to develop
    •  Many development tools support the creation of monolithic applications.
  • Simple to deploy
    • Deploy a single file or directory to your runtime.
  • Simple to scale
    • Scaling the application is easily done by running multiple copies behind some sort of load balancer.

Microservices Architecture

Microservices are all about small, self-contained services. The advantages are that these services are highly maintainable and testable, loosely coupled with other services, independently deployable and developed by small, highly productive developer teams. Some service types for a microservices architecture may include the following examples.

  • Client
    Client-side services are used for collecting client requests, such as requests to search, build, etc.
  • Identity Provider
    Before being sent to an API gateway, requests from clients are processed by an identity provider, which is a service that creates, authenticates and manages digital identity information.
  • API Gateway
    An API (application programming interface) is the intermediate system that allows for two services to talk to each other. In the case of microservices, an API gateway is like an entry point: It accepts requests from clients, collects the services needed to fulfill those requests from the backend and returns the correct response.
  • Database
    In microservices, each microservice usually has its own database, which is updated through an API service.
  • Message Formatting
    Services communicate in two types of ways: synchronous messages and asynchronous messages.
    • Synchronous messaging is for when a client waits for a response. This type includes both REST (representational state transfer) and HTTP protocols.
    • Asynchronous messaging is for scenarios where clients don’t wait for an immediate response from services. Common protocols for this type of messaging include AMQP, STOMP and MQTT.
  • Static Content
    Once microservices have finished communicating, any static content is sent to a cloud-based storage system that can directly deliver that content to the client.
  • Management
    Having a management element in a microservices structure can help monitor services and identify failures to keep everything running smoothly
  • Service Discovery
    For both client-side and server-side services, service discovery is a tool that locates devices and services on a specific network.

The monolithic model is more traditional and certainly has some pros, but microservices are becoming more and more common in cloud native environments. The following are the most defining differences between the two models.

Benefits of a Microservices Architecture

Microservices are showing lots of success in modern cloud native businesses. Businesses benefit from using this structure for a number of reasons.

Flexibility: Because each service is independent, the programming language can vary between all microservices (although it’s prudent to standardize as much as possible on one modern programming language).

Faster deployment: Not only are microservices easier to understand for most developers, but they’re also faster to deploy. Change one thing in the code for a monolithic structure, and it affects everything across the board. Microservices are independently deployed and don’t affect other services.

Scalability: If you run everything through one application, it’s hard to manage the massive scale of services as an application grows. Instead, a microservices architecture allows a team to modify the capabilities of an individual service instead of redeploying an entire system. This even applies to the scale of each service within an application. If a payment service, for example, is seeing more demand than other services, that microservice can be scaled as needed.

Isolated failures: With a monolithic architecture, a failure in one service can compromise the entire application. Microservices isolate each component, so if one service fails or has issues, the rest of the applications can still function, although likely in a degraded state.

Ultimately, microservices architecture saves teams time, offers more granular modifications to each service and scales with the need of every individual service and interface as a whole.

Challenges

Microservices are incredibly useful, but they do come with their own challenges.

Increased complexity and dependency issues: The design of microservices favors individual services, but that also increases the number of users, the variety of user behavior and the interactions between services. This makes tracing individual traffic from frontend to backend more difficult and can cause dependency issues between services. When microservices are so independent of each other, it’s not always easy to manage compatibility and other effects caused by the different existing versions and workloads.

Testing: Testing can be difficult across the entire application because of how different and variant each executed service route can be. Flexibility is a great element of microservices, but the same diversity can be hard to observe consistently in a distributed deployment.

Managing communication systems: Even though services can easily communicate with each other, developers have to manage the architecture in which services communicate. APIs play an essential role in reliable and effective communication between services, which requires an API gateway. These gateways are helpful, but they can fail, lead to dependency problems or bottleneck communication.

Observability: Because microservices are distributed, monitoring and observability can be a challenge for developers and observability teams. It’s important to consider a monitoring system or platform to help oversee, troubleshoot and observe an entire microservices system.

Should You Adopt Microservices Architecture?

Sometimes a monolithic structure may be the way to go, but microservices architecture is growing for a reason. Ask yourself:

What Environment Are My Applications Working In?

Because most cloud native architectures are designed for microservices, microservices are the way to go if you want to get the full benefits of a cloud native environment. With applications moving to cloud-based settings, application development favors microservices architecture and will continue to do so moving forward.

How Does Your Team Function?

In a microservices architecture, the codebase can typically be managed by smaller teams. Still, development teams also need the tools to identify, monitor and execute the activity of different components, including if and how they interact with each other. Teams also need to determine which services are reusable so they don’t have to start from scratch when building a new service.

How Flexible Are Your Applications?

If you have to consistently modify your application or make adjustments, a microservices approach is going to be best because you can edit individual services instead of the entire monolithic system.

How Many Services Do You Have and Will There Be More?

If you have a lot of services or plan to continue growing — in a cloud-based platform, you should expect growth — then a microservices architecture is also ideal because monolithic application software doesn’t scale well.

Chronosphere Scales with Microservices

Microservices architecture is designed to make application software development in a cloud native environment simpler, not more difficult. With the challenges that come along with managing each individual service, it’s even more critical for teams to have observability solutions that scale with the growth of their business.

Chronosphere specializes in observability for a cloud native world so your team has greater control, reliable functionality and flexible scalability at every level of your application development. Monitoring microservices with a trustworthy and flexible platform greatly lowers risks, helps anticipate failures and empowers development teams to understand their data.

The post What Is Microservices Architecture? appeared first on The New Stack.

]]>
Java’s History Could Point the Way for WebAssembly https://thenewstack.io/webassembly/javas-history-could-point-the-way-for-webassembly/ Thu, 12 Jan 2023 11:00:28 +0000 https://thenewstack.io/?p=22697040

It’s hard to believe that it’s been over 20 years since the great dotcom crash happened in 2001, which continues

The post Java’s History Could Point the Way for WebAssembly appeared first on The New Stack.

]]>

It’s hard to believe that it’s been over 20 years since the great dotcom crash happened in 2001, which continues to serve as a harbinger of potential doom whenever cyclic tech is on a downward path. I remember quite distinctly hanging out with either unemployed or underemployed folks in the IT field shortly after the great crash in 2001. We were doing just that: hanging out with time on our hands.

During that time, one day in a park in New York City, Brookdale Park in Montclair, NJ, one of my friends was sitting on a park bench pounding away on his laptop, and he said there was this really cool thing for website creation called Java. It’s been around for a long time, actually, but described how amazing it was that you could program in Java code and deploy, where you want on websites, he said. And of course, have played a key role in transforming the user experience on websites compared to the days of the 1990s when HTML code provided the main elements of website design. Sure, why not, I’ll check it out, I said. And the rest is history as Java secured its place in history, not only for web development, but across IT infrastructure.

Flash forward to today: There’s this thing called WebAssembly or Wasm, which is offering a very similar claim. of where you write once and deploy anywhere. Not only for web applications for which it was originally created, but across networks and on anything that runs on a CPU.

Remind you of something?

“Wasm could be Java’s grandchild that follows the same broad principle of allowing developers to run the same code on any device, but at the same time Wasm fixes the fundamental issues that prevented the original vision of “Java on any device” from becoming reality,” Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack.

The Simple Case

Wasm has shown to be very effective in a number of different hardware environments, ranging from server-side to edge deployments and IoT devices or wherever code can be run directly on a CPU. The code runs bundled in the neatly packaged Wasm executable that can be compared to a container or even a mini operating system that can run with significantly less — if any — configuration required for the code and the target. Wherever code can be deployed, essentially, the applications are much farther out than just being confined inside the web browser environment. The developers thus creates the code and deploys it. It can really be that simple, especially when PaaS solutions are used.

Most importantly, Wasm enables true “code once, deploy” anywhere capabilities, where the same code runs on any supported device without the need to recompile, Volk noted. “Wasm is not tied to one development language but supports Python and many other popular languages. Developers can run their code on shared environments on servers and other devices without having to worry about the underlying Kubernetes cluster or hypervisor,” Volk said. “They also receive unified logging and tracing for their microservices, right out of the box. This simplified developer experience is another big plus compared to Java.”

During a recent KubeCon + CloudNativeCon conference, a talk was given about using Wasm to replace Kafka for lower latency data streaming. At the same time, Java continues to be used for networking apps even though alternatives can offer better performance, but developers kept using it because they just liked working with Java. So, if Wasm’s runtime performance were not great — which it is — developers might still adopt it merely for its simplicity of use.

“One of the big pluses of Wasm is that it is very easy for developers to get started, just by deploying some code and instantly watching it run. This is one of these value propositions where it may take a little while to fully understand it, but once you get hooked, you don’t want to worry about the ins and outs of the underlying infrastructure anymore,” Volk said. “You can then decide if it makes sense to replace Kafka or if you just want to connect it to your Wasm app.”

Java’s entire “write once, run anywhere” promise is quite similar to WebAssembly‘s, Fermyon Technologies CEO and co-founder Matt Butcher told the New Stack: “In fact, Luke [Luke Wagner was the original author of WebAssembly], once told me once that he considered Java to be 20 years of useful research that formed the basis of how to write the next generation (e.g. Wasm),” Butcher said.

Still Not the Same

There is one key difference between Java and Wasm: their security postures.

Its portability and consistency can make security and compliance much easier to manage (again, it runs in a binary format on a CPU level). Also, part of Wasm’s simplicity in structure means that code is released in a closed Sandbox environment, almost directly to the endpoint. Java’s (as well as .NET’s) default security posture is “to trust the code it is running,” while Java grants the code access to the file system, to the environment, to processes, and to the network, Butcher said.

“In contrast, Wasm’s default security posture is to not trust the code that is running in the language runtime. For Fermyon (being cloud- and edge-focused), this was the critical feature that made Wasm a good candidate for a cloud service,” Butcher said. “Because it is the same security posture that containers and virtual machines take. And it’s what makes it possible for us as cloud vendors to sell a service to a user without having to vet or approve the user’s code.”

In other words, there are just an exponentially greater number of attack points to worry about when working with distributed containerized and microservices environments. Volk agreed with Matt’s assessment, as relying on the zero trust principle allows for multitenancy based on the same technologies, like mTLS and jwt, that are already being used for application containers running on Kubernetes, Volk said. “This makes Wasm easy to safely try out in shared environments, which should lower the initial barriers to get started,” Volk said.

Another big difference between Java and Wasm — which can actually run in a Linux kernel — is that Java requires JVM and does not require additional resources, such as a garbage collector, Sehyo Chang, CTO at InfinyOn, told The New Stack. “Wasm, on the other hand, is very close to the underlying CPU and doesn’t need GC or other heavy glue logic,” Chang said. “This allows Wasm to run on a very low-power CPU suitable for running in embedded devices or IoT sensors to run everywhere.”

 

The post Java’s History Could Point the Way for WebAssembly appeared first on The New Stack.

]]>
Limiting the Deployment Blast Radius https://thenewstack.io/limiting-the-deployment-blast-radius/ Tue, 15 Nov 2022 16:14:52 +0000 https://thenewstack.io/?p=22692941

Complex application environments often require deployments to keep things running smoothly. But deployments, especially microservice updates, can be risky because

The post Limiting the Deployment Blast Radius appeared first on The New Stack.

]]>

Complex application environments often require deployments to keep things running smoothly. But deployments, especially microservice updates, can be risky because you never know what could go wrong.

In this article, we’ll explain what can go wrong with deployments and what you can do to limit the blast radius.

What Is a Blast Radius?

A blast radius is an area around an explosion where damage can occur. In the context of deployments, the blast radius is the area of the potential impact that a deployment might have.

For example, if you deploy a new feature to your website, the blast radius might be the website itself. But if you’re deploying a new database schema, the blast radius might be the database and all the applications that use it. The problem with deployments is that they often have an infinite blast radius.

While we always expect some blast radius, an infinite blast radius means that anything could go wrong and cause problems. That’s bad.

What Causes Blast Damage?

Poor Planning

Hastily developed and scheduled deployments are often the leading causes of infinite blast radius. When you rush a deployment, you’re more likely to make mistakes. These mistakes can include forgetting to update the documentation, accidentally breaking something in production, or not giving other interested parties, like dependent service owners, a chance to reflect and respond to the deployment.

Poor Communication

Have you ever woken up to frantic calls that your app is not working only to discover that another team had an unplanned deployment and didn’t inform you that it was going ahead? I have.

Telling someone they’re deploying without giving them adequate time to prepare is another source of trouble. Not communicating is a recipe for disaster.

Lack of Testing

One of the most important things you can do to limit a deployment’s blast radius is to test it thoroughly before pushing it to production. That means testing in a staging environment configured as close to production as possible. It also means doing things like unit testing and end-to-end testing.

By thoroughly testing your code before deploying it, you can catch any potential issues and fix them before they cause problems in production.

Incomplete or Unclear Requirements

If your developers had to guess what the requirements were, they most likely didn’t have a clear test plan either. Unclear requirements can lead to code that works in development but breaks in production. In addition, it can lead to code that doesn’t play well with other systems. It can also lead to features that don’t meet users’ needs.

To avoid this, make sure you have a clear and complete set of requirements before starting development. Precise requirements will help ensure that your developers understand what they need to build and that they can test it properly before deploying it.

Configuration Errors

The most common way that deployments go wrong is when configuration changes occur. For example, you might change the database settings and forget to update the application. Or you might change the way you serve your website and break the links to all of your other websites. Configuration changes are often the cause of deployments going wrong because they can affect so many different parts of your system.

Human Failure

Human beings get tired, make mistakes, and forget things. When deploying a complex system, there’s a lot of room for error. Even the most experienced engineers can make mistakes.

Environment Parity Issues

Sometimes, things get changed in production in the chaotic world of production support, and it doesn’t trickle down to the testing and development environments. Environment inequality can lead to problems when you go to deploy. Your application might not work in the new environment, or you might not have all the necessary files and configurations. Or there are the dreaded things in the environment that weren’t in testing, and no one knows why they’re there.

Software Issues

Finally, the software itself can go wrong. Software is complex, and it’s often hard to predict how it will behave in different environments. For example, you might test your software in a development environment, and it works fine. But when you deploy it to production, there might be unintended consequences. For example, if your code is “spaghetti” and is tightly coupled or is difficult to maintain, you probably have issues with deployments.

A lot can go wrong! But what if there were ways you could limit the blast radius?

How to Limit the Blast Radius

There are a few ways to limit the blast radius.

Plan and Schedule

Set your team up for success by establishing a regular cadence and process for deployments. Consistency will minimize the potential for human error.

When everyone knows when to expect deployments or what to expect when there’s a particular case, your deployments will go much smoother. You should also plan what to do if something goes wrong. Can you roll back? Are there backups?

Over Communicate

Part of planning the deployment is making sure that all responsible and affected parties are informed adequately and have time to review, reflect, and respond. Communicating includes sending an email to all users informing them of the upcoming deployment and telling them what to expect. In addition, it means communicating with the people who will do the deployment. Give them adequate time to prepare, review, practice, and clear their calendars. It’s better to err on the side of providing too much information rather than too little.

Understand Risks

The most important way to prepare is to understand your risks. First, you need to know what could go wrong and how it would affect your system. Only then can you take steps to prevent it from happening.

Automate Everything

Another way to limit the blast radius is to automate as much of the deployment process as possible. If there’s something that can be automated, then do it. For example, automating your deployments will help ensure consistency and accuracy. This way, you can be sure that everything is covered, it’s done correctly, and that you forget nothing.

There are many different tools available to help you automate your deployments. Choose the one that best fits your needs and then automate as much of the process as possible.

Use an Internal Developer Portal

Finally, use an internal developer portal. An internal developer portal organizes much of the information you may need to help limit the blast radius.

For instance, it can help you understand the downstream services that depend on your service, identify the owners of those services, find related documentation, and visualize key metrics about those services all in one place. This enables you to know who to communicate with ahead of a deployment and how to get in touch, provides context on how those downstream services work, and offers a place to monitor those services during testing (assuming your internal developer portal is infrastructure-aware at the environment level).

An internal developer tool can also help you understand your risks by providing access to version control information that gives you access to what was changed, when it was changed, and by whom. This way, you can identify any changes that might have introduced risk and take steps to mitigate that risk.

One such deployment management tool is configure8. It will help you understand the blast radius of your deployments and limit the potential fallout, ensuring your deployments run smoother and flawlessly. It enables your team to answer the following questions:

  • Who owns the service?
  • Who’s on call for the service?
  • What was in the last deployment?
  • What applications depend on the service, and how mission-critical are they?
  • What’s the health of the service, including monitoring at the environment level?
  • When was the last time someone updated the service?

Learn more about how your engineering team can benefit from Configure8 and how it can help you limit the blast radius of your deployments. You can check it out here.

The post Limiting the Deployment Blast Radius appeared first on The New Stack.

]]>
Do … or Do Not: Why Yoda Never Used Microservices https://thenewstack.io/do-or-do-not-why-yoda-never-used-microservices/ Tue, 25 Oct 2022 13:00:44 +0000 https://thenewstack.io/?p=22689827

Microservices were meant to be a blessing, but for many, they’re a burden. Some developers have even moved away from them after negative

The post Do … or Do Not: Why Yoda Never Used Microservices appeared first on The New Stack.

]]>

Microservices were meant to be a blessing, but for many, they’re a burden. Some developers have even moved away from them after negative experiences. Operational complexity becomes a headache for this distributed, granular software model in production. Is it possible to solve microservices’ problems while retaining their advantages?

Microservices shorten development cycles. Changing a monolithic code base is a complex affair that risks unexpected ramifications. It’s like unraveling a sweater so that you could change its design. Breaking that monolith down into lots of smaller services managed by two-pizza teams can make software easier to develop, update, and fix. It’s what helped Amazon grow from a small e-commerce outfit to the beast it is today.

Microservices also introduce new challenges. Their distributed nature exposes developers to complex state management issues.

The Yoda Principle

Ideally, developers shouldn’t deal with state management at all. Instead, the platform should handle it as a core abstraction. Database transaction management is a good example; many database platforms support atomic transactions, which divide a single transaction into a set of smaller operations and ensure that either all of them happen, or none of them do. To achieve this behavior the database uses transaction isolation, which restricts the visibility of each operation in a transaction until the entire transaction completes. If an operation fails, the application using the database sees only the pre-transaction state, as though none of the operations happened.

This transactionality enables the developer to concentrate on their business logic while the database platform handles the underlying state. A database transaction doesn’t fail half-complete and then leave the developer to sort out what happened. An account won’t be debited without the corresponding party’s account being credited, for example. As Yoda said: “Do or do not. There is no try.” Appreciate ACIDSQL databases, he would have.

“Phew,” you think. “Thank goodness I don’t have to write code to unravel half-completed operations just to work out the transaction state.” Unfortunately, microservices developers are still living in that era. This is why Yoda never used Kubernetes.

I’ve Got a Bad Feeling about This

In microservice architectures, a single business process interacts with multiple services, each of which operates and fails autonomously. There is no single monolithic engine to manage and maintain state in the event of a failure.

This lack of transactionality between independent services leaves developers holding the bag. Instead of just focusing on their own applications’ functionality, they must also handle application resilience by managing what happens when things go wrong. What was once abstracted is now their problem.

In practice, things can go wrong quickly in microservice architectures, with cascading failures that cause performance and reliability problems. For example, a service that one development team updates can cause other services to fail if they haven’t also been updated to handle those new errors.

The brittle complexity of microservices is a challenge, in part because of the weakest link effect. An application’s overall reliability is only as good as its least reliable microservice. The whole thing becomes a lot harder with asynchronous primitives. State management is more difficult if a microservice’s response time is uncertain.

Look at the Size of That Thing

Another aspect of this problem is that managing state on your own doesn’t scale well. The more microservices a user has, the more time-consuming managing their state becomes. Companies often have thousands of microservices in production, outnumbering their developers. This is what we noticed as early developers at Uber. Uber had 4,000 microservices, even back in 2018. We spent most of our time writing code to manage the microservice state in this environment.

Developers have taken several approaches to solve homegrown state management. Some use Kafka event streams hidden behind an API to queue microservice-based messages, but the lack of diagnostics makes root cause analysis a nightmare. Others use databases and timers to keep track of the system state.

Monitoring and tracing can help, but only up to a point. Monitoring tools oversee platform services and infrastructure health while tracing makes it easier to troubleshoot bottlenecks and unexpected anomalies. There are many on offer. For example, Prometheus offers open-source monitoring that developers can query, while its sibling Grafana adds visualization capabilities to trace system behavior.

These solutions can be useful, providing at least some observability into microservices-based systems. However, monitoring tools don’t help with the task of state management, leaving that burden with the developer. That’s why developers spend way too much time writing state management code instead of highly differentiated business logic. In an ideal world, something else would abstract state management for them.

Use the Microservices State Management Platform, Luke

The answer to simplifying state management in microservices is to offer it as a core abstraction for distributed systems.

We worked on a statement management platform after spending far too much time manually managing microservices state at Uber. We wanted a product that would enable us to define workflows that make calls to different microservices (in the language of the developer’s choosing), and then execute them without worrying about it afterwards.

In our solution, which we originally called Cadence, a workflow automatically maintains state while waiting for potentially long-running microservices to respond. Its concurrent nature also enables the workflow to continue with other non-dependent operations in the meantime.

The system manages disruption in state without requiring developer intervention. For example, in the event of a hardware failure the state management platform will restart a Workflow on another machine in the same state without the developer needing to do anything.

Do. Don’t Not Do.

A dedicated state management platform for microservices gives us the same kind of abstraction that we see in atomic database transactions. Developers can be certain that a Workflow will run once, to completion. Temporal takes care of any failures and restarts under the hood. Now, microservices-based applications can guarantee that a debit from one account automatically credits the other in just a couple of lines of code. They get the best of both worlds.

This fixes a long-standing problem with microservices and supercharges developer productivity, especially now that they are typically responsible for the operation of their application in addition to the development. Finally, developers that want the benefits of microservices can enjoy them without having to go to the dark side.

The post Do … or Do Not: Why Yoda Never Used Microservices appeared first on The New Stack.

]]>
The Gateway API Is in the Firing Line of the Service Mesh Wars  https://thenewstack.io/the-gateway-api-is-in-the-firing-line-of-the-service-mesh-wars/ Mon, 17 Oct 2022 14:00:55 +0000 https://thenewstack.io/?p=22682717

It appears that the leading service mesh vendors are leaning towards the Kubernetes Gateway API, replacing Ingress with a single

The post The Gateway API Is in the Firing Line of the Service Mesh Wars  appeared first on The New Stack.

]]>

It appears that the leading service mesh vendors are leaning towards the Kubernetes Gateway API, replacing Ingress with a single API that can be shared for the management of Kubernetes nodes and clusters through service mesh. While the Gateway API is designed to — like service mesh — for other uses for infrastructure management in addition to Kubernetes,  it has been configured for Kubernetes specifically, created by Kubernetes creator Google.

“In general, if implementing the Gateway API for Kubernetes solves any operational friction that exists today between infrastructure providers, platform admins, and developers and can ease any friction developers experience in how they deploy North-South APIs and services, I think it makes sense to assess what that change will look like,” Nick Rago, field CTO for Salt Security, told The New Stack. This will put organizations in a good position down the road as the Gateway API specification matures and the gateway controller providers support more of the specs, reducing the need for vendor or platform-specific annotation knowledge and usage.

The Controversy

In that sense, this helps to explain to some degree why projects such as Linkerd, Istio and especially Google are offering this as a standard API relying on that. Although the push to incite organizations to Gateway APIs is not without controversy.

To wit, Linkerd’s August release of Linkerd 2.12 is what Buoyant CEO William Morgan describes as “a first step towards adopting the Gateway API as a core configuration mechanism.” However, Morgan is cautious and wary of standards in general and the risk of leading to vendor lock-in and other issues (as he vociferously expresses below).

Wariness of standards is not unreasonable given that they may or may not be appropriate for certain use cases, often depending on the maturity of the project.

“Standards can be a gift and a curse depending on the lifecycle stage of the underlying domain/products. They can be an enabler to allow higher-order innovation, or they can be overly constraining,” Daniel Bryant, head of developer relations at Ambassador Labs, told The New Stack. “I believe that Kubernetes ingress and networking is a well enough explored and understood domain for standards to add a lot of value to support additional innovation. This is why we’re seeing not only Ingress projects like Emissary-ingress and Contour adopt the Gateway API spec, but also service mesh products like Linkerd and Istio.”

Needless to say, Google obviously supports the Gateway API as far as standards go.  In an emailed response, Louis Ryan, principal engineer, Google Cloud, offered this when prompted to explain why projects such as Linkerd, Istio and especially Google are supporting the Gateway API:

“Kubernetes has proven itself an effective hub for standardizing APIs with broad cross-industry engagement and support. The Gateway API has benefitted from this and, as a result, is a very well-designed solution for a wide variety of traffic management use cases; ingress, egress, and intra-cluster,” Ryan wrote. “Applying the Gateway API to mesh traffic management is a very natural next step and should benefit users by creating a standard that is thorough, community-driven and durable.”

The Istio Steering Committee’s decision to offer its service mesh project as an incubating project with the Cloud Native Computing Foundation (CNCF) in part to improve Istio’s integration with Kubernetes through the Gateway API (as well as with gRPC with proxyless mesh and Envoy). The Gateway API is also seen as a viable Ingress replacement.

“Donating the project to the CNCF offered reassurance that Istio is in good shape and that it’s not a Google project but a community project,” Idit Levine, founder and CEO of solo.io — the leading provider of tools for Istio — told The New Stack.

Istio’s move followed concerns by IBM — one of the original creators with Google and car-sharing provider Lyft — and other community members over the project’s governance, specifically Google’s advocacy of the creation of the Open Usage Commons (OUC) for the project in 2020.

In Linkerd’s case, Linkerd 2.12 provides a first step towards supporting the Kubernetes Gateway API, Linkerd CEO William Morgan said. While the Gateway API was originally designed as a richer and more flexible alternative to the long-standing Ingress resource in Kubernetes, it “provides a great foundation for describing service mesh traffic and allows Linkerd to keep its added configuration machinery to a minimum,” Morgan wrote in a blog post.

“The value of the Gateway API for Linkerd is that it’s already on users’ clusters because it’s a part of Kubernetes. So to the extent that Linkerd can build on top of the Gateway API, that reduces the amount of novel configuration machinery we need to introduce,” Morgan told The New Stack “Reducing configuration is part and parcel of our mission to deliver all the benefits of the service mesh without the complexity of other projects in the space.”

The portability promised by the Gateway API spec “is attractive to operators and platform engineers,” Bryant said. “Although many of them will chose a service mesh for the long haul, using the Gateway API does enable the ability to both standardize configuration across all service meshes deployed within an organization and also open to door to swapping out an implementation if the need arises (although, I’m sure this wouldn’t be an easy task),” Bryant said.

However, Linkerd also only provides a partial implementation of parts of the Gateway API (e.g. CRDs such as HTTPRoute) to configure Linkerd’s route-based policies. This approach allows Linkerd to start using Gateway API types without implementing the portions of the spec “that don’t make sense for Linkerd,” Morgan wrote in a blog post. As the Gateway API evolves to better fit Linkerd’s needs, Linkerd’s intention is to switch to the source types in a way that minimizes friction to our users.

“I think the biggest concern is that the Gateway API gets co-opted by one particular project or company and stops serving the needs of the community as a whole. While the Gateway API today is reasonably complete and stable for the ingress use case, making it amenable to service meshes is still an ongoing effort (the “GAMMA initiative”) and there’s plenty of room for that process to go south,” Morgan told The New Stack. “In particular, many of the participants in the Gateway API today are from Google and work on Istio; if the GW API develops in an Istio-specific way then it doesn’t actually help the end user because we’ll end up with projects (like Linkerd) just developing their own APIs rather than conforming to something that doesn’t make sense to them. (We saw this a little bit with SMI.)”

The Way to Go

Meanwhile, Linkerd is only providing parts of the Gateway API to remain in line with the service mesh provider’s vision to keep the service mesh experience “light and simple,” Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack, “They will not want to adopt anything that creates admin overhead for their user base or could potentially introduce network latency that would take away from their high-performance claim,” Volk said. “They even advertise on their website, ‘as little YAML and as few CRDs as possible,’ meaning that they will want to critically evaluate any advanced features they might need to offer to fully support Gateway API. This would dilute simplicity and performance as Linkerd’s key differentiators against Istio.”

Istio and Linkerd, of course, represent competing service mesh alternatives. For some service mesh users, Istio is GKE’s service mesh of choice and therefore, if GKE support is critical, Istio “might be the way to go,” Volk said. “However, most other vendors of proxies, ingress controllers and service mesh platforms have also indicated their support of Gateway API, at some point in the future,” Volk said. “Therefore, it might be wisest to trust your vendor of choice to ultimately support the critical elements of the Gateway API standard.”

HashiCorp, Kong, Ambassador and others are supporting the API Gateway, Bryant noted. Already, “the majority of Kubernetes API Gateway providers offer some level of support for the Gateway API spec,” Bryant said. “Both Emissary-ingress and Ambassador Edge Stack have offered this type of support for quite some time, and this will continue to evolve in the future.“

Ambassador Labs is also working with other founding contributors on the Envoy Gateway project, which will be the reference implementation of the Kubernetes Gateway API spec, Bryant said. They include Tetrate, VMware, Fidelity and others. “Our goal here is to collaborate on a standardized K8s API Gateway implementation that we can all build upon and innovate on top of,” Bryant said.

The post The Gateway API Is in the Firing Line of the Service Mesh Wars  appeared first on The New Stack.

]]>
AmeriSave Moved Its Microservices to the Cloud with Traefik’s Dynamic Reverse Proxy https://thenewstack.io/amerisave-moved-its-microservices-to-the-cloud-with-traefiks-dynamic-reverse-proxy/ Thu, 08 Sep 2022 21:02:23 +0000 https://thenewstack.io/?p=22682687

When AmeriSave Mortgage Corporation decided to make the shift to microservices, the financial services firm was taking the first step

The post AmeriSave Moved Its Microservices to the Cloud with Traefik’s Dynamic Reverse Proxy appeared first on The New Stack.

]]>

When AmeriSave Mortgage Corporation decided to make the shift to microservices, the financial services firm was taking the first step in modernizing a legacy technology stack that had been built over the previous decade. The entire project — migrating from on-prem to cloud native — would take longer.

Back in 2002, when company founder and CEO Patrick Markert started AmeriSave, only general guidelines for determining rates were available online. “At that time, finance was very old-school, with lots of paper and face-to-face visits,” said Shakeel Osmani, the company’s principal lead software engineer.

But Markert had a technology background, and AmeriSave became a pioneer in making customized rates available online. “That DNA of technology being the driver of our business has remained with us,” said Osmani.

Since then, AmeriSave has automated the creation and processing of loan applications, giving it lower overall operating costs. With six major loan centers in 49 states and over 5,000 employees, the company’s continued rapid growth demanded an efficient, flexible technology stack.

Steps to the Cloud

With many containerized environments on-prem, company management initially didn’t want to migrate to a cloud native architecture. “The financial industry was one of the verticals hesitant to adopt the cloud because the term ‘public’ associated with it prompted security concerns,” said Maciej Miechowicz, AmeriSave’s senior vice president of enterprise architecture.

Most of the engineers on his team came from companies that had already adopted microservices, so that’s where they started. First, they ported legacy applications into microservices deployed on-prem in Docker Swarm environments, while continuing to use the legacy reverse proxy solution NGINX for routing.

“We then started seeing some of the limitations of the more distributed Docker platform, mostly the way that networking operated, and also some of the bottlenecks in that environment due to increased internal network traffic,” said Miechowicz.

The team wanted to move to an enterprise-grade cloud environment for more flexibility and reliability, so the next step was migrating microservices to Microsoft’s Azure Cloud platform. Azure’s Red Hat OpenShift, already available in the Azure Cloud environment, offered high performance and predictable cost.

The many interdependencies among AmeriSave’s hundreds of microservices required the ability to switch traffic easily and quickly between Docker Swarm and OpenShift environments, so the team wanted to use the same URL for both on-prem and in the cloud. Without that ability, extensive downtime would be required to update configurations of each microservice when its dependency microservice was being migrated. With over 100 services, that migration task would cause severe business interruptions.

First, the team tried out Azure Traffic Manager, an Azure-native, DNS-based traffic load balancer. But because it’s not automated, managing all those configurations through Azure natively would require a huge overhead of 300 to 500 lines of code for each service, said Miechowicz.

One of the lead engineers had used Traefik, a dynamic reverse proxy, at his prior company and liked it, so the team began discussions with Traefik Labs about its enterprise-grade Traefik Enterprise for cloud native networking.

Cloud and Microservices Adoption Simplified

Traefik was founded to deliver a reverse proxy for microservices that can automatically reconfigure itself on the fly, without the need to go offline.

The open source Traefik Proxy handles all of the microservices applications networking in a company’s infrastructure, said Traefik Labs founder and CEO Emile Vauge. This includes all incoming traffic management: routing, load balancing, and security.

Traefik Enterprise is built on top of that. “Its additional features include high availability and scalability, and advanced security, as well as advanced options for routing traffic to applications,” he said. “It also integrates API gateway features, and connects to legacy environments.”

Vauge began work on Traefik as an open source side project while he was developing a Mesosphere-based microservices platform. “I wanted to automate 2,000 microservices on it,” he said. “But there wasn’t much in microservices available at that time, especially for edge routing.”

He founded Traefik Labs in 2016 and the software is now one of the top 10 downloaded packages on GitHub: it’s been downloaded more than 3 billion times.

“The whole cloud native movement is driven by open source, and we think everything should be open source-based,” he said. “We build everything with simplicity in mind: we want to simplify cloud and microservices adoption for all enterprises. We want to automate all the complexity of the networking stack.”

Multilayered Routing Eliminates Downtime

Working together, Traefik’s team and Miechowicz’s team brainstormed the idea of dynamic path-based routing of the same URL, between on-prem Docker Swarm and cloud-based OpenShift. This means a service doesn’t need to be updated while its dependency microservice is being migrated.

Any migration-related problem can be quickly fixed in Traefik Enterprise by redirecting routing from OpenShift back to on-prem Docker Swarm, correcting the issue, and redirecting back to OpenShift. Also, there’s no need to update configurations of any other services.

This is made possible by the way that Traefik Enterprise’s multilayered routing works. “Layer 1 of Traffic Enterprise dynamically collects path-based and host-based routing configured in Layer 2,” said Miechowicz. “In our case, we had two Layer 2 sources: on-prem Docker Swarm and cloud-based OpenShift. Layer 1 then directs the traffic to the source that matches the host/path criteria and has a higher priority defined. Rollback from OpenShift to Docker Swarm simply consists of lowering the priority on the OpenShift route. We did a proof-of-concept and it worked perfectly and fast.”

This contrasts with how NGINX works. “You may configure it to route to a hundred services, but if one service does not come up, NGINX will fail to start and cause routing outage of all the services,” said Osmani. But Traefik Enterprise will detect a service that’s failing and stop routing to it, while other services continue to work normally. Then, once the affected service comes back up, Traefik Enterprise automatically establishes routing again.

NGINX also doesn’t have Traefik’s other capabilities, like routing on the same URL, and it’s only suited for a smaller number of services, Osmani said. Both Azure Traffic Manager and Traefik must be maintained and managed, but that’s a lot easier to do with Traefik.

No More Service Interruptions

Osmani said adopting Traefik Enterprise was one of the best decisions the team has made in the past year because it’s removed many pain points.

“When we were on-prem, we were responsible for managing everything — we’ve often gotten up at midnight to fix something that someone broke,” he said. “But with Traefik you can only take down the service you’re affecting at that moment.”

From the business standpoint, the main thing that’s better is the migration, said Osmani. “Because we are a living, breathing system, customers are directly affected. In the online mortgage lending business, if a service is down people will just move on to the next mortgage lender’s site. Now we don’t experience service interruptions. There’s no other way we could have easily accomplished this.”

“For developers in our organization, the result works like magic,” said Miechowicz. “We just add a few labels and Traefik Enterprise routes to our services. As our developers move services to the cloud, none of them have seen a solution as streamlined and automated like this before.”

The post AmeriSave Moved Its Microservices to the Cloud with Traefik’s Dynamic Reverse Proxy appeared first on The New Stack.

]]>
Event Streaming and Event Sourcing: The Key Differences https://thenewstack.io/event-streaming-and-event-sourcing-the-key-differences/ Thu, 01 Sep 2022 13:43:09 +0000 https://thenewstack.io/?p=22682394

Customers like to be aware of events when they happen. After a customer orders a new pair of shoes and

The post Event Streaming and Event Sourcing: The Key Differences appeared first on The New Stack.

]]>

Customers like to be aware of events when they happen. After a customer orders a new pair of shoes and receives a notification that the purchase has been shipped, getting up-to-the-minute shipping status updates before it arrives improves the overall customer experience.

David Dieruf
David is the streaming developer advocate at DataStax. Over his career, he’s worked on enterprise sales, developer advocacy and open source support. He’s a father and husband surrounded by horses and bourbon in Louisville, Kentucky.

The updates about the order are events that trigger a response in an event-driven architecture (EDA). An EDA is a software design that reacts to changes of state (events) and transmits these events using a decoupled architecture.

This decoupled architecture can employ several design patterns like the publish-subscribe (pub-sub) pattern, where a producer publishes an event and a subscriber watches for events, but neither are dependent on the other.

Event streaming and event sourcing represent two ways that organizations can power their EDAs.

With event streaming, there’s a continuous stream of data flowing between systems, with the data representing a new state of events broadcast using the pub-sub pattern. Event sourcing, on the other hand, stores every new event in an append-log. This serves as a source of truth containing a chronological order of events and contexts.

Event sourcing and event streaming are often used side by side in EDAs, but it’s important to distinguish the two as they work very differently. While event streams promote more accessible communication between systems, event sourcing provides event history by storing new events in an append-only log.

Here, we’ll discuss both event-coordinating methods and provide a few use cases for each.

Event Streaming: Decouple Your Services

Event streaming employs the pub-sub approach to enable more accessible communication between systems. In the pub-sub architectural pattern, consumers subscribe to a topic or event, and producers post to these topics for consumers’ consumption. The pub-sub design decouples the publisher and subscriber systems, making it easier to scale each system individually.

The publisher and subscriber systems communicate through a message broker like Apache Pulsar. When a state changes or an event occurs, the producer sends the data (data sources include web apps, social media and IoT devices) to the broker, after which the broker relates the event to the subscriber, who then consumes the event.

Event streaming involves the continuous flow of data from sources like applications, databases, sensors and IoT devices. Event streams employ stream processing, in which data undergoes processing and analysis during generation. This quick processing translates to faster results, which is valuable for businesses with a limited time window for taking action, as with any real-time application.

Event streaming provides several advantages for businesses; here are a few:

Improved Customer Experience

Event streaming and processing offers organizations the ability to enrich their customers’ experience. For example, a customer placing a dinner order can get instant status updates, notifying them when the delivery vehicle is on its way to their location, or if it has arrived. This heightened customer experience translates into more trust, better reviews and improved revenue.

Risk Mitigation

Applications like PayPal and other financial technology applications can employ event streaming to provide online fraud detection to enhance security using real-time monitoring. Fraud algorithms test the circumstances of an event (purchase or transaction) using predictive analytics to detect a deviation from the norm (an outlier). If the system detects an outlier or unusual event, it stops the transaction or blocks the card from completing it.

Reduced Operational Costs

By analyzing event streams, industrial tools can log performance and health metrics to assess equipment health. This feature enables organizations to perform predictive maintenance on machines before a total breakdown, which costs more to repair. In manufacturing, for example, organizations can employ Pulsar streams to aggregate and process data from machine parameters, like temperature or pressure. Engineers could set a machine’s maximum temperature and set an alert that would be triggered if that temperature is exceeded. Machine operators could perform checks and maintenance before more costly problems occur.

How Is Event Streaming Used?

Event streaming is essential for businesses and applications that stream a high volume of data and depend on fast, actionable insights. These applications include e-commerce, financial trading and IoT devices.

Financial-trading applications employ event streaming to publish time-sensitive events where customers want to act immediately. For instance, users may subscribe to a backend service that sends updates on specific events, like a change in stock price, to enable timely decision-making.

Event streaming also has risk- and fraud-detection applications in financial systems that process payments and other transactions (and block fraudulent transactions). Defined fraud algorithms can block suspicious transactions by analyzing data immediately after it’s generated.

Event Sourcing: An Ordered History

Event sourcing stores data as events in append logs. The process captures every change to the state of an application in an event object and stores these event objects as logs in chronological order. With event sourcing, event stores compile the state of a business entity as an event in a sequence, and a change in the state, like new orders or the cancellation of an order, appends the latest state to the list of events.

For event sourcing to work efficiently and consume minimal resources, each event object should only contain the necessary details. That minimizes storage space and prevents the use of valuable resources in processing data that leads to nonactionable insights.

Event stores compile business events and context; appending long streams to the event logs consumes database storage quickly. Keeping only the necessary event contexts as part of the event object helps free up storage space for adding multiple event logs, which drive actionable insights.

Organizations may choose to use “snapshots” to help optimize performance in such cases. Snapshots enable storage of the current state of the entity. Knowing the current state may only involve pulling the snapshots and recreating a timeline to know the most current state.

Let’s illustrate this. Suppose we have a database that takes stock of recent items in an e-commerce store:

Most databases store only the current state. If we were to account for the journey to how we arrived at the final stock value of 91, there would be no certainty or clarity as to how we got there. Event sourcing records every state change in a log, making tracing event history for root cause analysis and auditing possible.

The above image illustrates event sourcing and shows three events, each with the database’s date, quantity and type of item. In this case, we can trace how we arrived at the final amount of 91.

The above image illustrates event sourcing and shows three events, each with the database’s date, quantity, and type of item. In this case, we can trace how we arrived at the final amount of 91.

Health-care organizations are one of the most heavily regulated industries, with ever-changing regulations to protect customer information. They need a flexible storage solution that adapts to growing data needs while maintaining easy migration of legacy systems to newer technologies.

By employing event stores as their single source of truth, health-care systems can rely on the immutable state of event logs for the actual state of their data and make valuable projections by employing real-time stream processing. Retail and e-commerce businesses could gain better knowledge of their customers by analyzing large, durable event stores, which helps them create more personalized customer experiences.

Differences Between Event Streaming and Event Sourcing

There are a few similarities between event streaming and event sourcing. For one, each event-coordination method employs a decoupled microservices architecture, which helps improve scalability and performance.

Although event stores and streams differ in state durability, they are essential in providing the current event states of applications for use in analysis and driving business decisions. Also, both event-coordination methods possess durable storage capabilities, although event stores usually offer longer extended storage than event streams.

Here, let’s delve more into some key differences between event streaming and event sourcing.

Optimization

Event streaming is optimal for more accessible communication between data in motion by decoupling publishers from subscribers and making it easy to publish millions of messages at high performance. On the other hand, event sourcing helps establish event history by storing every new state of an entity in an append-only log.

Data Movement

For event sourcing, data exists at rest because events are immutable. However, event streams involve data always in transit, passing between multiple storage systems like databases, sensors and applications.

Wrapping Up

Event streaming and event sourcing help coordinate events in an event-driven architecture. Although their use and value are different, they work well together to help build a durable and high-performance application.

Event streaming employs the decoupled pub-sub pattern to continuously stream data from various sources, which helps drive business decision-making. Unfortunately, although event- streaming tools may possess durable storage, they’re not designed to store messages for long, as the durable storage features only persist long enough to make them fault-tolerant and resilient.

One can view event sourcing as a subset or component of event streaming. Event sourcing appends a new event to the current list of events in an ordered manner. It can also act as a source of truth for reliable audits and obtaining the current state of events at any time. Event sourcing is crucial for financial industries with heavy regulatory and audit requirements and a reliable store to trace and build the current state of events. In contrast, event streaming is crucial in financial-trading applications where actions have a time-bound window and require immediate action.

EDA isn’t necessarily a destination. It’s a path to follow, driving certain system performance and characteristics. For example, event sourcing decouples a collection of microservices so they become less dependent on each other. This drives resilience and easier iteration, among other benefits. Combined with event sourcing, microservices gain the ability to replay events as well as a full log of changes for a given feature like a user’s profile. This kind of architecture opens new possibilities within existing systems.

The post Event Streaming and Event Sourcing: The Key Differences appeared first on The New Stack.

]]>
Lessons from Deploying Microservices for a Large Retailer https://thenewstack.io/lessons-from-deploying-microservices-for-a-large-retailer/ Tue, 30 Aug 2022 17:01:24 +0000 https://thenewstack.io/?p=22682236

Microservices “saved our butts” every day, said Heath Murphy, director of consulting services at the IT consultancy CGI. Murphy led

The post Lessons from Deploying Microservices for a Large Retailer appeared first on The New Stack.

]]>

Microservices “saved our butts” every day, said Heath Murphy, director of consulting services at the IT consultancy CGI.

Murphy led the design of a microservices architecture for an order fulfillment system for a US-based specialty retailer that needed to serve over 1,200 brick and mortar stores, 15 e-commerce sites and 5,000 B2B partner company stores. He shared his experience at CodePaLOUsa, a regional development conference held this month in Louisville, Ky.

Defining Microservices

Microservices has no single definition, Murphy said. He ultimately settled on a few key defining characteristics:

  • Services “are often processes that communicate over a network to fulfill a goal using technology-agnostic protocols such as HTTP.”
  • Services are organized around business capabilities.
  • Services can be implemented using different languages, databases, hardware and software environments.
  • Services are “small in size, messaging-enabled, bounded by contexts, autonomously developed, independently deployable, decentralized, and built and released with automated processes.”

CGI’s Microservices Architecture

The system CGI built for its client needed to be able to inject EDI files, flat files, and “a million other types of files,” Murphy said. It needed to handle API Rest service calls, direct database access calls, and the message queue as it flowed through validation, stock reservations and the warehouse. Finally, at the end of the journey, the process would need to handle inbound customer phone calls and hardware streaming data.

The business events it needed to handle included everything from new inbound orders to cancellations, outbound status updates and inventory updates.

CGI built a microservices architecture composed on one very large legacy database on SQL Server to handle it all — which he acknowledges was very much an anti-pattern in microservices, but what they were required to do. Since the client was a .Net shop, it relied on .Net and C#. All of the messaging was handled by RabbitMQ, which he noted is an “extremely important part of the architecture and one first dismissed.” Finally, it used an off-the-shelf product to handle EDI files, since they didn’t want to process EDI files in .Net.

What’s notable is what’s missing, he added. The microservices architecture did not use containers, MongoDB or a similar big data solution, or even the cloud.

The Importance of Logging

Logging proved to be a crucial part of the microservice app. Whereas logging is straightforward and sequential in a monolithic app, in a microservices architecture the log happens in dozens of processes and modules. This creates challenges and makes the logging process more important.

“You got logs happening everywhere, dozens of places,” Murphys said.

Murphy recommended development pick a package (his example was SeriLog) and use a boilerplate template. Check for good logging-in PRs. Make it easy to log to get better logs, he explained. The system also required correlating IDs in all inbound services; it rejected all calls without that attribute, he explained. Finally, he advised building or leveraging an aggregator platform. “Logs are helpful when searchable,” his presentation noted.

He recommended using the same correlation ID for all logging in the same business transaction. To correlate IDs: Generate the correlation ID in the HEAD-IN service. GUIDs make fine IDs, he noted. Then pass these along as the header. Finally, cross-reference the log keys to link all external business identifiers — such as order number, external partner order, warehouse order number, shipment tracking number and customer ID — back to the log key.

Service Bus to Manage Microservices

One lesson learned by the consultancy: If you build microservices so they’re dependent on one another, it can crash the system when one connection goes down. To avoid this, CGI set up the process so that API Management (APIM) sends a microservice to an Azure Service Bus, which triggers events between the microservices.

In the end, the system expected it would need to input about 100,000 fulfillment requests. What really happened was that it handled 500,000 requests, he said — thus “saving their butts.”

It was able to handle that because it relied on lightweight messages in JSON with no order details, which triggered the fulfillment request microservice. RabbitMQ was able to scale up and handle bursts of traffic. Microservices then would fetch the full order details, perform full validation, and leverage other microservices.

Murphy’s Advice: Monitor Everything

Murphy recommended monitoring everything. They ran heat checks to check connectivity. They monitored message queues and set a threshold for the number of errors that were acceptable within a certain time frame. The system triggers an alert when the system exceeded that threshold. They also monitor what was expected to happen but hadn’t happened yet.

“Logs tell you what happened but not what hasn’t happened yet,” Murphy’s slides noted. “Build monitoring around when things should have occurred but have not.”

Finally, the team used synthetic messages, injecting fake requests into the correlated business transactions and monitoring the service level agreements on timing, errors, etc.

The system does use humans to manage edge cases — which Murphy noted are guaranteed to pop up — but the DevOps culture also incorporates edge case fixes back into the sprint cycle.

The post Lessons from Deploying Microservices for a Large Retailer appeared first on The New Stack.

]]>
Why Distributed Testing Is the Best Way to Test Microservices https://thenewstack.io/why-distributed-testing-is-the-best-way-to-test-microservices/ Thu, 25 Aug 2022 17:00:02 +0000 https://thenewstack.io/?p=22681057

It’s been almost a decade since I started developing full time, and almost four of those years were spent developing

The post Why Distributed Testing Is the Best Way to Test Microservices appeared first on The New Stack.

]]>

Aviv Kerbel
Aviv is the developer advocate at Helios. An experienced developer and R&D leader, Aviv is passionate about development in modern environments and trending dev tools and technologies. Prior to Helios, Aviv was a senior software developer at multiple startups and an associate at Amiti, an Israel-based venture capital firm, where he was involved in due diligence and investment processes for tech startups.

It’s been almost a decade since I started developing full time, and almost four of those years were spent developing in a microservices-based environment. But while we were all educated about the importance of testing, over time I found it was getting more complicated to write tests for my code, until it became almost impossible. Or maybe just not worth it, because I found it took more time than writing the code itself!

After all, with so many asynchronous calls between services, it was easy to miss exceptions — so the preparations for the tests and building the infrastructure became tedious and time consuming, not to mention having to prepare data for those kinds of tests.

Think about a flow that includes Kafka topics, Postgres, DynamoDB, third-party APIs and multiple microservices, which all depend on each other somehow. Any change might affect many others. This is not monolith-land anymore, where we would get an HTTP500 status code from the server. In this scenario, we might have 100 microservices and still not find the exception that was thrown to another microservice without being notified about it.

When we tried to build backend E2E tests, here are the options that we tried and tested:

Option one: Log-based testing

The first option we tried was fetching the logs added by the developers of the feature. These logs should validate the relevant service data.

For example:

class OrderTest(unittest.TestCase):
  def test_process_order_happy_flow(self):
      with self.assertLogs('foo', level='INFO') as cm:
          requests.get(f'/process_order/{TEST_CLIENT_ID}')
          self.assertEqual(
              cm.output,
              [f'INFO:send email to {TEST_CLIENT_ID}',
               f'INFO:Charge {TEST_CLIENT_ID} succeeded!']
          )

Pros: This method is always available and doesn’t require additional tools.

Cons:

  1. Logs aren’t always reliable, and they depend on the developers who created them (which doesn’t always happen).
  2. Tests that are based on logs are limited by design, since they only test the logs and not the operation itself.

Option two: DB querying

If we save the operation indication in the DB, we can query the DB during the test and then create our test based on DB updates:

class OrderTest(unittest.TestCase):

  def test_process_order_happy_flow(self):
       requests.get(f'/process_order/{TEST_CLIENT_ID}')
       client = Client.get_by_id(TEST_CLIENT_ID)
       assert client.charged_successfully is True
       assert client.email_sent is True

Pros: 

  1. It’s less flaky than log-based testing.
  2. Side effect — this data can later help us generate analytics in real environments.

Cons: 

  1. It’s not always easy to expose the DB to testing projects, since it requires redesigning DB schemes, which is wrong from a DB design perspective.
  2. Like the former option — it’s kind of “second-hand testing.” We test the DB object instead of the actual operation, so we might miss some issues. 

There are probably more options, but they all suffer from similar issues — they are not reliable enough and are too time consuming. 

This brings us to the bottom line — this is the reason why most of the developers I know don’t test their code properly.

Option three: Contract testing

Contract testing is a testing method that is based on contracts between two systems to ensure compatibility. The interactions between the two services are stored in a contract, which is then used during communication to ensure both sides adhere to the contract. Pact.io is a popular open source solution for contract testing.

Pros:

  1. It ensures consistency.
  2. It’s a single place for storing architectural changes.

Cons:

  1. It does not cover all use cases — tests are created based on actual consumer behavior.
  2. It requires mocking.

Option four: end-to-end testing

End-to-end (e2e) testing provides a comprehensive overview of the system components and how messages pass between them. With e2e testing, developers can make sure the application is working as expected and that no misconfigurations prevent it from performing in production.

Pros:

  1. It provides powerful tests.
  2. It ensures that microservices are really working for users.

Cons:

  1. It’s difficult to maintain and debug.
  2. It has high costs.

Option five: Trace-based testing — a new test paradigm

In the last few years, distributed tracing technologies have been gaining momentum. Standards like OpenTelemetry enable a different way to look at microservices: from a trace standpoint.

Distributed tracing allows overseeing everything that happens when one single operation is activated. By using this method, we could approach each operation and get a holistic view of everything that happens in my application triggered by this operation, instead of looking at it separately. 

In OpenTelemetry terminology — the trace is built of spans. A span represents an operation — sending an HTTP request, making a DB query, handling an asynchronous event, etc. Ideally, using spans can help us review any attributes we want to validate for the test. Moreover, the trace also shows a continuous flow and the relation between operations, as opposed to a single operation, which can be powerful for test purposes.

How to Create Trace-Based Testing

When using traces for tests, you first need to deploy a tracing solution. As mentioned before, the common tracing solution in the industry is OpenTelemetry, or OTeL. It also allows searching capabilities.

Let’s review the things you should do if you want to succeed in testing with OTeL:

  1. Deploy the OTeL SDK.
  2. Build processor services that will take OTEL data and transform it into assets you can work with.
  3. Create an ELK service that can save all the traces.
  4. Create a tests framework that will allow you to search for a specific span inside traces.

And that’s it. Note that we didn’t discuss any visualization tools that would help you investigate traces, such as Zipkin on Jaeger, but you can use them as well.

But at the end of the day — another tool was still missing.

Looking at the relevant open source tools in the domain that are based on OpenTelemetry, we observed that most of them require the implementation of OpenTelemetry as a prerequisite, which left us with the same headaches.

Helios

Helios is a free tool that instruments OpenTelemetry and allows developers to generate test code for each trace.

 Through spans, Helios collects all the payloads of each request and response. Using this capability, developers can generate tests without changing a single code line.

It’s also built in a smart way that enables the creation of a test from a bug. Even better, using the trace visualization tool, a developer can visually create a test directly from a trace.

Bottom line: The way developers test microservices today is broken. It’s time consuming and inefficient. Using distributed tracing testing is the right answer — it allows developers to properly test their microservices, since it provides a holistic view of all microservices. Whichever tool you choose, make sure you take advantage of its abilities. 

The post Why Distributed Testing Is the Best Way to Test Microservices appeared first on The New Stack.

]]>
Why Enterprises Need a Management Plane https://thenewstack.io/why-enterprises-need-a-management-plane/ Fri, 19 Aug 2022 17:03:01 +0000 https://thenewstack.io/?p=22681413

This is the third in a three-part series. Read Part 1 and Part 2. Working on the control plane of

The post Why Enterprises Need a Management Plane appeared first on The New Stack.

]]>

This is the third in a three-part series. Read Part 1 and Part 2.

Working on the control plane of cloud native infrastructure is not for the faint of heart.

Eric Braun
Eric is vice president of product management and commercial strategy for F5 NGINX. Eric is a 20-year SaaS, PaaS, and IaaS veteran with 10 years of experience in cloud — public, private (systems software), and hybrid at Joyent, Samsung, Oracle and Deutsche Telekom’s MobiledgeX (acquired by Google). Eric is an architect of communities and ecosystems with domain expertise in developer experiences as well as cloud’s intersection with telecommunications and emerging AI, ML, XR and other use cases and connected devices.

Light on menus and heavy on config files, the control plane is easy to mess up, especially when an inexperienced admin tries to set traffic control, security or access-control policies.

In part, this is because the traditional control plane for applications came about when there were fewer roles and specialized products – your enterprise might have had just a handful of load balancers, a couple of application delivery controller (ADC) instances, global firewall and a web application firewall (WAF).

Today, enterprises with large cloud native applications might need tens of thousands of load balancers, each with its own WAF. These distributed networking and traffic-management elements sit in front of distributed and constantly moving containers. Managing that fleet en masse via scripts and without abstractions is a tremendous accomplishment even for the most astute DevOps, NetOps or DevSecOps teams.

But the whole point of cloud native is to empower more people to be self-sufficient and shift capabilities left. At the same time, we need to make it possible for humans to manage this constantly morphing compute kaleidoscope, and at machine speed.

In this era of mass computation and containers, a layer of abstraction is needed above the various control planes to set and enforce policy, to deliver enterprisewide observability and to provide a security layer across everything.

That new layer is the management plane, the latest essential tool in cloud ops. The management plane is key to enabling cloud native systems to deliver promised efficiencies and capabilities at scale for the entire enterprise.

What Is the Management Plane?

In networking systems and architectures for larger organizations, we have long thought about management in terms of two planes:

  • Roughly speaking, the data plane is where data lives, moves and acts. It is where networking systems – load balancers, firewalls, API gateways, ingress controllers, caching systems – inspect inbound and outbound packets and decide what to do with them.
  • The control plane is the policy layer residing above the data plane, where the rules controlling how the data plane behaves are managed and configured.

A long-held tenet of NGINX product design is that the two planes must be cleanly separated for your architecture to be scalable and system- and architecture-agnostic.

With the rise of cloud native, shift left and modern apps, we are seeing an emerging need for a further layer of abstraction: the management plane. The management plane is a layer above the control plane that reduces management complexity on the control plane and makes it easier for different personas inside the enterprise to manage applications, networking and compute resources.

The Platform Ops team designs and configures the management plane. Users of the management plane likely include the usual suspects: application development teams, DevOps, SecDevOps, SecOps and NetOps teams.

Beyond these core networking, application management and security teams, many other teams can use the management plane to do their jobs more easily while staying within architectural guardrails designed to safely provide scalability, security, resilience and observability.

Other beneficiaries of the management plane include IT teams, procurement teams, API management teams, infrastructure management teams and even marketing teams. In other words, the management plane becomes a broad offering that benefits users well beyond traditional networking, security and application development teams.

With Cloud Native, Everyone’s an Infrastructure Engineer

As we transition to modern apps for the cloud native era, the entire concept of centralized control and specialization of roles has been turned on its head. Consider average developers building a microservice. In the past, they primarily worried about the quality of their application code.

But with modern apps, developers have more responsibilities. To ensure their microservice runs and scales, they need to set up ideal traffic-management policies and load-balancing rules, configure firewall rules for a WAF, set API rate limits and retry policies – and that is just at the traffic level! For the most part, performing these tasks requires detailed knowledge of the control plane and data plane combined with mad scripting chops.

It’s the same story for marketing technologists, a newer but fast-growing persona. They need to scale their systems up and down for marketing efforts and are usually one layer up from application developers. But they have the same concerns and needs around guaranteeing security, resiliency and observability of the infrastructure of the marketing stack.

Take this over to the NetOps experience, where they’re now tasked with managing networking security and reliability not just for global networks using a global set of appliances, but also for numerous types of networking instances across different clouds, vendors and more.

This is not what the networking team signed up for, and it’s not sustainable. Even going from one major system to two means that complexity grows astronomically. Add Kubernetes to the mix and it gets even more complicated; networking looks and feels completely different with commingled Layer 7 and Layer 4 traffic and new data routing and control methods.

The Management Plane Reduces Complexity, Increases Control and Enables Scaling

It’s not enough simply to expose control-plane management tools to new stakeholders; that is more likely to increase the complexity and the scope of their jobs. This is why enterprises must consider embracing the management plane. Let’s look at how the management plane helps specific teams.

  • Developers: The management plane presents a single interface in which microservice owners can pick from a basic list of settings and configurations for all the necessary infrastructure for their application – load balancer or ingress controller, API gateway, WAF, etc. The service owners can choose from a curated list of configurations and policies, allowing them to set parameters for their application runtime, networking and security. This enables shift left, agility and speed while minimizing cognitive load and requirements for specialized knowledge.
  • NetOps teams: The management plane enables quick changes and policy checks across all networking applications without requiring NetOps teams to learn multiple coding languages, like NGINX conf scripts, YAML and Terraform’s scripting language. The management plane also allows them to export observability and telemetry data for all networking systems into centralized and customizable dashboards and reports.
  • Line-of-business engineers: Use the management plane to procure instances to scale out their systems confident that the instances already have all the properly configured security, traffic and restart settings.

Across all these teams, the management plane solves one common problem: too much stuff to think about. To ensure that shift left works and that an organization can convert to cloud native and run quickly to build modern apps, all teams will need to learn to do more with more. NGINX believes that a management plane layer tuned for productivity, resilience and security removes complexity at both the control plane and the data plane.

The post Why Enterprises Need a Management Plane appeared first on The New Stack.

]]>
Monoliths to Microservices: 4 Modernization Best Practices https://thenewstack.io/monoliths-to-microservices-4-modernization-best-practices-2/ Tue, 16 Aug 2022 19:29:28 +0000 https://thenewstack.io/?p=22680965

When it comes to refactoring monolithic applications into microservices, most engineering teams have no idea where to start. Additionally, a

The post Monoliths to Microservices: 4 Modernization Best Practices appeared first on The New Stack.

]]>

When it comes to refactoring monolithic applications into microservices, most engineering teams have no idea where to start. Additionally, a recent survey revealed that 79% of modernization projects fail, at an average cost of $1.5 million and 16 months of work.

Oliver J. White
Oliver J. White is director of community relations at vFunction. Since 2007, he has been helping companies and startups like ZeroTurnaround (acquired by Perforce) and Lightbend tell their technology stories and build communities with digital content.

In other articles, we discussed the necessity of developing competencies for assessing your application landscape in a data-driven way to help you prioritize your first big steps. Factors like technical debt accumulation, cost of innovation and ownership, complexity and risk are important to understand before blindly embarking on a modernization project.

Event storming exercises, domain-driven design (DDD), the Strangler Fig Pattern and others are all helpful concepts to follow here, but what do you as an architect or developer actually do to refactor a monolithic application into microservices?

There is a large spectrum of best practices for getting the job done, and in this post, we look at some specific actions for intelligently decomposing your monolith into microservices.

These actions include identifying service domains, merging two services into one, renaming services to something more accurate and removing services or classes as candidates for microservice extraction. The best part: Instead of trying to do any of this manually, we’ll be using artificial intelligence (AI) plus automation to achieve our objectives.

Best Practice #1: Automate the Identification of Services and Domains

Surveys have shown that the days of manually analyzing a monolith using sticky notes on whiteboards take too long, cost too much and rarely end in success. Which architect or developer in your team has the time and ability to stop what they’re doing to review millions of lines of code and tens of thousands of classes by hand? Large monolithic applications need an automated, data-driven way to identify potential service boundaries.

The Real-World Approach

Let’s select a readily available, real-world application as the platform in which we’ll explore these best practices. As a tutorial example for Java developers, Oracle offers a medical records (MedRec) application — also known as the Avitek Medical Records application, which is a traditional monolith using WebLogic and Java EE.

Using vFunction, we will initiate a “learning” phase using dynamic analysis, static analysis and machine learning based on the call tree and system flows to identify ideal service domains.

Image 1: This services graph displays individual services identified for extraction

In Image 1, we see a services graph in which services are shown as spheres of different sizes and colors, as well as lines (edges) connecting them. Each sphere represents a service that vFunction has automatically identified as related to a specific domain. These services are named and detailed on the right side of the screen.

The size of the sphere represents the number of classes contained within the service. The colors represent the level of class “exclusivity” within each service, referring to the percentage of classes that exist only within that service, as opposed to classes shared across multiple services.

Red represents low exclusivity, blue medium exclusivity and green high exclusivity. Higher class exclusivity indicates better boundaries between services, fewer interdependencies and less code duplication. Taken together, these traits indicate that it will be less complex to refactor highly-exclusive services into microservices.

Images 2 and 3: Solid and dashed lines represent different relationships between services

The solid lines here represent common resources that are shared across the services (Image 2). Common resources include things like beans, synchronization objects, read-only DB transactions and tables, read-write DB transactions and tables, websockets, files and embedded files. The dashed lines represent method calls between the services (Image 3).

The black sphere in the middle represents classes still in the monolith, which contains classes and resources that are not specific to any particular domain, and thus have not been selected as candidates for extraction.

By using automation and AI to analyze and expose new service boundaries previously contained in the black box of the monolith, you are now able to begin manipulating services inside of a suggested reference architecture that has cleared the way to make better decisions based on data-driven analysis.

Best Practice #2: Consolidate Functionality and Avoid Duplication

When everything was in the monolith, your visibility was somewhat limited. If you’re able to expose the suggested service boundaries, you can begin to make decisions and test design concepts — for example, identifying overlapping functionality in multiple services.

The Real-World Approach

When does it make sense to consolidate disparate services with similar functionality into a single microservice? The most basic example is that, as an architect, you may see an opportunity to combine two services that appear to overlap — and we can identify these services based on the class names and level of class exclusivity.

Image 4: Two similar services have been identified to be merged

In the services graph (Image 4), we see two similar chat services outlined with a white ring: PatientChatWebSocket and PhysicianChatWebSocket. We can see that the physician chat service (red) has 0% dynamic exclusivity and that the patient chat service (blue) has slightly higher exclusivity at 33%.

Neither of these services is using any shared resources, which indicates that we can merge these into a single service without entangling anything by our actions.

Image 5: Confirming the decision to merge services can be rolled back immediately with the push of a button

By merging two similar services, you are able to consolidate duplicate functionality as well as increase the exclusivity of classes in the newly merged service (Image 5). As we’re using vFunction Platform in this example, everything needed to logically bind these services is taken care of — classes, entry points and resources are intelligently updated.

Image 6: A newly merged single service now represents two previous chat services

Merging services is as simple as dragging and dropping one service onto the other, and after vFunction Platform recalculates the analysis of this action, we see that the sphere is now green, with a dynamic exclusivity of 75% (Image 6). This indicates that the newly-merged service is less interconnected at the class level and gives us the opportunity to extract this service with less complexity.

Best Practice #3: Create Accurate and Meaningful Names for Services

We all know that naming things is hard. When dealing with monolithic services, we can really only use the class names to figure out what is going on. With this information alone, it’s difficult to accurately identify which classes and functionality may belong to a particular domain.

The Real-World Approach

In our example, vFunction has automatically derived service domain names from the class names on the right side of screen in Image 7. As an architect, you need to be able to rename services according to your preferences and requirements.

Image 7: Rename a merged service to something more accurate

Let’s now go back to the two chat services we merged in the last section. Whereas previously we had a service for both the patient and physician chat, we now have a single service that represents both profiles, so the name PatientChatWebSocket is no longer accurate, and may cause misunderstandings for other developers working on this service in the future. We can decide to select a better name, such as ChatService (Image 7).

Image 8: Rename an automatically identified service to something more meaningful

In Image 8, we can see another service named JaxRSRecordFacadeBroker (+2). The (+2) part here indicates that we have entry points belonging to multiple classes. You may find this name unnecessarily descriptive, so you can change it simply to RecordBroker.

By renaming services in a more accurate and meaningful way, you can ensure that your engineering team can quickly identify and work with future microservices in a straightforward way.

Best Practice #4: Identify Functionality That Shouldn’t Be a Separate Microservice

What qualities suggest that functionality previously contained in a monolith deserves to be a microservice? Not everything should become a microservice, so when would you want to remove a service as a candidate for separation and extraction?

Well, you may decide that some services don’t actually belong in a separate domain, for example, a filter class that simply filters messages. Because this isn’t exclusive to any particular service, you can decide to move it to a common library or another service in the future.

The Real-World Approach

When removing functionality as a candidate for future extraction as a microservice, you are deciding not to treat this class as an individual entry point for receiving traffic. Let’s look at the AuthenticatingAdministrationController service (Image 9), which is a simple controller class.

Image 9: Removing a very simple, non-specific service

In Image 9, we can see that the selected class has low exclusivity by the red color, and also that it is a very small service, containing only one dynamic class, one static class and no resources. You can decide that this should not be a separate service by itself and remove it by dragging and dropping it onto the black sphere in the middle (Image 10).

By relocating this class back to the monolith, we have decided that this particular functionality does not meet the requirements to become an individual microservice.

In this post, we demonstrated some of the best practices that architects and developers can follow to make refactoring a monolithic application into bounded contexts and accurate domains for future microservice extraction.

By using the vFunction Platform, much of the heavy lifting and manual efforts have been automated using AI and data-driven analysis. This ensures that architects and development teams can spend time focusing on refining a reference architecture based on intelligent suggestions, instead of spending thousands of hours manually analyzing small chunks of code without the appropriate “big picture” context to be successful.

The post Monoliths to Microservices: 4 Modernization Best Practices appeared first on The New Stack.

]]>
eBPF or Not, Sidecars are the Future of the Service Mesh https://thenewstack.io/ebpf-or-not-sidecars-are-the-future-of-the-service-mesh/ Fri, 12 Aug 2022 17:00:00 +0000 https://thenewstack.io/?p=22680472

eBPF is a hot topic in the Kubernetes world, and the idea of using it to build a “sidecar-free service

The post eBPF or Not, Sidecars are the Future of the Service Mesh appeared first on The New Stack.

]]>

William Morgan
William is the co-founder and CEO of Buoyant, the creator of the open source service mesh projects Linkerd. Prior to Buoyant, he was an infrastructure engineer at Twitter, where he helped move Twitter from a failing monolithic Ruby on Rails app to a highly distributed, fault-tolerant microservice architecture. He was a software engineer at Powerset, Microsoft, and Adap.tv, a research scientist at MITRE, and holds an MS in computer science from Stanford University.

eBPF is a hot topic in the Kubernetes world, and the idea of using it to build a “sidecar-free service mesh” has generated recent buzz. Proponents of this idea claim that eBPF lets them reduce service mesh complexity by removing sidecars. What’s left unsaid is that this model simply replaces sidecar proxies with multitenant per-host proxies — a significant step backward for both security and operability that increases, not decreases, complexity.

The sidecar model represents a tremendous advancement for the industry. Sidecars allow the dynamic injection of functionality into the application at runtime, while — critically — retaining all the isolation guarantees achieved by containers. Moving from sidecars back to multitenant, shared proxies loses this critical isolation and results in significant regressions in security and operability.

In fact, the service mesh market has seen this first-hand: the first service mesh, Linkerd 1.0 offered a “sidecar-free” service mesh circa 2017 using the same per-host proxy model, and the resulting challenges in operations, management, and security led directly to Linkerd 2.0 based on sidecars.

eBPF and sidecars are not an either-or choice, and the assertion that eBPF needs to replace sidecars is a marketing construct, not an actual requirement. eBPF has a future in the service meshwork, but it will be as eBPF and sidecars, not eBPF or sidecars.

eBPF in a Nutshell

To understand why we first need to understand eBPF. eBPF is a powerful Linux kernel feature that allows applications to dynamically load and execute code directly within the kernel. This can provide a substantial performance boost: rather than continually moving data between kernel and application space for processing, we can do the processing within the kernel itself. This boost in performance means that eBPF opens up an entire class of applications that were previously infeasible, especially in areas like network observability.

But eBPF is not a magic bullet. eBPF programs are very limited, and for good reason: running code in the kernel is dangerous. To prevent bad actors, the kernel must impose significant constraints on eBPF code, not the least of which is the “verifier.” Before an eBPF program is allowed to execute, the verifier performs a series of rigorous static analysis checks on the program.

Automatic verification of arbitrary code is hard, and the consequences of errors are asymmetric: rejecting a perfectly safe program may be an annoyance to developers, but allowing an unsafe program to run would be a major kernel security vulnerability. Because of this, eBPF programs are highly constrained. They can’t block, or have unbounded loops, or even exceed a predefined size. The verifier must evaluate all possible execution paths, which means the overall complexity of an eBPF program is limited.

Thus, eBPF is suitable for only certain types of work. For example, functions that require limited state, e.g., “count the number of network packets that match an IP address and port,” are relatively straightforward to implement in eBPF. Programs that require accumulating state in non-trivial ways, e.g., “parse this HTTP/2 stream and do a regular expression match against a user-supplied configuration”, or even “negotiate this TLS handshake,” are either outright impossible to implement or require Rube Goldberg levels of contortions to make use of eBPF.

eBPF and the Service Mesh

Let’s turn now to service meshes. Can we replace our sidecars with eBPF?

As we might expect, given the limitations of eBPF, the answer is no — what the service mesh does is well beyond what pure eBPF is capable of. Service meshes handle all the complexities of modern cloud native networking. Linkerd, for example, initiates and terminates mutual TLS; retries requests in the event of transient failures; transparently upgrades connections from HTTP/1.x to HTTP/2; enforces authorization policies based on cryptographic workload identity; and much more.

eBPF and sidecars are not an either-or choice, and the assertion that eBPF needs to replace sidecars is a marketing construct, not an actual requirement.

Like most service meshes, Linkerd does this by inserting a proxy into each application pod — the proverbial sidecar. In Linkerd’s case, this is the ultralight Linkerd2-proxy “micro proxy,” written in Rust and designed to consume the least amount of system resources possible. This proxy intercepts and augments all TCP communication to and from the pod and is ultimately responsible for implementing the service mesh’s full feature set.

Some of the functionality in this proxy can be accomplished with eBPF. For example, occasionally, the sidecar’s job is simply to proxy a TCP connection to a destination, without L7 analysis or logic. This could be offloaded to the kernel using eBPF. But the majority of what the sidecar does requires significant state and is impossible or at best infeasible to implement in eBPF.

Thus, even with eBPF, the service mesh still needs user-space proxies.

The Case for the Sidecar

If we’re designing a service mesh, where we place the proxies is up to us. From an architectural level, we could place them at the sidecar level, at the host level, or even at the cluster level, or even elsewhere. But from the operational and security perspective, there’s really only one answer: compared to any of the alternatives, sidecars provide substantial and concrete benefits to security, maintainability, and operability.

A sidecar proxy handles all the traffic to a single application instance. In effect, it acts as part of the application. This results in some significant advantages:

  • Sidecar proxy resource consumption scales with the traffic load to the application, so Kubernetes resource limits and requests are directly applicable.
  • The “blast radius” of sidecar failure is limited to the pod so that Kubernetes’s pod lifecycle controls are directly applicable.
  • Upgrading sidecar proxies are handled the same way as upgrading application code is, e.g. via rolling deployments.
  • The security boundary of a sidecar proxy is clearly delineated and tightly scoped: the sidecar proxy contains only the secret material pertaining to that pod, and acts as the enforcement point for the pod. This is granular enforcement is central to zero trust approaches to network security.

By contrast, per-host proxies (and other forms of multitenancy, e.g. cluster-wide proxies) handle traffic to whichever arbitrary set of pods Kubernetes are scheduled on the host. This means all of the operational and security advantages of sidecars are lost:

  • Per-host proxy resource consumption is unpredictable. It is a function of Kubernetes’s dynamic scheduling decisions, meaning that resource limits and requests are no longer useful—you cannot tell ahead of time how much of the system the proxy requires.
  • Per-host proxies must ensure fairness and QoS, or the application risks starvation. This is a non-trivial requirement and no popular proxy is designed to handle this form of “contended multitenancy”.
  • The blast radius for per-host proxies is large and continuously changing. A failure in a per-host proxy will affect whichever arbitrary sets of pods from arbitrary applications were scheduled on the host. Similarly, upgrading a per-host proxy will impact arbitrary applications to arbitrary degrees depending on which pods were scheduled on the machine.
  • The security story is… messy. A per-host proxy must contain the key material for all pods scheduled on that host and must perform enforcement on behalf of all applications scheduled on this host. This turns the proxy into a new attack vector vulnerable to the confused deputy problem, and any CVE or flaw in the proxy now has a dramatically larger security impact.

In short, sidecar proxies built on top of the isolation guarantees gained through containers, allowing Kubernetes and the kernel to enforce security and fairness. Per-host proxies step outside of those guarantees entirely, introducing significant complexities to operations, security, and maintenance.

So Where Do We Go from Here?

eBPF is a big advancement for networking, and can optimize some work from the service mesh by moving it to the kernel. But eBPF will always require userspace proxies. Given that, the right approach is to combine eBPF and sidecars, not to avoid sidecars.

Proposing a sidecar-free service mesh with eBPF is putting the marketing cart before the engineering horse. Of course, “incrementally improving sidecars with eBPF” doesn’t have quite the same buzz factor of “goodbye sidecars,” but from the user perspective, it’s the right decision.

The sidecar model is a tremendous advancement for the industry. It is not without challenges, but it is the best approach we have, by far, to handle the full scope of cloud native networking while keeping the isolation guarantees achieved by adopting containers in the first place. eBPF can augment this model, but it cannot replace it.

The post eBPF or Not, Sidecars are the Future of the Service Mesh appeared first on The New Stack.

]]>
How Kafka and Redis Solve Stream-Processing Challenges https://thenewstack.io/how-kafka-and-redis-solve-stream-processing-challenges/ Mon, 08 Aug 2022 13:40:14 +0000 https://thenewstack.io/?p=22679990

Although streams can be an efficient way of processing huge volumes of data, they come with their own set of

The post How Kafka and Redis Solve Stream-Processing Challenges appeared first on The New Stack.

]]>

Raja Rao
Raja has been an engineer, developer advocate and a tech writer with nearly two decades in the software industry. He’s now transitioned into growth marketing and writes mostly about databases and big data.

Although streams can be an efficient way of processing huge volumes of data, they come with their own set of challenges. Let’s take a look at a few of them.

1. What happens if the consumer is unable to process the chunks as quickly as the producer creates them? Let’s look at an example: What if the consumer is 50% slower than the producer? If we’re starting out with a 10 gigabyte file, that means by the time the producer has processed all 10GBs, the consumer would only have processed 5GB. What happens to the remaining 5GB while it’s waiting to be processed? Suddenly that 50 to100 bytes allocated for data that still needs to be processed would have to be expanded to 5GB.

Picture 1: If the consumer is slower than the producer, you’ll need additional memory.

2. And that’s just one nightmare scenario. There are others. For example, what happens if the consumer suddenly dies while it’s processing a line? You’d need a way of keeping track of the line that was being processed and a mechanism that would allow you to reread that line and all the lines that follow.

Picture 2: When the consumer fails.

3. Finally, what happens if you need to be able to process different events and send them to different consumers? And, to add an extra level of complexity, what if you have interdependent processing, when the process of one consumer depends on the actions of another? There’s a real risk that you’ll wind up with a complex, tightly coupled, monolithic system that’s very hard to manage. This is because these requirements will keep changing as you keep adding and removing different producers and consumers.

For example (Picture 3), let’s assume we have a large retail shop with thousands of servers that support shopping through web apps and mobile apps.

Imagine that we are processing three types of data related to payments, inventory and webserver logs and that each has a corresponding consumer: a “payment processor,” an “inventory processor” and a “webserver events processor.” In addition, there is an important interdependency between two of the consumers. Before you can process the inventory, you need to verify payment first. Finally, each type of data has different destinations. If it’s a payment event, you send the output to all the systems, such as the database, email system, CRM and so on. If it’s a webserver event, then you send it just to the database. If it’s an inventory event, you send it to the database and the CRM.

As you can imagine, this can quickly become quite complicated and messy. And that’s not even including the slow consumers and fault-tolerance issues that we’ll need to deal with for each consumer.

Picture 3: The challenge of tight coupling because of multiple producers and consumers.

Of course, all of this assumes that you’re dealing with a monolithic architecture, that you have a single server receiving and processing all the events. How would you deal with a microservices architecture? In this case, numerous small servers — that is, microservices — would be processing the events, and they would all need to be able to talk to each other. Suddenly, you don’t just have multiple producers and consumers. You have them spread out over multiple servers.

A key benefit of microservices is that they solve the problem of scaling specific services depending on changing needs. Unfortunately, although microservices solve some problems, they leave others unaddressed. We still have tight coupling between our producers and consumers, and we retain the dependency between the inventory microservices and the payment ones. Finally, the problems we pinpointed in our original streaming example remain problems:

  1. We haven’t figured out what to do when a consumer crashes.
  2. We haven’t come up with a method for managing slow consumers that doesn’t force us to vastly inflate the size of the buffer.
  3. We don’t yet have a way to ensure that our data isn’t lost.

These are just some of the main challenges. Let’s take a look at how to address them.

Picture 4: The challenges of tight coupling in the microservices world

Specialized Stream-Processing Systems

As we’ve seen, streams can be great for processing large amounts of data but also introduce a set of challenges. New specialized systems such as Apache Kafka and Redis Streams were introduced to solve these challenges. In the world of Kafka and Redis Streams, servers no longer lie at the center, the streams do, and everything else revolves around them.

Data engineers and data architects frequently share this stream-centered worldview. Perhaps it’s not surprising that when streams become the center of the world, everything is streamlined.

Picture 5 shows a direct mapping of the tightly coupled example you saw earlier. Let’s see how it works at a high level.

Picture 5: When we make streams the center of the world, everything becomes streamlined.

  1. Here the streams and the data (events) are first-class citizens as opposed to systems that are processing them.
  2. Any system that is interested in sending data (producer), receiving data (consumer) or both sending and receiving data (producer and consumer) connects to the stream-processing system.
  3. Because producers and consumers are decoupled, you can add additional consumers or producers at will. You can listen to any event you want. This makes it perfect for microservices architectures.
  4. If the consumer is slow, you can increase consumption by adding more consumers.
  5. If one consumer is dependent on another, you can simply listen to the output stream of that consumer and then do your processing. For example, in the picture above, the inventory service is receiving events from both the inventory stream (Purple) and also the output of the payment-processing stream (Orange) before it processes the inventory event. This is how you solve the interdependency problems.
  6. The data in the streams are persistent (as in a database). Any system can access any data at any time. If for some reason data wasn’t processed, you can reprocess it.

A number of streaming challenges that once seemed formidable, even insurmountable, can readily be solved just by putting streams at the center of the world. This is why more and more people are using Kafka and Redis Streams in their data layer.

This is why data engineers view streams as the center of the world.

Learn more about how Kafka and Redis deal with these complex data challenges. Download this free book that features 50+ illustrations to help you understand this complex topic in a fun and engaging way.

The post How Kafka and Redis Solve Stream-Processing Challenges appeared first on The New Stack.

]]>
Instacart Speeds ML Deployments with Hybrid MLOps Platform https://thenewstack.io/instacart-speeds-ml-deployments-with-hybrid-mlops-platform/ Fri, 08 Jul 2022 11:00:24 +0000 https://thenewstack.io/?p=22676786

Grocery delivery service Instacart recently spun up a new Machine Learning platform, called Griffin, that tripled the number of ML

The post Instacart Speeds ML Deployments with Hybrid MLOps Platform appeared first on The New Stack.

]]>

Grocery delivery service Instacart recently spun up a new Machine Learning platform, called Griffin, that tripled the number of ML applications that the service spun up in a year.

Instacart began developing its machine learning infrastructure in 2016 with Lore, an open sourced framework. After years of rapid growth leading to an increase in the amount, diversity, and complexity of ML applications, Lore’s monolithic architecture was increasingly becoming a bottleneck.

This bottleneck challenge led to the development of Griffin, a hybrid, extensible platform that supports diverse data management systems, and integrates with multiple ML tools and workflows. Sahil Khanna’s recent blog post goes into great detail about Griffin, including its benefits, components, and workflows.

Instacart relies heavily on machine learning for product and operation innovations. Such innovations don’t come easy as multiple machine learning models often must work together to provide a service. Griffin, built by the machine learning infrastructure team, now plays a foundational role in supporting the following machine learning applications and empowering innovations.

In short, Griffin offers the following benefits to the service:

  • Aids customers with locating the correct item in a catalog of over 1 billion products.
  • Supports 600,000+ shoppers with the delivery of products to millions of customers in the US and Canada.
  • Incorporates AI into Instacart’s support of their 800+ retailers across 70,000+ stores in 5,000+ cities in the US and Canada.
  • Enables 5,000+ brand partners to connect their products to potential partners.

Griffin: Instacart’s MLOps Platform

To allow Instacart to stay current with innovations in the state of the art of ML operations (MLOps) while also deploying specialized and diverse solutions, Griffin was designed as a hybrid model. Griffin allows Machine Learning Engineers (MLE) to utilize third-party solutions such as Snowflake, Amazon Web Services, Databricks, and Ray to support diverse use cases and in-house abstraction layers to provide unified access to those solutions.

Griffin was created with the main goals of helping MLEs quickly iterate on machine learning models, effortlessly manage product releases, and closely track production applications. With that in mind, the system was built with these major considerations:

  • Scalability It needs to support thousands of machine learning applications.
  • Extensibility It needs to be flexible enough to extend and integrate with a number of data management and machine learning tools.
  • Generality It needs to provide a unified workflow and consistent user experience despite broad integration with third-party solutions

The diagram below illustrates Griffin Systems Architecture.

The considerations are clearly illustrated in the diagram above. Griffin integrates multiple SaaS solutions including Redis, Scylla, and S3 demonstrating extensibility which supports growth at Instacart showing its scalability. The integrated interface for the MLEs shows Griffin’s generality.

Instacart can develop specialized solutions for distinct use cases (such as real-time recommendations) as a result of the four foundational concepts introduced below which are also considered distinct elements.

  • MLCLI: The in-house machine learning command-line interface that develops machine learning applications and manages the model lifecycle.
  • Workflow Manager and ML Launcher: The orchestrator that schedules and manages machine learning pipelines & containerizes task execution.
  • Feature Marketplace: This uses third-party platforms for real-time and batch feature engineering.
  • Training and Interference Platform: The framework-agnostic training and inference platform for adopting open-source frameworks.

MLCLI

MLCLI allows MLEs to customize and execute tasks such as training, evaluation, and inference in their applications within containers (Docker for example). Containerization eliminates bugs caused by variations in execution environments and provides a unified interface.

The diagram below illustrates MLCLI features used by MLE’s during ML application development.

Workflow Manager and ML Launcher

Workflow Manager handles the scheduling and managing of the machine learning pipelines. It leverages Airflow to schedule containers and utilizes ML Launcher, an in-house abstraction, to containerize task execution.

ML Launcher integrates third-party compute backends such as Sagemaker, Databricks, and Snowflake to perform container runs and meet unique hardware requirements for ML. Instacart chose this design because it allows for the scaling up to hundreds of Directed Acyclic Graphs (DAGs) with thousands of tasks in a short period without worrying about Airflow run time.

The diagram below illustrates the Architecture Design of Workflow Manager and ML Launcher.

Feature Marketplace (FM)

With data being the center of any MLOps platform, Instacart developed its FM product to support both real-time and batch engineering. FM manages feature computation, provides feature storage, supports feature discoverability, eliminates offline/online feature drift, and allows feature sharing. This product uses third-party platforms such as Snowflake, Spark, and Flint and integrates multiple storage backends, Scylla, Redis, and S3.

The diagram below illustrates the Architecture Design of Feature Marketplace.

Inference and Training Platform

The Inference and Training Platform allows MLEs to define the model architecture and inference routine to customize applications which allowed Instacart to triple the number of ML applications in one year. Instacart standardized package, metadata, and code management to support diversity in frameworks and ensure reliable model deployment. Some of the frameworks already adopted were Tensorflow, XGBoost, and Faiss.

The diagram below illustrates the Architecture Design of the Inference and Training Platform.

A Few Key Learnings

Some valuable lessons were learned during the development of Griffin.

  • Buy vs. Build Utilizing third-party solutions is important when it comes to supporting a quickly growing feature set and in avoiding reinventing the wheel. In order to benefit from seamless switching between solutions while keeping migration overhead costs down, careful platform integration is key.
  • Make Incremental Progress Prioritizing regular onboarding sessions streamlined feedback and kept the design simple. Regular hands-on codelabs and onboarding sessions encouraged early feedback and collaboration. This environment prevented engineers from going down the rabbit hole of wanting to design the “perfect” platform.

The post Instacart Speeds ML Deployments with Hybrid MLOps Platform appeared first on The New Stack.

]]>