Managed Cloud Services Overview and News | The New Stack

Managing Kubernetes Complexity in Multicloud Environments

Hemanth Kavuluru — Thu, 15 Jun 2023 14:48:40 +0000

Kubernetes has become the ubiquitous choice as the container orchestration platform for building and deploying cloud native applications. As enterprises adopt Kubernetes, one of the key decisions they have to make is around adopting a multicloud strategy. It’s essential to understand the factors driving the need for a solution across public cloud providers such as Amazon Web Services (AWS) , Azure, GCP, Oracle, Alibaba, etc., and validate whether those factors are relevant currently or in the future. Some factors that influence multicloud strategy are:

Specialized cloud technology needs/requirements for particular applications
Multiple business units adopting separate clouds
GDPR and other locality considerations
Disaster recovery
Mergers and acquisitions of other businesses that have adopted different clouds
Dependency on a cloud-managed service

Specialized Cloud Technology Needs/Requirements for a Particular Application

Some applications require specialized cloud services only available on specific cloud platforms. For example, Google Big Table is a NoSQL database only available on Google Cloud. Similarly, Azure has specialized machine learning and AI services, such as Azure Cognitive Services.

In such scenarios, enterprises need to deploy their applications across multiple clouds to access the specialized services required for their applications. This approach can also help organizations optimize costs by choosing the most cost-effective cloud service for each application.

Multiple Business Units Adopting Separate Clouds

In large organizations, different business units may have unique requirements for their cloud services, leading to the adoption of separate cloud services. For example, one business unit may prefer Google Cloud for its machine learning capabilities, while another may prefer AWS for its breadth of services. As a result, the cloud environment becomes fragmented, and deploying applications across multiple clouds becomes complex.

GDPR and Other Locality Considerations

Regional regulations can also drive the need for a multicloud approach. For example, enterprises may need to store and process data in specific regions to comply with data residency regulations. For instance, Alibaba Cloud is China’s leading cloud provider and the preferred cloud in that region.

Deploying applications across multiple clouds in different regions can help enterprises meet their data residency and compliance requirements.

Disaster Recovery

Implementing disaster recovery in the right manner is essential for enterprises, as downtime can lead to significant revenue loss and reputational damage. A multicloud approach can help enterprises ensure business continuity by deploying applications across multiple clouds. In such scenarios, primary applications can run in one cloud while secondary applications can run in another for disaster recovery.

This approach can also help enterprises optimize their costs by choosing the most cost-effective cloud service for disaster recovery.

Mergers and Acquisitions

When organizations merge, they may have different cloud environments that must be integrated. Similarly, when organizations acquire other companies, they may need to integrate the acquired company’s cloud environment with their existing cloud environment, hence the need for a multicloud approach.

Dependency on a Particular Cloud Service

Enterprises may need to deploy applications in a particular cloud due to the dependency on a specific service that a specific cloud provider only offers. For example, an organization may require managed Oracle for its databases or SAP HANA for its ERP systems. In this case, deploying the applications in the same cloud is necessary to be closer to the database. Platform and site reliability engineering (SRE) teams must now acquire skills to manage Kubernetes infrastructure on a new public cloud. Platform teams must thoroughly understand all their application team requirements to see whether any of their applications will fall into this category.

How to Manage Multicloud Kubernetes Operations with a Platform Approach

Enterprises may want to invest in a true Kubernetes operations platform if the multicloud deployment is a critical requirement now or in the future. A true Kubernetes operations platform helps enterprises develop standardized automation across clouds while leveraging public cloud Kubernetes distributions such as AWS EKS, Azure AKS, Google GKE, etc. On the other hand, deploying and managing Kubernetes infrastructure on multiple clouds without a Kubernetes operations platform requires a lot of manual effort and can lead to substantial operational costs, operational inconsistencies, project delays, etc.

A Kubernetes operations platform can standardize the process for deploying and managing Kubernetes clusters across multiple clouds. Enterprises can use a unified interface to automate the deployment and management of Kubernetes clusters across multiple clouds. This automation helps improve consistency and reduce the risk of human error. It also reduces the need for specialized skills.
Enterprises also need to maintain a unified security posture across clouds. In a multicloud environment, each cloud provider has its own security policies, which makes it hard for enterprises to implement standard security policies across the clouds. A Kubernetes operations platform can provide consistent security policies across clouds, enforcing governance and compliance uniformly.
Consistent policy management and network security policies across clouds are critical for adopting multicloud Kubernetes deployments. A Kubernetes operations platform should provide standardized workflows for applying network security and Open Policy Agent (OPA) policies for Kubernetes clusters spanning clouds. Policies, including network policies, ingress and egress rules, can be defined in a centralized location and deployed to all Kubernetes clusters, ensuring consistency and reducing operational complexity.
A true Kubernetes operations platform should provide a unified bimodal multitenancy (cluster and namespace) across clouds. This platform should allow multiple teams and applications to share the same Kubernetes clusters without affecting each other, providing better resource utilization and cost efficiency. Similarly, for teams, applications or environments that require dedicated clusters, the Kubernetes platform should offer cluster-as-a-service where the individual teams can create their clusters in a self-serve manner adhering to the security, governance and compliance set by the platform and SRE teams.
Kubernetes access control, role-based access control (RBAC) and single sign-on (SSO) across all clouds are essential for a Kubernetes operations platform. However, access management becomes increasingly complex when deploying Kubernetes across multiple clouds. A unified access management solution can simplify the process and reduce the security risk.
Finally, a single pane of administration offering visibility for the entire infrastructure spanning multiple clouds is essential for a Kubernetes operations platform. A single management plane can provide centralized visibility into Kubernetes clusters across multiple clouds, allowing enterprises to monitor, manage and troubleshoot their Kubernetes clusters more efficiently.

Conclusion

A multicloud strategy may be an important consideration for enterprises that are adopting a Kubernetes operations platform for managing their Kubernetes infrastructure. Enterprises should carefully look at all factors that influence a multicloud deployment and decide whether multicloud is required for their organization. A true multicloud Kubernetes operations platform should provide standardized automation, consistent security policies, unified Kubernetes bimodal multitenancy, access management and a single administration pane, offering visibility for the entire infrastructure spanning multiple clouds.

The post Managing Kubernetes Complexity in Multicloud Environments appeared first on The New Stack.

3 AI Moves Smart Companies Get Right

Loraine Lawson — Wed, 14 Jun 2023 15:41:17 +0000

Artificial intelligence leaders get three moves right when it comes to creating outcomes: priorities, people and platforms. That’s according to Nick Elprin, co-founder and CEO of machine learning/AI platform Domino Data Lab at this month’s Rev4 conference.

Priorities may seem like an obvious one, but companies do get it wrong, he said.

“Too many companies make the mistake of starting with some interesting data set they have or some trendy or novel new technique or algorithm, and they ask what can I do with this?” Elprin said. “In contrast, AI leaders working backwards, they start from a strategic objective or a business goal and they ask how can AI help me achieve this.”

Surprisingly, many companies also don’t talk about KPIs or business goals, he added — instead, many seem to view it as a shiny new toy without having clarity around how it will help their businesses, he said.

People and Platforms

Once there’s clarity around priorities, AI leaders build their talent strategy around a core of professional data scientists.

“That doesn’t mean that everyone has to be a Ph.D. in computer science, but what it does mean is that you need people inside your organization who have the expertise and the knowledge and a sound fundamental understanding of the methods and techniques involved in this type of work,” Elprin told audiences.

He shared customer testimonies about Domino’s support for collaboration across people and — perhaps more importantly to programmers and data scientists — different programming languages, including Python and R. He also predicted that a new wave of advanced AI, with its more complex models, is going to be the death knell for “citizen data scientist experiments.”

“They have a wider range of unexpected failure modes and negative consequences for the model from unexpected model behavior,” he said. “So it’s going to be ineffective and risky to have citizens doing the heavy lifting and building operators models.”

The third step is to empower them with technology and platforms for operating AI, he added.

“It [AI] is unlike anything that most businesses have had to build or operate or manage in the past, and it has some important implications for the kinds of technology you need to empower enable this sort of work,” he said.

How Domino Data Lab Differentiates

Domino Lab has built a business model on the premise of a purpose-built system. It handles the infrastructure and integration pieces, allowing a data scientist to start with a smaller footprint and then scale up — whether that means more GPU, CPU or whatever — as needed, without rebuilding. That’s one way it differentiates itself from the big cloud providers, who focus on compute and use proprietary platforms. It primarily competes against these cloud providers, custom solutions and, to some extent, the SAS Institute.

The company announced a number of new capabilities at its Rev4 conference in New York, starting with Code Assist for hyperparameter tuning of foundation models. Ramanan Balakrishnan, vice president of product marketing demoed deploying a new chatbot. He shared how experiment managers can enable automatic login of key metrics and artifacts during normal training to monitor the progress of AI experiments, including model training and fine-tuning. Domino Data Lab has also added enterprise security to ensure only approved personnel can see the metrics, logs and artifacts.

The summer release, which will be available in August, also includes advanced cost management tools. Specifically, Domino introduced detailed controls for actionable cost management. Balakrishna also introduced Model Sentry, a responsible AI solution for in-house generative AI. One aspect of Model Sentry that will be of interest to international companies is that it supports the training of models using on-premise GPUs, so data isn’t moved across borders, he said.

Domino Cloud will now include Nexus support. Users can now use a fully-managed control plane in the cloud with single-pane access to private hybrid data planes, including NVIDIA DGX clusters. Finally, Domino has a new Domino Cloud for Life Sciences, which incorporates an audit-ready specialized AI cloud platform with a Statistical Computing Environment to address the unique needs of the pharmaceutical industry.

“It’s fair to say that now we live in a new era of AI,” Balakrishna said.

Domino Data Lab paid for The New Stack’s travel and accommodations to attend the Rev4 conference.

The post 3 AI Moves Smart Companies Get Right appeared first on The New Stack.

Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All

Scott M. Fulton III — Tue, 13 Jun 2023 13:00:28 +0000

In any organizational structure, once you break down regular jobs into overly granularized tasks and delegate them to too many individuals, their messaging soon becomes unmanageable, and the organization stops growing.

Last March 22, in a blog post that went unnoticed for several weeks, Amazon Prime Video’s engineers reported the service quality monitoring application they had originally built to determine quality-of-service (QoS) levels for streaming videos — an application they built on a microservices platform — was failing, even at levels below 10 percent of service capacity.

What’s more, they had already applied a remedy: a solution their post described as “a monolith application.”

The change came at least five years after Prime Video — home of on-demand favorites such as “Game of Thrones” and “The Marvelous Mrs. Maisel” — successfully outbid traditional broadcast outlets for the live-streaming rights to carry NFL Thursday Night Football.

One of the leaders in on-demand streaming now found itself in the broadcasting business, serving an average 16.6 million real-time viewers simultaneously. To keep up with live sports viewers’ expectations of their “networks” — in this case, CBS, NBC, or Fox — Prime Video’s evolution needed to accelerate.

It wasn’t happening. When the 2022 football season kicked off last September, too many of Prime Video’s tweets were prefaced with the phrase, “We’re sorry for the inconvenience.”

Prime Video engineers overcame these glitches, the engineers’ blog reported, by consolidating QoS monitoring operations that had been separated into isolated AWS Step Functions and Lambda functions, into a unified code module.

As initially reported, their results appeared to finally confirm many organizations’ suspicions, well-articulated over the last decade, that the costs incurred in maintaining system complexity and messaging overhead inevitably outweighed any benefits to be realized from having adopted microservices architecture.

Once that blog post awakened from its dormancy, several experts declared all of microservices architecture dead. “It’s clear that in practice, microservices pose perhaps the biggest siren song for needlessly complicating your system,” wrote Ruby on Rails creator David Heinemeier Hansson. “Are we seeing a resurgence of the majestic monolith?” asked .NET MVP Milan Jovanović on Twitter. “I hope so.”

“That’s great news for Amazon because it will save a ton of money,” declared Jeff Delaney on his enormously popular YouTube channel Fireship, “but bad news for Amazon because it just lost a great revenue source.”

Yet there were other experts, including CodeOpinion.com’s Derek Comartin, who compared Prime’s “before” and “after” architectural diagrams with one another, and noticed some glaring disconnects between those diagrams and their accompanying narrative.

As world-class experts speaking with the New Stack also noticed, and as a high-ranking Amazon Web Services engineer finally confirmed for us, the solution Prime Video adopted not only fails to fit the profile of a monolithic application. In every respect that truly matters, including scalability and functionality, it is a more evolved microservice than what Prime Video had before.

That Dear Perfection

“This definitely isn’t a microservices-to-monolith story,” remarked Adrian Cockcroft, the former vice president of cloud architecture strategy at AWS, now an advisor for Nubank, in an interview with The New Stack. “It’s a Step Functions-to-microservices story. And I think one of the problems is the wrong labeling.”

Cockcroft, as many regular New Stack readers will be familiar, is one of microservices architecture’s originators, and certainly its most outspoken champion. He has not been directly involved with Prime Video or AWS since becoming an advisor, but he’s familiar with what actually happened there, and he was an AWS executive when Prime’s stream quality monitoring project began. He described for us a kind of prototyping strategy where an organization utilizes AWS Step Functions, coupled with serverless orchestration, for visually modeling business processes.

With this adoption strategy, an architect can reorganize digital processes essentially at will, eventually discovering their best alignment with business processes. He’s intimately familiar with this methodology because it’s part of AWS’ best practices — advice which he himself co-authored. Speaking with us, Cockcroft praised the Prime Video team for having followed that advice.

As Cockcroft understands it, Step Functions was never intended to run processes at the scale of live NFL sports events. It’s not a staging system for processes whose eventual, production-ready state would need to become more algorithmic, more efficient, more consolidated. So the trick to making the Step Functions model workable for more than just prototyping is not just to make the model somewhat scalable, but also transitional.

“If you know you’re going to eventually do it at some scale,” said Cockcroft, “you may build it differently in the first place. So the question is, do you know how to do the thing, and do you know the scale you’re going to run it at? Those are two separate cases. If you don’t know either of those, or if you know it’s small-scale, complex, and you’re not exactly sure how it’s going to be built, then you want to build a prototype that’s going to be very fast to build.”

However, he suggested, if an organization knows from the outset its application will be very widely deployed and highly scalable, it should optimize for that situation by investing in more development time up-front. The Prime Video team did not have that luxury. In that case, Cockcroft said, the team was following best practices: building the best system they could, to accomplish the business objectives as they interpreted them at the time.

“A lot of workloads cost more to build than to run,” Cockcroft explained. “[For] a lot of internal corporate IT workloads, lots of things that are relatively small-scale, if you’re spending more on the developers than you are on the execution, then you want to optimize for saving developer time by building it super-quickly. And I think the first version… was optimized that way; it wasn’t intended to run at scale.”

As any Step Functions-based system becomes refined, according to those same best practices, the next stage of its evolution will be transitional. Part of that metamorphosis may involve, contrary to popular notions, service consolidation. Despite how Prime Video’s blog post described it, the result of consolidation is not a monolith. It’s now a fully-fledged microservice, capable of delivering those 90% cost reductions engineers touted.

“This is an independently scalable chunk of the overall Prime Video workload,” described Cockcroft. “If they’re not running a live stream at the moment, it would scale down or turn off — which is one reason to build it with Step Functions and Lambda functions to start with. And if there’s a live stream running, it scales up. That’s a microservice. The rest of Prime Video scales independently.”

The New Stack spoke with Ajay Nair, AWS’ general manager for Lambda and for its managed container service App Runner. Nair confirmed Cockcroft’s account in its entirety for how the project was initially framed in Step Functions, as well as how it ended up a scalable microservice.

Nair outlined for us a typical microservices development pattern. Here, the original application’s business processes may be too rigidly coupled together to allow for evolution and adaptation. So they’re decoupled and isolated. This decomposition enables developers to define the contracts that spell out each service’s expected inputs and outputs, requirements and outcomes. For the first time, business teams can directly observe the transactional activities that, in the application’s prior incarnations, had been entirely obscured by its complexity and unintended design constraints.

From there, Nair went on, software engineers may codify the isolated serverless functions as services. In so doing, they may further decompose some services — as AWS did for Amazon S3, which is now served by over 300 microservice classes. They may also consolidate other services. One possible reason: Observing their behavior may reveal they actually did not need to be scaled independently after all.

“It is a natural evolution of any architecture where services that are built get consolidated and redistributed,” said Nair. “The resulting capability still has a well-established contract, [and] has a single team managing and deploying it. So it technically meets the definition of a microservice.”

Breakdown

“I think the definition of a microservice is not necessarily crisp,” stated Brendan Burns, the co-creator of Kubernetes, now corporate vice president at Microsoft, in a note to The New Stack.

“I tend to think of it more in terms of capabilities around functionality, scaling, and team size,” Burns continued. “A microservice should be a consistent function or functions — this is like good object-oriented design. If your microservice is the CatAndDog() service, you might want to consider breaking that into Cat() and Dog() services. But if your microservice is ThatOneCatOnMyBlock(), it might be a sign that you have broken things down too far.”

“The level of granularity that you decompose to,” explained F5 Networks Distinguished Engineer Lori MacVittie, speaking with The New Stack, “is still limited by the laws of physics, by network speed, by how much [code] you’re actually wrapping around. Could you do it? Could you do everything as functions inside a containerized environment, and make it work? Yes. It’d be slow as heck. People would not use it.”

Adrian Cockcroft advises that the interpretability of each service’s core purpose, even by a non-developer, should be a tenet of microservice architecture itself. That fact alone should mitigate against poor design choices.

“It should be simple enough for one person to understand how it works,” Cockcroft advocated. “There are lots of definitions of microservices, but basically, you’ve partitioned your problem into multiple, independent chunks that are scaled independently.”

“Everything we’re describing,” remarked F5’s MacVittie, “is just SOA without the standards… We’re doing the same thing; it’s the same pattern. You can take a look at the frameworks, objects, and hierarchies, and you’d be like, ‘This is not that much different than what we’ve been doing since we started this.’ We can argue about that. Who wins? Does it matter? Is Amazon going to say, ‘You’re right, that’s a big microservice, thank you?’ Does it change anything? No. They have solved a problem that they had, by changing how they design things. If they happen to stumble on what they should have been doing in the first place, according to the experts on the Internet, great. It worked for them. They’re saving money, and they did expose one of those problems with decomposing something too far, on a set of networks on the Internet that is not designed to handle it yet.

“We are kinda stuck by physics, right?” she continued. “We’re unlikely to get any faster than we are right now, so we have to work around that.”

Perhaps you’ve noticed: Enterprise technology stories thrive on dichotomy. For any software architecture to be introduced to the reader as something of value, vendors and journalists frame it in opposition to some other architecture. When an equivalent system or methodology doesn’t yet exist, the new architecture may end up being portrayed as the harbinger of a revolution that overturns tradition.

One reason may be because the discussion online is being led either by vendors, or by journalists who tend to speak with vendors first.

“There is this ongoing disconnect between how software companies operate, and how the rest of the world operates,” remarked Platify Insights analyst Donnie Berkholz. “In a software company, you’ve got ten times the staffing and software engineering on a per capita basis across the company, as you do in many other companies. That gives you a lot of capacity and talent to do things that other people can’t keep up with.”

Maybe the big blazing “Amazon” brand obscured the fact — despite the business units’ proximity to one another — that Prime Video was a customer of AWS. With its engineers’ blog post, Prime joined an ongoing narrative that may have already spun out of control. Certain writers may have focused so intently upon selected facets of microservices architecture, that they let readers draw their own conclusions about what the alternatives to that architecture must look like. If microservices were, by definition, small (an aspect that one journalist in particular was guilty as hell of over-emphasizing), its evil counterpart must be big, or bigness itself.

Subsequently, in a similar confusion of scale, if Amazon Prime Video embraces a monolith, so must all of Amazon. Score one come-from-behind touchdown for monoliths in the fourth quarter, and cue the Thursday Night Football theme.

“We’ve seen the same thing happening over and over across the years,” mentioned Berkholz. “The leading-edge software companies, web companies, and startups encounter a problem because they’re operating at a different scale than most other companies. And a few years later, that problem starts to hit the masses.”

Buildup

The original “axis of evil” in the service-orientation dichotomy was 1999’s Big Ball of Mud. First put forth by Professors Brian Foote and Joseph Yoder of the University of Illinois at Urbana-Champaign, the Big Ball helped catalyze a resurgence in support for distributed systems architecture. It was seated at the discussion table where the monolith sits now, but not for the same reasons.

The Big Ball wasn’t a daunting tower of rigid, inflexible, tightly-coupled processes, but rather programs haphazardly heaped onto other programs, with data exchanged between them by means of file dumps onto floppy disks carried down office staircases in cardboard boxes. Amid the digital chaos of the 1990s and early 2000s, anything definable as not a Big Ball of Mud, was already halfway beautiful.

“Service Oriented Architecture was actually the same idea as microservices,” recalls Forrester senior analyst David Mooter. “The idea was, you create services that align with your business capabilities and your business operating model. Most organizations, what they heard was, ‘Just put stuff [places] and do a Web service,’ [the result being] you just make things SOAP. And when you create haphazard SOAP, you create Distributed Little Balls of Mud. SOA got a bad name because everyone was employing SOA worst practices.”

Mooter shared some of his latest opinions in a Forrester blog post entitled, “The Death of Microservices?” In an interview with us, he noted, “I think you’re seeing, with some of the reaction to this Amazon blog, when you do microservices worst practices, and you blame microservices rather than your poor architectural decisions, then everyone says microservices stink… Put aside microservices: Any buzzword tech trend cannot compensate for poor architectural decisions.”

The sheer fact that “Big Ball” is a nebulous, plastic metaphor has enabled almost any methodology or architecture that fell out of favor over the past quarter-century, to become associated with it. When microservices makes inroads with organizations, it’s the monolith that gets to wear the crown of thorns. More recently, with some clever phraseology, microservices has carried the moniker of shame.

“Our industry swings like a pendulum between innovation, experimentation, and growth (sometimes just called ‘peacetime’) and belt-tightening and pushing for efficiency (‘wartime’),” stated Laura Tacho, long-time friend of The New Stack, and a professional engineering coach. “Of course, most companies have both scenarios going on in different pockets, but it’s obvious that we’re in a period of belt-tightening now. This is when some of those choices — for example, breaking things into microservices — can no longer be justified against the efficiency losses.”

Berkholz has been observing the same trend: “There’s been this push back-and-forth within the industry — some sort of a pendulum happening, from monolith to microservices and back again. Years ago, it was SOA and back again.”

Defenders of microservices against the mud-throwing that happens when the pendulum swings back, say their architecture won’t be right for every case, or even every organization. That’s a problem. Whenever a market is perceived as being served by two or more equivalent, competing solutions, that market may correctly be portrayed as fragmented. Which is exactly the kind of market enterprises typically avoid participating in.

“Fragmentation implies that the problem hasn’t been well-solved for everybody yet,” Berkholz told us, “when there’s a lot of different solutions, and nobody’s consolidated on a single one that makes sense most of the time. That is something that companies watch. Is this a fragmented ecosystem, where it’s hard to make choices? Or is this an ecosystem where there’s a clear and obvious master?”

From time to time, Lori MacVittie told us, F5 Networks surveys its clients, asking them for the relative percentages of their applications portfolios they would describe as monoliths, microservices, mobile apps and middleware-infused client/server apps. “Most organizations were operating at some percentage of each of those,” she told us. When the question was adjusted, asking only whether their apps were “traditional” or “modern,” the split usually has been 60/40, respectively.

“They’re doing both,” she said. “And within those, they’re doing different styles. Is that a mess? I don’t think so. They had specific uses for them.”

“I kind of feel like microservice-vs.-monolith isn’t a great argument,” stated Microsoft’s Brendan Burns. “It’s like arguing about vectors vs. linked lists or garbage collection vs. memory management. These designs are all tools — what’s important is to understand the value that you get from each, and when you can take advantage of that value. If you insist on microservicing everything, you’re definitely going to microservice some monoliths that probably you should have just left alone. But if you say, ‘We don’t do microservices,’ you’re probably leaving some agility, reliability and efficiency on the table.”

The Big Ball of Mud metaphor’s creators cited, as the reason software architectures become bloated and unwieldy, Conway’s Law: “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” Advocates of microservices over the years have taken this notion a few steps further, suggesting business structures and even org charts should be deliberately remodeled to align with software, systems, and services.

When the proverbial pendulum swings back, notes Tacho, companies start reconsidering this notion. “Perhaps it’s not only Conway’s Law coming home to roost,” she told us, “but also, ‘Do market conditions allow us to take a gamble on ignoring Conway’s Law for the time being, so we could trade efficiency for innovation?’”

Continuing her war-and-peace metaphor, Tacho went on: “Everything’s a tradeoff. Past decisions to potentially slow development down and make processes less efficient due to microservices might have been totally fine during peacetime, but having to continuously justify those inefficiencies, especially during a period of belt-tightening, is tiresome. What surprises me sometimes is that rearchitecting a large codebase is not something that most companies would invest in during wartime. They simply have to have other priorities with a better ROI for the business, but big fish like Amazon have more flexibility.”

“The first thing you should look at is your business,” advised Forrester’s Mooter, “and what is the right architecture for that? Don’t start with microservices. Start with, what are the business outcomes you’re trying to achieve? What Forrester calls, ‘Outcome-Driven Architecture.’ How do we align our IT systems and infrastructure and applications, to optimize your ability to deliver that? It will change over time.”

“It’s definitely the case,” remarked Microsoft’s Burns, “that one of the benefits of microservices design is that it enables small teams to behave autonomously because they own very specific APIs with crisp contracts between teams. If the rest of your development culture prevents your small teams from operating autonomously, then you’re never going to gain the agility benefits of microservices. Of course, there are other benefits too, like increased resiliency and potentially improved efficiency from more optimal scaling. It’s not an all-or-nothing, but it’s also the case that an engineering culture that is structured for independence and autonomy is going to do better when implementing microservices. I don’t think that this is that much different than the cultural changes that were associated with the DevOps movement a decade ago.”

Prime Video made a huge business gamble on NFL football rights, and the jury is still out as to whether, over time, that gamble will pay off. That move lit a fire under certain sensitive regions of Prime Video’s engineering team. The capabilities they may have planned to deliver three to five years hence, were suddenly needed now. So they made an architectural shift — perhaps the one they’d planned on anyway, or maybe an adaptation. Did they enable business flexibility down the road, as their best practices advised? Or have they just tied Prime Video down to a service contract, to which their business will be forced to adapt forever? Viewed from that perspective, one could easily forget which option was the monolith, and which was the microservice.

It’s a dilemma we put to AWS’ Ajay Nair, and his response bears close scrutiny, not just by software engineers: “Building an evolvable architectural software system is a strategy, not a religion.”

The post Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All appeared first on The New Stack.

Open Sourcing AWS Cedar Is a Game Changer for IAM

Or Weis — Mon, 12 Jun 2023 17:00:36 +0000

In today’s cloud native world, managing permissions and access control has become a critical challenge for many organizations. As applications and microservices become more distributed, it’s essential to ensure that only the right people and systems have access to the right resources.

However, managing this complexity can be difficult, especially as teams and organizations grow. That’s why the launch of Cedar, a new open source project from Amazon Web Services, is a tectonic shift in the identity and resource management (IAM) space, making it clear that the problem of in-app permissions has grown too big to ignore.

Traditionally, organizations have relied on access control lists (ACLs) and role-based access control (RBAC) to manage permissions. However, as the number of resources and users grows, it becomes difficult to manage and scale these policies. This is where policy as code emerges as a de facto standard. It enables developers to write policies as code, which can be versioned, tested and deployed like any other code. This approach is more scalable, flexible and auditable than traditional approaches.

The Advantages of Cedar

Aside from impressive performance, one of the most significant advantages of Cedar is its readability. The language is designed to be extremely readable, empowering even nontechnical stakeholders to read it (if not write it) for auditing purposes. This is critical in today’s world, where security and compliance are top priorities.

Cedar policies are written in a declarative language, which means they can be easily understood and audited. Cedar also offers features like policy testing and simulation, which make it easier to ensure that policies are enforced correctly.

Unlike some other policy languages, Cedar adheres to a more strict and structured syntax, which provides its aforementioned readability, emphasis on safety by default (i.e., deny by default), and more assurances on correctness/security thanks to verification-guided development.

Open Source Supporting Open Source

AWS has recognized the huge challenge that is application-level access control and strives to make Cedar easily consumed within its cloud via Amazon Verified Permissions (AVP). But what about on-premises deployments or other clouds? This is where other open source projects come into play.

With Cedar-Agent, developers can easily run Cedar as a standalone agent (just like Open Policy Agent) wherever they need it. And with OPAL, developers can manage Cedar-Agent at scale, from a unified event-driven control plane. OPAL makes sure that agents like OPA, AVP (Amazon Verified Permissions) and Cedar-Agent are loaded with the policy and data they need in real time.

Permit’s Unified Platform for Policy as Code

As developers, being polyglot and avoiding lock-in enables us to choose the right tool for the right job. With Permit’s SaaS platform, developers can choose between OPA’s Rego, AWS Cedar or any other tool as their policy engine of choice. And by leveraging Permit’s low code/no-code interfaces, RBAC and ABAC policy as code will be automatically generated so that users can take full advantage of policy as code without having to learn a new language.

Conclusion

The launch of AWS’ Cedar is a tectonic shift in the IAM space. It’s clear that the problem of in-app permissions has grown too big to ignore. Policy as code has emerged as a de facto standard, and tools like OPAL and Permit.io are making it easier for developers to write and manage policies at scale. Cedar’s readability and testing features make it an attractive choice for many organizations looking to manage permissions in a scalable, auditable and flexible way.

As the ecosystem continues to expand, we’ll likely see more tools and systems adopting policy as code as the preferred approach to managing permissions and access control in the cloud.

The post Open Sourcing AWS Cedar Is a Game Changer for IAM appeared first on The New Stack.

Dell Intros New Edge, Generative AI, Cloud, Zero Trust Prods

Chris J. Preimesberger — Wed, 31 May 2023 18:00:35 +0000

At its annual Dell Technologies World conference in Las Vegas last week, Dell joined a spate of companies this year in releasing new products that address emerging needs in edge management, secure generative AI development, zero trust security and multicloud workload management. It’s no surprise that these are the hottest sectors of enterprise IT products and services sales here in 2023.

New items Dell introduced include the NativeEdge platform; a partnership between itself and Nvidia to build Project Helix, a secure on-premises-based generative AI management package; Project Fort Zero, another new crack at providing an end-to-end zero trust security solution; and updates for its multicloud app-service management platform, Apex.

Dell NativeEdge

Dell had only a loosely coordinated edge strategy for a few years until last October when the company previewed a new development called Project Frontier, which was to result in a standalone edge operations software and hardware platform sometime in 2023.

“Sometime in 2023” is now here. Project Frontier is now closed, and Dell said its replacement, dubbed NativeEdge, is the industry’s “only” edge operations software platform delivering secure device onboarding at scale, remote management and multicloud application orchestration. Companies such as IBM, AWS, Google, HP, Microsoft, Hitachi, Oracle and several others may dispute the “only” claim, however. Here is Gartner’s most recent edge computing market report.

“This is the industry’s only edge operation platform that mixes up and delivers three things in parallel: secure device onboarding at scale; remote management of devices in those locations; and multicloud application orchestration to edge devices from data centers to make sure that the edge outcome end-to-end is deployed in the lifecycle,” Dell SVP of Edge Solutions Gil Schneorson told reporters.

Edge operations, which make use of automated processes, are generally found in locations ranging from manufacturing floors and retail stores to remote wind turbines and long-distance hospital operating rooms.

Use cases for edge computing are growing because more organizations want to manage and secure data at the source but often have limited IT support to do it.

For example, a large manufacturer may need to automate packaging and shipping across its numerous factory sites in various geographies. This means connecting multiple technologies, such as IoT, streaming data and machine vision, which requires dedicated devices to run multiple software applications across locations. Testing and deploying infrastructure to run the applications can take months. With NativeEdge, the manufacturer can consolidate its technology stacks using existing investments and reduce the time to deploy edge assets and applications from months to weeks, Schneorson said. The platform uses automation to streamline edge operations and helps the manufacturer securely roll out new applications to all sites from a central location.

NativeEdge is designed to run any enterprise edge use case with automated deployment, Schneorson said, integrating with various Dell data center hardware nodes. NativeEdge includes built-in zero trust internal use-confidential capabilities, Schneorson said.

The NativeEdge software platform will be available in 50 countries beginning in August, the company said.

Project Helix Partnership with Nvidia

Everybody needs “easier” when it comes to incorporating generative AI into an enterprise system, no matter where its use is being considered. Pre-built AI data models are one major step in that direction.

Dell and Nvidia revealed that Project Helix will result in a series of full-stack solutions with technical expertise and prebuilt tools based on their own infrastructure and software. This includes a complete blueprint to help enterprises use their proprietary data and more easily deploy generative AI responsibly and accurately, Dell said.

“Project Helix gives enterprises purpose-built AI models to more quickly and securely gain value from the immense amounts of data underused today,” Jeff Clarke, Dell Technologies’ vice chairman and co-chief operating officer, said in a media advisory. “With highly scalable and efficient infrastructure, enterprises can create a new wave of generative AI solutions that can reinvent their industries.”

Blueprints for on-premises generative AI Project Helix will support the complete generative AI lifecycle, from infrastructure provisioning, modeling, training, fine-tuning, application development and deployment, to deploying inference and streamlining results, Dell said. The validated designs will use Dell PowerEdge servers, such as the PowerEdge XE9680 and PowerEdge R760xa, which are optimized to deliver performance for generative AI training and AI inferencing, the company said. They will combine with Nvidia H100 Tensor Core GPUs and networking to form the infrastructure backbone for these workloads.

Dell did not indicate whether the Helix models could be run optimally on other brands of servers. Dell Validated Designs based on the Project Helix initiative will be available through traditional channels and Apex flexible consumption options beginning in July.

Dell Apex Multicloud Management

Dell Apex software, introduced in May 2021, is designed to work across all cloud platforms, public cloud storage software, client devices and computing environments. New additions to the company’s multicloud portfolio span data centers to public cloud and client devices include:

Dell, Microsoft, VMware and Red Hat platforms for specific cloud-based use cases:

Dell Apex Cloud Platform for Microsoft Azure features full-stack software integration and automated lifecycle management through Microsoft native management tools. The platform is designed for application modernization and is said to offer faster time to value of Azure based on Azure Arc-enabled infrastructure with consistent operations and governance across on-premises data centers, edge locations, and the Azure public cloud using Azure Arc.

Apex Cloud Platform for Red Hat OpenShift aims to simplify container-based application development and management, wherever applications are developed and deployed, through full stack software integration and automation with Kubernetes. Users can run containers and virtual machines side by side, natively within Kubernetes, for a wide variety of workloads — including AI/ML and analytics, with broad GPU support across any hybrid cloud footprint, Dell said.

Apex Cloud Platform for VMware gives users the flexibility to deploy vSphere on a fully integrated system with Dell software-defined storage. It joins Dell Apex Private Cloud and Dell Apex Hybrid Cloud within the broader Dell Apex portfolio to offer more choices for VMware users.

Zero Trust Security News

This is still on the drawing and testing boards for the next year until further notice. Project Fort Zero is designed to make zero trust more available for mid-range and smaller enterprises and is engineered to provide a hardened end-to-end zero trust security solution for global organizations to protect against cyberattacks.

Fort Zero is part of an ecosystem of more than 30 technology companies and will result in a validated, advanced maturity zero trust solution within the next 12 months, Dell said. “Zero trust is designed for decentralized environments, but integrating it across hundreds of point products from dozens of vendors is complex — making it out of reach for most organizations,” Herb Kelsey, industry CTO of government projects at Dell, told reporters. “We’re helping global organizations solve today’s security challenges by easing integration and accelerating adoption of Zero Trust.”

The fully configured Fort Zero solution will serve a variety of use cases, including on-premises data centers; remote or regional locations such as retail stores where secure, real-time analysis of customer data can deliver a competitive advantage; in the field, where a temporary implementation is needed in places with intermittent connectivity, such as airplanes or vehicles, for operational continuity.

The post Dell Intros New Edge, Generative AI, Cloud, Zero Trust Prods appeared first on The New Stack.

Cloud Dependencies Need to Stop F—ing Us When They Go Down

Jeff Martens — Thu, 25 May 2023 17:00:35 +0000

We are building software faster and with more functionality than ever, thanks to an abundance of third-party cloud infrastructure offerings, APIs and SaaS tools. They are allowing software developers like us to soar.

But if these cloud dependencies go down, we go down with them. And because most vendors refuse to provide visibility into their platforms, we’re left scrambling and asking ourselves, “Is it me or them?” In short, we’re f—ed.

That’s why when we talk about the promises of the powerful cloud services at our fingertips, we also need to talk about their problems — including how vendors can stop screwing us over when they go down — and how we can mitigate their lack of visibility in the meantime.

Cloud Dependencies Are Awesome … Until They’re Not

Upstream cloud dependencies — that is, software such as Amazon Web Services, Auth0, GitHub, Twilio, etc. — are becoming increasingly popular and important. That’s because building on and with third-party cloud dependencies makes our software better. So we are increasingly turning to third-party cloud apps to power our products and run our businesses.

For example, a typical digital product might rely directly on 50 cloud products, which represent just a portion of the 130 cloud products the average digital business uses to power its entire business.

However, there’s an important problem we need to address with this innovation: Our reliability is greatly affected by the reliability of our dependencies. Let’s look at a common example of how this plays out.

Has This Ever Happened to You?

You’re on call and PagerDuty starts going off — something is clearly wrong with the core functionality of your product. You assemble an eight-person team and get to work.

Immediately, someone suspects it’s a specific third party that you have a hard reliance on. You check the status page of this service, and it says everything is fine. You have to keep searching.

Ten minutes passes, and support tickets are pouring in. All the metrics point to a dependency outage, but the status page is still green. Twenty-five minutes in and the incident risks becoming a Service Level Agreement (SLA)-violating event with financial consequences. With no obvious solution and an “all-clear” vendor status page, the team debates various ideas and doesn’t take action to remediate the issue.

Twenty-nine minutes later, the cloud vendor’s status page updates: Your colleague was right! But as usual, the status page update lacks the details regarding the exact problem. Frustrated by the delayed and insufficient update, you initiate a failover plan that you wish you could have executed with confidence sooner.

Once everything returns to normal, you dismiss the team, feeling pissed off and gaslighted by the upstream cloud dependency. If you’re so dependent on these services, why can’t you have visibility into them like you do your own software?

More Cloud Dependencies = Less Reliability

You’d think that if you rely on multiple services with 99.99% uptime you’d have a product with 99.99% uptime. But that’s not the case. In fact, when you add more services, the uptime you can safely offer actually goes down.

That’s because with each product you introduce, you introduce the amount of unreliability that product has into your product’s reliability (even if it’s incredibly small). The math depends on a few factors. It’s a composite score which you can learn how to calculate here. But in the simplest terms, if you add a hard dependency with 99.99% uptime, you need to subtract about 0.01% from the best possible uptime that your app can achieve.

And while 0.01% might seem insignificant, it adds up with every service you use. According to research by the Uptime Institute, 70% of all major SaaS outages are connected to upstream cloud dependency issues. In fact, depending on the product or application, engineering teams may find that between 25% and 70% of all alert-able incidents come from third-party cloud dependency issues.

No Visibility = We’re F—ed During Outages

So it would seem like with our reliability on cloud vendors, visibility into them would be a high priority. Unfortunately, right now that visibility is opaque at best.

What’s more, current observability tools focus internally — on first- and second-party signals — forcing us to infer cloud-service health. As a result, we can’t answer the “us vs. them” question in an efficient, timely manner.

Vendor status pages are updated manually, if they are updated at all. Consequently, status pages are updated on average 29 minutes after issues start, and they are only updated for the most serious of issues. Most of us have to turn to Twitter and Hacker News for up-to-date information on critical reliability data.

This reality results in:

Unnecessary downtime due to prolonged MTTD, MTTI and MTTR
Inefficient, ill-informed incident response
Missed opportunities to avoid incidents

And we haven’t even gotten to holding vendors accountable to their SLAs.

Vendor SLA Accountability? Never Heard of Her.

Another consequence of having no visibility into upstream service health is that those vendors hold all the power around reliability data and SLA compliance. Let’s think about this:

They define SLA to their specs.
They communicate about issues if/when they want.
They don’t share metrics outside of status pages that are fraught with trust issues.
They require customers to carry the burden of proof when things do go wrong.

And if we don’t have the data to know whether they’re even staying reliable to their own definition of reliability (which could be different from yours), there’s an inherent inability to hold cloud vendors accountable.

It’s time this changed.

Software Makers Deserve Better

What do we need to achieve an appropriate level of visibility and avoid the pitfalls of dependency unreliability?

Timely, detailed cloud dependency health metrics. We deserve a similar level of visibility into our third-party dependencies as we have for our first- and second-party software. Cloud vendors are just as critical, if not more so, than the software we write and operate. We wouldn’t dream of operating our own services without proper monitoring, so why should we do any less for our upstream dependencies? Without cloud dependency health metrics, we are stuck trying to answer the “is it us or is it them” question without full context, despite cloud dependencies accounting for at least 25% of all incidents.
Service status specific to us and our use of a product. Outages rarely take down an entire product, for all customers, at the same time. It’s not enough to say “Some customers are experiencing elevated error rates.” Instead, we deserve to know what functionality is affected, where, for whom and, most importantly, if our account/resources are affected. Don’t make us scramble an incident response with a vague status page update when it’s not necessary.
A single place for all third-party service-health information. Time is critical during incident response, and we can’t waste time going back and forth between 15 different status pages, Twitter, Hacker News and our internal dashboards. We deserve a real-time and complete view of every third-party service that we depend on, in a single place, alongside our app metrics, logs and traces.
Control over what, where and how we get visibility. Vendors have all the power, and they force us to come to them and rely on an often sh–ty status-page experience. In addition to having access to service health metrics, rather than just status-page updates, we should also have access to cloud dependency metrics in the same ways we access and control our own application metrics. We deserve first-class Slack integrations, native Datadog plugins, PagerDuty integrations, webhooks and more.

And how can we hold vendors accountable for their promised reliability?

An independent SLA authority. Vendors hold all the power with SLAs, defining them how they want, reporting violations if they want, but forcing customers to prove violations if they want a refund. This is the fox guarding the henhouse, as the saying goes. With annual global SaaS spending topping $146 billion, we deserve an independent arbiter of truth for SLA compliance, written into our contracts.
A vendor/customer partnership built on data. Too many times, we’ve sat in QBRs where cloud vendors and their customers disagree on reliability and each uses their own data to support their belief. Most cloud vendors want to do right by their customers, and most customers want to help their vendors increase reliability. Yet everyone holds their data close to their chest, and we don’t treat each other as trusted partners. Sharing reliability data should become the norm, building a bridge toward better reliability for all.
Community benchmarks and baselines. Do you know if your traffic to a cloud dependency is being treated the same as every customer’s? You probably don’t. Do you know how many outages to expect from a vendor over the year, or how fast they typically resolve them? Don’t bother looking at the status page for that information; it will mislead you. We don’t just deserve metrics about our own experience with a cloud service, we deserve to know how those metrics compare to others, and to the expected.
Proactive incident communication. During incidents, vendors often know they have a problem but wait to update their status page until they have clear details of the incident, or worse, approval from a marketing executive. In these cases, a simple “we think we might have some sort of problem. Stay tuned for details” would go a long way, but it rarely happens. Similarly, vendors shouldn’t wait for customers to catch them with SLA violations rather than admitting to issues themselves.
A w3c standard for status pages. It is hard to imagine a world where status pages don’t play a part in creating greater and better visibility into cloud dependency health. As we turn to status pages, we deserve a consistent experience so we can find the data we need, clearly and quickly. Leveraging cloud dependencies to build software is as common as using an internet browser to access them, and status pages are a fundamental aspect of nearly every cloud product out there. A w3c standard for status pages can move the industry forward and make life better for both status-page publishers and users.

A Cloud Customer Bill of Rights would have a profound effect on our software reliability practices, but this reality is a long way off. Is it possible to get this kind of visibility today?

Filling the Void Ourselves

Visibility into third-party reliability should be a critical component of our own reliability efforts. The “is it us or is it them” question can and should be answered quickly and clearly. But until our cloud vendors give us what we deserve, there are a few things we can do ourselves to increase visibility and decrease risk.

We should be treating our third-party dependencies just like we treat our internal dependencies. Know exactly what dependencies you have, include them in your service catalogs and have runbooks for them. From there, we can begin to map out the actual availability we experience in order to understand which services introduce the most risk to your own reliability.

To get this visibility, you may need to employ a canary testing strategy, scripting continuous checks against critical third-party endpoints. Simple API pings won’t tell the full story of functionality, but multistep synthetic monitoring from an observability provider like Datadog can. Or, use a dedicated cloud dependency-monitoring solution such as Metrist or APImetrics. Regardless of the approach, it is our responsibility to collect the metrics we need in order to respond well-informed and partner with our vendors to improve their reliability.

Speaking of partnering with our vendors, now is the time to build your relationship with them. Just as we should treat third-party services like internal services, we should treat our vendors as an extension of our team. Most cloud vendors want to do right by their customers. Establishing rapport can result in hands-on support during an outage and proactive collaboration before issues happen. Knowing which public cloud provider and region your dependencies rely on, and ensuring they don’t host their status page in the same place, can help you identify and reduce the risk you take on.

With expanded visibility and strong relationships with our vendors, we can:

Evaluate and select vendors whose reliability meets our needs.
Manage vendors toward better reliability and hold them accountable to SLAs.
Avoid impact from third-party outages with warnings and automation.
Reduce MTTR through a better-informed incident response.
Reduce MTTD with direct monitoring and alerting on cloud-dependency service health.

Standing Up to Vendors

Our software is increasingly dependent on cloud vendors to function. We are in an interconnected world, and vendors need to act like it.

We pay handsomely for these services that become a critical part of our infrastructure — including the personnel who need to configure and troubleshoot it — but the transparency of these services has not kept pace with the advancements in observability tooling and reliability culture.

We should of course give these vendors some benefit of the doubt. None of them want to affect their customer’s reliability. Often, the software engineers who build the products we rely on have their hands tied. They want to programmatically update status pages, communicate quickly and share more details about their software’s performance, but they can’t because of their own observability limitations or a strict process around customer-facing reliability communication.

So we’re left asking friends and strangers on social media, “Is it me or them?” as our primary source of third-party visibility. It’s time for our cloud vendors to join us on this reliability journey so that we aren’t f—ed when they go down.

The post Cloud Dependencies Need to Stop F—ing Us When They Go Down appeared first on The New Stack.

Microsoft Fabric Defragments Analytics, Enters Public Preview

Andrew Brust — Tue, 23 May 2023 15:00:15 +0000

At Microsoft’s Build conference in Seattle today, the software and cloud giant announced the public preview of Microsoft Fabric, the company’s new end-to-end analytics platform. The product both evolves and unifies heretofore separate cloud services, including Azure Synapse Analytics, Azure Data Factory and Power BI. And though all of the precursor services on the Azure side have been available as separately-billed Platform as a Service (PaaS) offerings, Microsoft Fabric unifies them under a single Software as a Service (SaaS) capacity-based pricing structure, with a single pool of compute for all workloads within the scope of the platform.

Fabric’s functionality spans the full data lifecycle, including data ingest and integration, data engineering, real-time analytics, data warehousing, data science, business intelligence and data monitoring/alerting. All such workloads are data “lake-centric” and operate on top of a unified data store called OneLake, in which data is persisted in Apache Parquet/Delta Lake format, a combination that brings relational table-like functionality to data stored in open data lake formats. And since Fabric can also connect to data in Amazon S3 (and, in due time, to Google Cloud Storage, Microsoft promises) via so-called “shortcuts” in OneLake, there’s a multicloud dimension to all of this as well.

Less Assembly Required

The New Stack was briefed last week on Microsoft Fabric by Arun Ulagaratchagan, Microsoft’s Corporate Vice President of Azure Data, who provided important color on why Microsoft created Fabric and what its goals were with it. With the benefit of that insight, it seems the real innovation in Microsoft Fabric may be around the analytics integration it delivers, which has been sorely lacking in the industry at large, and even within Microsoft’s own collection of data and analytics services.

Ulagaratchagan got to the point pretty quickly with one key observation about the analytics landscape today: “if you put yourself in the customers’ shoes, there are literally hundreds, thousands of products out there that they have to figure out what makes sense for them, how do they use it, how do they wire it all up together to be able to take advantage the data, get it to the right shape that they need, and make it work for their business.”

Fabric was designed to solve this very real deficiency in the modern data stack, and the level of integration and rationalization that Fabric has already achieved is unprecedented. I say this as someone who participated in the early adopter’s program for Fabric and who has used Microsoft data and analytics technology for almost 30 years.

Must read: Data 2023 Outlook: Rethink the Modern Data Stack

The Microsoft analytics stack was fractured even in the enterprise software days, and the era of the cloud had only made it worse. What Microsoft has done with Fabric is to address these disconnects and arbitrary segregations of functionality, not just technologically, but also in terms of the pricing/billing model and the structure of the engineering organization behind it all. I will explain all of that in more detail later in this post, but first, let’s take inventory of Fabric’s components, capabilities and use cases.

What’s in the Box?

Of Fabric’s seven core workload-specific components, one, called Data Activator, which implements data monitoring/alerting, is built on new technology and is in private preview. The other six, which are based on technologies that existed previously, and are available today in the Fabric public preview, are as follows:

Data Factory, based on Azure Data Factory and Power Query technology, provides for visual authoring of data transformations and data pipelines.
Synapse Data Engineering, based on the same technology as the Spark pools in Azure Synapse Analytics, and related technologies for Apache Spark, including notebooks, PySpark and .NET for Apache Spark, provides for code-first data engineering. However, unlike the explicitly provisioned Spark pools in Azure Synapse Analytics, Spark resources in Fabric are provided on a serverless basis using “live-pools.”
Synapse Data Science, which provides for training, deployment and management of machine learning models. This component is also based largely on Spark, but incorporates elements of Azure Machine Learning, SQL Server Machine Learning Services and the open source MLflow project, as well.
Synapse Data Warehousing, based on an evolution of the original Azure SQL Data Warehouse (and, ultimately, SQL Server) technology provides a “converged” lakehouse and data warehouse platform.
Synapse Real-Time Analytics, which combines Azure Event Hubs, Azure Stream Analytics, Azure Data Explorer and even open source event streaming platforms like Apache Kafka, allows analytics on IoT, telemetry, log and other streaming data sources.
Power BI, Microsoft’s flagship business intelligence platform, but soon-to-be enhanced with a new large language model AI-based Copilot experience that can generate DAX (Data Analysis eXpressions — Power BI’s native query language). In many ways, Power BI is the “captain” of the Microsoft Fabric team as its Premium capacities and workspaces are the basis for their counterparts in Fabric.

The inclusion of Power BI in Microsoft Fabric goes beyond its own capabilities and provides integration with Microsoft 365 (read: Office) by extension. Tight integration between Power BI on the one hand, and Excel, PowerPoint, Teams, SharePoint and Dynamics 365, on the other, means the power of Fabric can be propagated outwards, to bona fide business users and not just business data analysts.

Check out: Microsoft Power BI Gets Low-Code Datamart Feature

All for One

Despite the varied branding, which may have been done for reasons of continuity or politics, Fabric is a single product with a single overarching user interface and user experience. As Ulagaratchagan explained, “even though it is seven workloads running on top of One Lake, it looks and feels and works from an architecture perspective as one integrated product. We wanted to make sure we conveyed how all of these experiences just flow. That’s why the Fabric name seemed appropriate.”

Although persona-specific UIs are provided for different workloads, they are more akin to “skins” or “views” than distinct products. In fact, The “Create” button in Fabric’s navigation bar presents a menu of all artifacts from all workloads, as an alternative to the compartmentalized experiences and, in the process, emphasizes the integrated nature of it all.

Whether the workloads are engaged from their respective user interfaces or the general one, the “artifacts” created in each are kept together in unified Fabric workspaces. And because the basis for each data artifact consists of Delta/Parquet data files stored in OneLake, many of the assets are just thin, workload-specific layers that sit atop those physical data files. For example, a collection of data in OneLake is directly readable and writeable by Spark, as with any data lake, but it can also manifest as tables in a relational data warehouse or a Power BI dataset.

Of course, each artifact type can contain its own unique assets; for example, a warehouse can have views and stored procedures, and a Power BI dataset can have measures and hierarchies. None of the components in Fabric gets dumbed down, but several (and, eventually, all, quite possibly) use Delta/Parquet as a native format, so the data doesn’t need to be replicated in a series of component-specific proprietary formats.

Cooperation, Not Duplication

This means that, in Fabric, a data engineer writing Python code running on Spark, a data scientist training a machine learning model, a business analyst creating sophisticated data visualizations, and an ETL engineer building a data pipeline are all working against the same physical data. And folks using other data platforms — including Databricks or non-Microsoft BI tools — can share this same physical data too, because OneLake is based on, and API-compatible with, Azure Data Lake Storage, to which most modern data stack technologies have connectivity.

In the case of Power BI, the ramifications get even more interesting. When working with data in Microsoft Fabric, a BI engineer doesn’t have to decide whether to import the data into a Power BI model or leave it in OneLake and query it on the fly. With the aid of something called Direct Lake mode, that distinction goes away, because the data in OneLake already is in a Power BI-native format. Power BI already supported composite models, where Import and DirectQuery access methods could be combined in a single model. But with Direct Lake mode, composite models aren’t necessary, as the need to import data from OneLake is simply eliminated.

Eliminating the segregation between services, artifacts and data formats means the economics get simpler, too. Fabric’s capacity-based compute model provides processing power that is fungible between, and usable from, all of its workloads. Ulagaratchagan had this to say on that subject: “We see this as an opportunity for customers to save a ton of money because today, often every analytics product has multiple subsystems. These subsystems typically require different classes of products, often coming from different vendors, and you’re provisioning multiple pools of compute across many different products, and weaving them all together to create one analytics project.”

While Ulagaratchagan identifies this as a problem with a multi-vendor approach, even a Microsoft-only solution has up until now suffered from the same issue. The combination of Power BI and the present-day Azure services needed to create an equivalent to Microsoft Fabric has up until now required separately provisioned compute. This sprawl can even be an issue within an individual service. For example, Azure Synapse requires the management of four different types of compute clusters (Dedicated SQL, Serverless SQL, Spark and Data Explorer), three of which invoke separate infrastructure lifecycles and billing. Fabric eliminates these redundancies and their accompanying complexity and expense.

Take Me to Your (Common) Leader

There’s corporate unification at play here too. Many of the teams at Microsoft that built Fabric’s forerunner technologies — like Azure Synapse, Data Factory and Power BI — worked together to build Fabric and are all part of the same organizational structure under the management of Ulagaratchagan and the technical direction of Technology Fellow Amir Netz, the duo that previously led the standalone Power BI organization. This alignment is rather unprecedented at Microsoft, a company infamous for its internal competition and the sometimes disjoint technologies that result. The challenges here involved geography, too: engineering teams in the US, India, Israel and China, each with their own culture and operating in their own time zones, worked together in a remarkably cohesive fashion to build Fabric.

Building a federated product team like this was a calculated but very big gamble. Frankly, it could have gone horribly wrong. But from my point of view, morale was high, hubris was practically non-existent and top talent at Microsoft was skillfully honed to build a very comprehensive platform that changes the analytics game immensely. All three cloud providers were guilty of creating numerous siloed services and putting the burden of implementing them in combination on the customer. That’s not just an issue of discourtesy or insensitivity — it’s one of expense too, as customers need either to allocate significant human resources to such projects, or else invest an enormous amount of capital in the consulting talent necessary to carry them off.

Some of the technologies that now work together within Fabric have literally decades of history. And those that are newer had to be integrated with the older ones, and each other. Much as PC “defrag” tools tidy and reorganize files on spinning hard drives that have become scattered across the disk, Microsoft has had to defrag itself and its analytics technology stack to get Fabric built. Even if much of the technology itself isn’t new, the harmonious unification of it is a huge breakthrough that will enable new analytics use cases because of new simplicity, ease of use, efficiencies and economies of scale.

How to Get Started

The public preview of Microsoft Fabric begins immediately. Microsoft says Power BI Premium customers can get access to Fabric today by turning on the Fabric tenant setting in the Power BI admin portal, which will have the effect of upgrading Premium capacities to support Fabric. Customers can also enable Fabric in specific capacities instead of their entire tenant. Microsoft says that using new Fabric functionality (versus capabilities that were already available under Power BI) will not incur capacity usage before Aug. 1, 2023, but customers can still use the Capacity Metrics app to monitor how Fabric would impact capacity usage were the meter running.

Non-Power BI Premium customers can get access to a free 60-day Fabric trial.

Next Frontier

There’s more work to be done. Not only does Fabric have to move from Public Preview to GA, but more functionality is required. Microsoft is promising Copilot experiences for all workloads, rather than just Power BI. The company says this is in private preview now, so general availability would seem a long way off. Likewise, Data Activator needs to move forward to public preview and full release. And data governance functionality of the type offered by Microsoft Purview will be needed in Fabric to make it a truly complete offering. For now, the lineage, impact analysis and asset endorsement capabilities of Power BI will have to do.

There’s lots of other work ahead, too. Just because the Fabric team was successful with the heavy lift required to get where they are now doesn’t mean the pressure’s off, by any means. As the product moves to public preview, the technology, the pricing model, and the very notion that customers will prefer the end-to-end approach to “a la carte” functionality and procurement will now all be put to the test. There’s more pivoting, more innovation and more risk management required; and if Fabric is successful, competition from AWS and/or Google Cloud is almost sure to follow.

But Microsoft can celebrate significant interim success with Fabric already and betting on its ultimate success seems prudent. In an era when everyone’s going gaga over AI, we need to keep in mind that AI’s models are only as good as their underlying data, and the engineering used to discover, shape and analyze it. AI may get a lot of the “oohs and ahs” at Build, but I’d argue Fabric is the real news.

Disclosure: Post author Andrew Brust is a Microsoft Data Platform MVP and member of Microsoft’s Regional Directors Program for independent influencers. His company, Blue Badge Insights [www.bluebadgeinsights.com], has done work for Microsoft, including the Power BI team.

The post Microsoft Fabric Defragments Analytics, Enters Public Preview appeared first on The New Stack.

A Boring Kubernetes Release

Alex Williams — Fri, 19 May 2023 19:23:49 +0000

Kubernetes release 1.27 is boring, says Xander Grzywinski, a senior product manager at Microsoft.

It’s a stable release, Grzywinski said on this episode of The New Stack Makers from KubeCon Europe in Amsterdam.

“It’s reached a level of stability at this point,” said Grzywinski. “The core feature set has become more fleshed out and fully realized.

The release has 60 total features, Grzywinski said. The features in 1.27 are solid refinements of features that have been around for a while. It’s helping Kubernetes be as stable as it can be.

Examples?

It has a better developer experience, Grzywinski said. Storage primitives and APIs are more stable.

“Storage primitives have been around in Kubernetes for a while, and people have debated whether you should store persistent data on Kubernetes,” he said. “But I think a lot of those primitives and APIs have become more stable. So one of the new ones that have gotten some excitement is the read-write-once access method. So there’s a feature now where you can restrict access of a storage volume. Only one pod at a time can read and write from it. Things like that. That’s like just general refinement that makes the developer experience a little bit better.”

It’s not all boring.

The Vertical Pod Autoscaler (VPA) is pretty cool, Grzywinski said. It’s in alpha this time, but it will allow pods to scale to larger resources on demand. It will enable users to scale up to a configured level without restarting.

According to its GitHub page, when configured, VPA sets the requests based on usage, “allowing proper scheduling onto nodes so that appropriate resource amount is available for each pod. It will also maintain ratios between limits and requests that were specified in initial containers configuration.”

Efforts will continue to communicate better without surprises.

For example, there’s a new depreciation process that came based on feedback from the community. Grzywinski noted the Dockershim removal that caught a lot of people by surprise in release 1.24.

The New Stack’s Joab Jackson reported in March of 2022 that Dockershim would no longer be supported.

The lesson learned: over-communicate so there are fewer surprises. For example, Grzywinski said a blog is launching dedicated to deprecations and removals that will get pushed out earlier than the regular release blog.

The post A Boring Kubernetes Release appeared first on The New Stack.

Optimizing Mastodon Performance with Sidekiq and Redis Enterprise

Martin Heller — Thu, 18 May 2023 17:30:45 +0000

In the last six months, the open source Mastodon platform has attracted millions of new users and made organizations contemplate creating their own servers (called instances, in Mastodon parlance). It’s not hard to set up a Mastodon instance to support a handful of users. However, it is hard to set up a Mastodon server that can handle a lot of traffic because the default configuration leaves much to be desired.

In my previous article, “How to Boost Mastodon Server Performance with Redis,” I noted that one chokepoint in Mastodon servers is its Sidekiq queues, which depend in turn on Redis queues.

The Mastodon tech stack is built using Redis open source (Redis OSS), and it works great for the purpose. The usual way to configure Mastodon is to run it and Redis OSS on the same machine, and to scale that setup with Redis Sentinel if needed.

“Free” is wonderful, as is the support of the open source community. Redis OSS makes sense for ordinary workloads (whatever “ordinary” means to you) on a basic Mastodon instance. But you might want to consider additional options.

Using Redis OSS with Mastodon is free if you only look at licensing costs. It may not be optimal in the larger context of application performance, or even in the context of total cost of ownership. Where should you put your resources — technical, financial and human?

In this article, we explore the practicalities of using Redis Enterprise Cloud to power the queues for Sidekiq. Redis Enterprise Cloud is a fully managed database-as-a-service and offers enterprise capabilities such as Active-Active clustering topology that scales linearly. Since both of us are benchmark writers, and Filipe is the performance guru at Redis, you’re about to see lots of numbers and graphs. Fear not: We explain everything as we go.

Since we learned from other Mastodon administrators’ experience that the job queues often are the bottleneck, exhibiting 100% CPU load in the Redis process during high-traffic periods, we theorized that we could improve the results by removing Redis from the Mastodon server. We realized that we needed to connect Mastodon and Sidekiq to an external Redis instance, most conveniently to a Redis Enterprise Cloud instance.

We discovered along the way that Mastodon wasn’t designed for that plan, although we found a pull request in its GitHub repository to fix the problem. We also discovered that conventional HTTP load testing wouldn’t help much with Mastodon, but we could adapt a Sidekiq benchmark to compare the performance of Redis OSS and Redis Enterprise Cloud.

Connecting Mastodon and Sidekiq to Redis Enterprise Cloud

Our first step was to demonstrate that we could connect Mastodon and its Sidekiq job queue to an external Redis Enterprise Cloud instance. And, of course, we needed to have such an instance to test.

Filipe created a four-shard Redis Enterprise Cloud cluster in AWS rated for 100,000 operations per second (ops/sec) and 10 GB deployment using Terraform.

The database cluster uses Redis on Flash in multiple availability zones, with an Active-Active configuration. The cluster has two c5.2xlarge instances, one m5.large instance and a 127 GB EBS volume.

At the time we performed these tests, Mastodon didn’t support connecting the Sidekiq queues to anything but a local Redis database. There was a pull request to enable this in the Mastodon GitHub repository, which we applied to our Mastodon instance.

The patch adds Ruby code after line 500 of mastodon.rake:

# With the environment now reloaded, update Sidekiq to use the Redis config that was provided earlier interactively, in case it differs from the default localhost:6379.
# When the admin user is created, User dispatches an 'account.created' event to Sidekiq, which connects to Redis.


Sidekiq.configure_client do |config|
    new_params = REDIS_SIDEKIQ_PARAMS.dup
    new_params['url'] = "redis://:#{env['REDIS_PASSWORD']}@#{env['REDIS_HOST']}:#{env['REDIS_PORT']}/0"
    new_params.freeze
    config.redis = new_params
end

We needed to verify that the Sidekiq queues were running against our Redis Enterprise Cloud instance. To do so, we monitored the database.

The following console log is running on the Redis Enterprise Cloud cluster. The “queue” entries prove that the Mastodon/Sidekiq job queues reach the correct database instance rather than running locally on the Mastodon server.

redis-11459.c24274.us-east-2-1.ec2.cloud.rlrcp.com:11459> monitor
OK
1680648013.303105 [0 18.118.184.123:46734] "brpop" "queue:default" "queue:pull" "queue:ingress" "queue:push" "queue:mailers" "queue:scheduler" "2"
1680648014.047110 [0 18.118.184.123:46898] "evalsha" "f4a8a5467f9f4697a26fdfb839476b9ee52e897c" "1" "retry" "1680648014.0486577"
1680648014.047110 [0 18.118.184.123:46898] "evalsha" "f4a8a5467f9f4697a26fdfb839476b9ee52e897c" "1" "schedule" "1680648014.0491111"
1680648014.047110 [0 18.118.184.123:46898] "scard" "processes"
1680648014.679114 [0 18.118.184.123:46910] "brpop" "queue:default" "queue:mailers" "queue:ingress" "queue:push" "queue:scheduler" "queue:pull" "2"
1680648014.679114 [0 18.118.184.123:46872] "brpop" "queue:mailers" "queue:default" "queue:pull" "queue:ingress" "queue:push" "queue:scheduler" "2"
1680648014.679114 [0 18.118.184.123:46816] "brpop" "queue:default" "queue:ingress" "queue:mailers" "queue:push" "queue:pull" "queue:scheduler" "2"
1680648014.879116 [0 18.118.184.123:46808] "brpop" "queue:mailers" "queue:push" "queue:default" "queue:ingress" "queue:pull" "queue:scheduler" "2"
1680648015.079117 [0 18.118.184.123:46854] "brpop" "queue:push" "queue:ingress" "queue:default" "queue:pull" "queue:mailers" "queue:scheduler" "2"
1680648015.083117 [0 18.118.184.123:46782] "brpop" "queue:pull" "queue:scheduler" "queue:default" "queue:ingress" "queue:mailers" "queue:push" "2"
1680648015.083117 [0 18.118.184.123:46862] "brpop" "queue:push" "queue:ingress" "queue:default" "queue:pull" "queue:mailers" "queue:scheduler" "2"
1680648015.083117 [0 18.118.184.123:45206] "brpop" "queue:ingress" "queue:default" "queue:mailers" "queue:push" "queue:scheduler" "queue:pull" "2"
…

The four-shard cluster seemed like overkill, so we tried again with a smaller, single-shard Redis Enterprise Cloud database (rated for 25,000 ops/sec and 5 GB deployment), which showed 15 to 30 ops/sec load from the Sidekiq queues coming from Mastodon.

The chart at the upper left shows the database load in operations/second; the chart at the upper right shows the database latency. Low latency is good. The bottom four charts break out the load into reads, writes and other operations.

Modifying the Sidekiq Load Testing Tool

Sidekiq has two benchmarking tools in its repository. We chose the simpler one, which resides at bin/sidekiqload.

The Sidekiq load test tool creates 100,000 no-op jobs and drains them as fast as possible. As the code is written, it also uses toxiproxy to simulate network latency against a local instance of Redis. Since we were testing against a remote Redis Enterprise Cloud cluster, we didn’t need toxiproxy; we commented out that code.

Then we added the following Ruby code to read the Redis password, port and host from the environment, and we used it to configure the Redis connection for the benchmark.

Sidekiq.configure_server do |config|
  config.options[:concurrency] = 10
  redis_pass = ENV['REDIS_PASSWORD'] || ''
  redis_port = ENV['REDIS_PORT'] || 6380
  redis_host = ENV['REDIS_HOST'] || "127.0.0.1"
  config.redis = { password: redis_pass, host: redis_host, port: redis_port}
  # config.redis = { db: 13, port: 6380, driver: :hiredis}
  config.options[:queues] << "default"
  config.logger.level = Logger::ERROR
  config.average_scheduled_poll_interval = 2
  config.reliable! if defined?(Sidekiq::Pro)
end

Performing Sidekiq Benchmarks against a Single Shard on Redis Enterprise Cloud

Running that (modified) Sidekiq load test showed about 13,000 ops/sec and a latency of 0.06 milliseconds.

The chart at the upper left shows the database load from the Sidekiq load test in operations per second; the chart at the upper right shows the database latency. Low latency is good. The bottom four charts break out the load into reads, writes and other operations. The benchmark tool reported running 100,000 jobs in 7.8 seconds, meaning that each job took 78 microseconds to complete. That isn’t shabby at all.

The single-shard 5 GB/25,000 Redis cluster used two m5.xlarge instances, one m5.large instance, and a 119 GB EBS volume.

In our experiments, we increased the number of jobs from 100,000 to 5 million. As the screenshot illustrates, the throughput is about the same (about 13,000 ops/sec) and the latency is about the same (about 0.06 ms), although the Redis memory usage increased to about 1.3 GB. The increased Redis memory usage from the larger queue is not a surprise.

These small charts show the detailed Redis performance during the load test. The most significant tests are highlighted: steady load that is well below the database capacity and very low latency.

The load test tool reports that processing 5 million jobs took 400 seconds, so each job took 80 microseconds to complete, very slightly higher than the smaller queue.

Clearly, the bottleneck for the Sidekiq queue is not the Redis Cloud shard, which never reached its 25,000 ops/sec capacity.

More Sidekiq benchmarks using Redis OSS and Redis Enterprise Cloud

In the previous tests, we load-tested Sidekiq against a single Redis Enterprise Cloud shard. What happens when we test against Redis OSS?

We set up a single Redis OSS database in an m5.large AWS instance. That should be roughly comparable to the single-shard Redis Enterprise Cloud even though it lacks the Redis on Flash and Active-Active features. We re-ran the Sidekiq load test with 5 million jobs. This time, the test was completed in 427 seconds, meaning that the average time to complete a job was 85 microseconds.

We also set up another four-shard Redis Enterprise Cloud database to the same specifications as the 10 GB, 100K ops/sec cluster configuration we first showed you. This configuration completed the 5M-job Sidekiq load test in 387 seconds, giving us an average time to complete a job of 77μs. It also showed lower latency.

To summarize: Redis OSS is a little slower and has higher latency than a similarly sized single-shard Redis Enterprise Cloud instance, while a four-shard Redis Enterprise Cloud instance is a little faster, has higher capacity and has lower latency.

Future Explorations

What all that tells us: Removing the Redis database from the virtual machine that runs Mastodon and Sidekiq should make Mastodon handle high loads more gracefully, with fewer stalls and posting failures.

To prove that conclusively, we plan to set up a production Mastodon node with an external Redis Enterprise Cloud cluster to handle the job queues and perhaps the PostgreSQL cache as well, and we will monitor how it scales with lots of users.

If you’d like to get ready to try all this yourself, you should start by exploring Redis Enterprise Cloud. A free tier instance might not be enough to use for your own high-capacity Mastodon server, but it certainly allows you to become familiar with setting up and using the database, and it will only cost you a little time.

The post Optimizing Mastodon Performance with Sidekiq and Redis Enterprise appeared first on The New Stack.

Datadog’s $65M Bill and Why Developers Should Care

Loraine Lawson — Wed, 17 May 2023 17:56:34 +0000

Sixty-five million dollars: That’s how much one customer was billed by Datadog in the first quarter of 2022, according to a May 4 earnings call for the SaaS observability and security vendor.

An unspecified financial services firm was faced with the bill, which did not reoccur, according to Olivier Pomel, co-founder, CEO and Director at Datadog.

“We had a large upfront bill for a client in Q1 ’22 that did not recur at the same level or timing in Q1 ’23,” he said, during the earnings call. “Pro forma for this client, billings growth was in the low 30s percent year-over-year.”

Pro forma refers to a financial statement that uses hypothetical data or assumptions about future values to project performance over a period that hasn’t yet occurred, according to the Harvard Business School. Generally, observability companies use this approach to predict what a bill may be — but as the $65 million bill shows, there can be surprises when the actual usage bill comes due.

Mark Ronald Murphy, an executive director and financial analyst with JPMorgan Chase’s research division, crunched the numbers and revealed that the upfront bill would have been about $65 million.

David Obstler, chief financial officer at Datadog, said the company changed the billing frequency.

“That customer’s bill will, one, be spread out more over time. That company — that was a crypto company and continues to be a customer of ours, but that was an early optimizer,” Obstler said, during the earnings call. “We will get that bill at a smaller size than was billed last year in a more of a chunked up billing way.”

He added that since becoming public, Datadog has pointed out when they have an unusual bill and that the problem isn’t common. When it does happen, they change the duration or timing or size of the bill to accommodate customers.

Was Coinbase the Customer?

Pomel added that the customer was in a vertical that was “pretty much decimated over the past year.”

“Their own business was cut in three or four in terms of their revenue and when that’s the case, that we really work with customers to restructure their contract with us,” Pomel said. “We want to be part of the solution for them, not part of the problem. And that’s what we did here, we restructured that contract. So we kept them as a happy customer for many more years and do deal that works for everyone with their business profile.”

While Datadog did not return an interview request by deadline, Gergely Orosz, a software developer who blogs as the Pragmatic Engineer, cites multiple unnamed engineering sources at Coinbase whom he said confirmed Coinbase was the company in question. Coinbase did not respond to The New Stack’s request to confirm or deny whether they were the company in question.

Observability Costs: The Impact on Developers

In Datadog’s case, the numbers are complicated by the fact the company offers more than observability solutions, including security bills. The earnings report did not clarify how many such SaaS services the unnamed company used.

While $65 million is a shocking amount — shocking enough that the news quickly circulated on Twitter — bills in the ten million range are not unusual for traditional observability companies, said Shahar Azulay, CEO of observability alternative provider Groundcover.

“Big companies, like Coinbase, have already proceeded to the $10 million per year, price tag a while ago,” Azulay said. “It’s not rare to hear companies paying Splunk, Dynatrace, Datadog — all like the big observability players — paying them over $10 million a year and even paying multiple vendors, each of them above like two figures per year.”

Part of it is how observability companies choose to price their offerings, he added. Observability solutions monitor three types of data: Logs, metrics and traces (which monitor the pathways for interactions, such as end-to-end transactions and what happens between services). It’s difficult if not impossible to predict how these data sources will grow, particularly when they might be spiked by events such as Black Friday, when customer usage peaks.

“It has tons of unpredictability and a lot of dependency on the amount of data you push to your log, and that’s the base root cause on these huge volumes of pricing points because you’re not being able to control that and you’re not being able to know how much you’re going to pay next month,” Azulay said.

What’s more, even if a contract is for one tier level, once a company exceeds that tier, it’s billed at the higher tier rate from that day forward, he added.

“That specific logline can be a critical part of the infrastructure, say, a search engine in Google or whatever that runs a million times a day — just customers using it a million times a day,” he said. “Suddenly, from an organization perspective, you could have just pushed like a million more log lines or data points into data without knowing that as a developer. It creates a cycle of developers creating applications, building business logic that supports what the organization should be doing as a product, and then R&D management, figuring out two months later, ‘Oh, that just spiked our prices by 50%.”

That may fall back on the developer for pushing too much information to the observability stack, he said.

“They’re causing the developers to cut down on the number of data points they push to monitor the production,” Azulay said. “It’s a weird, vicious cycle of developers wanting more data to troubleshoot, and management being put in a trade-off where they have to pay tons of money for that.”

Not all observability companies charge this way. Groundcover, which uses an eBPF Agent for observability, collects the data but stores only what matters, which it says is more cost effective, so it can charge by the number of servers running in production, Azulay said.

More Datadog Deals

Earnings reports provide an opportunity to learn about the inner workings of public companies. For instance, the Datadog first-quarter earnings report shows that revenue during that quarter was $482 million — an increase of 33% year over year despite a March service outage that reduced the quarter’s revenue by $5 million. That outage required three shifts of 500-600 engineers working on the outage, Pomel said. Billings were $511 million for the first quarter up 15% year over year, Pomel said.

The company ended the quarter with 25,500 customers, up 19,800 over the same quarter last year. Of those, about 2,910 have an annual reoccurring revenue (ARR) of $100,000 or more, up from about 2,250 last year. These customers generate about 85% of Datadog’s annual reoccurring revenue. The majority of customers — 81% — also use two or more Datadog products.

In the first quarter of the year, Datadog executives reported signing an eight-figure deal with a leading AI company; a seven-figure expansion with a Fortune 500 health care company; a seven-figure multiyear deal with a leading university in Australia that had historically relied on open source solutions; and an expansion deal to another eight-figure ARR deal with “one of the world’s largest fintech companies.

“This customer has expanded meaningfully over time, and today see Datadog platform used by thousands of users across dozens of business units,” Pomel reported. “With this expansion, this customer now uses 14 Datadog products and is consolidating multiple open source, homegrown and commercial tools across observability and security into the Datadog platform.”

It also provided a peek at why companies might be willing to shell out eight-figure fees to the company. Before Datadog, Pomel said the Fortune 500 health care company would need to mobilize upper to 150 employees for an average of 3-4 hours for the same function that now requires only 20 employees for about 30 minutes.

Correction: The story has been updated to reflect the correct capitalization of the second d in Datadog; also Groundcover does collect data, but doesn’t store all data.

The post Datadog’s $65M Bill and Why Developers Should Care appeared first on The New Stack.

Google Goes Gaga for AI Developer Tools

Loraine Lawson — Fri, 12 May 2023 15:12:53 +0000

At its I/O developer conference this week, Google demonstrated how its AI can support more automation in developer spaces. Among the announcements was the application of AI to mobile Android development and cloud application deployments.

Studio Bot for Mobile Android Development

Matthew McCullough, vice president of Product at Android Developer, introduced three new features — all reliant on AI — into the world of Android development. First, Google has added AI directly into the Android developer workflow. Second, it has added more support for developing in a multi-device world, including support for flippable or foldable phones, which Google also unveiled at I/O on Wednesday. Finally, it demonstrated how a new language toolkit and tool improvements “seamlessly converge in the modern Android development stack,” McCullough said.

Jamal Eason, director of product management, demonstrated how AI is being put to use in an Android Studio tool called Studio Bot that helps refine code and deploy with best practices based on the hardware your app will be deployed on. By using the assistant window found in the toolbar, developers can chat with the bot to create.

“Studio Bot is a tightly integrated AI-powered helper in Android Studio, designed to make you more productive,” Eason said. “What’s unique about this chat setup is that you don’t need to send your source code to Google — [it’s] just a chat dialogue between you and the bot.”

Studio Bot can generate code — of course, it can — but it does more, Eason explained.

“It’s the place to ask questions in context. So for instance, I don’t need this layout in XML but in Kotlin, since I’m doing this app in Jetpack Compose, so let’s ask studio bot: How do I do this in Jetpack Compose? And perfect, the code makes sense.”

It also provides additional guidance and documentation, he said. Developers can also ask the Studio Bot to create a unit test for the app as well.

“I’m just having a conversation with Studio Bot, but it remembers the context between question to question,” he said, demonstrating that Studio Bot created the unit tests in the right context.

The bot can even explain what caused a crash — in this case, Eason forgot to add internet permissions. With the click of a button, thanks to the AI, he generated the missing code and added it.

“Now once you build your app with the help of Studio Bot, you’re ready to publish to Google Play, and here, we’re also bringing the power of AI,” Eason said. “So today, we’re launching a new experiment of the Google Play console that regenerates custom store listings for different types of users. You will ultimately have control of what you submit, but Google Play is there to help you be more creative, from start to finish, develop to publish. We’re deploying AI to help you move faster and to be more creative.”

Cloud Development with Duet AI

Cloud development hasn’t been left out of the generative AI rush, either. Google introduced Duet AI as an AI-powered collaborator, with the ability to build code models trained directly by your own code.

“When it comes to cloud, generative AI is opening the door for professional developers with different skill levels to be productive,” said Chen Goldberg, general manager and vice president of Engineering for Kubernetes and Serverless at Google. “We believe to add AI fundamentally changes how developers of all skill levels can build cloud applications.”

With the new cloud capabilities, any developer can build enterprise-ready applications without expertise in security, scalability, sustainability and cost management, she said.

She demoed adding support for Hindi to a shopping website called Simple, which has a lot of customers in India. She used Cloud Workstation, a secure, fully managed development environment now available in GA.

“All I have to do is to create a function and add a comment, and now, thanks to Duet AI, I can see a code snippet for using Cloud Translation API [that] it suggested to me immediately,” Goldberg said. “Generate code is a good start, but good software engineering practice, like ensuring that my dependencies are up to date, is essential. So before I go to production, I can check these and ensure they work.”

The AI detected she was running an old version of a telemetry library and with a click, allowed her to upgrade it and then the website supported Hindi.

“What would have taken me a long time — not to mention I might not have been able to do it on my own — just took me minutes,” Goldberg said.

It can also be applied to existing code.

“One of my personal favorite ways to use Duet AI is making the work of maintaining large code bases simpler,” Goldberg said. “I’ve come across this code, which I’m not familiar with. Well, now instead of pinging the owner, searching for related code, and spending a long time reviewing it, I can just ask Duet AI to help me understand this piece of code.”

Duet AI currently is available only through Google’s Trusted Tester Program. VertexAI offers a similar experience for your own codebase, she added.

“You can tune and customize foundation models from Google with your own code base; no ML expertise required,” she said. “And you can call your custom code models directly from the Duet API.”

Vertex AI can tune and customize foundation models, but it can also be used to create new content, such as images, she said.

“In Vertex AI, you can easily access a full suite of foundation models from Google and open source partners with enterprise-grade data governance and security, and you don’t need to worry about all the work needed to set them up,” Goldberg said, demonstrating Vertex AI’s ability to take a handbag picture, add it to the image foundation model, and create multiple variations of the image.

“It will work regardless of the complexity of the image, giving me the freedom to easily iterate and explore different options without the complexity of hosting my own model or figuring out the hyperparameters,” she said. “With Vertex AI, I can quickly and easily upscale it [the image] so it looks consistent on high-resolution displays in my online store and in print, it’s almost ready to be added to my site. As I’m expanding globally, the power of Vertex AI will lead me to generate text captions for accessibility and localize them into more than 300 languages.”

Look, Ma, No Code

She then threw out all the stops and showed how Duet AI could be integrated with Google Workspace to create an app without even knowing how to code.

“I describe in natural language the travel approval app I want to build. Next, Duet AI walked me through the process step by step, asking a simple set of questions like ‘How would you like to be notified? What are the key sections of my app? And most importantly, what’s the name of the app?’ Let’s call it simple travel,” she said. “Once the questions are answered, Duet AI creates the app with travel requests from my team within Google Workspace.”

The new chat APIs are now available in Google Workspace and will be generally available in the coming weeks, she added. With those APIs, developers can build chat apps that let users perform actions such as creating or updating records. Atlassian used these APIs to build their Jira app for chat, she noted. Jira app allows teams to track issues, manage projects, and automate workflows.

Also, coming to preview in the next few weeks will be the new Google Meet APIs and two new SDKs that allow developers to bring Google Meet data and capabilities to apps, she added.

For a look at the AI models underlying these developments, check out, “Google’s New TensorFlow Tools and Approach to Fine-Tuning ML.”

The post Google Goes Gaga for AI Developer Tools appeared first on The New Stack.

Kafka on Kubernetes: Should You Adopt a Managed Solution?

Brianna Blacet — Thu, 11 May 2023 17:43:24 +0000

Companies are increasingly choosing to run Apache Kafka on Kubernetes, and for good reason. Kubernetes provides a highly scalable, resilient orchestration platform that simplifies the deployment and management of Kafka clusters, allowing DevOps to spend less time worrying about infrastructure and more time building applications and services. Experts expect this trend to accelerate as more organizations use Kubernetes to manage their data infrastructure.

If your company is just getting started with Kafka in your Kubernetes environment, you’ll have several decisions to make, beginning with whether to deploy Kafka yourself or to purchase a managed solution.

The right answer will depend on your specific environment and the regulations that govern your industry. In this article, we’ll walk through the various factors at play, so you can help your organization make an informed decision.

Costs and Benefits of Self-Managed Kafka

Self-managed or “do-it-yourself” (DIY) Kafka has some advantages. You’ll have more control over your deployment, including whether to extend it across multiple clouds. It may be easier to align with your internal security and operations policies, accommodate your specific data residency concerns, and better control costs.

In this scenario, your in-house staff must perform the following tasks:

Setting up the infrastructure and storage
Installing and configuring the Kafka software
Setting up Apache Zookeeper™, if necessary. (Zookeeper is now deprecated and will no longer be supported as of Kafka v. 4.0. After that point, Kafka will use KRaft, the Kafka Raft consensus protocol.)
Monitoring and troubleshooting your clusters
Security
Horizontal and vertical scaling
Replication (for disaster recovery and availability)

Is Managed Kafka a Better Fit?

“Managed” Kafka is a service you can purchase from some hyperscalers, such as Amazon Web Services, and other third-party vendors. While the initial cost of the service may give you sticker shock, you may save money on hosting and payroll.

That said, some managed solutions may still require your team to have some level of Kafka expertise on board, especially during the setup phase.

With managed Kafka, you’ll lose the ability to control your data residency. What’s more, if you’re not sure how much compute or storage space you’ll need, you may end up with some surprise hosting costs.

What’s Included in a Managed Solution?

While each Kafka vendor’s exact offering varies a bit, hosted solutions include setup of the cloud infrastructure necessary to run Kafka clusters, including virtual machines, network, storage, backups and security.

Most managed solutions (whether or not they include hosting), provide features that:

Install and manage the Kafka software, including upgrades, patches, and security fixes.
Monitor Kafka clusters for issues such as running out of memory or storage space and provide alerts and notifications when problems arise. These solutions usually also include tools for troubleshooting and resolving problems like the above.
Ensure that data stored in Kafka clusters is durable and available by replicating data across multiple nodes and data centers.
Perform a variety of additional functions, depending on the solution. For example, they may include features that easily install additional functionality — such as schema management, connectors, and ksqlDB—which allow you to easily integrate with other data systems, transform data and build real-time applications.

The Decision-Making Process: Where to Start

Installing, configuring, and maintaining Kafka isn’t merely a matter of opening the manual and diving in. Every organization is different. Your Kafka implementation will vary, depending on your cloud provider, the size of your deployment, the applications you’re running and the size of your company, among other factors. So you’ll need a team with the specific skills required to perform the tasks in your unique environment.

In some companies, there may be more than one department involved—one to install the clusters and set up the infrastructure and another to “administer” Kafka, meaning set up topics, configure the producers and consumers, and connect it all to the rest of your application(s). Even if you have folks on board with some Kafka experience, they may not have the knowledge they need to set it up in a cloud or Kubernetes environment. So you may have to hire in this skill set or get training for your existing staff. It may take them a while to come up to speed. This indirect cost may not be trivial, especially if you work for a smaller organization.

To muddy the waters further, there are many “flavors” of technical staff who work with Kafka. Many don’t have the word “Kafka” in their titles. However, a quick search on LinkedIn turned up a few of the job titles that do:

Kafka site reliability engineer (SRE)
Staff software engineer, Kafka
Kafka admin
Kafka developer
Kafka engineer
Kafka support engineer
Java developer with Kafka

Depending on your location, as well as on the seniority and the specific job responsibilities, the cost to hire staff members to work on Kafka can vary tremendously.

In some cases, you may wish to split the job into two (the infrastructure responsibilities and the development responsibilities). In others, you may want to hire folks who will have responsibilities beyond just your Kafka deployment. Either way, this is one of the major costs associated with DIY Kafka.

If you choose a managed Kafka solution, you won’t need as much Kafka expertise on your team, since your provider will take care of most of the operational tasks involved.

However, as mentioned earlier, some solutions may still require you to perform a significant number of setup tasks. You’ll still need staff to build your Kafka-based applications and/or integrate them into your application ecosystem.

Consider Your Cloud Provider

Depending on the Kafka solution you’re considering, you’ll need to think about hosting. While this is obvious in the DIY scenario, there are still decisions to make with managed Kafka. Some providers, such as Confluent and Amazon Managed Streaming for Apache Kafka (MSK), include cloud hosting as part of their solutions. Others, such as Aiven and Cisco Calisti, are not hosted solutions. Still others, such as Instaclustr, give you the option to run your Kafka deployment in their cloud environment or use your own. So you’ll need to factor in cloud cost and convenience as you make your choices.

Open Source: A Hybrid Option

If you’d like the idea of using some of the features available in a managed Kafka solution, but would still prefer to retain control over your data and cloud compute and storage, consider using an open source solution.

An example is Koperator, a Kubernetes operator that automates provisioning, management, autoscaling, and operations for Kafka clusters deployed to Kubernetes.

Koperator provisions secure, production-ready Kafka clusters, and providing fine-grained configuration and advanced topic and user management through custom resources. Have a look at Koperator’s readme.md file and feel free to contribute to the project.

Learn more about Cisco Open Source and join our Slack community to be part of the conversation.

The post Kafka on Kubernetes: Should You Adopt a Managed Solution? appeared first on The New Stack.

Nutanix Adds 3 New Parts to Its Multicloud Data Platform

Chris J. Preimesberger — Wed, 10 May 2023 17:28:44 +0000

Nutanix, which two years ago with Red Hat launched Nutanix Cloud Platform, its open multicloud data management platform, revealed several substantial additions to that product at its .NEXT 2023 Conference in Chicago.

The Nutanix Cloud Platform enables enterprises to build, scale and manage cloud native applications on-premises and in hybrid cloud environments. It serves as the core platform for all Nutanix users that provides a unified environment for virtualized workloads, containers and bare-metal applications across private, public and hybrid clouds. NCP is built on Nutanix’s hyperconverged infrastructure (HCI), which combines storage, compute and virtualization into a single appliance.

The new additions announced on May 9 are:

Nutanix Central
Data Services for Kubernetes
Project Beacon, a group of data-centric platforms as a service (PaaS) level services

Nutanix Central

Nutanix Central is a cloud-delivered software package that serves as a single console for visibility, monitoring and management across all IT environments: public cloud, on-premises, hosted or edge infrastructure. This aims to extend the universal cloud operating model of the Nutanix Cloud Platform to break down silos and simplify the management of apps and data anywhere, Nutanix SVP of Product and Solutions Marketing Lee Caswell told a group of reporters.

“Central is a service management model to manage federated endpoints,” Caswell said. “So from an observability and manageability standpoint, this now allows us to consolidate different clusters, if you will, across the hybrid multicloud environment. Nutanix Central then becomes our mechanism for how we go and help customers with a single pane of glass across all of these endpoints.”

From the Nutanix Central dashboard, customers will access domain and cluster-level metrics, including capacity utilization and alert summary statistics, to get a quick overview of the state of each domain, Caswell said. This functionality will also provide seamless navigation to individual domains, based on individual user role-based access control (RBAC), across all domains registered, he said.

Nutanix Central also will support multidomain use cases, including federated identity and access management (IAM), global projects and categories, and global fleet management, Caswell said. This all enables IT teams to deliver self-service infrastructure at scale while remaining in control of governance and security, he said.

Data Services for Kubernetes

Nutanix Data Services for Kubernetes is designed to give users control over cloud native apps and data at scale, Thomas Cornely, SVP of Product Management at Nutanix, said in a media advisory.

Initially conceived as part of Nutanix Cloud Infrastructure, NDK brings management of Nutanix’s enterprise-class storage, snapshots and disaster recovery to Kubernetes. This helps accelerate containerized application development for stateful workloads by introducing storage provisioning, snapshots and disaster recovery operations to Kubernetes pods and application namespaces, Cornely said.

NDK will give Kubernetes developers self-service capabilities to manage storage and data services, while also enabling IT with visibility and governance over consumption, Cornely said.

Project Beacon

Project Beacon is a multiyear Nutanix initiative designed to deliver a portfolio of data-centric platform as a service (PaaS) level services available natively anywhere — including on Nutanix or on a native public cloud. With a goal of decoupling the application and its data from the underlying infrastructure, Project Beacon aims to enable developers to build applications once and run them anywhere. Sort of the original aim of Java 30 years ago — long before the cloud and the edge.

“What Project Beacon does is it says, ‘We’re gonna move beyond the infrastructure we control today,'” Caswell said. “You can take portable licenses and move them from an AI-enabled edge to the multidata site to a data center and into our cloud partners.

“The issue there though, is that that’s about running our full stack. Now, what we’re saying is we’re going to go and expand into hyper-scalar infrastructure that’s running its own compute and storage constructs. And that’s very interesting because what Project Beacon then says is, ‘We’re going to show this NDB (network database) as a service — the basis of all modern and performance applications,” Caswell said.

Caswell said Nutanix did databases first with Project Beacon to show that “now we can take our database-as-a-service offering and allow that to go and run directly on AWS, without requiring the customer-controlled advertising. That’s a very interesting way to start thinking about our intent over time.”

All three of the platform additions will become available for users later this year, Nutanix said.

Nutanix’s .NEXT 2023 Conference continues through May 10.

The post Nutanix Adds 3 New Parts to Its Multicloud Data Platform appeared first on The New Stack.

5 Things for CISOs to Know to Secure Data and Apps in the Cloud

Crowdstrike Staff — Thu, 04 May 2023 15:22:34 +0000

Digital transformation has pushed organizations to adopt a hybrid IT approach and created a mix of on-premises and cloud infrastructure that has to be supported and protected.

Unfortunately, while hybrid IT holds significant promise for businesses when it comes to creating efficiencies and speeding the delivery of applications and services, it also introduces a new set of challenges.

As cloud environments become more complex and distributed, stitching together a comprehensive view of cloud activity is a vital part of enterprise security. To embrace the cloud with confidence, there are five things every CISO will need to know.

Keeping up with the New Normal

Poor visibility can lead to all manner of security risks, from data loss to credential abuse to cloud misconfigurations. It is one of the biggest challenges facing CISOs today as they look to adopt cloud technologies. In a survey from Enterprise Strategy Group, 33% of respondents reported they felt a lack of visibility into the activity of the infrastructure hosting their cloud native applications was the biggest challenge involved in securing those apps.

That should come as no surprise. Some of the difficulties businesses are facing can be traced to the rapid changes to the environment that DevOps introduces in the name of speed and scalability. From microservices to containers, modernizing your operation with cloud native applications can come at a cost to security. For example, the short lifespan of microservices means they are being spun up and down frequently, which challenges organizations’ ability to maintain a clear view of their cloud environments. Containers face a similar challenge, as many are also short-lived. While this approach effectively reduces the attack surface, it also makes obtaining full visibility more complex.

Another challenge to visibility is shadow IT. As DevOps teams push back against anything that slows them down, they often increase their use of shadow IT. This is not something done out of malice but out of necessity. If IT cannot respond to requests to provision resources fast enough — or developers prefer unapproved applications they believe will increase their productivity — IT may find itself out of the loop.

By definition, shadow IT is outside the view of IT security, which increases the probability that vulnerabilities, misconfigurations and policy violations will go undetected. In a similar vein, though the growth of user self-provisioning may be good for speed, it is not without its drawbacks when it comes to security. By making the power to provision resources more decentralized, organizations can create an environment that allows for increased agility but does so at the expense of visibility.

Meeting the Challenge: 5 Things You Need to Know

Embracing the cloud requires a comprehensive approach to security that emphasizes both monitoring and real-time workload protection.

Defending the multicloud environments that organizations have to protect today requires keeping track of what is going on across any number of cloud instances. While cloud providers often have their own tools, those solutions are typically designed for the provider’s own infrastructure and not others, leaving many organizations needing more advanced capabilities that can cover multiple clouds so they can maintain security and compliance.

The following are the capabilities CISOs should consider as they look to embrace the cloud securely:

A solution that scales: As your organization grows, your security needs will grow as well. An effective solution must be able to scale up or down as needed to provide the protection your organization requires across containers, multicloud environments, virtual machines and more.
Portability: Businesses should not need to redo security every time they deploy a new cloud instance or use different cloud providers; security should be automated and extend to new cloud instances as they are deployed.
Unified security: Integrated security reduces complexity. CISOs should be looking for a cloud native security platform that can offer a cloud security posture management solution, cloud workload protection and container security in a single, unified solution instead of relying on multiple tools and consoles.
Always on: When it comes to cloud security, simplicity should be the rule. DevOps teams need to be able to turn on automated security through their normal workflows to keep pace with the speed of app delivery and ensure that they can meet security and compliance requirements.
Comprehensive and actionable: The security solution should monitor the environment and provide a complete view of the organization’s security posture. With high levels of automation, the right security solution can speed the time to remediation and reduce the noise for security teams dealing with alert fatigue. Bolstered by threat intelligence, these capabilities will empower security teams to take more effective actions.

A Way Forward

Whether on-premises or in the cloud, protecting data, systems and applications begins with having a clear view of what is happening in the environment. As organizations look to expand their footprint in the cloud, they must choose a solution that supports security and compliance across their entire IT environment. As a CISO you need to make it your mission that you have the ability to gain the visibility you need to continuously monitor threats and ensure compliance in the cloud. Doing so will help minimize risk in the new cloud-driven ecosystem while enabling DevOps to deploy applications with greater speed and efficiency.

Learn more about CrowdStrike Cloud Security.

The post 5 Things for CISOs to Know to Secure Data and Apps in the Cloud appeared first on The New Stack.

Return of the Monolith: Amazon Dumps Microservices for Video Monitoring

Joab Jackson — Thu, 04 May 2023 14:23:21 +0000

A blog post from the engineering team at Amazon Prime Video has been roiling the cloud native computing community with its explanation that, at least in the case of video monitoring, a monolithic architecture has produced superior performance than a microservices and serverless-led approach.

For a generation of engineers and architects raised on the superiority of microservices, the assertion is shocking indeed. In a microservices architecture, an application is broken into individual components, which then can be worked on and scaled independently.

“This post is an absolute embarrassment for Amazon as a company. Complete inability to build internal alignment or coordinated communications,” wrote analyst Donnie Berkholz, who recently started his own industry-analyst firm Platify.

“What makes this story unique is that Amazon was the original poster child for service-oriented architectures,” weighed in Ruby-on-Rails creator and Basecamp co-founder David Heinemeier Hansson, in a blog item post Thursday. “Now the real-world results of all this theory are finally in, and it’s clear that in practice, microservices pose perhaps the biggest siren song for needlessly complicating your system. And serverless only makes it worse.”

In the original post, dated March 22, Amazon Prime Senior Software Development Engineer Marcin Kolny explained how moving the video streaming to a monolithic architecture reduced costs by 90%. It turns out that components from Amazon Web Services hampered scalability and skyrocketed costs.

The Video Quality Analysis (VQA) team at Prime Video initiated the work.

The task as to monitor the thousands of video streams that the Prime delivered to customers. Originally this work was done by a set of distributed components orchestrated by AWS Step Functions, a serverless orchestration service, AWS Lambda serverless service.

In theory, the use of serverless would allow the team to scale each service independently. It turned out, however, that at least for how the team implemented the components, they hit a hard scaling limit at only 5% of the expected load. The costs of scaling up to monitor thousands of video streams would also be unduly expensive, due to the need to send data across multiple components.

Initially, the team tried to optimize individual components, but this did not bring about significant improvements. So, the team moved all the components into a single process, hosting them on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS).

Takeaway

Kolny was careful to mention that the architectural decisions made by the video quality team may not work in all instances.

“Microservices and serverless components are tools that do work at high scale, but whether to use them over monolith has to be made on a case-by-case basis,” he wrote.

To be fair, the industry has been looking to temper the enthusiasm of microservices over the past decade, stressing it is only good in some cases.

“As with many good ideas, this pattern turned toxic as soon as it was adopted outside its original context, and wreaked havoc once it got pushed into the internals of single-application architectures,” Hansson wrote. “In many ways, microservices is a zombie architecture. Another strain of an intellectual contagion that just refuses to die.”

The IT world is nothing but cyclical, where an architectural trend is derided as hopelessly archaic one year can be the new hot thing the following year. Certainly, over the past decade when microservices ruled (and the decade before when web services did), we’ve heard more than one joke in the newsroom about “monoliths being the next big thing.” Now it may actually come to pass.

The post Return of the Monolith: Amazon Dumps Microservices for Video Monitoring appeared first on The New Stack.

Setting the Record Straight on Cloud Computing ROI

Kate Turchin Wang — Tue, 02 May 2023 17:00:44 +0000

There have been some misleading headlines floating around the mainstream media as of late regarding cloud computing. Articles from the Wall Street Journal (“CIOs Still Waiting for Cloud Investments to Pay Off”) and InfoWorld (“Companies are Still Waiting on their Cloud ROI”) have been stirring up misinformation around the claim that cloud computing has not delivered the ROI it initially promised and that companies have been pulling out, seeking greater cost saving opportunities on premise.

At Cloudticity, we’ve been openly critical of this narrative these past few months, sharing our insight on where these articles went wrong and how to properly engage in a discussion around cloud computing ROI by looking beyond cost savings to other more important measures.

But a new article by David Linthicum in InfoWorld called “Making a New Business Case for Cloud Computing” finally gets it right. In the article he states:

“The most significant value of cloud computing is rarely found in cost savings, although they sometimes do occur; it’s about delivering the more critical business values of agility and speed to innovation.”

To that we say, bravo! We’re glad to hear that the industry is finally getting some clarity here. Looking at the cloud as merely a cost saving solution is to miss the point entirely of what the cloud is and what it delivers.

Five Ways to Assess the Value of Public Cloud

As the InfoWorld article mentions, it was never about cost savings. There are soft values attributed to cloud that are difficult to quantify and measure. But if you take the time to look, that’s where you’ll find the gold.

When evaluating the ROI of your cloud computing program, consider the following areas of transformation.

Speed of innovation and agility

Cloud computing was designed to make it easier for companies to innovate, without having to worry about managing hardware and maintenance of the physical space.

For one, you can test out a new idea upon its conception in the cloud. You can say, “Hmm, I wonder if this would be a good project.” And you can spin up a virtual server for pennies on the dollar to test out your new idea. Then, you can spin the server down right away if it’s not working out how you’d hoped.

Imagine trying to accomplish that same scenario on-premise. You’d have to create a project plan and budgeting proposal. You’d have to get buy-in and approval from the top. Then, you’d have to order hardware, wait for it to be delivered, wait for the infrastructure team to rack it and stack it and get it tuned up.

If you didn’t plan your infrastructure needs perfectly you could end up needing more than you initially anticipated, causing the whole project to come to a screeching halt. Or, perhaps you ordered too much and the project never ended up taking off how you envisioned. The business takes a loss, and you’re the one at fault!

An incentive structure like that just doesn’t foster innovation. But using the cloud, experimentation is very low risk making it an ideal environment for companies aiming to be on the cutting edge with their tech to stay ahead of an innovative market.

The cloud is allowing companies to deliver 3x more features per year, or perhaps even more. While it’s hard to measure the ROI of that, we know anecdotally that it’s significant.

Adoption of new technologies

In an on-premises model, it can be prohibitively costly to keep up with technology advances. Organizations typically replace on-premises servers every three to five years, which is an eternity in the fast-paced world of technology.

Cloud Service Providers (CSPs) like Amazon Web Services, Microsoft Azure, and Google take the opposite approach. They adopt new technologies as they are introduced and quickly make them available to their customers as a competitive differentiator.

This means that companies using the public cloud can readily adopt the latest and greatest technologies as soon as they’re introduced, test them out and integrate them into their solutions without a large capital investment or even a commitment.

Let’s take a closer look at what this means for something like a machine learning (ML) instance.

Historically, artificial intelligence (AI) and ML frameworks were incredibly complex and expensive to implement and manage. Only well-funded organizations with a large amount of capital to throw around were able to invest in this technology. Today, cloud providers have made this technology available to nearly anyone with an internet connection.

In an industry like healthcare, the ability to run ML models on large sets of patient data is practically invaluable. By scanning terabytes, petabytes, or exabytes of historical data for trends and anomalies, healthcare organizations are able to identify more effective treatments, predict outcomes with incredible accuracy and even diagnose life-threatening problems earlier in order to increase the chances of survival.

When discussing the ROI of cloud in a space like healthcare, it’s necessary to flip the question around. What is the cost of not putting these new tools to work? How many people will die that could have been saved?

In other industries, like financial services, the answer will be less emotionally charged, but meaningful nonetheless.

Ability to scale

A telehealth provider client of ours experienced a 30-fold increase in demand for their application during March 2020. If this was an on-premise solution the company would have had to throw in the towel and accept the dreary fate of not being able to support all the potential business. They’d have missed out on customers while also failing to perform their business mission of providing critical healthcare.

Luckily, this was a cloud-native application. It required some re-architecture to account for the change, but within a matter of hours it was up and running in tip-top shape, reliable and performant and able to handle the influx of patients who needed care.

Being able to serve all potential customers without ever turning anyone away is a huge value that may be difficult for businesses to measure. But, again, it’s worth asking the question: what’s the cost of not having this capability?

Time spent on value-adding activities

Infrastructure management is an undifferentiated activity, meaning that it’s administrative and doesn’t increase your application’s competitive advantage in the eyes of customers in the way that a new feature or solution would. It’s more like a utility. You just need it to work!

Compare it to something like managing an electrical plant. Would you rather recruit and hire electrical experts, build out a power plant on your premises, and manage that going forward, or would you rather simply plug into the system and pay the provider for what you use?

The cloud allows companies the opportunity to allocate more internal resources toward differentiated activities that require creativity, innovation, and industry knowledge to do well. Things like developing applications, delivering healthcare or creating and executing strategic business plans can all receive greater focus when you retire your data center and start using the public cloud.

In a discussion around cloud ROI, take a look at what percentage of internal resources and individual employee time is dedicated to differentiated vs. undifferentiated activities, and compare that to pre-cloud transition.

Security posture

Most people these days understand the shared responsibility model and are bought into the idea that the cloud has the potential to be more secure than traditional infrastructure. That’s because the big CSPs can hire the best cybersecurity talent in the world and invest in best-in-breed tools that allow them to secure their end of the cloud more effectively than most organizations could if they owned the infrastructure themselves.

Another benefit of the shared responsibility arrangement is that security certifications and compliance attestations become easier to achieve in the public cloud. CSPs like AWS, Microsoft and Google all offer services that are HITRUST, SOC 2 and FedRAMP compliant, to name just a few of the supported compliance frameworks. Just by using those services, you can inherit attestation to various controls (or policies) that your CSP has already met. You can shave dozens or hundreds of hours off your certification or audit timeline by inheriting from your CSP. In fact, companies that work with Cloudticity are reducing their HITRUST certification timeline by 40–60%.

Global cyber attacks increased 38% in 2022. It goes without saying the value of good security is high, and these certifications offer other benefits such as increased marketability in highly regulated markets, which can directly translate to financial ROI.

Wrap-Up

You wouldn’t measure the value of a car based on how much cheaper it is than a horse. Similarly, you can’t measure the overall value cloud delivers by the size of your bill.

There are cost-saving opportunities based on how you architect and manage your cloud. You should absolutely focus on cloud cost optimization, but cost savings shouldn’t be your number one goal.

Cloud is about enabling organizations to solve problems that previously could not be solved. It’s about freeing companies to get out of the data center business so they can be pioneers and focus on what makes them very, very special.

It’s gratifying to see the industry finally coming around to understand this. It would be a shame for the world to miss out on all the potential value cloud has to offer because of some uninformed headlines.

The post Setting the Record Straight on Cloud Computing ROI appeared first on The New Stack.

Infrastructure as Code or Cloud Platforms — You Decide!

Venkat Thiruvengadam — Tue, 02 May 2023 16:34:07 +0000

Let’s compare two prevalent approaches to cloud infrastructure management. First is what we broadly classify as Infrastructure as Code (IaC), where engineers use programming\scripting languages to build a set of scripts to achieve the desired topology on a cloud platform. Terraform, Cloud Formation, Chef, Puppet and Ansible are some popular ones.

This technology consists of a language to write scripts, plus a controller that can run the scripts. Once satisfied with the result, the user would save the scripts in a code repository. Subsequently, if a change is to be made, then the files would be edited and the same process repeated.

The second category would be a cloud orchestrator or platform. This would typically be a thin abstraction over native cloud APIs that would interface with the user as a web service, and the user would connect to the service (via UI or API) and build the cloud topology within that web service itself.

The topology built will be applied by the orchestrator and saved in its own database. The user does not need to explicitly save the configuration. When an update has to be made, the user will again log in to the system and make changes.

For smaller-scale use cases, a platform may be too heavy. But at scale, the IaC approach tends to morph into an in-house platform. A better strategy, in this case, is to use an off-the-shelf platform that can be enhanced with IaC scripts when customization is required. Megascale data centers like those belonging to Facebook and Netflix are a different ballgame and are not considered in this context.

‘Long-Running Context’

The fundamental value that a platform-based approach provides is what we call “long-running context.” People may also call this a “project” or a “tenant.” A context could map to, say, an application or an environment like demo, test, prod or a developer sandbox. When making updates to the topology, the user always operates in this context. The platform would save the updates in its own database within this context before applying the same to the cloud. In short: You are always guaranteed that what is present in this database is what is applied to the cloud.

In the IaC approach, such a context is not provided natively and is left to the user. Typically this would translate to something like “Which scripts need to be run for which context?” or maybe a folder in the code base that represents a configuration for a given tenant or project. Defining the context as a collection of code is harder because many of the scripts might be common across tenants. So most likely it comes down to the developers’ understanding of the code base.

A platform is a more declarative approach to the problem, as it requires little or no coding, as the system would generate the code based on the intent, without requiring knowledge of low-level implementation details. Meanwhile, in the case of IaC, any changes require a good understanding of the code base, especially when operating at scale. In the platform approach, a user can come back and log in to the same context a few days later and continue where they left off without having to dig deep into the code to understand what was done before.

Difference Between the Code Base and What Is Applied to the Cloud

The second fundamental difference between the two is that IaC is a multistep process (write the script, run it and merge it in the repo), while a platform is a one-step process (log in to the context and make the change). With IaC, it is possible that the user might update a script, but may also forget or postpone saving it in the repository. Meanwhile, another engineer could have made changes to the code base for their own side of topology and merged it. Now, since many pieces of code are shared for the two use cases, the first developer might find themselves in a conflict which, even if resolved by merging the code, lands them in a situation where what was run in the cloud is not what is in the repo. Now the developer has to re-run the merged code to validate, notwithstanding the possibility of causing regression. To avoid this risk, we need to now test the script in a QA environment.

All the ‘Other’ Stuff

IaC tools will enable deployments, but there is so much more to running infrastructure for cloud software. We need an application-provisioning mechanism, a way to collect and segregate logs and metrics per application, monitor health and raise alerts, create an audit trail, and an authentication system to manage user access to the infrastructure. Several tools are available to solve these individual problems, but they need to be stitched together and integrated into an application context. Kubernetes, Splunk, CloudWatch, Signalfx, Sentry, Elk and Oauth providers are all examples of these tools. But the developer needs a coherent “platform” to bring all this together if they want to operate at a reasonable scale. This brings us to our next point.

Much of IaC Is Basically a Homegrown Cloud Platform

When talking to many engineers we hear the argument that Infrastructure as Code combined with BASH scripts of even regular programming languages like Go, Java and Python provide all the hooks necessary to overcome the above challenges. Of course, I agree. With this sort of code, you can build anything. However, you might be building the same kind of platform that already exists. Why not start from an existing platform and add customization through scripts?

The second argument I have heard is that Infrastructure as Code is more flexible and allows for deep customization, while in a platform, you might have to wait for the vendor to provide the same support. I think as we are progressing in technology to the point where cars are driving themselves — once thought to be little more than pure fantasy! — platforms are far more advanced than they are given credit for and have great machine-generation techniques to satisfy most, if not all, use cases. Plus, a good platform would not block a user from customizing the part that is beyond its own scope via scripting tools. A well-designed platform should provide the right hooks to consume scripts written outside the platform itself. Hence this argument does not justify building a code base for the majority of the tasks that are standard.

‘There Is No Platform That Fits Our Needs’

This is also a common argument. And I agree: A good platform should strive to solve this prevalent problem. At DuploCloud, we believe we have built a platform that addresses the majority of the use cases while giving developers the ability to integrate policies created and managed outside the system.

‘The San Mateo Line!’

A somewhat surprising argument in favor of building homegrown platforms is that it is simply a very cool project for an engineer to tackle — especially if those engineers are from a systems background. I live in Silicon Valley and have found a very interesting trend while talking to customers specifically in this area.

When we talk to infrastructure engineers, we find that they have a stronger urge to build platforms in-house, and they are quite clear that they are building a “platform” for their respective organizations and are not, as they would consider it, “scripting.” For such companies, customization is the common argument against off-the-shelf tools, while hybrid cloud and on-premises are important use cases. Open source components like Kubernetes, Consul, etc., are common, and thus I frequently hear the assertion that the wheel need not be reinvented. Yet the size of the team and time allocated for the solution is substantial. In some cases, the focus on building the platform overshadows the core business product that the company is supposed to sell. While not entirely scientific, I tend to see these companies south of San Mateo.

Meanwhile, the engineering talent at companies north of San Mateo building purely software as service applications is full stack. The applications use so much native cloud software — S3, Dynamo, Amazon Simple Queue Service (SQS), Amazon Simple Notification Service (SNS) — that it’s hard to be hybrid. They are happy to give the container to Amazon Elastic Container Service (Amazon ECS) via API or UI to deploy it. They find no joy in either deploying or learning about Kubernetes. Hence, the trend and depth of in-house customizations is much less.

How many times and how many people will write the same code to achieve the same use? Time to market will eventually prevail.

The post Infrastructure as Code or Cloud Platforms — You Decide! appeared first on The New Stack.

A Quick Guide to Designing Application Architecture on AWS

Chris Yommer — Fri, 28 Apr 2023 20:02:54 +0000

An architecture design is a blueprint for your application and an essential component of the design process. It outlines the infrastructure, components and services required to build and operate the application. When you create a robust architecture design, you know whether the app is scalable, secure and cost-effective.

In this article, we’ll discuss the steps to follow when designing your application architecture on Amazon Web Services (AWS), how to deploy and test applications, and how to monitor applications. But first, let’s start with why you should build your applications on AWS.

Why You Should Build on AWS

AWS is an excellent platform for application development from an engineering perspective. With AWS, you do not have to worry about manually hosting applications. Being able to use services like ECS or EC2 to host an application helps facilitate what can sometimes be the most painful part of developing an application: putting it somewhere that’s accessible for your target audience. AWS makes it easier to get an application from development complete to deployed and usable. And to top it all off, after setting it up once, you can reuse that setup as a template for future applications.

Additionally, AWS offers a comprehensive suite of machine learning services, such as Amazon SageMaker and Amazon Rekognition, which can be used to build intelligent applications with ease. Overall, AWS provides a powerful and flexible infrastructure for application development, with a broad range of services that cater to various use cases, making it the way to go for many engineering teams.

Next, let’s look at the steps you should follow when designing your application on AWS.

Determine Your Workload Type

The first step is to determine the workload type. Ask yourself, “Will my application be CPU-intensive, memory-intensive or I/O-intensive?” Knowing the answer will help you choose the right instance types, storage solutions and other resources to support your workload.

Plan for Scaling

Scaling is crucial for making sure that your application can handle increased levels of demand. Determine how your application will scale and which services you’ll use to support that. Consider factors such as whether the app requires horizontal or vertical scaling. Look to AWS services such as Elastic Load Balancing, Auto Scaling and AWS Lambda.

Prepare for Resilience

No application is immune to failure, so look for potential failure points in your architecture design. Consider how your application will handle failures and which backup and recovery procedures you’ll put in place. Services such as AWS Backup and AWS Disaster Recovery help you prepare for potential failures.

Implement Security Measures

Security should be an ongoing priority when designing applications. What security measures will you implement? How will you protect your application and data? AWS offers a range of tools and services, such as AWS Identity and Access Management (IAM), AWS Shield and AWS WAF.

Develop a Monitoring Process

Monitoring your applications is critical for identifying issues and ensuring strong performance. Plan for how you’ll monitor your application and what metrics you’ll track. Amazon CloudWatch, AWS X-Ray and AWS Config are just a few examples of the available monitoring tools.

Pro Tip

Yum is a package manager specific to Amazon Linux, CentOS and other Fedora-based operating systems. Regularly running the yum update command is an important step in ensuring the security and stability of your Amazon EC2 instances created with CloudFormation. This triggers an update to the RPM package and makes sure you get the latest security patches, bug fixes and features.

How to Develop and Test AWS Apps

Development and testing are essential stages if you want your application to be functioning, high-performing, scalable and reliable. AWS provides many developer tools and testing options, such as AWS CodeCommit, AWS CodeBuild, and AWS CodeDeploy, to help you develop, test and deploy your applications on AWS.

Here are the fundamental steps in developing and testing your application.

Write and Test Your Application’s Code

The first step in developing and testing your application is to write the code. You can use any programming language and development environment that’s compatible with AWS. AWS provides a range of software development kits (SDKs) and APIs that make it easy to integrate your application.

Use tools such as AWS CodeCommit, which is a fully managed source control service. This collaborative tool enables you to track changes and collaborate with other developers throughout development and debugging processes.

After writing code, test it. You can use AWS CodeBuild, a fully managed build service, to build, test and deploy your application code. You can also use AWS Lambda, a serverless compute service, to test your application code without deploying it to a server.

Resolve Errors and Debug Issues

AWS CodeBuild is where most of the work is done to resolve errors and debug issues from an automated testing perspective. CodePipeline can pick up changes pushed to the git repo, which then kicks off a CodeBuild step that runs any automated testing.

CodeBuild can run a series of predefined tests and checks, or you can define custom tests and checks to run against your code. If any errors or issues are detected during the testing phase, CodeBuild will alert you and provide detailed information about the problem.

Assuming the automated tests pass, CodeDeploy can then deploy your application code to the appropriate environment for user acceptance testing. CodeDeploy is an automated deployment service that makes it easy to deploy your code to production or staging environments. It allows you to deploy your code to multiple instances at once, rolling back the deployment if any issues arise.

Complete Testing

Once you’ve written and smoothed out your application code, you’ll need to complete the testing phase. You can use AWS Device Farm, a service that provides real mobile devices for testing, to test your application on a range of devices and operating systems. You can also use AWS Elastic Load Balancing, a service that distributes incoming application traffic across multiple targets. This test measures whether your application can handle a high volume of traffic.

Monitor Performance and Security

Performance and security should be a priority throughout the development process to ensure that your AWS app is secure, optimized and can handle what’s thrown at it. Consider load testing to identify performance bottlenecks, along with caching and content delivery to improve application speed. Security auditing and encryption help protect sensitive data from unauthorized access.

How to Deploy and Monitor AWS Apps

After developing and testing your AWS application, it’s time to deploy it. Here are some steps to follow for a successful deployment.

Choose Your Deployment Method

Deployment of AWS applications isn’t one-size-fits-all. There are multiple deployment methods available, including AWS CodeDeploy or using a third-party deployment service. You can also select a combination of methods.

Your choice of deployment method will depend on factors including the complexity of your application, the size of your team and your level of AWS expertise.

Configure Deployment Settings

Begin by setting up source controls using AWS CodeCommit, where you can also upload your application code. This provides the added benefit of version control, making it easier to manage changes over time. Next, create a deployment pipeline that will run automated tests and deploy the code to the appropriate environment.

Upload Your Application Code

Now, it’s time to upload your application code to the repository. This can be done either manually or through an automated process such as a build script. Once your code is uploaded, the pipeline will run automated tests to ensure that the application is functioning as expected. Any issues or errors will be flagged for your review and resolution.

Test Your Deployment

There’s one more testing phase to go through before you release your application to the public. This phase of testing can involve automated or manual tests or a combination. Automated testing is ideal for catching errors and bugs early in the deployment process, while manual testing is useful later on for verifying that your application is working as intended.

Monitor Your Application

After your application is deployed, keep track of its performance. Performance monitoring metrics include request latency and error. AWS provides monitoring tools such as CloudWatch, which tracks metrics, logs and other events related to your application.

Implement Security Measures

Security measures after deployment include encryption algorithms, third-party tools and security protocols such as HTTPS or SSH. This process helps protect your application from malicious attacks, data breaches and other threats.

Set up Backups

Regular backups are critical for preventing data loss in the event of a disaster or system failure. Your backups should cover ‌application code, databases and other related files. AWS provides backup and disaster-recovery tools such as Amazon S3 and Amazon Glacier.

Pro Tip

Leveraging automation tools can help save time, reduce errors and make your deployment process more reliable. CodePipeline is a fully managed continuous delivery service that can automate the entire delivery process, from picking up code changes to testing to deployment. CodePipeline allows you to define your entire software release process in one place, including building, testing and deploying your application code. You can create a pipeline that automatically deploys code changes to production, staging or test environments, with the ability to customize each stage to suit your needs.

Partnering with AWS Experts

When you partner with AWS experts like Mission Cloud,, your business gains access to experienced developers with deep knowledge of AWS services and tools. You’ll also learn about all the best practices for developing and deploying applications on AWS.

Connect with one of our cloud advisors and discuss the next steps in your application development project.

The post A Quick Guide to Designing Application Architecture on AWS appeared first on The New Stack.

Paris Is Drowning: GCP’s Region Failure in Age of Operational Resilience

Michelle Gienow — Thu, 27 Apr 2023 20:15:06 +0000

Google Cloud Platform’s europe-west9 region outage is precisely the type of service failure that keeps the world’s government officials up at night. Their deepest concern is for the potentially catastrophic impact a major cloud provider failure could have on financial institutions — and the very real world problems and pain this would cause for their economies.

This concern is increasingly turning to action as different countries begin proposing technical requirements aimed at ensuring operational resilience for their financial institutions, and eventually other critical services like utilities, healthcare and transportation.

Trouble in Paris

The first Google Cloud Platform incident message advising trouble in GCP’s europe-west9 region went out on April 25 at 19:00 PDT: “We are investigating an issue affecting multiple Cloud services in the europe-west9-a zone…Customers may be unable to access Cloud resources in europe-west9-a.”

Initially, GCP advised customers to fail over to other zones within its Paris-based europe-west9 region while its engineering team investigated the issue. Over the next few hours, updates indicated that “water intrusion” in europe-west9-a led to an emergency shutdown of some hardware in that zoneand continued advising failover to other zones in the europe-west9 region.

That is, until 23:05 PDT: “A multi-cluster failure has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region. There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage. Customers are advised to failover to other regions if they are impacted.”

Cloud provider service failures are not all that uncommon. They are also generally limited in scope and over with before most users even notice. Yesterday’s GCP event, though, is a textbook-definition worst-case scenario: It was not just a quickie zone outage, but an entire cloud provider region failing.

The failure of an entire region for GCP (or AWS, orAzure) means that all of the data centers in a cloud service provider’s particular geographic region, in all of its availability zones, have gone offline. (Basically, regions and zones are cloud service provider terminology for the underlying physical resources provided in one or more physical data centers).

When a full region failure occurs, all of the services hosted in that region become unavailable to users, including that cloud provider’s own platform services. Businesses may find their applications, services and websites are unavailable to their customers, leading to lost revenue and grumpy customers. Businesses that rely on a cloud provider for critical infrastructure services, such as data processing, machine learning or storage experience disruption in their operations that may have delayed projects and decreased productivity, at the very least.

The Danger Is in the Data

“When cloud regions fail, like we saw with the europe-west9 region, if you haven’t thought ahead of time about how you’re replicating your data, it’s so easy to end up in a situation where your application is hard-down,” said Jordan Lewis, senior director of engineering at Cockroach Labs. “In that scenario, there’s really nothing you can do besides wait for the cloud provider to do their best to pick up the pieces.”

Beyond the initial crisis of downtime, though, there is also the long-tail potential damage to data integrity. A whole region going dark can lead to data loss or corruption, particularly when appropriate backup and recovery processes are not in place. This happens because data is usually replicated across multiple zones (remember that zones are logical representations of physical data centers within a region). So if all the data centers in a region fail simultaneously, there may not be a backup available to restore the data.

In addition, if the outage results in data loss or corruption, which might not be immediately recognized, businesses can face the risk of legal liability, data breaches and compliance violations, to name but a few potential negative consequences. And of course, any of these could result in significant financial penalties or damages.

Operational Resilience as Mandate

Today’s GCP europe-west9 region outage is precisely the type of service failure that increasingly is turning concern into action as different countries begin proposing technical requirements aimed at ensuring operational resilience for their financial institutions.

The UK is leading the way in holding financial firms responsible and accountable for their operational resiliency.

“Financial market infrastructure firms are becoming increasingly dependent on third-party technology providers for services that could impact the financial stability of the UK if they were to fail or experience disruption,” said UK Deputy Governor for Financial Stability John Cunliffe in a joint announcement made by Bank of England, Prudential Regulation Authority (PRA) and Financial Conduct Authority (FCA) describing potential resilience measures for critical third-party services.

One of the keystone requirements: Regulators have instructed financial firms to meet operational resilience requirements, inserting governmental oversight into what used to be internal decision-making. So long as the results meet the required minimum level of operational resilience, CIOs are able to choose from scenarios that best suit their needs. Hybrid cloud (operating an additional physical data center to supplement their primary cloud infrastructure) and multicloud (running on multiple cloud provider platforms) are two of the options for satisfying these requirements.

Similarly, the European Union’s Digital Operational Resilience Act (DORA) seeks to establish technical requirements that ensure operational resilience in the face of critical service failures. As such, it is expected to apply to all digital service providers, including cloud service providers, search engines, e-commerce platforms and online marketplaces, regardless of whether they are based within or outside the EU. DORA entered into force in January 2023; with an implementation period of two years, financial entities will be expected to be compliant with the regulation by early 2025.

Ultimately, the full impact of GCP’s europe-west9 region outage will depend on the severity of the outage, its duration and the impact on critical services and data. Time will tell. But no matter what the fallout from this particular region failure, it is vivid validation of the reality that, while serious cloud provider outages are uncommon, they are also basically inevitable.

Operational resilience used to be viewed as part of business continuity planning, to be handled privately by individual companies. The time is coming, and maybe sooner than we think, when legislators and regulators will act to standardize the way individual companies approach operational resilience, all in the name of public good. Organizations need to start re-evaluating their tech infrastructure to ensure operational resilience is hardwired into their application architecture. If the countless cloud outages that have occurred over the years have taught us anything, it’s that this should no longer be a consideration, but a requirement.

Postscript: 36 hours after the initial outage and incident report, details are beginning to emerge. Apparently, a cooling system water pump failure caused water to accumulate and leak. That in turn is said to have flooded the data center’s battery room and caused a fire. It’s not immediately clear whether the data center’s fire suppression system then caused the “water incident” that took the entire europe-west9 region offline yesterday, or the actions of firefighters to contain the blaze. It looks like GPC now has two out of three zones back up, but europe-west9a is out for the foreseeable future.

The post Paris Is Drowning: GCP’s Region Failure in Age of Operational Resilience appeared first on The New Stack.

Bobsled Offers Platform-Neutral Data Sharing Service

Andrew Brust — Thu, 27 Apr 2023 18:25:21 +0000

Bobsled, a new venture based in Los Angeles, this week introduced its Data as a Service (DaaS) sharing platform and highlighted the data economy business requirements that are its underpinnings. The company also announced its Series A funding round, led by Greycroft and Madrona Venture Group.

Bobsled is working to make data sharing a universal prospect, providing connectivity to the major public cloud and data cloud platforms (including Microsoft Azure, Amazon Web Services, Google Cloud, Snowflake and Databricks) and providing the full plumbing necessary for dataset trial, subscription, version control, distribution and telemetry. Bobsled also supports delivery to numerous levels of the modern data stack, including data warehouse, data lake, notebook and business intelligence (BI) platforms.

Boblsed’s co-founder and CEO, Jake Graham, previously led the development of Microsoft’s Azure Data Share service, providing Bobsled with concrete knowledge of how to provide a sharing service, as well as a contextual understanding of what’s lacking in existing data sharing platforms. Additionally, as a result of the financing round, Bobsled is adding Greycroft Principal Brentt Baltimore, Madrona Managing Director Soma Somasegar, and .406 Ventures Partner Graham Brooks to its Board of Directors. Madrona was one of Snowflake’s early investors and Somasegar ran Microsoft’s Developer Division for several years, bringing even more domain expertise to the table.

A Contemporary and Legacy Need

For about as long as data has been around, there has been a need to share it, within teams, across business units and between organizations. The effort around doing so goes back decades and has engendered its own slate of acronyms and technologies. Approaches have included the generation of CSV (comma-separated values) “flat files” from the mainframe, more formal use of EDI (Electronic Data Interchange) across VANs (value-added networks), and erecting FTP (file transfer protocol) sites to allow parties to download data feeds. In addition to CSV and similar delimited text formats, data sharing provided the inspiration for longstanding formats like XML (Extensible Markup Language) and JSON (JavaScript Object Notation), as well as newer formats like Parquet that have given rise to the modern data lake.

But providing these technology nuts and bolts doesn’t constitute a true sharing service, any more than the provision of phone lines and email provides for customer service. Instead, these technologies have had to be cobbled together by technology teams, typically in an ad hoc manner, in order for data-sharing workflows to be operationalized. As a result, until recently, these efforts have been bespoke, at best.

Meanwhile, with the rise of the data mesh approach to data management, and its constituent innovation of “data products,” the demand for data sharing has grown, and budgets have materialized to make that demand a concrete business opportunity rather than a mere enthusiast wish-list item.

Efforts to Date

Critical mass has started to form in the last few years around more modern data-sharing protocols. Microsoft introduced Azure Data Share back in 2019, and has morphed it into Microsoft Purview Data Sharing. Databricks built the Delta Sharing protocol an open sourced it under the auspices of The Linux Foundation. Snowflake introduced Snowflake Data Sharing a while back, allowing a virtual warehouse interface to be built around shared data. Numerous vendors have also established data marketplaces to help organizations get to and use relevant third-party data sources.

But these protocols tend to be limited in scope and/or versatility. Purview Data Sharing works only between Azure Data Lake Storage (ADLS) Gen 2 or Azure Blob Storage accounts. Snowflake Data Sharing is, as the name would imply, something that gets implemented within Snowflake’s “data cloud.” Delta Sharing is more flexible, for a few reasons: it’s open source, it’s a protocol rather than a feature, and it can be integrated at the data lake, data governance, data developer, or BI tool layer of the modern data stack. Nevertheless, the protocol’s most concrete implementation as a full service is Databricks Delta Sharing.

Sharing Is Caring; Integration Brings Agitation

While data sharing has historically facilitated point-to-point data transfer, the modern crop of data-sharing platforms looks to provide discovery and connectivity, too. Bobsled is all-in on that reimagining of the space and it is explicitly looking to help eliminate the creation, maintenance and customization of data ingestion pipelines, in order to minimize complexity and (earnestly) compress time-to-value.

Sometimes the best new technology categories aren’t the ones that bring about some raw, novel innovation, but instead create adoptable platforms around things that numerous organizations have been building themselves, repeatedly and informally, often driven by urgent requirements that don’t leave time or budget for fit, finish and resiliency. A solid data-sharing platform looks like it falls into this latter category. And, as a data-sharing provider that operates independently of any particular cloud or data platform, Bobsled may bring the neutrality necessary to provide cross-platform versatility, in addition to base functionality.

The post Bobsled Offers Platform-Neutral Data Sharing Service appeared first on The New Stack.

The Complex Relationship Between Cloud Providers and Open Source

Michael Guarino — Thu, 27 Apr 2023 17:00:49 +0000

Cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure have long had a frenemy relationship with the open source community.

On one hand, cloud providers’ Platform-as-a-Service (PaaS) layer has successfully redirected most of the value from software maintainers to themselves. However, big tech companies have also significantly contributed to the open source ecosystem.

One notable example is the Kubernetes project, which originated at Google. Other examples include LinkedIn’s Apache Kafka, Facebook’s Presto, and Airbnb’s Apache Airflow and Superset.

The tension between cloud providers and open source maintainers is on the verge of changing dramatically, but it is worth keeping up with what is happening right now.

The Platform as a Service Model

The most obvious source of tension is the establishment of PaaS as a major revenue model for cloud providers. In the early days of AWS, its core business was compute and storage.

Still, there’s a major technical gap between a raw virtual machine and a fully distributed architecture needed to host a significant web service. Over time, AWS realized it needed another layer to ensure its users were successful, largely around the provisioning and lifecycle management of complex software like databases, queues, and job orchestrators.

The clouds relied upon open source for battle-tested implementations of this software and exploited their permissive licensing to create extremely profitable lines of business on top of them. This seems nakedly exploitative of the open source developers’ effort, but it’s worth understanding why this worked.

There are two main pain points cloud PaaS solves:

The operational economy of scale
Distribution of the service

As an engineer, I can immediately see the need for an operational economy of scale. I have seen firsthand how difficult operating these systems can be. It requires a deep experience with engineering fundamentals as well as the idiosyncrasies of the software itself to provision and maintain them reliably.

Most organizations will not have the in-house expertise, and will instead have to pay someone who has that expertise.

In theory, the likes of Amazon or Google are not the only organizations that can offer this sort of service, and it turns out many open source communities have commercialized their products by creating PaaS offerings of their own, like Elastic for ElasticSearch, Confluent for Kafka, MongoDB, and others. This business model has provided vital sustaining capital to those open source projects and is a huge reason for the success they’ve had.

Distribution is the second huge competitive advantage the clouds have over really any other source of software. The harsh reality of most enterprises is that almost all action is severely bureaucratically constrained, and any vendor or open source package has to survive tedious scrutiny to begin being used.

Cloud providers have the advantage of an established commercial relationship with virtually every large business on the planet, so the friction of trialing a new service through one is orders of magnitude less, creating a massively powerful sales machine. That competitive risk has proven so great that some open source projects have considered it existential and entirely changed their licensing to counteract the threat, most notably Elastic and MongoDB.

Another interesting side effect of the emphasis on this business model has been on the codebases themselves. The spat between AWS and ElasticSearch has caused the project to become forked, with AWS maintaining OpenSearch and Elastic maintaining control over the legacy Elasticsearch codebase.

This happens on a more subterranean level as well, with services like AWS RDS effectively rewriting major relational databases like Postgres for its Aurora service ( a major advancement in database design) but not contributing that innovation back to the open source community. This works because the software is never expected to leave the walled garden of the managed service, but that ultimately neglects the wider ecosystem of open source as a result.

Cloud Providers Take a Step Forward

When you break down the problem, there are two main issues at play between the cloud providers and open source relationships.

First, there is the problem of finding a more equitable distribution of value between the infrastructure providers and the open source maintainers that doesn’t overcompensate the ability to overcome corporate bureaucratic inefficiency and allow the software ecosystem to be more self-sustaining.

There has been progress towards acknowledging this issue, with AWS leading the way again by establishing a partnership with Grafana to distribute software with a clear revenue-sharing agreement. That’s a win-win-win; Grafana gets appropriate compensation for its product, corporates get to cut through their red tape, and AWS gets another service in its catalog.

But, there’s also a technical challenge that stands at the root of all of this: open source software is frequently so complex that a third-party infrastructure provider is needed to provide a decent user experience. If that were no longer the case, this issue would change materially.

Organizations would not need to carefully parse who they outsource their infrastructure to, in order to utilize standard open source components, making distribution advantages, not game-breaking.

Developers could monetize the value of the software itself and not its “management.” And, users would not have to make compromises around data tenancy in order to rely on the best of the open source ecosystem.

Plural Is Here to Help

At Plural, we believe this is much more achievable than most people realize.

We are not in the world of the early 2010s where the cloud was new and unproven, and managing software on it was a wild west of duct-taped half-solutions.

There’s now an incredibly powerful ecosystem of tools like Kubernetes and Terraform that can virtually automate the entire problem of distributed system management, but people have simply not exploited their full potential.

We’ve already packaged over 50 major open source solutions like Airbyte, Airflow, Prefect, ElasticSearch, Kafka, PostHog, Grafana, and Argo CD to enable developers to deploy them using DevOps best practices on top of Kubernetes. Our platform provides engineers with all the operational tools they would get in a managed offering plus a verified stream of upgrades, all deployed in your own cloud for maximum control and security.

The post The Complex Relationship Between Cloud Providers and Open Source appeared first on The New Stack.

Why We Need an Inter-Cloud Data Standard

Aaron Schneider — Thu, 27 Apr 2023 15:19:18 +0000

The cloud has completely changed the world of software. Everyone from startup to enterprise has access to vast amounts of compute hardware whenever they want it. Need to test the latest version of that innovative feature you were working on? Go ahead. Spin up a virtual machine and set it free. Does your company need to deploy a disaster recovery cluster? Sure, if you can fit the bill.

No longer are we limited by having to procure expensive physical servers and maintain on premises data centers. Instead, we are free to experiment and innovate. Each cloud service provider has a litany of tools and systems to help accelerate your modernization efforts. The customer is spoiled with choices.

Except that they aren’t. Sure, a new customer can weigh the fees and compare the offerings. But that might be their last opportunity, because once you check in, you can’t check out.

The Hotel California Problem

Data is the lifeblood of every app and product, as all software, at its core, is simply manipulating data to create an experience for the user. And thankfully, cloud providers will happily take your precious data and keep it safe and secure for you. The cost to upload your data is minimal. However, when you want to take your ever-growing database and move it somewhere else, you will be in for a surprise: the worst toll road in history. Heavy bills and slow speeds.

Is there a technical reason? Let’s break it down. Just like you and me, the big cloud providers pay for their internet infrastructure and networking, and that must be factored into their business model. Since the costs can’t be absorbed, it is worth subsidizing the import fees and taxing the exports.

Additionally, the bandwidth of their internet networks is also limited. It makes sense to deprioritize the large data exports so that production application workloads are not affected.

Combine these factors and you can see why Amazon Web Services (AWS) offers a service where it sends a shipping container full of servers and data drives so you can migrate data to its services. It is often cheaper and faster to mail a hard drive than it is to download its contents from the web.

It helps that all these factors align with the interests of the company. When a large data export is detected, it probably is a strong indicator that the customer wants to lower their fees or take their business elsewhere. It is not to the cloud provider’s benefit to make it easy for customers to move their data out.

Except that it is in the cloud provider’s interest. It’s in everyone’s interest.

It Really Matters

The situation is not unlike the recent revolution in the smart home industry. Since its inception, it has been a niche enthusiast hobby. But in 2023, it is poised to explode.

Amazon, Google and Apple have ruthlessly pursued this market for years, releasing products designed to coordinate your home. They have tried to sell the vision of a world where your doors are unlocked by Alexa, where Siri watches your cameras for intruders and where Google sets your air conditioning to the perfect temperature. But you were only allowed one assistant. Alexa, Siri or Google.

By design, there was no cross compatibility; you had to go all in. This meant that companies who wanted to develop smart home products also had to choose an ecosystem, as developing a product that works with and is certified for all three platforms was prohibitively expensive. Product boxes had a ridiculous number of logos on them indicating which systems they work with and what protocol they operate on.

Source: Google Images

It was a minefield for consumers. The complexity of finding products that work with your system was unbearably high and required serious planning and research. It was likely you would walk out of the shop with a product that wouldn’t integrate with your smart home assistant.

This will change. In late 2022, products certified on the new industry standard, named Matter, started hitting shelves, and they work with all three ecosystems. No questions asked, and only one logo to look for. This reduces consumer complexity. It makes developing products easier, and it means that the smart home market can grow beyond a niche hobby for technology enthusiasts and into the mass market. By 2022, only 14% of households had experimented with smart technology. However, in the next four years, adoption of smart home technology is set to double, with another $100 billion of revenue being generated by the industry.

Bringing It Back to Cloud

We must look at it from the platform vendor’s perspective. Before Matter, users had to choose, and if they chose your ecosystem, it was a big win! Yet the story isn’t that simple, as the customers were left unfulfilled, limited to a small selection of products that they could use. Worse, the friction that this caused limited the size of the market and ensured that even if the vendor improved its offering, it was unlikely to cause a competitor’s customers to switch.

In this case, lock-in was so incredibly detrimental to the platform owners that all the players in the space acknowledged the existential threats to the budding market, driving traditionally bitter rivals to rethink, reorganize and build a new, open ecosystem.

The cloud service providers (CSPs) are in a similar position. The value proposition of the cloud was originally abundantly clear, and adoption exploded. Today, sentiment is shifting, and the cloud backlash has begun. After 10 years of growing cloud adoption, organizations are seeing their massive cloud bills continue to grow, with an expected $100 billion increase in spending in 2023 alone and cloud lock-in is limiting agility.

With so much friction in moving cloud data around, it might be easier for customers to never move data there and just manage the infrastructure themselves.

The value still exists for sporadic workloads, or development and innovation, as purchasing and procurement is a pain for these sorts of use cases. Yet, even these bleeding-edge use cases can be debilitated by lock-in. Consider that there may be technology offered by AWS and another on Google Cloud that together could solve a key technical challenge that a development team faces. This would be a win-win-win. Both CSPs would gain valuable revenue, and the customer would be able to build their technology. Unfortunately, today this is impossible as the data transfer costs make this unreasonably expensive.

There are second-order effects as well. Each CSP currently must invest in hundreds of varying services for their customers. As for each technology category, the cloud provider must offer a solution to its locked-in customers. This spreads development thin, perhaps limiting the excellence of each individual service since many of them need to be developed and supported. As thousands of employees are let go by Amazon (27,000), Google (12,000) and Microsoft (10,000), can these companies really keep up the pace? Wouldn’t quality and innovation go up if these companies could focus their efforts on their differentiators and best-in-class solutions? Customers could shop at multiple providers and always get the best tools for their money.

High availability is another key victim to the current system. Copies of the data must be stored and replicated in a set of discrete locations to avoid data loss. Yet, data transfer costs means that the cost of replicating data between availability zones internally within a single cloud region already drives up the bill. Forget replicating any serious amount of data between cloud providers as that becomes infeasible due to cost and latency. This places real limits on how well customers can protect their data from disasters or failures, artificially capping risk mitigations.

An Industry Standard

So many of today’s cloud woes come down to the data-transfer cost paradigm. The industry needs a rethink. Just like the smart home companies came together to build a single protocol called Matter, perhaps the CSPs could build a simple, transparent and unified system for data transfer fees.

The CSPs could invest in building an inter-cloud super highway: an industry-owned and -operated infrastructure designed solely for moving data between CSPs with the required throughput. Costs would go down as the public internet would no longer be a factor.

A schema could be developed to ensure interoperability between technologies and reduce friction for users looking to migrate their data and applications. An encryption standard could be enforced to ensure security and compliance and use of the aforementioned cross-cloud network would reduce risk of interception by malicious actors. For critical multicloud applications, customers could pay a premium to access faster inter-cloud rates.

Cloud providers would be able to further differentiate their best product offerings knowing that if they build it, the customers will come, no longer locked into their legacy cloud provider.

Customers could avoid lengthy due diligence when moving to the cloud, as they could simply search for the best service for their requirements, no longer buying the store when they just need one product. Customers would benefit from transparent and possibly reduced costs with the ability to move their business when and if they want to. Overall agility would increase, allowing strategic migration on and off the cloud as requirements change.

And of course, a new level of data resilience could be unlocked as data could be realistically replicated back and forth between different cloud providers, ensuring the integrity of the world’s most important data.

This is a future where everyone wins. The industry players could ensure the survival and growth of their offerings in the face of cloud skepticism. Customers get access to the multitudes of benefits listed above. Yes, it would require historic humility and cooperation from some of the largest companies in the world, but together they could usher in a new generation of technology innovation.

We need an inter-cloud data mobility standard.

In the Meantime

Today there is no standard, and all the opposite is true. The risks of cloud lock-in are high, and customers must mitigate them by leveraging the cloud carefully and intelligently. Data transfer fees cannot be avoided, but there are other ways to lower your exposure.

That’s why Couchbase enables its NoSQL cloud database to be used in a multitude of different ways. You can manage it yourself, or use the Autonomous Operator to deploy it on any Kubernetes infrastructure (on premises or in the cloud). We also built our database-as-service, Capella, to natively run on all three major cloud platforms.

Couchbase natively leverages multiple availability zones and its internal replication technology to ensure high availability alongside multiple availability zones. With Cross Datacenter Replication (XDCR), you can asynchronously replicate your data across regions and even cloud platforms themselves to ensure your data is safe even in the worst-case scenarios.

Try Couchbase Capella today with a free trial and no commitment.

The post Why We Need an Inter-Cloud Data Standard appeared first on The New Stack.

KubeCon Panel Offers Cloud Cost Cutting Advice

Loraine Lawson — Thu, 27 Apr 2023 15:00:22 +0000

Back in the days of on-premise compute, reducing costs meant cutting capital expenditures. But with the cloud’s pay-as-you-go model, how can companies realize efficiencies in light of the current economic climate?

“It’s really becoming an … operational expense and impacting companies greatly,” said Aparna Subramanian, director of product engineering infrastructure at Shopify, during a Friday session at KubeCon+CloudNativeConEurope 2023 conference in Amsterdam. “That’s the reason why we have this increased focus on optimizing and doing more with less is the mantra these days.”

Subramanian joined Phillip Wittrock, an Apple software engineer; Nagu Chinnakaveti Thulasiraman, engineering manager in the car infrastructure department at Zalando SE; and Todd Ekenstam, principal software engineer at Intuit for Friday session on, “Cloud Computing’s First Economic Recession? Let’s Talk Platform Efficiency.” The panel looked at three broad categories of reducing costs: Culture, operations and design.

Culture: Measure at App and Service Level to Find Costs

When it comes to reducing costs, the first step is creating a culture of measurement, said Wittrock.

“One thing I think it’s helpful to start with is start out measuring where your big wins are, where do you want to focus? What’s going to move the needle a lot, what’s going to take a long time to do, what’s maybe not going to move it as much but is very easy to get done?” Wittrock said. “Then from there, figure out who the right folks to engage with are, what are the right teams, so you can start looking forward.”

It can also be hard to figure out whose problem it is to increase efficiencies and cut costs, added Subramanian. That’s why it should be a cross-team effort with a financial practice or center of excellence component to it, she said.

“Often we run into the situation where it’s everybody’s problem, but it’s nobody’s problem,” she said. “Having the central team is really important but it’s also important to understand that it now suddenly doesn’t become only the central teams responsibility for making sure the platform is efficient. It has to be a collaboration between engineering, finance, procurement — the team that is negotiating contracts with your cloud vendor or other vendors.”

Ekenstam asked the packed audience for a show of hands to determine who knows what their cloud bill is. He then asked for a show of hands from those who know how much their individual services or applications costs. Not surprisingly, the number was smaller although not insubstantial.

“That’s, to me, the first step you need to know — what you’re spending,” Ekenstam said. “That’s the big challenge, taking that cloud costs, that big bill, and actually breaking it into individual teams, individual applications, because only then when you have that visibility will you know where you have the opportunities to improve.”

Intuit runs a developer portal where it tracks all of its different software assets, whether it’s a service or application, he said. Each has an asset ID that is propagated and tagged to all the resources required to support that service or application, he said. Then, IT aggregates all the billing data attributed based on the service or application, and it provides a number for the development teams. Those numbers also are rolled up and provided to various directors and vice presidents.

“It’s not enough to give a top-level CTO or the CEO the bill — you need to get that visibility to people who can actually make decisions and make changes to how the system operates,” Ekenstam said.

“That level of visibility is really the first starting point when we started looking into things more closely at Shopify,” Subramanian added. “We were able to see clearly from the cloud bill what are the different projects, what are the different clusters, but it’s not exactly helpful, right? Because if you have a multitenant platform, you want to know how much is App A costing and how much is App B costing.”

Identifying application cost can enable the platform team to go to the team or leader and hold them responsible for making the changes necessary to improve the efficiency, she added.

Don’t Automatically Cut Where CPU Is Idle

It may seem like the best plan of action would be to cut wherever there are idle resources, but that’s actually not a great idea because it could interrupt a workload that’s trying to complete, warned Wittrock.

“The idle resources may be an artifact of what are the capabilities of the platform you’re running on? What does it offer and maybe that slack just needs to be there for your availability,” he said.

That’s why it’s important to view the efficiency and waste for each application across a variety of stakeholders.

“Shopify is an e-commerce platform and sometimes we have to reserve and scale up all the way because there’s a big flash sale coming up and that time, you don’t want to be scaled all the way down and you don’t want your Cluster Autoscaler to be kicking in and doing all of these things,” Subramanian said. “There are times when you want to protect your reputation, and it’s not about efficiency.”

That’s where a central finance team can come into play, ensuring that the platform returns to normal load after big peak events like Christmas for Shopify, she added.

“That’s why you need that central finance team because there’s somebody looking at this every day and reaching out to the appropriate teams to take action,” she said.

Operations: Focus on Business Need

Photo by Loraine Lawson

Intuit has a number of different patterns to its workload. TurboTax is busiest during tax season, for instance, while QuickBooks is very busy during the traditional 9-to-5 work day, Ekenstam said.

“CPU, memory and compute resources is a big component of cost,” he said. “You need to really see how can you make your clusters and applications run most efficiently to minimize costs, but at the same time, provide the services that you need to.”

Shopify actually prepares for Black Friday and Cyber Monday by disabling autoscaling and scaling all the way up to projected traffic because then the goal is to protect Shopify’s reputation on those high volume days, said Subramanian.

“But at other times, we do leverage autoscaling,” she added. “We use VPA [Vertical Pod Autoscaler] to recommend what the right memory and CPU should be and we make that suggestion to the respective application team using a Slack channel.”

The application team knows the specific nature of their workload, so it’s up to them to review the recommendation and make the appropriate changes, she added.

Autoscaling is a key capability for reducing cloud costs, Ekenstam said, but it isn’t a panacea.

“If we can autoscale, not only your application, but also your cluster up and down, that’s for cost,” he said. “That’s obviously the best, but it does come with some disruption. So how can you minimize that disruption? I think a lot of it starts with making sure the apps can be disruptive.”

Design: Kill Kubernetes Pods to Best Utilize Resources

You can’t launch a pod in Kubernetes and expect that pod to live forever, Ekenstam said. At Intuit, they rotate their clusters every 25 days — a de-scheduler automatically de-schedules pods and reschedules them on another node to ensure it takes full advantage of the node resources, as well as so Intuit can update security patches and Amazon machine image (AMI) on the nodes, he explained.

“It also has a side effect of forcing all those applications to get rescheduled and trains our developers that, ‘Hey, I can’t count on these pods running forever. It’s okay that they terminate. It’s okay that they come back up,’” said Ekenstam. “By doing that, we’ve helped build this culture of understanding how Kubernetes works for the developers.”

Intuit is investigating developing a system that takes the recommendations from the vertical pod autoscaling, the historical metrics from each application, and then using that to make decisions and recommendations for both the VPA and the horizontal pod autoscaling (HPA). The system would integrate those recommendations and then apply them the pipeline using GitOps, he explained.

“If you change the resources of a pod, the change in resource will start back in your pre-pod environment, get tested, validate that it does work in pre-prod and work its way through the pipeline to your production environment,” Ekenstam said. “We don’t want to just suddenly change the resources in production without being able to test it first.”

Profiling Apps for Efficiency

Another step to reducing cloud spend is to ensure applications are cloud native and can run on Kubernetes, Ekenstam said. But he asked the panel what can be done beyond that.

It takes a partnership between the platform or infrastructure team and the applications team, said Subramanian.

“Something that Shopify has been working on recently is continuous profiling of applications because you don’t want to just tell application developers … make sure it’s efficient and optimal at all times,” said “In order to reduce the friction, we have rolled up this continuous profiling feature that every application is getting profiled continuously at a certain sample rate.”

That’s made it easy for developers to look at their app profile and make decisions about CPU usage, processes running, and so on, she added.

“Being able to create such tools and enable the application developers to make the right decision is also a key part of efficiency and optimization,” Subramanian noted.

At Intuit, whenever there is a new release of their platform, they run it through Failure Modes and Effects Analysis (FMEA) testing, which includes a load test, Ekenstam said.

“Then we measure how many nodes did it take to do that workload and that helps us identify some kind of performance regression, and performance regressions are also quite often cost regressions, because if you’re suddenly needing to use more nodes to do the same workload, it costs you more, so that’s another technique that we’ve used to identify and to compare different releases,” he said.

CNCF paid for travel and accommodations for The New Stack to attend the KubeCon+CloudNativeConEurope 2023 conference.

The post KubeCon Panel Offers Cloud Cost Cutting Advice appeared first on The New Stack.

Google Cloud Services Hit by Outage in Paris

Chris J. Preimesberger — Thu, 27 Apr 2023 14:40:56 +0000

Google Cloud was hit by an outage of a high number of services yesterday when a data center in Paris caught fire around midnight Tuesday. The fire caused the fire department to soak the building with water, precipitating a multi-cluster failure which shut down more than 90 cloud services.

Google Cloud’s status page, calling the cause of the failure “water intrusion,” indicated that multiple cloud services in the “europe-west9-a” zone were affected starting at 1900 PDT on April 25.

The company’s status page explained that “water intrusion in europe-west9-a led to an emergency shutdown of some hardware in that zone,” and that there was no current ETA for recovery of operations. Customers were advised to fail over to other zones in europe-west9 if they were impacted.

The incident was later described as “a multicluster failure and has led to an emergency shutdown of multiple zones.” The outage impacted more than 90 Google Cloud services for europe-west9-a customers and was ongoing at the time this article was filed.

Google declined to comment beyond its status page statements and stated that details would be provided in its incident report next week. The company apologized to all who were affected by the disruption.

The Disruption

Service disruption details are as follows:

Services that were curtailed temporarily included:

Vertex AI AutoML Image, Cloud Debugger, Text-to-Speech, Vertex AI Matching Engine, AI Platform Training, Cloud Monitoring, Vertex AI AutoML Tabular, Pub/Sub Lite, Speech-to-Text, Hybrid Connectivity, Cloud Key Management Service, Cloud Natural Language API, Cloud Run, VMWare engine, Vertex AI TensorBoard, Apigee, Cloud Developer Tools, Virtual Private Cloud (VPC), reCAPTCHA Enterprise, Cloud Workflows, Cloud Firestore, Anthos Service Mesh, Operations, Cloud Spanner, Vertex AI Explainable AI, Cloud Profiler, Cloud External Key Manager, Vertex AI Workbench User Managed Notebooks, VPC Service Controls, Cloud Armor, Recommender, Google Compute Engine, Google Kubernetes Engine, Identity Platform, Cloud Memorystore, Google Cloud Bigtable, Resource Manager API, Google Cloud Datastore, Traffic Director, Cloud Logging, Web Risk, Artifact Registry, Cloud HSM, Retail API, Vertex AI Vizier, Persistent Disk, Vertex AI Data Labeling, Google Cloud Dataflow, Data Catalog, Google Cloud DNS, Vertex AI Model Registry, BigQuery Data Transfer Service, Google Cloud Storage, Google Cloud Networking, API Gateway, Google Cloud Console, Dataplex, Google Cloud Scheduler, Eventarc, Google Cloud Composer, Identity and Access Management, Vertex AI Training, Cloud CDN, Google Cloud Pub/Sub, Access Approval, AI Platform Prediction, AlloyDB for PostgreSQL, Cloud Build, Vertex AI AutoML Video, Vertex AI AutoML Text, Cloud NAT, Google Cloud SQL, Assured Workloads, Cloud Load Balancing, Recommendation AI, Vertex AI Pipelines, Cloud Filestore, Google App Engine, Secret Manager, Managed Service for Microsoft Active Directory (AD), Cloud IDS, Cloud Domains, Access Transparency, Cloud Billing, Google Cloud Functions, Access Context Manager, Vertex AI Feature Store, Cloud Asset Inventory, Cloud Data Fusion, Storage Transfer Service, Vertex AI ML Metadata, Vertex AI Online Prediction, Vertex AI Model Monitoring, Google Cloud Identity-Aware Proxy, Video Intelligence API, Database Migration Service, Service Directory, Transcoder API, Cloud Endpoints, Vertex AI Batch Prediction, Google Cloud Dataproc, Cloud Machine Learning.

Cloud Console: Experienced a global outage, which has been mitigated. Management tasks should be operational again for operations outside the affected region (europe-west9). Primary impact was observed from 2023-04-25 23:15:30 PDT to 2023-04-26 03:38:40 PDT.

GCE Global Control Plane: Experienced a global outage, which has been mitigated. Primary impact was observed from 2023-04-25 23:15:20 PDT to 2023-04-26 03:45:30 PDT and impacted customers utilizing Global DNS (gDNS). A secondary global impact for aggregated list operation failures for customers with resources in europe-west9 has also been mitigated. Please see the migration guide for gDNS to Zonal DNS for more information.

Cloud Pub/Sub: For information related to ongoing Cloud Pub/Sub impact, please see the latest status here.

BigQuery: For information related to ongoing BigQuery impact, please see the latest status here.

Workaround: Customers can failover to zones in other regions.

The post Google Cloud Services Hit by Outage in Paris appeared first on The New Stack.

Moving to the Cloud Won’t Solve Your Security Woes

Victoria Wright — Tue, 25 Apr 2023 13:16:52 +0000

Not all press is good press. Case in point: security breaches. One local government CTO we work with proclaimed, “Part of my job is to make sure we stay out of the news.”

It’s critical that technology leaders ensure the utmost security of their systems to avoid ending up the subject of the wrong kind of headline, like Norton LifeLock, Mailchimp, Equifax and countless other companies have.

Many organizations believe that simply migrating to the cloud will solve their security problems. Although adopting a cloud model for app development and delivery has many benefits, it’s essential to address security vulnerabilities before executing a cloud migration strategy. In our work with customers from all around the world, we at VMware Tanzu Labs have found hundreds, if not thousands, of common vulnerabilities and exposures (CVEs) that should be addressed before any modernization work is done.

No one wants to admit that they were breached as a result of unremediated CVEs, but in our practice, we see it happen all the time. Even within large, prestigious organizations. In fact, according to Contrast Security co-founder and CTO Jeff Williams, “The average web application or API has 26.7 serious vulnerabilities…and organizations often have hundreds, thousands or even tens of thousands of applications.” That’s a scary high number!

While most people know that they need to address CVEs as part of their app modernization initiatives, it’s often put off. Let’s face it: It’s a lot more fun to talk about doing cool “new” things than fixing old ones, not to mention that leadership teams expect constant innovation. For large enterprises, who likely have thousands of unattended CVEs, it can be a daunting task to remediate them. As a result, many companies attempt to adopt a cloud model despite having numerous CVEs in their data centers. This can not only lead to malicious attacks, but it can set your migration progress back and waste a lot of time.

Here’s why you should remediate CVEs before migrating to cloud:

Security risk: By migrating to the cloud, there is an increased potential for breaches, as potentially more people will have access to your cloud environment compared to your data center. Mitigating vulnerabilities before you migrate will help to eliminate some of that security risk.
Compliance: In highly regulated industries, it’s critical to maintain certain security requirements. Organizations that don’t comply are met with consequences such as fines and/or reputational damage. Even if you’re not in a highly regulated industry, it’s likely that your organization has its own levels of security expectations.
Migration complexity: If you put “mess” in, it’s likely you’ll get “mess” out. If you migrate a bunch of CVEs to the cloud, you’re still going to have to deal with them once they’re there. Having a lot of vulnerabilities prior to your migration can lead to additional complexity once you’re in the cloud.
Higher cost: It’s typically easier and cheaper to mitigate CVEs while they’re still in a known environment like your data center compared to the additional cost of remediating in a new, and somewhat unknown environment like the cloud.

So, we’ve established why you should remediate CVEs before migration to cloud, but what about the “how?” We know how painstaking it can be to address CVEs, especially at a large scale, due to the time and effort necessary to mitigate them. But keeping the following three essentials in mind will help teams create an effective CVE remediation practice at scale.

No. 1: Know What You Have

There are countless tools, including many open source options, for teams to gather important information about their data centers and the vulnerabilities within them. Here are some of the open source tools that we recommend to build a full understanding of the current state of your portfolio:

Open source scanning tools help to determine your starting point

Not only can these tools inform you of what kind of vulnerabilities you have, but they can also help you prioritize them based on factors such as risk, urgency and the amount of cost and time it will take to mitigate them.

No. 2: Take Action on Vulnerabilities with Intention

You’ll need to build a solid decision-making framework to inform what you need to address now versus later. For example, critical CVEs might exist in internal-facing applications that you can live with (provided you have tight network security), but the same critical CVE in a public-facing application should be addressed immediately. We’re not suggesting every single CVE needs to be fixed; we’re suggesting you make informed choices on which ones to address and when.

Now, the act of actually mitigating your list of CVEs is the most arduous step. Depending on the size of your organization and its existing vulnerability management strategy, you may have hundreds, if not thousands, of CVEs to address. Not to worry. Getting through these will prove valuable in the long run by protecting your organization and its data.

If your organization has not done a CVE overhaul in a long time, it’s likely that this step will take a while to accomplish, so you might need to get scrappy. Get your leadership’s support, find opportunities to collaborate with other teams on this and find resources to help your team patch these vulnerabilities routinely.

No. 3: Build Vulnerability Scanning into Every Step, Every Time

This step is critical: Build automation into your vulnerability management strategy. Automation is the most efficient way to scale and maintain your CVE mitigation strategy. Every team has a number of forces competing for their time. By incorporating automation into your workstream, you’ll be better able to take the guesswork out of which CVEs should be addressed and when, while also being able to focus on your day job (which is likely not just CVE patching).

CVEs are always evolving, so it’s important to constantly monitor for new vulnerabilities, and automation helps immensely with that process.

Ultimately, as tedious and painstaking as it can feel to address CVEs, we’ve seen firsthand how critical it can be to the security of your business. By having a solid understanding of what CVEs you have, building a strategy to mitigate them and folding automation into the mix, you’ll be in a much better position to scale your vulnerability management strategy with ease. Plus, once you no longer have vulnerabilities bogging you down, your cloud migration initiative will likely be much more successful.

To learn more about security best practices, check out a preview of the ebook “Securing Cloud Apps,” come and find us at RSA this week and catch Sumit Dhawan’s keynote about how infrastructure can help establish a new ground truth for today’s cyber professionals.

The post Moving to the Cloud Won’t Solve Your Security Woes appeared first on The New Stack.

How to Find and Eliminate Blind Spots in the Cloud

Guilherme (Gui) Alvarenga — Thu, 20 Apr 2023 18:10:18 +0000

Visibility in the cloud is an important but difficult problem to tackle. Poor visibility can lead to all manner of security risks, including data loss, credential abuse and cloud misconfigurations. It is one of the biggest challenges facing security managers today as they look to adopt cloud technologies.

In an April 2020 survey from Enterprise Strategy Group, researchers revealed that 33% of respondents felt that a lack of visibility into the activity of the infrastructure hosting their cloud native applications was the biggest challenge in securing those apps.

Depth of visibility differs among cloud providers, and each one has its own positive and negative aspects. Let’s look at some of the logging and visibility options that Amazon Web Services (AWS) and Google Cloud Platform (GCP) offer, as well as their blind spots and how to eliminate them.

Know and Understand Your Visibility Options

Cloud providers typically offer some sort of default logging or monitoring at no extra cost, but it is never enough to gain sufficient visibility into what’s going on across your organization. They also provide additional paid services that allow you to gain more visibility into your environments for a variety of use cases. Because each cloud provider does things slightly differently, blind spots and the lack of visibility differ across providers.

The All-Important Control Plane

The “control plane” offered by a cloud provider is essentially what handles “cloud operations,” or API calls. The control plane must be accessible from anywhere on the internet and is what allows users to interact with the cloud and the resources in it.

For example, API calls made through the AWS Command Line Interface (CLI) are routed through the AWS control plane, allowing you to take actions such as launching a new Amazon Elastic Compute Cloud (EC2) instance. In GCP, the control plane handles things such as calls made through the “gcloud” CLI.

Visibility into which API calls are being made in your environment is incredibly important, which is why many of the free, default logging services are provided. These include CloudTrail Event History in AWS (a 90-day history of API calls made in your account) or something like activity logging in GCP, which provides a broad overview of activity in a project. These default offerings have shortcomings, however. Without additional configuration, you will miss critical elements.

To monitor the cloud control plane, you need to use built-in services provided by your cloud provider. But don’t stop there. Those logs should then be exported to an external security information and event management (SIEM) solution, such as Splunk, for further analysis and alerting.

These services include things like configuring AWS CloudTrail for every region and for every event type across your organization, or applying GCP audit logs to all supported services at the organization level. Doing this will help give you the additional insight you need to make more informed security decisions.

But What about Network and Hosts?

When it comes to activity within your cloud network or the hosts within that network, there typically is no free default offering to gain visibility. There may be default logging on your hosts, such as bash history, but there is nothing aggregating those logs and providing you access to all of your hosts through a unified interface.

There are many built-in and third-party offerings available to gain visibility into activity across your network and on your hosts. Some cloud workload protection platforms provide host-based monitoring and visibility that allows you to detect and prevent threats in real time.

Other offerings, such as AWS VPC (Virtual Private Cloud) Traffic Mirroring or GCP Packet Mirroring, can help with full packet capture within your cloud network. VPC flow logs are another useful tool for network visibility and offerings, such as AWS GuardDuty which can monitor those flow logs, as well as DNS logs and CloudTrail logs, to detect threats and unusual activity within your environment.

Pain Points and Blind Spots

No matter which monitoring and visibility options are available, even if you use all of your cloud provider’s built-in services (and even some third-party services), there are likely pieces missing from the puzzle.

The Ins and Outs of Data-Level Events

Data-level events differ from control plane or management events because they are actions performed on specific data rather than on a cloud resource. For example, AWS Simple Storage Service (S3) data events log activity for objects in an S3 bucket, providing insight into who’s interacting with which objects. Without additional configuration (and therefore additional cost), these event types are not logged. Some services offer data-level event logging, such as S3 bucket access logs, but others, such as AWS EBS Direct APIs, do not.

To ensure you are not losing insight into data-level events, you should enable them where possible. Where it is not possible, we recommend that you deny all users access to those data-level APIs.

Undocumented APIs

Every cloud provider publishes documentation for their supported APIs, which allows you to identify what is possible in their specific cloud. The problem is that there are often undocumented APIs supported by each cloud provider that may be exposed when they should not be. Often these undocumented APIs are used to enhance the user experience, such as providing a good UX through the web console the provider supplies.

Because these are undocumented and generally not for direct access, they are not logged where other documented APIs are logged. This lack of logging could allow an attacker to perform certain actions in your environment with no way for you to know they did.

To prevent undocumented APIs from being used maliciously in your environment, it is important to grant permissions on a granular level. That is, don’t grant permissions using wildcards (such as using a “*” in AWS), don’t use “managed” permissions sets as they are often overly permissive, and regularly review the permissions you grant your users to ensure they are properly restricted. By allowing listing permissions on a granular level, you can avoid much of the risk that undocumented APIs expose.

Pre-GA Services

Another potential blind spot in the cloud is pre-GA (general availability) services. This could include alpha, beta, gamma, pre-production or similar services. Such services are often blocked from the public and only permit access to listed users, but many times they are made available for public use while still in pre-GA status.

One example of how to access pre-GA services are the commands available through the “beta” and “alpha” groups of the “gcloud” CLI, which can be accessed with “gcloud beta ” and “gcloud alpha ,” respectively.

Just because they are pre-GA doesn’t mean they are not logged, but this is where you may encounter services that do not yet support logging and have a potential for abuse. You can prevent access to many of these beta and alpha services in GCP by denying users access to enabling new APIs in their projects and only providing them access to a list of trusted services.

The Principle of Least Privilege

Permissions must be set at a granular level so they only grant access to necessary services and APIs. These risks often arise when wildcards or managed permission sets are applied in an environment where you can’t be 100% sure of every service and action you granted to your user. When setting the permissions granularly, you can ensure that nothing unexpected can be accessed by your users.

Without taking advantage of logging and visibility offerings provided by cloud providers, you can quickly lose insight into what’s actually going on in your environment. Visibility gaps can still exist even when the proper services are enabled, as we’ve discussed with data-level events, undocumented APIs and pre-GA services, so it’s necessary to follow the principle of least privilege to avoid granting access to services and APIs that escape your monitoring capabilities.

You Have a Role to Play

It may seem like it’s the cloud provider’s responsibility to not release undocumented APIs or pre-GA services, but until they do, it is your responsibility to delegate permissions and access in a way that these APIs do not expose your environment to undue risk.

The post How to Find and Eliminate Blind Spots in the Cloud appeared first on The New Stack.