Developer and IT Operations | The New Stack

Pulumi: New Features for Infrastructure as Code Automation

B. Cameron Gain — Thu, 15 Jun 2023 16:00:07 +0000

Given the enormous complexity involved, orchestrating cloud infrastructure manually, even with Infrastructure as Code (IaC), is time-consuming and tough. Enterprises often have dozens and sometimes hundreds of public cloud accounts, with new ones popping up all the time.

Without a unified control plane that keeps track of application stacks across clouds and cloud accounts, achieving operational consistency, cost efficiency and resiliency becomes near impossible.

Additionally, enterprises are missing out on the opportunity to learn from what worked and what didn’t work in the past, when creating new app stacks, Torsten Volk, an analyst at Enterprise Management Associates, told The New Stack.

He added, “Ideally, developers will be able to define their infrastructure requirements straight from within code functions, without having to specify the exact resources needed, while the IaC platform analyzes the new app, compares it to existing apps that are similar in character, and automatically derives the optimal infrastructure resources.”

Pulumi, an IaC provider, is seeking to simplify and automate IaC for complex cloud environments (Amazon Web Services, for instance, has more than 300 infrastructure resources alone). As part of that mission, it announced new product features during its PulumiUP virtual conference on Thursday.

For those organizations that may have cloud native ambitions and struggle with just getting started, Pulumi’s new AI-enhanced and other features and existing API are designed for the task.

Other newly introduced features include the ability to convert infrastructure across a stack from an alternative such as Terraform with accessible IaC commands.

AI and Insights

When managing thousands of resources across multiple clouds, manual errors can be devastating. A proper IaC platform must prevent manual errors and streamline operations. It should provide a single source of truth and become a necessity at the scale of cloud native environments.

For serverless architectures and Kubernetes applications, for example, managing infrastructure with a programming language of your choice — features that which Pulumi provides — is also critical as IaC becomes the default choice in the cloud native world.

“Pulumi is more suitable for this new world, where infrastructure plays a different role,” Aaron Kao, Pulumi’s vice president for marketing, told The New Stack.

Pulumi’s new features are designed to increase developer productivity and operational scalability by leveraging metrics from past projects to automatically compile an optimal application stack for new projects, Volk said.

For example, he said, the analytics engine might find that leveraging SQL databases over NoSQL ones leads to a decreased number of weekly deployments that at the same time show higher failure rates and a longer mean time to recovery.

The new features Pulumi announced at its conference include:

An On-Ramp from Terraform

Tf2pulumi, which converts Terraform projects to Pulumi programs, is now part of the Pulumi CLI. The new Terraform conversion support includes support for Terraform modules, all core features of Terraform 1.4 and the majority of Terraform built-in functions.

The tf2pulumi feature previously converted snippets of Terraform to Pulumi, and now supports conversion of most complete Terraform projects. It is now integrated with the pulumi convert command in the CLI, which can also be used to convert Pulumi YAML to other Pulumi languages.

A Deeper Dive into Cloud Resources

Pulumi Insights now lets engineers ask questions about cloud resource property values, in addition to resource types, packages, projects and stacks. This property search capability allows teams to perform deeper analysis on their resources.

The Insights feature also now allows search filtering by teams. This allows organizations to analyze resources under management per team and better estimate usage and cost.

Pulumi Insights is where Pulumi’s AI capabilities particularly shine, with a heavy implantation of ChatGPT functionality. The information retrieved by issuing commands in conversational English, and Pulumi Insights offers actionable analysis and information about how to accomplish infrastructure orchestration-related tasks.

On-Demand Infrastructure Stacks for Testing

Review Stacks, a new feature of Pulumi Deployments, are temporary, on-demand infrastructure environments created for each pull request in a repository. They allow engineers to review and test IaC changes in an isolated setting before merging them into the main branch.

The features streamlines the development process by maintaining a separation between testing and production environments and catching potential issues before they reach production. With Review Stacks, organizations can enhance resource efficiency by spinning up a test stack only when needed, which is intended to accelerate deployment cadence.

The post Pulumi: New Features for Infrastructure as Code Automation appeared first on The New Stack.

Managing Kubernetes Complexity in Multicloud Environments

Hemanth Kavuluru — Thu, 15 Jun 2023 14:48:40 +0000

Kubernetes has become the ubiquitous choice as the container orchestration platform for building and deploying cloud native applications. As enterprises adopt Kubernetes, one of the key decisions they have to make is around adopting a multicloud strategy. It’s essential to understand the factors driving the need for a solution across public cloud providers such as Amazon Web Services (AWS) , Azure, GCP, Oracle, Alibaba, etc., and validate whether those factors are relevant currently or in the future. Some factors that influence multicloud strategy are:

Specialized cloud technology needs/requirements for particular applications
Multiple business units adopting separate clouds
GDPR and other locality considerations
Disaster recovery
Mergers and acquisitions of other businesses that have adopted different clouds
Dependency on a cloud-managed service

Specialized Cloud Technology Needs/Requirements for a Particular Application

Some applications require specialized cloud services only available on specific cloud platforms. For example, Google Big Table is a NoSQL database only available on Google Cloud. Similarly, Azure has specialized machine learning and AI services, such as Azure Cognitive Services.

In such scenarios, enterprises need to deploy their applications across multiple clouds to access the specialized services required for their applications. This approach can also help organizations optimize costs by choosing the most cost-effective cloud service for each application.

Multiple Business Units Adopting Separate Clouds

In large organizations, different business units may have unique requirements for their cloud services, leading to the adoption of separate cloud services. For example, one business unit may prefer Google Cloud for its machine learning capabilities, while another may prefer AWS for its breadth of services. As a result, the cloud environment becomes fragmented, and deploying applications across multiple clouds becomes complex.

GDPR and Other Locality Considerations

Regional regulations can also drive the need for a multicloud approach. For example, enterprises may need to store and process data in specific regions to comply with data residency regulations. For instance, Alibaba Cloud is China’s leading cloud provider and the preferred cloud in that region.

Deploying applications across multiple clouds in different regions can help enterprises meet their data residency and compliance requirements.

Disaster Recovery

Implementing disaster recovery in the right manner is essential for enterprises, as downtime can lead to significant revenue loss and reputational damage. A multicloud approach can help enterprises ensure business continuity by deploying applications across multiple clouds. In such scenarios, primary applications can run in one cloud while secondary applications can run in another for disaster recovery.

This approach can also help enterprises optimize their costs by choosing the most cost-effective cloud service for disaster recovery.

Mergers and Acquisitions

When organizations merge, they may have different cloud environments that must be integrated. Similarly, when organizations acquire other companies, they may need to integrate the acquired company’s cloud environment with their existing cloud environment, hence the need for a multicloud approach.

Dependency on a Particular Cloud Service

Enterprises may need to deploy applications in a particular cloud due to the dependency on a specific service that a specific cloud provider only offers. For example, an organization may require managed Oracle for its databases or SAP HANA for its ERP systems. In this case, deploying the applications in the same cloud is necessary to be closer to the database. Platform and site reliability engineering (SRE) teams must now acquire skills to manage Kubernetes infrastructure on a new public cloud. Platform teams must thoroughly understand all their application team requirements to see whether any of their applications will fall into this category.

How to Manage Multicloud Kubernetes Operations with a Platform Approach

Enterprises may want to invest in a true Kubernetes operations platform if the multicloud deployment is a critical requirement now or in the future. A true Kubernetes operations platform helps enterprises develop standardized automation across clouds while leveraging public cloud Kubernetes distributions such as AWS EKS, Azure AKS, Google GKE, etc. On the other hand, deploying and managing Kubernetes infrastructure on multiple clouds without a Kubernetes operations platform requires a lot of manual effort and can lead to substantial operational costs, operational inconsistencies, project delays, etc.

A Kubernetes operations platform can standardize the process for deploying and managing Kubernetes clusters across multiple clouds. Enterprises can use a unified interface to automate the deployment and management of Kubernetes clusters across multiple clouds. This automation helps improve consistency and reduce the risk of human error. It also reduces the need for specialized skills.
Enterprises also need to maintain a unified security posture across clouds. In a multicloud environment, each cloud provider has its own security policies, which makes it hard for enterprises to implement standard security policies across the clouds. A Kubernetes operations platform can provide consistent security policies across clouds, enforcing governance and compliance uniformly.
Consistent policy management and network security policies across clouds are critical for adopting multicloud Kubernetes deployments. A Kubernetes operations platform should provide standardized workflows for applying network security and Open Policy Agent (OPA) policies for Kubernetes clusters spanning clouds. Policies, including network policies, ingress and egress rules, can be defined in a centralized location and deployed to all Kubernetes clusters, ensuring consistency and reducing operational complexity.
A true Kubernetes operations platform should provide a unified bimodal multitenancy (cluster and namespace) across clouds. This platform should allow multiple teams and applications to share the same Kubernetes clusters without affecting each other, providing better resource utilization and cost efficiency. Similarly, for teams, applications or environments that require dedicated clusters, the Kubernetes platform should offer cluster-as-a-service where the individual teams can create their clusters in a self-serve manner adhering to the security, governance and compliance set by the platform and SRE teams.
Kubernetes access control, role-based access control (RBAC) and single sign-on (SSO) across all clouds are essential for a Kubernetes operations platform. However, access management becomes increasingly complex when deploying Kubernetes across multiple clouds. A unified access management solution can simplify the process and reduce the security risk.
Finally, a single pane of administration offering visibility for the entire infrastructure spanning multiple clouds is essential for a Kubernetes operations platform. A single management plane can provide centralized visibility into Kubernetes clusters across multiple clouds, allowing enterprises to monitor, manage and troubleshoot their Kubernetes clusters more efficiently.

Conclusion

A multicloud strategy may be an important consideration for enterprises that are adopting a Kubernetes operations platform for managing their Kubernetes infrastructure. Enterprises should carefully look at all factors that influence a multicloud deployment and decide whether multicloud is required for their organization. A true multicloud Kubernetes operations platform should provide standardized automation, consistent security policies, unified Kubernetes bimodal multitenancy, access management and a single administration pane, offering visibility for the entire infrastructure spanning multiple clouds.

The post Managing Kubernetes Complexity in Multicloud Environments appeared first on The New Stack.

The Risks of Decomposing Software Components

Alex Williams — Wed, 14 Jun 2023 19:47:23 +0000

Software components decompose. They are not like gold bricks in a safe. They are more like lettuce in a warehouse. They have to be used and replaced. It’s not apples to apples, per se, but the point is that components require updating.

If the software is not updated, problems arise, and we face security vulnerabilities like Log4J.

But getting components updated in a timely way is a universal problem and a challenge getting tackled by the Linux Foundation’s Open Source Security Foundation (OpenSSF) said Omkhar Arasaratnam, general manager of the OpenSSF, and Brian Behlendorf, CTO, OpenSSF in an interview at The Open Source Summit North America in Vancouver, BC.

With a component fix, how do you get from upstream to downstream as quickly and efficiently as possible? People still use old versions, as illustrated with Log4J, where people are still relying on outdated and vulnerable software components.

“It’s a very classical case of a security issue, it’s not something novel,” Arasaratnam said. “I’d like to ensure that we start by making our software secure by construction so the issues like that don’t exist at all: through education, through using different techniques, hardened libraries, well-vetted patterns for addressing those kinds of issues. Now, when issues like that do occur, then you’re right, we do have to jump into rapid response mode. We have to have not only, as you pointed out, a good mechanism of traversing stuff from upstream all the way back down to what’s running in prod. But that’s where artifacts like SBOMs come in.”

An SBOM is a software bill of materials. The SBOM tells you the software components and, hopefully, even more.

According to the Linux Foundation, an SBOM is a complete, formally structured list of components, libraries, and modules required to build (i.e., compile and link) a given piece of software and the supply chain relationships between them. These components can be open source or proprietary, free or paid, and widely available or restricted access.”

Arasaratnam, who recently joined OpenSFF as general manager, said SBOMs, also noted by Behlendorf, provide telemetry. They provide data that can be reasoned over when making some of these decisions.

“Wouldn’t it be wonderful if we could also provide reputation data on a particular repo you’ve decided to link against? Wouldn’t it be great if you had that full inventory of the time that you use that GCC compiler flag that could have caused some kind of regression? All of this data is extremely valuable. And I think for a long time, we, in enterprise in general and production environments, have been fumbling around with imprecise data, and have been unable to really leverage all the telemetry we could be using.”

The discussion also covers the issues with package managers and how we may quantify the risks of software vulnerabilities.

The post The Risks of Decomposing Software Components appeared first on The New Stack.

No More FOMO: Efficiency in SLO-Driven Monitoring

Imaya Kumar Jagannathan — Wed, 14 Jun 2023 17:00:07 +0000

Observability is a concept that has been defined in various ways by different experts and practitioners. However, the core idea that underlies all these definitions is efficiency.

Efficiency means using the available resources in the best possible way to achieve the desired outcomes. In the current scenario, where every business is facing fierce competition and changing customer demands, efficiency is crucial for survival and growth. Resources include not only money, but also time, productivity, quality and strategy.

IT spending is often a reflection of the market conditions. When the market is booming, companies tend to spend more on IT projects and tools, without being too concerned about the value they are getting from them. This can create some problems, such as having too many tools that are not integrated or aligned with the business goals, wasting resources on unnecessary or redundant tasks, and losing visibility and control over the IT environment.

IT spend always correlates to market temperature.

Even the companies that spend heavily on cloud services are reconsidering their big decisions that involve significant, long-term investments. Companies are reassessing their existing substantial spend to ensure their investments can be aligned with revenues or future revenue potential.

Observability tools are also subject to the same review. It is essential that the total operating cost of observability tools can also be directly linked to revenue, customer satisfaction, growth in business innovation and operational efficiency.

Why Do We Need Monitoring?

If we had a system that would absolutely never fail, we wouldn’t need to monitor that system.
If we had a system for which we never have to worry about being performant, reliable or functional, we wouldn’t need to monitor that system.
If we had a system that self-corrects itself and auto-recovers from failures, we wouldn’t need to monitor that system.

None of the aforementioned points are true today, and it is obvious that we need to set up monitoring for our infrastructure and applications no matter what scale you operate.

What Is FOMO-Driven Monitoring?

When you are responsible for operating a critical production system, it is natural to want to collect as much monitoring data as possible. After all, the more data you have, the better equipped you will be to identify and troubleshoot problems. However, there are a number of challenges associated with collecting too much monitoring data.

Data Overload

One of the biggest challenges of collecting too much monitoring data is data overload. When you have too much data, it can be difficult to know what to look at and how to prioritize your time. This can lead to missed problems and delayed troubleshooting.

Storage Costs

Another challenge of collecting too much monitoring data is storage costs. Monitoring data can be very large, and storing it can be expensive. If you are not careful, you can quickly rack up a large bill for storage.

Reduced Visibility

When there is too much data, it can be difficult to see the big picture. This can make it difficult to identify trends and patterns that could indicate potential problems.

Increased Noise

More data also means more noise. This can make it difficult to identify important events and trends.

Security Concerns

Collecting too much monitoring data can also raise security concerns. If your monitoring data is not properly secured, it could be vulnerable to attack. This could lead to theft of sensitive data or disruption of your production systems.

FOMO-driven monitoring

Ultimately, an approach driven by the fear of missing out does not result in an optimal observability situation/setup and, in fact, can contribute to plenty of chaos, increased expenses, ambiguity between teams and overall increase in poor efficiency.

You can address this situation by being intentional in making decisions on all aspects of the observability pipeline including signal collection, dashboarding and alerting. Using service-level objectives SLOs is one of the strategies that offers plenty of benefits.

What Are SLOs?

An SLO is a target or goal for a specific service or system. A good SLO will define the level of performance your application needs, but not any higher than necessary.

SLOs help us set a target performance level for a system and measure the performance over a period of time.

Example SLO: An API’s p95 will not exceed 300ms response time

How Do You Set SLOs?

SLOs are generally set by customers. Yes, they are the ultimate authority. However, customers do not actually set SLOs as you can imagine. It is up to the business teams to tell the IT operations and development teams the expected performance and availability of a system.

For example, the business teams operating a marketing lead sign-up page can tell the IT teams that they want the page to load within 200ms at least 90% of the time. They would derive this conclusion by looking at the customer behavior already captured.

Now the IT teams can set the SLO for tracking by identifying SLIs(service-level indicators) in order to measure the SLOs over a period of time. SLIs are the specific metrics and query details of the metrics used to keep track of the SLO progression.

Here is what your observability life cycle looks like implementing an SLO-driven strategy.

SLO-driven strategy

There is an intentional loopback mechanism that is set in taking the SLO-driven strategy. Observability is never a settled problem. Organizations that do not continue reinventing their observability strategy fall behind very quickly, resulting in ambiguous tools, outdated processes and practices, which in the end increases overall operational cost while decreasing efficiency.

With this approach, you get the ability to scientifically measure your infrastructure and application performance over a period of time. Data collected as a result can be used to influence important decisions made on infrastructure spend which in turn helps improve further efficiency.

What Does This Tell Us?

Taking an SLO-first approach allows us to be intentional about the metrics to collect to meet commitments to business.

These are some of the benefits that organizations can achieve by following SLO-based observability strategy:

Results in improved signal vs. noise ratio
Reduces tool proliferation
Enriched monitoring data resulting in reduced MTTR/MTTI
Feedback loop provides continuous improvement opportunities
Connect monitoring costs in relation to business outcomes, hence able to justify spend to management

Use SLOs to drive your monitoring decisions:

Measure, revisit and review SLOs periodically based on outcomes
Improve observability posture through
- Lower cost
- Reduced issue resolution time
- Increased team efficiency and innovation

Conclusion

We live in an era where efficiency is critical for organizational success. Observability costs can become uncontrollable if you do not have a proper strategy in place. SLO-driven observability strategy can help you set guardrails, track performance goals, business metrics and measure impact in a consistent manner while increasing operational efficiency and innovation.

The post No More FOMO: Efficiency in SLO-Driven Monitoring appeared first on The New Stack.

Can Companies Really Self-Host at Scale?

Jessica Wachtel — Wed, 14 Jun 2023 14:47:06 +0000

There’s no such thing as free lunch, or in this case, free software. It’s a myth. Paul Vixie, vice president of security at Amazon Web Services, creator of the original Domain Name System (DNS), gave a compelling presentation at Open Source Summit Europe 2022 about this topic. His presentation included a comprehensive list of “dos and don’ts” for consumers of free software. Vixie’s docket included labor-intensive, often expensive, engineering work that ran the gamut of small routine upgrades to locally maintaining orphaned dependencies.

To sum the “dos and don’ts” up in one sentence though, engineer(s) are always working, monitoring, watching and ready for action. This “ready for action” engineer must have high-level expertise so that they can handle anything that comes their way. Free software isn’t inherently bad, and it definitely works. Identifying the hidden costs of selecting software also applies to the decision to self-host a database. Self-hosting is effective for many companies. But when is it time to let go and try the easier way?

What Is a Self-Hosted Database?

Self-hosted databases come in many forms. Locally hosted open source databases are the most obvious example. However, many commercial database products have tiered packages that include self-managed options. On-premises hosting comes with pros and cons: low security risk, the ability to work directly beside the data and complete control over the database are a few advantages. There is, of course, the problem with scaling. Self-hosting creates challenges for any business or developer team with spiky or unreliable traffic because on-demand scaling is impossible. Database engineers must always account for the highest amount of traffic with on-premises servers or otherwise risk an outage in the event of a traffic spike.

For businesses that want to self-host and scale on demand, self-hosting in the cloud is another option. This option allows businesses with spiky or less predictable traffic to scale alongside their needs. When self-hosting in the cloud, the cloud provider installs and hosts their database on a virtual machine in a traditional deployment model. When you’re hosting a commercial database in the cloud, support for cloud and the database is minimal because self-hosted always means your engineering resources helm the project. This extends to emergencies like outages and even security breaches.

The Skills Gap

There are many skilled professionals with experience managing databases at scale on-premises and in the cloud. SQL databases were the de facto database for decades. Now, with the rise of more purpose-built databases geared toward deriving maximum value from the data points they’re storing, the marketplace is shifting. Newer database types that are gaining a foothold within the community are columnar databases, search engine databases, graph databases and time series databases. Now developers familiar with these technologies can choose what they want to do with their expertise.

Time Series Data

Gradient Flow expects the global market for time series analysis software will grow at a compound annual rate of 11.5% from 2020 to 2027. Time series data is a vast category and includes any data with a timestamp. Businesses collect time series data from the physical world through items like consumer Internet of Things (IoT), industrial IoT and factory equipment. Time series data originating from online sources include observability metrics, logs, traces, security monitoring and DevOps performance monitoring. Time series data powers real-time dashboards, decision-making and statistical and machine learning models that heavily influence many artificial intelligence applications.

Bridging the Skills Gap

InfluxDB 3.0 is a purpose-built time series database that ingests, stores and analyzes all types of time series data in a single datastore, including metrics, events and traces. It’s built on top of Apache Arrow and optimized for scale and performance, which allows for real-time query responses. InfluxDB has native SQL support and open source extensibility and interoperability with data science tools.

InfluxDB Cloud Dedicated is a fully managed, single-tenant instance of InfluxDB created for customers who require privacy and customization without the challenges of self-hosting. The dedicated infrastructure is resilient and scalable with built-in, multi-tier data durability with 2x data replication. Managed services mean around-the-clock support, automated patches and version updates. A higher level of customization is also a characteristic of InfluxDB Cloud Dedicated. Customers choose the cluster tier that best matches their data and workloads for their dedicated private cloud resources. From the many customizable characteristics, increased query timeouts and in-memory caching are two.

Conclusion

It’s up to every organization to decide whether to self-manage or choose a managed database. Decision-makers and engineers must have a deep understanding of the organization’s needs, traffic flow patterns, engineering skills and resources and characteristics of the data before reaching the best decision.

To get started, check out this demo of InfluxDB Cloud Dedicated, contact our sales team or sign up for your free cloud account today.

The post Can Companies Really Self-Host at Scale? appeared first on The New Stack.

What’s Up with OpenStack in 2023

Kristin Barrientos — Wed, 14 Jun 2023 14:00:30 +0000

The OpenStack community has released its 27th version of the software, circling all the way back to the beginning of the alphabet. Due to its passionate and active contributor base, OpenStack continues to be one of the top five most active open source projects. Organizations around the globe, spanning almost every industry, have embraced OpenStack, reaching 40 million cores of compute in production. Within this footprint, adoption specifically among OpenStack-powered public clouds now spans over 300 data centers worldwide.

In addition to OpenStack, the OpenInfra Foundation has replicated its model for hosting open source projects including Kata Containers, StarlingX and Zuul. This model is now readily available for any organization that wants to leverage the Four Opens and three forces to build a sustainable open source project within the infrastructure layer.

The OpenInfra Summit Vancouver on June 13-15, is a great opportunity to get involved in the OpenStack community while collaborating more closely with other OpenInfra projects and learn from the world’s biggest users.

OpenStack Is More Reliable and Stable Than Ever

As the OpenStack software platform has matured, there has been a notable emphasis on reliability and stability. Many features and enhancements have been introduced to ensure a smoother and more robust experience. These improvements include the implementation of a new “skip level upgrade release process” cadence, which began with the Antelope release in March 2023.

One significant aspect of OpenStack’s evolution is the increased emphasis on thorough testing. More extensive testing procedures are now in place, ensuring that the platform is attentively examined for potential issues and vulnerabilities.

Another recent focus for the upstream community has been removing under-maintained services and features to allow for a more focused and efficient system, eliminating unnecessary components that may hinder the reliability and stability of OpenStack.

OpenStack also places a strong emphasis on interoperability. Integration and collaboration efforts with other popular open source components such as GNU/Linux, Kubernetes, Open vSwitch, Ceph and Ansible have been prioritized. These initiatives promote compatibility and interaction between different software systems, which has enhanced overall reliability.

In contrast to the past focus on vendor-specific drivers and niche features, OpenStack now prioritizes contributions to general functionality. For example, developing a unified client/SDK offers a standardized and consistent experience across the platform. This shift promotes stability and reliability by focusing on core functionalities that benefit all users.

As OpenStack continues to mature, these various measures and initiatives demonstrate a strong commitment to reliability, stability and long-term success.

Flexible Support for a Variety of Workload Models

OpenStack is a powerful and versatile cloud computing platform that offers flexible support for various workload models. One of its standout features is the ability to work closely with hardware, enabling users to harness the full potential of their systems. For instance, the Ironic bare metal systems deployment and lifecycle management tool and the Nova bare metal driver provide seamless integration of on-demand physical server access into a full-featured OpenStack deployment.

To ensure long-term sustainability, OpenStack’s capabilities are continuously tested on architectures like ARM/AArch64 and have included support for other unconventional processor architectures like PowerPC. It also offers advanced scheduling capabilities like PCI passthrough, CPU pinning, and coordination and life cycle management of peripherals like accelerators, graphics processing units (GPUs), data processing units (DPUs) and field-programmable gate arrays (FPGAs). Moreover, OpenStack has tighter integration with the container ecosystem, with Magnum as a Kubernetes-certified distribution, Zun enabling individual application containers to be provisioned and manipulated as first-class server objects and Kuryr delivering advanced Neutron network features directly to container processes.

OpenStack also offers solutions for running its services as container workloads, with Kolla and OpenStack-Helm. It has fostered close collaboration with the Kubernetes community, with current and former leadership cross-over between the two projects. OpenStack provides services to facilitate long-lived processes and precious data, such as scheduling policies, data retention, backups and high availability/disaster recovery. Its services facilitate ephemeral and distributed applications with load distribution and multi and hybrid cloud, along with cloud-bursting features. Overall, this is an ideal platform for organizations looking to achieve maximum flexibility and efficiency in their cloud computing environments, with a broad range of tools and features that can support a wide variety of workloads and use cases.

What’s the Story with Security?

Security is a major concern in any computing platform, and the OpenStack community takes this issue very seriously. Over time, the OpenStack contributors have made significant strides in enhancing security through long-term improvement initiatives. Community goals have been set to tackle critical security aspects, such as role-based access control, privilege separation for services, image encryption and Federal Information Processing Standards (FIPS) compliance testing. These efforts demonstrate the community’s commitment to continuously enhancing security features and mitigating potential risks.

One notable achievement is the steady reduction in the volume of reported security vulnerabilities. By actively identifying and addressing security concerns, the community has created a safer environment for cloud deployments.

Additionally, OpenStack has implemented new vulnerability coordination policies that promote transparency and collaboration. These policies not only provide open access to more projects but also mandate clearer publication timelines. By ensuring that vulnerabilities are promptly disclosed and addressed, OpenStack enables users to stay informed and take appropriate actions to protect their systems.

OpenStack’s commitment to security extends beyond its own ecosystem. The platform has been a pioneer in establishing a sustainable vulnerability management process, which has served as a model for many other open source communities. This recognition highlights the effectiveness of OpenStack’s security practices and reinforces its position as a leader in the open source world.

How Can People Contribute and Add to This Project?

The OpenStack community welcomes all individuals and organizations to actively participate and enhance the community by adhering to OpenInfra’s “Four Opens” principles. If you’re interested in joining this collaborative effort, various avenues are available to guide you through the process.

To begin contributing, the project offers Contributor Guides that serve as valuable resources for both individuals and organizations. These guides not only assist with upstream code contributions but also provide insights into non-code contributions. Additionally, they outline opportunities for users and operators to contribute their expertise and insights to the project’s growth.

One way to make a meaningful impact is by volunteering as a mentor for university interns. Sharing your knowledge and experience can help shape the next generation of contributors. Moreover, you can propose efforts for sponsorship through programs like Outreachy, which provides opportunities for individuals from underrepresented backgrounds to contribute to open source projects. Additionally, you can support events such as Open Source Day at the Grace Hopper conference.

For someone who is new and seeking information and advice, the First Contact SIG (Special Interest Group) within the OpenStack community is an excellent starting point. This group’s mission is to provide a place for new contributors, making it a welcoming and inclusive space for those who are just beginning their journey in the project.

If you’re looking to make a more significant impact, consider exploring Upstream Investment Opportunities. These opportunities offer a curated set of suggested investment areas based on the current needs in the OpenStack community along with contact points for who can help you get started.

Overall, the OpenStack project offers a range of avenues for individuals and organizations to contribute and add value to the community. Whether it’s through code or non-code contributions, mentoring, sponsorships or investment opportunities, there are numerous ways to engage and actively participate in the growth and success of the project.

The post What’s Up with OpenStack in 2023 appeared first on The New Stack.

Google’s DeepMind Extends AI with Faster Sort Algorithms

David Cassel — Tue, 13 Jun 2023 19:09:51 +0000

Computing pioneer Grace Hopper once quipped that the most dangerous phrase in data processing is ‘We’ve always done it this way.” In that spirit, Google’s DeepMind searched for a faster sorting algorithm using an AI system — and the company’s researchers are now claiming the new algorithms they’ve found “will transform the foundations of computing.”

Google emphasized that sorting algorithms affect billions of people every day — from how online search results get ranked to how data gets processed. But “Making further improvements on the efficiency of these routines has proved challenging,” notes a recent paper from DeepMind, “for both human scientists and computational approaches.” DeepMind focused on the algorithms for sorting short sequences — with between three and five elements — because they’re the most commonly used (often called when sorting even larger sequences).

And for short sequences of numbers, their results were up to 70% faster.

But even for longer sequences with over 250,000 elements, the results were still 1.7% faster. And this isn’t just an abstract exercise. Google has already made the code open source, uploading it into LLVM’s main library for standard C++ functions — the first change to its sorting algorithm in over a decade. Google proudly points out that “millions of developers and companies around the world now use it on AI applications across industries from cloud computing and online shopping to supply chain management.”

In announcing their results, DeepMind offered more examples where they’d applied AI to real-world problems, trying to demonstrate that beyond all the hype, some truly impactful improvements are waiting to be discovered. It’s interesting to see how the approached the problem — but the exercise also raises the possibility that some long-hidden secrets may finally be unlocked with our new and powerful AI systems.

How They Did It

To hunt for improvements, DeepMind drilled down to one of the lowest-level of programing: assembly language. (a human-readable representation of the machine code).

Their blog post calls this “looking where most humans don’t” (or “starting from scratch”). “We believe many improvements exist at this lower level that may be difficult to discover in a higher-level coding language,” argues DeepMind’s blog. “Computer storage and operations are more flexible at this level, which means there are significantly more potential improvements that could have a larger impact on speed and energy usage.”

For their search, the researchers created a program based on DeepMind’s AlphaZero program, which beat the world’s best players in chess and Go. That program trained solely by playing games against itself, getting better and better using a kind of massively automated trial-and-error that eventually determines the most optimal approach. DeepMind’s researchers modified into a new coding-oriented program called AlphaDev, calling this an important next step. “With AlphaDev, we show how this model can transfer from games to scientific challenges, and from simulations to real-world applications,” they write on the DeepMind blog.

The breakthrough happens when AlphaDev transformed coding into a new kind of game, where AlphaDev continually adds single instructions to its algorithm and assesses its results. (“Winning a game” is replaced here by rewards for correct and speedy results.) The researchers called it “AssemblyGame,” and the blog points out that the number of possible combinations of instructions “is similar to the number of particles in the universe.” But the paper also clearly quantifies the game’s stakes.

“Winning the game corresponds to generating a correct, low-latency algorithm using assembly instructions.”

DeepMind’s blog post reports the newly-discovered sorting algorithms “contain new sequences of instructions that save a single instruction each time they’re applied.” (It then envisions this performance savings multiplied by the trillions of times a day that this code is run.) “AlphaDev skips over a step to connect items in a way that looks like a mistake but is actually a shortcut.” (DeepMind’s blog argues this is similar to an AlphaZero’s Go move which looked like a mistake, but ultimately led it to victory — and believes the discovery “shows AlphaDev’s ability to uncover original solutions and challenges the way we think about how to improve computer science algorithms.”)

Their paper says it shows “how artificial intelligence can go beyond the current state of the art,” because ultimately AlphaDev’s sorts use fewer lines of code for sorting sequences with between three elements and eight elements — for every number of elements except four. And these shorter algorithms “do indeed lead to lower latency,” the paper points out, “as the algorithm length and latency are correlated.”

The current (human-generated) sorting for up to four numbers first checks the length of the sequence, then calls an algorithm optimized for that length. (Unless the length is one, meaning no sorting is required.) But AlphaDev realized that with four-element sequences, it’s faster to just sort the first three elements — and then use a simpler algorithm to find that fourth element’s position among the three already-sorted. And this approach eliminates much of the overhead of “branching” into an entirely different set of code for every other possible sequence length. Instead AlphaDev can handle most sequence lengths as part of its first check (for how the length relates to the number two).

Is length < 2 (If there’s one element, just return its value)
Is length = 2 (If there’s two elements, sort them and return them.)
Is length > 2 (Sort the first three elements. If there were only three elements, return them.)
If there are four elements, find the position of the fourth element among the already-sorted three.

Beyond Coding

Their paper applauds the results as “both new and more efficient than the state-of-the-art human benchmarks.” But that was just the beginning. DeepMind moved on, discovering a new hashing algorithm that was 30% faster in the 9-16 bytes range (adding it to Google’s Abseil library of C++ functions in January).

Google also sicced AlphaZero on its datacenter to optimize workload distributions, according to another post, ultimately resulting in a 19% drop in underused hardware. And it also improved the compression of videos on YouTube, (reducing the bitrate by 4%).

DeepMind now argues that AlphaDev’s success at coding represents a step toward general-purpose AI tools that solve problems to the benefit of society — including helping to optimize more of our code. And while better hardware has “kept pace” for the last half century, “as microchips approach their physical limits, it’s critical to improve the code that runs on them to make computing more powerful and sustainable.”

The paper points out this isn’t the first use of reinforcement learning for optimizing code — and even some that tried to optimize sorting algorithms.

So maybe the ultimate affirming message there is its reminder that one single corporation isn’t driving the progress. Instead the results announced this month are just part of a larger broad-based human effort to deliver real and tangible benefits using our newest tools.

And as society acknowledges potential dystopian futures and the possible danger of AI systems, maybe it’s balanced by the prospect that AI systems could also deliver another possible outcome.

The post Google’s DeepMind Extends AI with Faster Sort Algorithms appeared first on The New Stack.

3 Ways to Drive Open Source Software Maturity

Jay Livens — Tue, 13 Jun 2023 19:00:53 +0000

Open source software (OSS) is taking over the world. It’s a faster, more collaborative and flexible way of driving software innovation than proprietary code. This flexibility appeals to developers and can help organizational leadership drive down costs while supporting digital transformation goals. The figures speak for themselves: 80% of organizations increased their OSS use in 2022, especially those operating in critical infrastructure sectors such as oil and gas, telecommunications and energy.

However, open source is not a panacea. There can be challenges around governance, security and the balance between contributing to OSS development and preserving a commercial advantage. These each need careful consideration if developers want to maximize the impact of their work on open source projects.

Open Source Software Saves Time and Drives Innovation

There’s no one-size-fits-all approach with OSS. Projects could range from relatively small software components, such as general-purpose Java class libraries, to major systems, such as Kubernetes for container management or Apache’s HTTP server for modern operating systems. Those projects receiving regular contributions from reputable sources are likely to be most widely adopted and frequently updated. But there is already a range of proven benefits across them all.

Open source can save time and resources, as developers don’t have to expend their own energies to produce code. The top four OSS ecosystems are estimated to have recorded over 3 trillion requests for components last year. That’s a great deal of effort potentially saved. It also means those same developer teams can focus more fully on proprietary functionality that advances the baseline functionality available through OSS to boost revenue streams. It’s estimated just $1.1 billion invested in OSS in the EU back in 2018 generated $71 billion to $104 billion for the regional economy.

OSS also encourages experts from across the globe — whether individual hobbyists or DevOps teams from multinational companies — to contribute their coding skills and industry knowledge. The idea is projects will benefit from a large and diverse pool of developers, driving up the quality of the final product. In contributing to these projects, businesses and individuals can stake a claim to the future direction of a particular product or field of technology, helping to shape it in a way that advances their own solutions. Companies also benefit from being at the leading edge of any new discoveries and leaps in innovation as they emerge, so they can steal a march on the competition by being first to market.

This, in turn, can help to drive a culture of innovation at organizations that contribute regularly to OSS. Alongside a company’s track record on patents, their commitment to OSS projects can be a useful indicator to prospective new hires of their level of ambition, helping attract the brightest and best talent going forward.

Three Ways to Drive OSS Maturity

To maximize the benefit of their contributions to the OSS community, DevOps leaders should ensure their organization has a clear, mature approach. There are three key points to consider in these efforts:

1. Define the Scope of the Organization’s Contribution

OSS is built on the expertise of a potentially wide range of individuals and organizations, many of whom are otherwise competitors. This “wisdom of the crowd” can ultimately help to create better-quality products more quickly. However, it can also raise difficult questions about how to keep proprietary secrets under wraps when there is pressure from the community to share certain code bases or functionality that could benefit others. By defining at the outset what they want to keep private, contributors can draw a clear line between commercial advantage and community spirit to avoid such headaches later.

2. Contribute to Open Standards

Open standards are the foundation on which OSS contributors can collaborate. By getting involved in these initiatives, organizations have a fantastic opportunity to shape the future direction of OSS, helping to solve common problems in a manner that will enhance the value of their commercial products. OpenTelemetry is one such success story. This collection of tools, application programming interfaces and software development kits simplifies the capture and export of telemetry data from applications to make tracing more seamless across boundaries and systems. As a result, OpenTelemetry has become a de facto industry standard for the way organizations capture and process observability data, bringing them closer to achieving a unified view of hybrid technology stacks in a single platform.

3. Build Robust Security Practices

Despite the benefits of OSS, there’s always a risk of vulnerabilities slipping into production if they’re not detected and remediated quickly and effectively in development environments. Three-quarters (75%) of chief information security officers (CISOs) worry the prevalence of team silos and point solutions throughout the software development lifecycle makes it easier for vulnerabilities to fly below the radar. Their concerns are valid. The average application development project contains 49 vulnerabilities, according to one estimate. These risks will only grow as ChatGPT-like tools are increasingly used to support software development by compiling code snippets from open source libraries.

Given the dynamic, fast-changing nature of cloud native environments and the sheer scale of open source use, automation is the only way DevOps teams can take control of the situation. To support this, they should converge security data with real-time, end-to-end observability to create a unified source of insights. By combining this with trustworthy AI that can understand the full context behind that observability and security data, teams can unlock precise, real-time answers about vulnerabilities in their environment. Armed with those answers, they can implement security gates throughout the delivery pipeline so bugs are automatically resolved as soon as they are detected.

OSS is increasingly important to long-term success, even for commercially motivated organizations. How effectively they’re able to harness and contribute to its development will define the winners and losers of the next decade. If they put careful consideration into these three key points, DevOps leaders will bring their organizations much closer to being recognized as a leading innovator in their industries.

The post 3 Ways to Drive Open Source Software Maturity appeared first on The New Stack.

Survey Says: Cloud Maturity Matters

Fredric Paul — Tue, 13 Jun 2023 13:20:36 +0000

The third annual State of Cloud Strategy Survey, commissioned by HashiCorp and conducted by Forrester Consulting, focuses on operational cloud maturity — defined not by the amount of cloud usage but by adoption of a combination of technology and organizational best practices at scale.

The results were unambiguous: The organizations using operational best practices are deriving the biggest benefits from their cloud efforts, in everything from security and compliance to availability and the ability to cope with the ongoing shortage of critical cloud skills. High-maturity companies were more likely to report increases in cloud spending and less likely to say they were wasting money on avoidable cloud spending.

The seven headline numbers below capture many of the survey’s most important findings, and you can view the interactive State of Cloud Strategy Survey microsite for detailed results and methodology. Read on to learn more about our cloud maturity model and some of the key differences we found between high and low cloud-maturity organizations.

Source: A commissioned study conducted by Forrester Consulting on behalf of HashiCorp, February 2023

Our Cloud Maturity Model

To fully understand the survey results you need to know something about the cloud maturity model developed by HashiCorp and Forrester to describe where organizations are in their cloud adoption journey. HashiCorp commissioned Forrester Consulting to survey almost 1,000 technology practitioners and decision-makers from companies in a variety of industries around the world, primarily those with more than 1,000 employees.

Forrester asked about their use of best practices across technology layers including infrastructure, security, networking and applications, as well as their use of platform teams, and used that data to separate respondents into three tiers:

Low-maturity organizations, the lowest 25% or respondents, are experimenting with these practices.
Medium-maturity companies, the middle 50%, are standardizing their use of these practices.
High-maturity respondents, the top 25%, are scaling these practices across the entire organization.

How High-Maturity Organizations Are Different

Multicloud works better for highly mature companies. More than three quarters (76%) of high-cloud-maturity organizations say multicloud is helping them achieve their business goals, and another 17% expect it to within the next 12 months. That compares to just 60% of low-maturity respondents who say multicloud is working for them, while another 22% expect it to do so in the next year.

The Great Cloud Skills Shortage

“Skills shortages” is the most commonly cited barrier to operationalizing multicloud, and almost three quarters (74%) of high-maturity respondents say multicloud helps them attract, motivate and retain talent. That compares to less than half (48%) of low-maturity organizations who can say the same. Other large differences between the benefits experienced by high- and low-maturity respondents showed up in the areas of compliance and risk (80% to 56%), infrastructure visibility/insight (82% to 59%) and speed (76% to 59%). Also significant, 79% of high-maturity organizations report that their multicloud efforts have resulted in a stronger security posture, perhaps because working in multiple cloud environments can help organizations keep their security professionals engaged, and also be a forcing function toward more intentional oversight of their security operations.

Cloud Spending and Cloud Waste

Despite macroeconomic uncertainty, 62% of highly mature companies boosted their cloud spending in the last year. That compares to 56% of respondents overall and just 38% of low-maturity organizations. Yet even as they increased cloud spending, more than half (53%) of high-maturity respondents used multicloud to cut costs, compared to just 42% of low-maturity respondents.

Avoidable cloud spending remains high, with 94% of respondents reporting some degree of cloud waste (about the same as in last year’s survey). But the factors contributing to that waste differ notably: Low cloud-maturity firms, in particular, struggle with over-provisioning resources (53%, compared to 47% for high maturity firms), idle or underused resources (55% compared to 51%) and lack of needed skills (47% vs. 43%).

Multicloud Drivers

High- and low-maturity organizations also differ on what drives their multicloud efforts. For example, along with cost reductions, reliability, scalability, security and governance, digital transformation and, especially, portability of data and applications are much more commonly cited by high-maturity organizations. On the other hand, factors such as remote working, shopping for best-fit cloud service, desire for operational efficiency, backup/disaster recovery and avoiding vendor lock-in were relatively similar across all levels of maturity.

What are the business and technology factors driving your multicloud adoption?

Base: 963 respondents who are application development and delivery practitioners and decision-makers with budget authority for new investments. Source: A commissioned study conducted by Forrester Consulting on behalf of HashiCorp, February 2023.

When it comes to security threats, 49% of both high- and low-maturity respondents worry about data theft (the top-ranking choice), and roughly equal percentages are concerned about phishing and social engineering attacks. Notably, though, while 61% of low-maturity companies rank password/credential/secrets leaks as a big concern, only 47% of high-maturity respondents agree. Similarly, ransomware is an issue for 47% of low-maturity respondents but just 39% of their high-maturity counterparts.

What are the biggest threats your organization face when it comes to cloud security?

Base: 957 respondents who are application development and delivery practitioners and decision-makers with budget authority for new investments. Source: A commissioned study conducted by Forrester Consulting on behalf of HashiCorp, February 2023.

Find out More

You can explore the full results of the survey on HashiCorp’s interactive State of Cloud Strategy Survey microsite, where you can also download Forrester Consulting’s “Operational Maturity Optimizes Multicloud” study, which presents the firm’s key survey findings, analysis and recommendations for enterprises.

The post Survey Says: Cloud Maturity Matters appeared first on The New Stack.

Open Sourcing AWS Cedar Is a Game Changer for IAM

Or Weis — Mon, 12 Jun 2023 17:00:36 +0000

In today’s cloud native world, managing permissions and access control has become a critical challenge for many organizations. As applications and microservices become more distributed, it’s essential to ensure that only the right people and systems have access to the right resources.

However, managing this complexity can be difficult, especially as teams and organizations grow. That’s why the launch of Cedar, a new open source project from Amazon Web Services, is a tectonic shift in the identity and resource management (IAM) space, making it clear that the problem of in-app permissions has grown too big to ignore.

Traditionally, organizations have relied on access control lists (ACLs) and role-based access control (RBAC) to manage permissions. However, as the number of resources and users grows, it becomes difficult to manage and scale these policies. This is where policy as code emerges as a de facto standard. It enables developers to write policies as code, which can be versioned, tested and deployed like any other code. This approach is more scalable, flexible and auditable than traditional approaches.

The Advantages of Cedar

Aside from impressive performance, one of the most significant advantages of Cedar is its readability. The language is designed to be extremely readable, empowering even nontechnical stakeholders to read it (if not write it) for auditing purposes. This is critical in today’s world, where security and compliance are top priorities.

Cedar policies are written in a declarative language, which means they can be easily understood and audited. Cedar also offers features like policy testing and simulation, which make it easier to ensure that policies are enforced correctly.

Unlike some other policy languages, Cedar adheres to a more strict and structured syntax, which provides its aforementioned readability, emphasis on safety by default (i.e., deny by default), and more assurances on correctness/security thanks to verification-guided development.

Open Source Supporting Open Source

AWS has recognized the huge challenge that is application-level access control and strives to make Cedar easily consumed within its cloud via Amazon Verified Permissions (AVP). But what about on-premises deployments or other clouds? This is where other open source projects come into play.

With Cedar-Agent, developers can easily run Cedar as a standalone agent (just like Open Policy Agent) wherever they need it. And with OPAL, developers can manage Cedar-Agent at scale, from a unified event-driven control plane. OPAL makes sure that agents like OPA, AVP (Amazon Verified Permissions) and Cedar-Agent are loaded with the policy and data they need in real time.

Permit’s Unified Platform for Policy as Code

As developers, being polyglot and avoiding lock-in enables us to choose the right tool for the right job. With Permit’s SaaS platform, developers can choose between OPA’s Rego, AWS Cedar or any other tool as their policy engine of choice. And by leveraging Permit’s low code/no-code interfaces, RBAC and ABAC policy as code will be automatically generated so that users can take full advantage of policy as code without having to learn a new language.

Conclusion

The launch of AWS’ Cedar is a tectonic shift in the IAM space. It’s clear that the problem of in-app permissions has grown too big to ignore. Policy as code has emerged as a de facto standard, and tools like OPAL and Permit.io are making it easier for developers to write and manage policies at scale. Cedar’s readability and testing features make it an attractive choice for many organizations looking to manage permissions in a scalable, auditable and flexible way.

As the ecosystem continues to expand, we’ll likely see more tools and systems adopting policy as code as the preferred approach to managing permissions and access control in the cloud.

The post Open Sourcing AWS Cedar Is a Game Changer for IAM appeared first on The New Stack.

The Rise of the Cloud Native Cloud

John Dietz — Mon, 12 Jun 2023 16:38:31 +0000

Kubefirst delivers instant GitOps platforms made from popular free and open source cloud native tools. We’ve supported the Amazon Web Services (AWS) cloud for years and love how well our platform runs on Elastic Kubernetes Service (EKS). We recently announced our expanded support for the new Civo cloud, a cloud native cloud that runs all of its infrastructure on Kubernetes. There are some pretty staggering differences between the two clouds, yet some things remain virtually identical, and it got me thinking about the wild journey of how we got here as an industry.

Remember the Original Public Clouds?

Remember when public clouds were new? Think back to computing just 10 years ago. In 2013, AWS was trying to further its stronghold on the new cloud computing space with its self-service infrastructure public cloud model in Elastic Compute Cloud (EC2). Google Cloud Platform and Microsoft Azure were just a couple years removed from announcing their own, further solidifying the architectural shift away from self-managed data centers.

Despite the higher compute cost of public cloud infrastructure compared to its on-premises equivalents, the overall time and money saved by leveraging repeatable on-demand cloud infrastructure prompted companies to begin tearing down their rack space and moving their infrastructure to the public clouds. The self-service model gave more power to the developer, fewer handoffs in the DevOps space and more autonomy to engineering teams. The public cloud era was here to stay.

The IaC Revolution

Although the days of sluggish infrastructure IT tickets were now a thing of the past, the potential of the cloud still remained untapped for many organizations. True to Tesler’s Law, the shift toward the public cloud hadn’t exactly removed system complexity — the complexity had just found a new home.

To tackle that complexity, we needed new automated ways to manage our infrastructure and the era of Infrastructure as Code (IaC) did its best to rise to this challenge. New technologies like CloudFormation, Ansible, Chef, Puppet and Terraform all did their best to step up to the need, but the infrastructure story from company to company was generally still a rather complex and bespoke operation.

The Container Revolution

Around the same time another movement was sweeping through the application space: containerization. Largely Docker-based at the time, containerizing your apps was a new way to create a consistent application runtime environment, isolating the application from the infrastructure that it runs upon.

With containerization, we were suddenly able to run an app the same way on different operating systems or distributions, whether running on your laptop, on on-premises infrastructure or in the public cloud. This solved a lot of problems that companies suddenly had as their infrastructure requirements began to dramatically shift in new directions.

Organizations with the classic monolithic applications began exploring how container-based microservices could be leveraged to optimize their software development and scaling woes. As the containerized world evolved and teams started building containerized microfrontends making calls to containerized microbackends, the sprawl of micro products started to become a lot to manage. This was particularly felt with the management of applications, infrastructure, secrets and observability at scale.

The Orchestration Battle

With the motion to put applications into containers and the resulting explosion of microservices and containerized micro products came a new challenge: managing all of them.

HashiCorp Nomad, Docker Swarm and Google’s Kubernetes (now part of CNCF) swiftly found their way to the conference keynote stages.

Each had its distinct advantages, but Kubernetes rose to the top with its declarative design, operating system and cloud portability, in addition to an unprecedentedly strong user community. The YAML-based system made it easy to organize your desired state into simple files that represent everything an application needs to work. It could be run in any cloud, on your on-premises infrastructure or even on your laptop, and it boasts a bustling community of cloud native engineers who share a uniform vision for modern solutions.

To Kubernetes Goes the Spoils

Cloud native engineers were quick to identify that all the software running inside Kubernetes was much easier to manage than the software that ran outside of Kubernetes. Opinions were beginning to form such that if your product didn’t have a Helm chart (the Kubernetes package manager), then it probably wasn’t very desirable to the cloud native engineers who were responsible for platform technology choices. After all, if you need to install complex third-party software, your choices are a Helm install command that takes seconds to run or pages upon pages of installation guides and error-prone instructions.

Opportunistic software vendors were quick to pick up on this trend and feverishly began rearchitecting their systems to be installed by Helm and operational on Kubernetes. The promise of delivering complex multicomponent software packages with complex microarchitectures, but still having it easily installable to any cloud environment has been the dream of software delivery teams forever, and it has finally reached that inevitability with Kubernetes at the helm.

How Complex Does Your Cloud Need to Be?

We first built kubefirst to provision instant cloud native (Kubernetes) platforms in the world’s largest public cloud, AWS, and it runs very well there. The maturity of the AWS cloud is largely unparalleled. If you need to accommodate large swaths of Fortune 500 complexities, Federal Information Processing Standards (FIPS)-compliant endpoints from all angles, extreme scales with enormous data volume or some other nightmare of this type, choosing one of the “big 3” (AWS, Google Cloud or Microsoft Azure) is a pretty easy instinct to follow.

If you’re working in this type of environment, kubefirst is lightning-fast and can turn 12 months of platform building into a single 30-minute command (kubefirst aws create).

We still love the big clouds. However, when we asked our community what clouds we should expand into, we weren’t too surprised to find a clamoring of interest for a simpler cloud option that focused on managed Kubernetes. The newer cloud providers like Civo, Vultr, DigitalOcean and others of this ilk are boasting blazing fast cluster provisioning times with significantly reduced complexity. With fewer resources to manage than the cloud pioneers can offer, you get you into that new cluster much faster.

Let’s break this down in terms of Terraform cloud resources, the code objects in your infrastructure as code. To create a new kubefirst instant platform from scratch in AWS, our kubefirst CLI needs to provision 95 AWS cloud resources. This includes everything — the VPC, subnets, key management service keys, state store buckets, backends, identity and access management (IAM) roles, policy bindings, security groups and the EKS cluster itself. Many of these resources are abstracted behind Terraform modules within the kubefirst platform, so the complexity of the cloud has been heavily reduced from the platform engineer’s perspective, but there’s still quite a bit of “cloud going on.” It’s also a very sophisticated and enterprise-ready setup if that’s what your organization requires.

But there’s a cost for this sophistication. To provision these 95 resources and get you into your cluster, you’ll have to wait about 25 fully automated minutes, and a lot of that is waiting on cluster provision time. It takes about 15 minutes to provision the master control plane and another 10 to provision and attach the node groups to it. If you need to destroy all these resources, it will take another 20 minutes of (automated) waiting.

But to have the same kubefirst platform in Civo, you only need to manage three Terraform resources instead of 95, and instead of the 45 minutes it takes to provision and destroy, you could do the same in about four minutes. When infrastructure is part of what you’re changing and testing, this is an enormously consequential detail for a platform team.

The Rise of Platform Engineering and the Cloud Native Cloud

Platform engineering is an emerging practice that allows organizations to modernize software delivery by establishing a platform team to build a self-service developer platform as their product. The practice requires that platform teams iterate regularly on the provisioning of infrastructure, cloud native application suites, application CI/CD, and Day 2 observability and monitoring. With entire software development ecosystems being provisioned over and over becoming the new normal, spending 45 minutes between iterations instead of four can be a costly detail for your platform team’s productivity.

If you fear that you will eventually need the complexities of “the big 3” clouds, that doesn’t mean that you need to borrow that cloud complexity today. Kubefirst is able to abstract the cloud from the platform so you can build your platform on kubefirst civo today and move it to kubefirst aws tomorrow with all of the same cloud native platform tools working in all the same ways.

The Kubefirst Platform on the Cloud Native Clouds

Kubefirst provisions open source instant fully automated open source cloud native platforms on AWS, Civo, Vultr (beta), DigitalOcean (beta), and on the localhost with k3d Kubernetes. Civo Cloud is offering a one-month $250 free credit so you can try our instant platform on its cloud for free.

To create a new Civo account, add a domain, configure the nameserver records at your domain registrar, then run kubefirst civo create (full instructions when using Civo with GitHub, and with GitLab).

Within a few minutes you’ll have:

A gitops repo added to your GitHub/GitLab that powers your new platform so you can add your favorite tools and extend the platform as you need.
A Civo cloud and Kubernetes cluster provisioned with and configured by Terraform IaC.
A GitOps registry of cloud native platform application configurations, preconfigured to work well with each other.
HashiCorp Vault secrets management with all the platform secrets preconfigured and bound to their respective tools.
A user management platform with single sign-on (SSO) for your admins and engineers and an OpenID Connect (OIDC) provider preconfigured to work with all of your platform tools.
An example React microservice with source code that demonstrates GitOps pipelines and delivery to your new Kubernetes development, staging and production environments.
An Argo Workflows library of templates that conduct GitOps CI and integrate the Kubernetes native CI with GitHub Actions/GitLab pipelines.
Atlantis to integrate any Terraform changes with your pull or merge request workflow so that infrastructure changes are automated and auditable to your team.
Self-hosted GitLab/GitHub runners to keep your workloads cost-free and unlimited in use.

And with kubefirst you can throw away your production cluster with the next iteration available just a couple of minutes later.

The rise of the cloud native cloud is here.

The post The Rise of the Cloud Native Cloud appeared first on The New Stack.

Cloud-Focused Attacks Growing More Frequent, More Brazen

Kevin Casey — Mon, 12 Jun 2023 13:00:21 +0000

Cloud-focused attacks have soared in recent years, with attackers growing more sophisticated, brazen and determined in cloud exploitation, according to a new report.

Exploitations targeting cloud infrastructure increased 95% from 2021 to 2022, and cases of adversaries targeting cloud environments have nearly tripled in the same timeframe, as noted in the CrowdStrike 2023 Cloud Risk Report.

This report by the cybersecurity platform company shares in rich detail how attackers are going after enterprise cloud environments, as well as how those threat actors use the same cloud platforms to support their own malicious campaigns.

One key finding is that hackers are becoming more adept — and more motivated — in targeting enterprise cloud environments through a growing range of tactics, techniques and procedures. These include deploying command-and-control channels on top of existing cloud services, achieving privilege escalation, and moving laterally within an environment after gaining initial access.

Many cloud-focused campaigns begin with a single set of compromised account credentials, which attackers use to gain a back door into a customer’s cloud environment. “One of the big things a lot of customers don’t realize is that the adversary will use their initial access to gain access to their identity system,” said James Perry, CrowdStrike’s senior director, incident response services, at the CrowdStrike Cloud Threat Summit, a virtual event held this past Tuesday and Wednesday. (Video presentations from the event are now available on demand.)

“That allows them to use single sign-on to access many other applications, including their cloud – all they need is one password,” Perry said. “That allows them to pivot from an on-prem identity into the cloud and gain that more destructive access.”

Hackers are also getting better at avoiding detection once they’ve breached an environment: In 28% of incidents during the period when CrowdStrike collected data for this report, an attacker had manually deleted a cloud instance to hide evidence and avoid detection. Threat actors also commonly deactivate security tools running inside virtual machines once they’ve gained access, the report noted, another maneuver to evade detection.

Cloud Misconfigurations Drive Risk

But the cloud isn’t just a target for adversaries — it’s a tool, too. Attackers will use cloud infrastructure to host tools, such as phishing lure documents and malware payloads, that support their attacks.

The CrowdStrike 2023 Cloud Risk Report offers a deep dive into the various methods and attack vectors modern adversary groups are deploying today, noting the ephemeral nature of some cloud instances is pushing attackers to become even more tenacious in their pursuit of cloud compromise.

Moreover, the relative infancy of many cloud-centric paradigms and technologies, such as containers and orchestration, expands the threat surface as well. Teams may simply not know all they need to know in order to keep their cloud infrastructure and workloads safe.

Among the report’s findings:

Sixty percent of container workloads lack properly configured security protections, and nearly one in four are running with root-like capabilities.
Kubernetes (K8s) misconfigurations can create similar risks at the orchestration layer: 26% of K8s Service Account Tokens are automounted, according to CrowdStrike, which can enable unauthorized access and communication with the Kubernetes API.

While attack vectors and methods are increasingly varied, they often rely on some common denominators, including the oldest one around: human error. For example, 38% of observed cloud environments were running with insecure default settings from the cloud service provider.

Indeed, cloud misconfigurations are one of the major sources of breaches.

Similarly, identity access management (IAM) is another huge area of risk rife with human error. In two out of three cloud security incidents observed by CrowdStrike, IAM credentials were found to be over-permissioned, meaning the user had higher levels of privileges than necessary.

This is inextricably linked with a broader misconfiguration problem: CrowdStrike found nearly half of all detected cloud misconfigurations considered critical were the result of ineffective identity and entitlement hygiene, such as excessive permissions.

“Threat actors have become very adept at pivoting from on-prem enterprises to directly into the cloud leveraging stolen identities,” said Adam Meyers, CrowdStrike’s senior vice president of intelligence. “Identity security has become a major concern across all of our enterprise customers, as they understand that there’s not a single hack that’s taking place that doesn’t involve a compromised credential.”

Creating a Stronger Security Posture

Misconfiguration and identity challenges are highly preventable when organizations invest in the people, tooling and processes needed to get it right.

“CrowdStrike is consistently called in to investigate cloud breaches that could have been detected earlier or prevented if cloud security settings had been correctly configured,” the report said.

That speaks to a broader point: The report isn’t a doomsday story. It’s more of a call to arms, offering a blueprint for how enterprises can fight back and best protect their cloud environments from malicious actors. Since so many cloud security incidents begin with leaky credentials or oversized permissions, for example, shoring up identity and entitlement management is table stakes for a strong cloud security posture.

CrowdStrike identifies four pillars of a cloud-focused security posture that makes life difficult for even the most sophisticated adversaries.

Cloud workload protection (CWP): A product that provides continuous threat monitoring and detection for cloud workloads across modern cloud environments.
Cloud security posture management (CSPM): A set of processes and capabilities that detects, prevents and remediates the misconfigurations adversaries exploit.
Cloud infrastructure entitlement management (CIEM): A set of features that secure cloud identities and permissions across multi-cloud environments, detects account compromises, and prevents identity misconfigurations, stolen access keys, insider threats and other malicious activity.
Container security: A set of tools that perform detection, investigation and threat-hunting tasks on containers, even those that have been decommissioned.

This multi-layered approach, starting at the workload level, is crucial in today’s security landscape, said CrowdStrike president Michael Sentonas.

“If you’re not on the workload, you can’t stop an attack,” Sentonas said. “At best, you’re detecting it without the ability to do anything about it.”

The multi-pronged approach is what’s needed to protect and mitigate against both active attacks and the persistent reality of human error, he said: “Organizations need the tight native integration of an agent and an agentless solution that spans runtime to CSPM to CIEM to stop breaches from both adversaries and human error.”

Read the full report to boost your cloud security awareness and strategy.

The post Cloud-Focused Attacks Growing More Frequent, More Brazen appeared first on The New Stack.

Generative AI: What’s Ahead for Enterprises?

Heather Joslyn — Thu, 08 Jun 2023 18:13:38 +0000

There’s been a lot of speculation and hand-wringing about what impact ChatGPT and other generative AI tools will have on employment in the tech industry, especially for developers. But what is its potential for organizations and businesses? What new opportunities lie ahead?

In this episode of The New Stack Makers podcast, Nima Negahban, CEO of Kinetica, spoke to Heather Joslyn, features editor of The New Stack, about what could come next for companies, especially when generative AI is paired with data analytics.

The conversation was sponsored by Kinetica, an analytic database.

There’s an obvious use case, Negahban told us, one that will result in a transformative “killer app”: “An Alexa for all your data in your ecosystem in real time. Where you can ask, ‘What store is performing best today?’ Or, “What products underperform when it’s raining?’ Things like that, that’s within the purview, in a very short order of what we can do today.”

The result, he said, could be “a whole new level of visibility into how your enterprise is running.”

An Expectation of Efficiency

Two big challenges loom in the generative AI space, Negahban said. One, security, especially when using internal data to help train an AI model: “Is it OK to send the necessary information that you need to a large language model?”

And two, accuracy — making sure that the AI outputs aren’t riddled with hallucinations. “If my CEO is asking a question, and [generates that analytic on the fly and gives them an answer, how do we make sure that it’s right? How do we make sure that the information that we’re going to give that person is correct, and it’s not going to put them down a false path?”

For developers, generative AI — including tools like GitHub Copilot — will bring a new expectation of efficiency and innovation, Negahban said.

For both devs and product managers, he said, it can spur creativity; for instance, he said it can enable them “to make new features that previously you wouldn’t have been able to think of?”

The Future: Orchestration and Vector Search

Much remains to be discovered about using generative AI in the enterprise. For starters, the current models are basically “text completion engines,” Negahban noted. “How do you orchestrate that in a way that can actually accomplish something? That’s a multistep process”.

In addition, organizations are just starting to grapple with how to leverage the new technology with their data. “That is part of the reason why the vector search, vector database and vector search capability world is exploding right now,” he said. “Because people want to generate embeddings, and then do embedding search.”

Kinetica’s processing engine handles ad hoc queries in a performant way, without users needing to do a lot of pre-planning, indexing or data engineering. “We’re coupling that engine with the ability to generate SQL on the fly against natural language,” he said, with Open AI technology trained on Kinetica’s own Large Language Model.

The idea, Negahban said, is “if you can take that killer app and marry it with an engine that can do the querying, in a way that’s performing in a way that doesn’t require whole teams of people to prepare the data, that can be exceptionally powerful for an enterprise.”

Check out the entire episode to get more insight.

The post Generative AI: What’s Ahead for Enterprises? appeared first on The New Stack.

Security as Code Protects Rapidly Developing Cloud Native Architectures

Aakash Shah — Thu, 08 Jun 2023 17:00:23 +0000

Enterprises are increasingly going beyond lift-and-shift migrations to adopt cloud native strategies — the approach of developing, releasing and maintaining applications, all within cloud environments. According to Gartner, more than 95% of new digital initiatives will be conducted on cloud native platforms by 2025.

As enterprises dial up the focus on cloud native functionality, they’re moving away from manual click-Ops approaches to adopt automation that enables higher velocity and better manages increasing cloud complexity and scale. HashiCorp’s State of Cloud Strategy Survey shows that 81% of enterprises are already multicloud or plan to be within a year. Of those who have adopted multicloud, 90% say it works for them.

There’s a familiar problem amid all this adoption, one that’s plagued the entire industry for years: Traditional security workflows can’t keep up. They were never designed to support a paradigm where the architecture is represented as code that can change several times a day. The velocity and scope of change of today’s cloud native architectures cause security teams to struggle.

Embracing automation is the only viable approach for security teams to support this new paradigm. Developers have leaned on Infrastructure as Code (IaC) to build these cloud native applications on a large scale, even in complex environments. Security as Code (SaC) also leverages automation to intelligently analyze and remediate security and compliance design gaps, even as context changes. It’s the missing piece that completes an enterprise cloud environment.

HashiCorp’s survey shows a whopping 89% of respondents see security as a key driver of cloud success. Cloud-service providers recognize their customers’ challenges and are making investments in security to mitigate them.

Infrastructure automation tools are a catalyst for boosting operational efficiency in development, and the same is true for security. Automation helps optimize cost and scale operations. SaC ensures these applications are built right the first time, rather than security teams rushing to put out fires after they’re deployed. Empowering security teams to codify security best practices, and enforce them autonomously, allows them to focus on the strategic work of building standards that provide the necessary guardrails for developers to move with velocity. The future of SaC should be a corollary of IaC adoption, which is growing.

SaC helps both security and development teams operate autonomously, share responsibility, and collaborate more effectively in delivering secure products and features at the speed required by today’s business landscape. SaC is the only way that we can ensure security keeps up with the rapid pace of cloud native development.

How We Got Here

Modern application architectures have increased scale and complexity, completely outpacing traditional security methods, which can’t offer adequate protection in today’s landscape. These architectures are defined in IaC languages like Terraform and often span more than 100,000 lines of code that change frequently. This has allowed development teams to rapidly evolve the architecture, deliver infrastructure in an agile manner, and build architectures at an unparalleled scale and velocity.

These developers are increasingly empowered to choose their cloud providers, feature capabilities and tech stacks to rapidly deliver on customer needs. With all the choices developers are empowered to make, applications live in heterogeneous environments that are difficult to manage. If we were to measure the average entropy of an application architecture based on the interconnectedness of components, the curve would be exponential. Add in the false positives and lack of actionable, achievable and applicable feedback, and the impact on developer productivity is huge. This is especially detrimental at scale, and when time to market is a critical business objective.

Now consider the breadth of the community that’s creating these complex architectures. The Cloud Native Computing Foundation (CNCF) reports that there are 7.1 million cloud native developers — more than two and a half times the population of Chicago.

Multicloud strategies, diversity of cloud-feature capabilities, disparate tech stacks and an enormous base of developers combine to make security an incredibly complex undertaking. Functionality is prioritized, and often the security guardrails we need are not calculated in that developer freedom.

Why SaC

Traditional security measures simply can’t match the scale of today’s cloud native architectures, and enterprises recognize this issue. One report shows that nearly 70% of organizations believe their cloud security is only “somewhat mature,” and 79% aren’t sure they enforce policies consistently.

The answer is SaC, because it solves the most-pressing business challenges.

Say you need to deliver a unique solution for a fleeting business opportunity. Often, security considerations slow down the time to market. With SaC, instead of being an inhibitor, security becomes an accelerator. SaC provides the developers with flexible guardrails that let them operate autonomously with velocity. Developers can evolve their feature capabilities without having to slow down for security and potentially miss the window of opportunity.

SaC comes to the rescue when technology needs change, like modernizing your tech stack to pay off tech debt and adopt new capabilities. It also allows you to rapidly evolve security practices when your threat landscape changes because your business is increasingly being targeted. Enterprises struggling with compliance at scale can alleviate those challenges by leveraging SaC to automate compliance workflows to reduce the time and cost of becoming compliant.

McKinsey saw the promise of SaC as the best “and maybe only” path to securing cloud native architectures more than a year ago. In addition to being the next logical step of IaC and operating at the scale and pace of innovation with security baked in, SaC creates transparency in security design, and consistent, repeatable and reusable representations of the security architecture.

What SaC Enables

We’re already seeing the payoff. Opening up our SaC framework is the feature our customers ask for the most. It’s allowed resource-constrained security teams to stop putting out fires and elevate their strategy, leveraging automation to do the tedious work. Our customers have reported a 70% reduction in security design review time and 40% reduction of cost in delivering security design by automating design validation using SaC.

SaC is also the key to unlocking collaboration, autonomy and shared responsibility across development and security teams, enabling the DevOps and DevSecOps cultures that organizations want to adopt.

This is increasingly a priority, as 62% of organizations have a DevSecOps plan or are evaluating use cases, and 84% believe getting the right data and tools to developers is key to enabling DevSecOps, according to ESG Research. As modern application development evolves, SaC is the accelerator that allows security to keep pace with everything else.

Envisioning a Modern Security Practice

Developers have been unleashed to innovate as fast as possible, using whatever tools and cloud environments they wish. The only way to have security keep up with them is to identify best practices at the policy level, agnostic to the technology stacks these developers choose. Automation, powered by SaC, turns that from a dream to reality.

We can use SaC to fit into developers’ workflows and democratize security for them. This completely changes the dynamic of how developers and security interact. Ten years from now, the traditional workflows that rely on Word documents, Excel spreadsheets and Visio diagrams will be a thing of the past. Developers will have an increased responsibility for security, with collaboration making those efforts stronger. When security is defined as code, developers can easily change a security architecture to better meet their requirements.

Shifting to SaC allows enterprises to make security a driver of their velocity and agility. Automation improves security from reducing human error, to eliminating scaling challenges so security can keep pace with development, to providing richer security policies.

With SaC, we finally have a way to quickly make changes that deliver repeatable outcomes at the same speed as application development. As cloud native architectures become more prominent, this is the only way security can keep pace.

The post Security as Code Protects Rapidly Developing Cloud Native Architectures appeared first on The New Stack.

Sundeck Launches Query Engineering Platform for Snowflake

Andrew Brust — Thu, 08 Jun 2023 12:00:06 +0000

Sundeck, a new company led by one of the co-founders of Dremio, recently launched a public preview of its eponymous SaaS “query engineering” platform. The platform, which Sundeck says is built for data engineers, analysts and database administrators (DBAs), will initially work with Snowflake‘s cloud data platform. Sundeck will be available free of charge during the public preview; afterward, the company says it will offer “simple” pricing, including both free and premium tiers.

Sundeck (the product) is built atop an Apache-licensed open source project called Substrait, though it offers much additional functionality and value. Sundeck (the company) has already closed a $20M seed funding round, with participation from venture capital firms Coatue, NEA and Factory.

What Does It Do?

Jacques Nadeau, formerly CTO at Dremio and one of its co-founders, briefed the New Stack and explained in depth how Sundeck query engineering works. Nadeau also described a number of Sundeck’s practical applications.

Basically, Sundeck sits between business intelligence (BI)/query tools on the one hand, and data sources (again, just Snowflake, to start) on the other. It hooks into the queries and can dynamically rewrite them. It can also hook into and rewrite query results.

Sundeck “hooks” (bottom, center) insinuate themselves in the query path between data tools (on the left) and the data source (on the right). Credit: Sundeck

One immediate benefit of the query hook approach is that it lets customers optimize the queries with better SQL than the tools might generate. By inspecting queries and looking for specific patterns, Sundeck can find inefficiencies and optimize them on-the-fly, without requiring users, or indeed BI tools, to do so themselves.

Beyond Query Optimization

More generally, though, Sundeck lets customers evaluate rules and take actions. The rules can be based on the database table(s) being queried, the user persona submitting the query or even properties of the underlying system being queried. This lets Sundeck do anything from imposing usage quotas (and thus controlling Snowflake spend); to redirecting queries to different tables or a different data warehouse; rejecting certain high-cost queries outright; reducing or reshaping a result set; or kicking off arbitrary processes.

In effect, Sundeck takes the call-and-response pipeline between client and database and turns it into an event-driven service platform, with a limitless array of triggers and automated outcomes. But that’s not to say Sundeck does this in some generic compute platform-like fashion. Instead, it’s completely contextual to databases, using Snowflake’s native API.

Must read:

With that in mind, we could imagine other applications for Sundeck, including observability/telemetry analytics, sophisticated data replication schemes and even training of machine learning models, using queries and/or result sets as training or inferencing data. Data regulation compliance, data exfiltration prevention, and responsible AI processes are other interesting applications for Sundeck. Apropos of that, Sundeck says its private result path technology ensures data privacy and that its platform is already SOC 2-certified.

In the Weeds

If all of this sounds a bit geeky, that would genuinely seem to be by design. Sundeck’s purpose here was to provide a user base — that already works at a technical level — access to the query pipeline, which heretofore has largely been a black box. This user audience is already authoring sophisticated data transformation pipelines with platforms like dbt, so why not let them transform queries as well?

It’s no surprise that Sundeck is a product that lives deep in the technology stack. After all, Nadeau previously led similarly infrastructural open source projects like Apache Arrow, which provides a unified standard for storing columnar data in memory (and which Nadeau says is an important building block in Snowflake’s platform), and Apache Drill, which acts as a SQL federated query broker. The rest of the fifteen-person Sundeck team has bona fides similar to Nadeau’s, counting 10 Apache project management committee (PMC) leaders, and even co-founders of Apache projects, like Calcite and Phoenix, among its ranks.

Check out:

Sunny Forecast on Deck?

If data is the lifeblood of business, then query pathways are critical arteries in a business’ operation. As such, being able to observe and intercept queries, then optimize them or automate processes in response to them, seems like common sense. If Sundeck can expand to support the full array of major cloud data warehouse and lakehouse platforms, query engineering could catch on and an ecosystem could emerge.

The post Sundeck Launches Query Engineering Platform for Snowflake appeared first on The New Stack.

How WASM (and Rust) Unlocks the Mysteries of Quantum Computing

Mary Branscombe — Thu, 08 Jun 2023 10:00:40 +0000

WebAssembly has come a long way from the browser; it can be used for building high-performance web applications, for serverless applications, and for many other uses.

Recently, we also spotted it as a key technology used in creating and controlling a previously theoretical state of matter that could unlock reliable quantum computing — for the same reasons that make it an appealing choice for cloud computing.

Quantum Needs Traditional Computing

Quantum computing uses exotic hardware (large, expensive and very, very cold) to model complex systems and problems that need more memory than the largest supercomputer: it stores information in equally exotic quantum states of matter and runs computations on it by controlling the interactions of subatomic particles.

But alongside that futuristic quantum computer, you need traditional computing resources to feed data into the quantum system, to get the results back from it — and to manage the state of the qubits to deal with errors in those fragile quantum states.

As Dr. Krysta Svore, the researcher heading the team building the software stack for Microsoft’s quantum computing project, put it in a recent discussion of hybrid quantum computing, “We need 10 to 100 terabytes a second bandwidth to keep the quantum machine alive in conjunction with a classical petascale supercomputer operating alongside the quantum computer: it needs to have this very regular 10 microsecond back and forth feedback loop to keep the quantum computer yielding a reliable solution.”

Qubits can be affected by what’s around them and lose their state in microseconds, so the control system has to be fast enough to measure the quantum circuit while it’s operating (that’s called a mid-circuit measurement), find any errors and decide how to fix them — and send that information back to control the quantum system.

“Those qubits may need to remain alive and remain coherent while you go do classical compute,” Svore explained. “The longer that delay, the more they’re decohering, the more noise that is getting applied to them and thus the more work you might have to do to keep them stable and alive.”

Fixing Quantum Errors with WASM

There are different kinds of exotic hardware in quantum computers and you have a little more time to work with a trapped-ion quantum computer like the Quantinuum System Model H2, which will be available through the Azure Quantum service in June.

That extra time means the algorithms that handle the quantum error correction can be more sophisticated, and WebAssembly is the ideal choice for building them Pete Campora, a quantum compiler engineer at Quantinuum, told the New Stack.

Over the last few years, Quantinuum has used WebAssembly (WASM) as part of the control system for increasingly powerful quantum computers, going from just demonstrating that real-time quantum error correction is possible to experimenting with different error correction approaches and, most recently, creating and manipulating for the first time the exotic entangled quantum states (called non-Abelian anyons) that could be the basis of fault-tolerant quantum computing.

Move one of these quasiparticles around another — like braiding strings — and they store that sequence of movements in their internal state, forming what’s called a topological qubit that’s much more error resistant than other types of qubit.

At least, that’s the theory: and WebAssembly is proving to be a key part of proving it will work — which still needs error correction on today’s quantum computers.

“We’re using WebAssembly in the middle of quantum circuit execution,” Campora explained. The control system software is “preparing quantum states, doing some mid-circuit measurements, taking those mid-circuit measurements, maybe doing a little bit of classical calculation in the control system software and then passing those values to the WebAssembly environment.”

Controlling Quantum Circuits

In cloud, developers are used to picking the virtual machine with the right specs or choosing the right accelerator for a workload.

Rather than picking from fixed specs, quantum programming can require you to define the setup of your quantum hardware, describing the quantum circuit that will be formed by the qubits and as well as the algorithm that will run on it — and error-correcting the qubits while the job is running — with a language like OpenQASM (Open Quantum Assembly Language); that’s rather like controlling an FPGA with a hardware description language like Verilog.

You can’t measure a qubit to check for errors directly while it’s working or you’d end the computation too soon, but you can measure an extra qubit (called an “ancilla” because it’s used to store partial results) and extrapolate the state of the working qubit from that.

What you get is a pattern of measurements called a syndrome. In medicine, a syndrome is a pattern of symptoms used to diagnose a complicated medical condition like fibromyalgia. In quantum computing, you have to “diagnose” or decode qubit errors from the pattern of measurements, using an algorithm that can also decide what needs to be done to reverse the errors and stop the quantum information in the qubits from decohering before the quantum computer finishes running the program.

OpenQASM is good for basic integer calculation, but it requires a lot of expertise to write that code: “There’s a lot more boilerplate than if you just call out to a nice function in WASM.”

Writing the algorithmic decoder that uses those qubit measurements to work out what the most likely error is and how to correct it in C, C++ or Rust and compiling it to WebAssembly makes it more accessible and lets the quantum engineers use more complex data structures like vectors, arrays, tuples and other ways to pass data between different functions to write more sophisticated algorithms that deliver more effective quantum error correction.

“An algorithmic decoder is going to require data structures beyond what you would reasonably try to represent with just integers in the control system: it just doesn’t make sense,” Campora said. “The WASM environment does a lot of the heavy lifting of mutating data structures and doing these more complex algorithms. It even does things like dynamic allocation that normally you’d want to avoid in control system software due to timing requirements and being real time. So, the Rust programmer can take advantage of Rust crates for representing graphs and doing graph algorithms and dynamically adding these nodes into a graph.”

The first algorithmic decoder the Quantinuum team created in Rust and compiled to WASM was fairly simple: “You had global arrays or dictionaries that mapped your sequence of syndromes to a result.” The data structures used in the most recent paper are more complex and quantum engineers are using much more sophisticated algorithms like graph traversal and Dijkstra’s [shortest path] algorithm. “It’s really interesting to see our quantum error correction researchers push the kinds of things that they can write using this environment.”

Enabling software that’s powerful enough to handle different approaches to quantum error correction makes it much faster and more accessible for researchers to experiment than if they had to make custom hardware each time, or even reprogram an FPGA, especially for those with a background in theoretical physics (with the support of the quantum compiler team if necessary). “It’s portable, and you can generate it from different languages, so that frees people up to pick whatever language and software that can compile to WASM that’s good for their application.”

“It’s definitely a much easier time for them to get spun up trying to think about compiling Rust to WebAssembly versus them having to try and program an FPGA or work with someone else and describe their algorithms. This really allows them to just go and think about how they’re going to do it themselves,” Campora said.

Sandboxes and System Interfaces

With researchers writing their own code to control a complex — and expensive — quantum system, protecting that system from potentially problematic code is important and that’s a key strength of WebAssembly, Campora noted. “We don’t have to worry about the security concerns of people submitting relatively arbitrary code, because the sandbox enforces memory safety guarantees and basically isolates you from certain OS processes as well.”

Developing quantum computing takes the expertise of multiple disciplines and both commercial and academic researchers, so there are the usual security questions around code from different sources. “One of the goals with this environment is that, because it’s software, external researchers that we’re collaborating with can write their algorithms for doing things like decoders for quantum error correction and can easily tweak them in their programming language and resubmit and keep re-evaluating the data.”

A language like Portable C could do the computation, “but then you lose all of those safety guarantees,” Campora pointed out. “A lot of the compilation tooling is really good about letting you know that you’re doing something that would require you to break out of the sandbox.”

WebAssembly restricts what a potentially malicious or inexpert user could do that might damage the system but also allows system owners to offer more capabilities to users who need them, using WASI — the WebAssembly System Interface that standardizes access to features and services that aren’t in the WASM sandbox.

“I like the way WASI can allow you, in a more fine-grained way, to opt into a few more things that would normally be considered breaking the sandbox. It gives you control. If somebody comes up to you with a reasonable request that that would be useful for, say, random number generation we can look into adding WASI support so that we can unblock them, but by default, they’re sandboxed away from OS things.”

In the end, esoteric as the work is, the appeal of WebAssembly for quantum computing error correction is very much what makes it so useful in so many areas.

“The web part of the name is almost unfortunate in certain ways,” Camora noted, “because it’s really this generic virtual machine-stack machine-sandbox, so it can be used for a variety of domains. If you have those sandboxing needs, it’s really a great target for you to get some safety guarantees and still allows people to submit code to it.”

The post How WASM (and Rust) Unlocks the Mysteries of Quantum Computing appeared first on The New Stack.

Enhance Kubernetes Scheduling for GPU-Heavy Apps with Node Templates

Žilvinas Urbonas — Wed, 07 Jun 2023 17:00:45 +0000

Kubernetes scheduling ensures that pods are matched to the right nodes so that the Kubelet can run them.

The whole mechanism promotes availability and performance, often with great results. However, the default behavior is an anti-pattern from a cost perspective. Pods running on half-empty nodes equal higher cloud bills. This problem becomes even more acute with GPU-intensive workloads.

Perfect for parallel processing of multiple data sets, GPU instances have become a preferred option for training AI models, neural networks, and deep learning operations. They perform these tasks faster, but also tend to be costly and lead to massive bills when combined with inefficient scheduling.

This issue challenged one of CAST AI’s users — a company developing an AI-driven security intelligence product. Their team overcame it with our platform’s node templates, an autoscaling feature that boosted the provisioning and performance of workloads requiring GPU-enabled instances.

Learn how node templates can enhance Kubernetes scheduling for GPU-intensive workloads.

The Challenge of K8s Scheduling for GPU Workloads

Kube-scheduler is Kubernetes’ default scheduler running as part of the control plane. It selects nodes for newly created and yet unscheduled pods. By default, the scheduler tries to spread these pods evenly.

Containers within pods can have different requirements, so the scheduler filters out any nodes that don’t meet the pod’s specific needs.

It identifies and scores all feasible nodes for your pod, then picks the one with the highest score and notifies the API server about this decision. Several factors impact this process, for example, resource requirements, hardware and software constraints, affinity specs, etc.

Fig. 1 Kubernetes scheduling in overview

The scheduler automates the decision process and delivers results fast. However, it can be costly as its generic approaches may get you to pay for resources that are suboptimal for different environments.

Kubernetes doesn’t care about the cost. Sorting out expenses — determining, tracking and reducing them — is up to engineers, and this is particularly acute in GPU-intensive applications, as their rates are steep.

Costly Scheduling Decisions

To better understand their price tag, let’s look at Amazon EC2 P4d designed for machine learning and high-performance computing apps in the cloud.

Powered by NVIDIA A100 Tensor Core GPUs, it delivers top throughput and low latency networking and support for 400 Gbps instance networking. P4d promises to lower the cost of training ML models by 60% and provide 2.5x better performance for deep learning than earlier P3 instance generations.

While it sounds impressive, it also comes at an hourly on-demand price exceeding the cost of a popular instance type like C6a several hundred times. That’s why it’s essential to control the scheduler’s generic decisions precisely.

Fig. 2 Price comparison of p4d and c6a

Unfortunately, when running Kubernetes on GKE, AKS or Amazon Web Services‘ Elastic Kubernetes Service (EKS), you have minimal impact on adjusting scheduler settings without using components such as MutatingAdmissionControllers.

That’s still not a bulletproof solution, as when authoring and installing webhooks, you need to proceed with caution.

Node Templates to the Rescue

This was precisely the challenge one of CAST AI users faced. The company develops an AI-powered intelligence solution for the real-time detection of threats from social and news media. Its engine analyzes millions of documents simultaneously to catch emerging narratives, but it also enables the automation of unique Natural Language Processing (NLP) models for intelligence and defense.

The volumes of classified and public data that the product uses are ever-growing. That means its workloads often require GPU-enabled instances, which incur extra costs and work.

Much of that effort can be saved using node pools (Auto Scaling groups). But while helping streamline the provisioning process, node pools can also be highly cost-ineffective, leading you to pay for the capacity you don’t need.

CAST AI’s autoscaler and node templates improve on that by providing you with tools for better cost control and reduction. In addition, thanks to the fallback feature, node templates let you benefit from spot instance savings and guarantee capacity even when spots become temporarily unavailable.

Node Templates in Action

The workloads of the CAST AI client now run on predefined groups of instances. Instead of having to select specific instances manually, the team can broadly define their characteristics, for example “CPU-optimized,” “Memory-optimized” and “GPU VMs,” then the autoscaler does the rest.

This feature has given them far more flexibility, as they can use different instances more freely. As AWS adds new, highly performant instance families, CAST AI automatically enrolls you for them, so you don’t need to enable them additionally. This isn’t the case with node pools, which require you to keep track of new instance types and update your configs accordingly.

By creating a node template, our client could specify general requirements — instance types, the lifecycle of the new nodes to add, and provisioning configs. They additionally identified constraints such as the instance families they didn’t wish to use (p4d, p3d, p2) and the GPU manufacturer (in this case, NVIDIA).

For these particular requirements, CAST AI found five matching instances. The autoscaler now follows these constraints when adding new nodes.

Fig. 3 Node template example with GPU-enabled instances

Once the GPU jobs are done, the autoscaler decommissions GPU-enabled instances automatically.

Moreover, thanks to spot instance automation, our client can save up to 90% of hefty GPU VMs costs without the negative consequences of spot interruptions.

As spot prices can vary dramatically for GPUs, it’s essential to pick the most optimal ones at the time. CAST AI’s spot instance automation takes care of this. It can also ensure the right balance between the most diverse and cheapest types.

And on-demand fallback can be a blessing in mass spot interruptions or low spot availability. For example, an interrupted, not properly saved training process in deep learning workflows can lead to severe data loss. If AWS happens to withdraw at once all EC2 G3 or p4d spots your workloads have been using, an automated fallback can save you a lot of hassle.

How to Create a Node Template for Your Workload

Creating a node template is relatively quick, and you can do it in three different ways.

First, by using CAST AI’s UI. It’s easy if you have already connected and onboarded a cluster. Enter your product account and follow the screen instructions.

After naming the template, you need to select if you wish to taint the new nodes and avoid assigning pods to them. You can also specify a custom label for the nodes you create using the template.

Fig. 4 Node template from CAST AI

You can then link the template to a relevant node configuration, but you can also specify if you wish your template to use only spot or on-demand nodes only.

You also get a choice of processor architecture and the option to use GPU-enabled instances. If you select this preference, CAST AI will automatically run your workloads on relevant instances, including any new families added by your cloud provider.

Finally, you can also use restrictions such as :

Compute-optimized: helps to pick instances for apps requiring high-performance CPUs.
Storage Optimized: selects instances for apps that benefit from high IOPS.
Additional constraints, such as Instance Family, minimum and maximum CPU and memory limits.

But the hard fact is that the fewer constraints you add, the better matches and the higher cost savings you will get. CAST AI’s engine will take care of that.

You can also create node templates with Terraform (you can find all details in GitHub) or use API (check the documentation).

Summary

Kubernetes scheduling can be challenging, especially when it comes to GPU-heavy applications. Although the scheduler automates the provisioning process and delivers fast results, it can often prove too generic and expensive for your application’s needs.

With node templates, you get better performance and flexibility for GPU-intensive workloads. The feature also ensures that once a GPU instance is no longer necessary, the autoscaler decommissions it and gets a cheaper option for your workload’s new requirements.

We found that this quality helps build AI apps faster and more reliably — and we hope it will support your efforts, too.

The post Enhance Kubernetes Scheduling for GPU-Heavy Apps with Node Templates appeared first on The New Stack.

Building GPT Applications on Open Source Stack LangChain

Akmal Chaudhri — Wed, 07 Jun 2023 13:58:49 +0000

This is the first of two articles.

Today, we see great eagerness to harness the power of generative pre-trained transformer (GPT) models and build intelligent and interactive applications. Fortunately, with the availability of open source tools and frameworks, like LangChain, developers can now leverage the benefits of GPT models in their projects. LangChain is a software development framework designed to simplify the creation of applications using large language models (LLMs). In this first article, we’ll explore three essential points that developers should consider when building GPT applications on the open source stack provided by LangChain. In the second article, we’ll work through a code example using LangChain to demonstrate its power and ease of use.

Quality Data and Diverse Training

Building successful GPT applications depends upon the quality and diversity of the training data. GPT models rely heavily on large-scale datasets to learn patterns, understand context and generate meaningful outputs. When working with LangChain, developers must therefore prioritize the data they use for training. Consider the following three points to ensure data quality and diversity.

Data Collection Strategy

Define a comprehensive data collection strategy tailored to the application’s specific domain and use case. Evaluate available datasets, explore domain-specific sources and consider incorporating user-generated data for a more diverse and contextual training experience.

Data Pre-Processing

Dedicate time and resources to pre-process the data. This will improve its quality that, in turn, enhances the model’s performance. Cleaning the data, removing noise, handling duplicates and normalizing the format are essential well-known pre-processing tasks. Use utilities for data pre-processing, simplifying the transformation of raw data into a suitable format for GPT model training.

Ethical Considerations

There may be potential biases and ethical concerns within the data. GPT models have been known to amplify existing biases present in the training data. Therefore, regularly evaluate and address biases to ensure the GPT application is fair, inclusive and respects user diversity.

Fine-Tuning and Model Optimization

A pre-trained GPT model provides a powerful starting point, but fine-tuning is crucial to make it more contextually relevant and tailored to specific applications. Developers can employ various techniques to optimize GPT models and improve their performance. Consider the following three points for fine-tuning and model optimization.

Task-Specific Data

Gather task-specific data that aligns with the application’s objectives. Fine-tuning GPT models on relevant data helps them understand the specific nuances and vocabulary of the application’s domain, leading to more accurate and meaningful outputs.

Hyperparameter Tuning

Experiment with different hyperparameter settings during the fine-tuning process. Adjusting hyperparameters such as learning rates, batch sizes and regularization techniques can significantly affect the model’s performance. Use tuning capabilities to iterate and find the optimal set of hyperparameters for the GPT application.

Iterative Feedback Loop

Continuously evaluate and refine the GPT application through an iterative feedback loop. This can include collecting user feedback, monitoring the application’s performance and incorporating improvements based on user interactions. Over time, this iterative approach helps maintain and enhance the application’s accuracy, relevance and user satisfaction.

User Experience and Deployment Considerations

Developers should not only focus on the underlying GPT models, but also on creating a seamless and engaging user experience for their applications. Additionally, deployment considerations play a vital role in ensuring smooth and efficient operation. Consider the following three points for user experience and deployment.

Prompt Design and Context Management

Craft meaningful and contextually appropriate prompts to guide user interactions with the GPT application. Provide clear instructions, set user expectations and enable users to customize and control the generated outputs. Effective prompt design contributes to a better user experience.

Scalable Deployment

Consider deployment strategies that ensure the scalability and efficiency of the GPT application. Use cloud services, containerization and serverless architectures to effectively handle varying workloads and user demands.

Continuous Monitoring

Implement a robust monitoring system to track the performance and usage patterns of the GPT application. Monitor resource utilization, response times and user feedback to identify potential bottlenecks and areas for improvement.

Summary

By considering these three key aspects — quality data and diverse training, fine-tuning and model optimization and user experience and deployment considerations — developers can build powerful GPT applications on the open source stack provided by LangChain. In an upcoming article, I’ll start exploring the potential of GPT models and LangChain through a worked example. I will also host a workshop on June 22 during which I will go through building a ChatGPT application using LangChain. You can sign up here.

The post Building GPT Applications on Open Source Stack LangChain appeared first on The New Stack.

Can DevEx Metrics Drive Developer Productivity?

Jennifer Riggins — Wed, 07 Jun 2023 10:00:00 +0000

Developer experience, as it centers on human beings, is inherently sociotechnical. Yet, much of the work of “DevEx” and developer productivity focuses solely on the technical — despite the long-held truth that happy workers are more productive. Technical leadership typically concentrates on measuring the output of developers or the time it takes for them to complete tasks — which makes for a whole lot of story points, and not a lot of influenced change.

Last month, a research paper entitled “DevEx: What Actually Drives Productivity” went viral around the software consultancy world. It outlines an approach to understanding DevEx, as well as builds on a previously published actionable framework that combines developer feedback with that data from engineering systems.

Neither paper provides a secret formula but they aim to offer organizations potential areas to focus their measurements and improvements on. After all, developer experience and software delivery as a whole is dependent on factors at the individual, team and organizational levels.

Especially during a time of trying to do more with less, gaining insights into getting more out of the significant engineering cost center is a valuable endeavor. Here’s how.

What Is DevEx and How Can You Measure It?

“Developer productivity is more important than ever. I mean, everyone has been saying that forever, but companies right now are really focused on efficiency and doing more with the developers they have,” Abi Noda, CEO and co-founder of DX developer insights platform, told The New Stack.

At the same time, software development is evermore complex so that, “with all the different tools and technologies that developers use today, just doing your job is getting harder and harder,” he continued. “And then there’s also the shift to hybrid and remote work. People are trying to understand how does that affect developers and/or the productivity of their workforces.” This trifecta makes it the perfect time to dive into developer productivity and improving developer experience.

To the authors of this white paper, “Developer experience focuses on the lived experience of developers and the points of friction they encounter in their everyday work.” It’s not just about productivity, but increased efficiency, product quality and employee retention. DevEx has also been defined as encompassing how developers feel about, think about and value their work — not exactly easily measurable subjects, which may be why, unfortunately, most companies aren’t looking to measure them.

The authors coalesce around three dimensions of developer experience:

Feedback loops – waiting time for developers to get done their work and how streamlined teams can shorten that time
Cognitive load – in the ever-growing complexity of the cloud native world, organizations should look to limit hurdles to delivering value to customers
Flow state – when developers “get in the zone,” with limited distractions — meetings, unplanned work, ad-hoc requests for help — they feel energized by a greater sense of purpose

These form the three angles of a triangle, all feeding into each other.

Early on, the paper cites a 2020 McKinsey study which revealed that companies with better work environments for their developers boasted dramatically increased developer velocity, which in turn correlated to four to five times the revenue of their competitors. It’s therefore presumed that the above three dimensions are highly influential to velocity.

What influences that developer experience comes down to 25 sociotechnical factors — including interruptions and friction from tools or processes — which are evaluated by survey responses. This data is then combined with existing data from tools, like issue trackers and CI/CD pipelines, as well as the traditional KPIs and OKRs. Another powerful DevEx metric, particularly during these leaner times, is Knowledge Discovery Efficiency or KEDE, which leverages repositories to identify developer knowledge gaps.

No matter which measurements work for your organization, it should be a sociotechnical blend of perceptual measurements — like how developers feel, as ascertained via semi-frequent surveys — and more concrete developer workflows.

Focus on the Person or the Team?

Developer experience is highly personal and contextually dependent, Noda said, which is why the framework is unique in focusing heavily on the individual. But that creates a challenge around how to measure the individual but work to improve the collective experience.

Indeed, the paper calls out surveys as an important “fast and accurate” measurement tool. After these carefully designed surveys — asking things like “Based on X, how likely are you to…” — are regularly run, break down results and Net Promoter Scores (NPS) by team and developer persona, advises the paper. Noda clarified in our interview that these surveys should be anonymized or aggregated. It remains unclear how anonymous it can be on the average “two-pizza team” of five to nine people, and if it really can be individually actionable to aggregate results.

A key measurement of developer experience is how good you perceive you are at your job — because feeling good at your job is highly motivational and signals both a lessened cognitive load and an optimized flow state. However this measurement brings its own slew of implicit biases that increase with intersections across demographic, role and experience.

After all, imposter syndrome is more likely if you are new to the tech industry or role and/or if you are from a marginalized group. Both of those circumstances would also make you feel less safe to reply honestly about flaws or hurdles. Add to all this, we are still in a time of tech layoffs, where morale may be down, but people may feel less safe to speak up. On the other hand, optimization, particularly for the individual’s flow state, would likely increase inclusion of neurodivergent developers.

All of these concerns should be considered within your DevEx survey design. The 2021 paper of the same authors “Actionable Framework for Understanding and Improving Developer Experience” is a more in-depth work following interviews with 21 developers and developer leads — though, it notes, despite efforts, it included only one woman. This paper cites psychological safety as the single most important factor affecting developer experience.

Psychological safety in this instance could be defined as feeling safe to speak frankly about your experience. “On teams with good psychological safety and culture, developers are more willing to voice and tackle problems in order to continuously improve developer experience,” the paper reads, while unsafer culture discourages developers from speaking up or trying to make proactive improvements.

Focus on Flow

Embracing your flow doesn’t just mean going fast — it’s as much about reducing friction and frustration for more focused and happy developers.

“Flow metrics is about finding a sustainable pace that allows you to keep going infinitely,” Sophia Ashley, scrum master at AND Digital, told The New Stack. Pointing to how flow metrics are often misunderstood, she said, “It’s not necessarily about speeding up. Yes, they can help you increase your velocity, but it’s about finding a pace that works for you,” making lead time also an individual metric. Once you’ve reached repeated consistency, she explained, you can then look to increase your pace, but at your own level of sustainability — endless growth is simply unsustainable.

In the capitalistic world of move fast and break things, she said that this more controlled growth can be a tough pill to swallow, but it falls in line with the change in an industry that’s embracing responsibility and environmental, social and governance or ESG goals. And it helps reduce the developer burnout that’s rampant in this industry.

Following the DevOps philosophy, for Ashley, flow metrics are teaching your team how to deliver sustainably. “A lot of companies want to do big bang releases,” and things break, she said. It’s more sustainable to do small releases to “teach teams to undo.”

Prior to joining the tech industry in 2018, Ashley was a physical therapist, from which she draws a lot of comparison, including post-injury training. “If they don’t exercise consistently, they will be stuck with their broken hip forever.” On tech teams, she continued, “Whatever we do, we stay flexible and we make changes that we can revert if needed, and that allows us ultimately to have this flow enabled that we don’t add damage to our company or environment.”

Progressive delivery is a series of technological solutions to help decrease cognitive load, allowing teams to roll back changes more easily. Observability and monitoring are also essential so bugs and causes of outages can be uncovered much faster.

Reflecting on the DevEx metrics triangle, Ashley said that it all comes back to that flow state. “Just being able to utilize your time well and keep working. That’s what developers want. Not being interrupted — context switching wastes a lot of time,” especially when developers are responsible for knowing several layers of the stack or are halted waiting for pull requests to be approved. “Work with users to understand the problems,” she said to shorten your feedback loops. And make sure you’re managing the developer cognitive load because “context switching like multitasking is not as efficient as doing one thing at a time.”

With her work as a consultant, she continuously runs some of the pulse surveys mentioned in the paper, asking:

Are you happy within the team?
Are you satisfied in your role?
Do you think you provide value?
Do you feel you are productive?

Is DevEx Just DevOps for the Individual?

It’s hard not to compare this DevEx approach to other widespread practices in the tech industry like DevOps and platform engineering. In part that’s because Nicole Forsgren is a prominent co-author of both these papers and of Accelerate, which is considered an essential DevOps text. But also this DevEx paper echos back to the three goals of DevOps:

Shortening feedback loops with customers
Systems thinking and flow
Continuous experimentation and improvement

It’s just, while they both aim to increase the velocity of the software development lifecycle, DevOps focuses on the team while DevEx focuses on the individual. But, of course, optimizing for more developers to reach their flow states in turn should reduce the time to deliver value to customers. And by delivering value to customers faster, this in turn tightens feedback loops, reduces developer frustration and more regularly offers that dopamine boost of doing work that matters.

As established in Accelerate, DORA Metrics — deployment frequency, lead time for changes, mean time to recovery, and change failure rate — are as important than ever. DevEx just focuses on the individual’s contribution to these team, department or division metrics.

And then if you look at the next level up, the discipline of platform engineering observes and learns from the work of different teams to find behavioral patterns and especially blockers to the value flow chain. It aims to reduce, abstract and automate any demotivating, repetitive and non-differential work. It also further reduces context switching so developers stay focused on delivering value to the end users.

“Platform teams have to actually be understanding where the organization is at and what’s holding back productivity and make sure that they’re tackling those things and showing the impact of them by measuring and tying that back to the goals of the business,” Noda said. This is what distinguishes the platform teams that are adding value during economic downturn and the old-fashioned ones that just toss the platform over and are more likely to be cut right now.

Also, whether it’s borrowing developers, embedding within the app teams, or running lunch-and-learns and regular surveys, we know the biggest factor into the success of platform teams is reducing that feedback loop with internal developer customers, prioritizing them as your internal customers.

So as organizations look to increase developer productivity, at a time of likely reduced headcount, there could be a strong argument to examine the developer experience at three levels — individual, team and company-wide — to truly unlock the power of developer experience. And to run regular surveys that look to measure psychological safety, so the presence of problems is surfaced early and often at each tier.

The post Can DevEx Metrics Drive Developer Productivity? appeared first on The New Stack.

The Need to Roll up Your Sleeves for WebAssembly

B. Cameron Gain — Mon, 05 Jun 2023 13:00:41 +0000

We already know how putting applications in WebAssembly modules can improve runtime performance and latency speeds and compatibility when deployed. We also know that WebAssembly has been used to improve application performance when running on the browser on the backend. But the day when developers can create applications in the language of their choice for distribution across any environment simultaneously, whether it’s on Kubernetes clusters, servers, edge devices, etc. remains a work in progress.

This status quo became that much more apparent from the talks and impromptu meetings I had during KubeCon + CloudNativeCon in April. In addition to a growing number of WebAssembly module and service providers and startups offering support for WebAssembly, it’s hard to find any organization that is not getting down to work to at least see how it works as a sandbox project in wait of when customers will ask for or require it.

Many startups, established players and tool and platform providers are actively contributing to the common pool of knowledge by contributing or maintaining open source projects, taking part in efforts such as the ByteCode Alliance or sharing their knowledge and experiences at conferences, such as during the KubeCon + CloudNativeCon Europe’s co-located event Cloud Native Wasm Day. This collective effort will very likely serve as a catalyst so that WebAssembly will eventually soon move past its current status as just a very promising new technology and begin to be used for what it’s intended for on a massive industry scale.

Indeed, WebAssembly is the logical next step in the evolution from running applications on specific hardware, running them on virtual machines, to running them in containers on Kubernetes, Torsten Volk, an analyst at Enterprise Management Associates (EMA), said. “The payout in terms of increased developer productivity alone justifies the initial investments that come with achieving this ultimate level of abstraction between code and infrastructure. No more library hell: No more debugging app-specific infrastructure. No more refactoring of app code for edge deployments. In general, no more wasting developer time on stuff other than writing code,” Volk said. “This will get us to a state where we can truly compose new applications from existing components without having to worry about compatibility.”

Work to Be Done

But until we get that point of developer-productivity nirvana, work needs to be done. “Now we need all-popular Python libraries to work on WebAssembly and integrations with key components of modern distributed apps, such as NoSQL storage, asynchronous messaging, distributed tracing, caching, etc.,” Volk said. “Luckily there’s a growing number of startups completing the ‘grunt work’ for us to make 2024 the year when WebAssembly really takes off in production.”

Collaboration, alliances and harmony in the community, especially in the realm of open source, will be critical. “The one thing I’ve learned from the container wars is that we were fighting each other too early in the process. There was this mindset that the winner would take all, but the truth is the winner takes all the burden,” Kelsey Hightower, principal developer advocate, Google Cloud, said during the opening remarks at KubeCon + CloudNativeCon Europe’s Cloud Native Wasm Day. “You will be stuck trying to maintain the standards on behalf of everyone else. Remember collaboration is going to be super important — because the price for this has to be this invisible layer underneath that’s just doing all of this hard work.”

At the end of the day, those writing software probably just want to use their favorite language and framework in order to do it, Hightower said. “How compatible will you be with that? Or will we require them to rewrite all the software?” Hightower said. “My guess is anything that requires people to rewrite everything is doomed to fail, almost guaranteed and that there is no way that the world is going to stop innovating at the pace we’re on where the world will stop, and implement all the lower levels. So, it is a time to be excited, but understand what the goal is and make sure that this thing is usable and has tangible results along the way.”

During the sidelines of the conference, Peter Smails, senior vice president and general manager, enterprise container management, at SUSE, discussed how internal teams at SUSE shared an interest in Wasm without going into details about SUSE’s involvement. “WebAssembly has an incredibly exciting future and we see practical application of WebAssembly. I personally think of it as similar to being next-generation Java: it is a small, lightweight, fast development platform and, arguably, is an infrastructure that lets you write code and deploy it where you want and that’s pretty cool,” Smails told The New Stack.

In many ways, WebAssembly proponents face the chicken-before-the-egg challenges. After all, what developer would not want to be able to use the programming language of their choice to deploy applications for an environment or device without having to worry about configuration issues? What operations and security team would not appreciate a single path of deployment from finalized application code to deployment on any device or environment (including Kubernetes) security without the hassles of reconfiguring the application for each endpoint? But we are not there yet and many risks must be taken and investments made before wide-scale adoption really does happen the way it should in theory.

“We have a lot of people internally very excited about it, but practically speaking, we don’t have customers coming to talk about this asking for the requirements — that’s why it’s in the future,” Smails said. “We see it more as a potentially exciting space because we’re all about infrastructure.”

Get the Job Done

@ShopifyEng's @saulecabrera discussed Winch, a baseline compiler in Wasmtime. He says a baseline compiler is significantly faster than 'ahead-of-time' compilation optimization. Benchmarks to come.. #Wasm day #KubeConEU #kubeconeurope pic.twitter.com/Vq4wdiH6fe

— BC Gain (@bcamerongain) April 18, 2023

Meanwhile, there is a huge momentum to create, test and standardize the Wasm infrastructure to pave the way for mass adoption. This is thanks largely to the work of the open source community working on projects sponsored in-house or among new tool providers startups that continue to sprout up, as mentioned above. Among the more promising projects discussed during the KubeCon + CloudNativeCon co-located event Cloud Native Wasm Day, Saúl Cabrera, a staff developer, for Shopify, described how he is leading the development of Winch during his talk “The Road to Winch.” Winch is a compiler in Wasmtime created to improve application performance beyond what Wasm already provides. Offering an alternative to overcome the limitations of a baseline compiler, WebAssembly Intentionally-Non Optimizing Compiler and Host (Winch) improves startup times of WebAssembly applications, Cabrera said. Benchmarks result that demonstrates the touted performance metrics will be available in the near future, Cabrera said.

The post The Need to Roll up Your Sleeves for WebAssembly appeared first on The New Stack.

VeeamON 2023: When Your Nightmare Comes True

B. Cameron Gain — Fri, 02 Jun 2023 14:47:43 +0000

Conferences can run the gamut from being poorly organized with a product focus on one end of the spectrum to offering both deep-device and accessible talks chock full of information to solve real-world problems. Veeam’s annual user’s conference VeeamON 2023 squarely falls under the latter category.

The key takeaway: By becoming more digitized, the amount of data organizations must manage and the number of security holes and attacks continues to explode. So, when, and not if, a ransomware or another attack shuts down your organization’s operations, you had better have a working disaster recovery system in place.

“The explosion of devices and sensors connected to IoT has increased massively the endpoints that must be managed, protected and made secure,” Anand Eswaran, CEO of Veeam, said during his keynote.

All told, the massive amount of new connections means the sheer volume of data being generated will skyrocket worldwide from 79 zettabytes today to 175 zettabytes by 2025, according to IDC numbers Eswaran discussed. “Digital transformation is happening in every single business and data is the key to covering digital transformation,” Eswaran said. “So, protecting data becomes life itself. It’s not a surprise then that cybercrime and ransomware targeting the data is exponentially on the rise.”

Much appreciated is how data and security trends were broken down into key data points and analyzed in function of how organizations are struggling and overcoming security threats, especially ransomware attacks. To wit, VeeamON marked the release of its annual Ransomware Trends Report which covered around 1,200 organizations that were victims of ransomware attacks. The insights Eswaran shared included how:

The majority of organizations seek “higher reliability and improved recoverability. The data says that four out of five companies felt that there was a gap between how quickly you need to recover versus how quickly you can have a big gap in reliability,” Eswaran said. With concerns about reliability, four out of five companies in the survey faced widening gaps between the amount of data businesses can afford to lose and how frequently data is protected.

Ransomware remains the top threat. In the survey, a staggering 85% of the respondents reported an attack during the past 12 months. The 17 of you who reported four or more attacks in the last couple of months. And 60% of you believe that significant improvement was needed between how the cyber and backup teams come together, accounting for how 93% of the time almost backups are the first target of the attack.

Cyber insurance remains necessary, but finding viable plans for coverage is becoming more challenging — and expenses. Premiums and deductibles are increasing, while coverage benefits become skimpier.

You don’t necessarily get your money back when you pay ransomware. “Paying ransomware does not ensure recoverability,” Eswaran said. According to the study, 21% of the respondents said their organization could not recover the data while only 16% of the respondents reported that they were able to recover their data without paying ransomware (compared to 19% in the previous-year survey).

To recover without paying your backups must survive. As Eswaran noted, 75% of organizations lost some of their backup repositories during a data attack in the study and when that happened 39% of backup repositories were lost. “Imagine two out of five files gone. Two out of five hard drives — gone. Two out of five of your family pictures — gone,” Eswaran said. “That’s a huge impact.”

The secret to survivable backups is immutability. “Most of you use immutable repositories in some way, but you are actually still unable to recover your backups without paying the ransom. And why is that?” Eswaran said. “It actually means that you need to pay a little more attention to the architecture of the platform… There is clearly [often] a gap between the promise and execution of when companies say they offer immutable storage.”

The secret to recoverability is portability. “While many large organizations have multiple data centers, which helps them do this better, many do not,” Eswaran said. “A hybrid approach and data portability are supercritical. It allows you to backup to and from anywhere and recover to and from anywhere.”

It is critical to not reinfect during the recovery process. “More than half the organizations run the risk of infection because they do not have the means to ensure they have clean data during recovery,” Eswaran said. “You need immutable and air-gapped backups. You need Hybrid IT architectures, which allows you to create data portability and you need a staged recovery to prevent reinfection.”

The Ransomware Elephant in the Room

Security attacks are not the only thing that can cause an organization to lose data, especially if proper disaster covering is not in place. If your organization is running a data center, conceivable and real threats still include floods, fires and other natural disasters. Human error and sabotage are always a threat for data on the cloud or in data centers. But during the past few years, ransomware remains the mother of all threats. “For the last several years, we have asked the question, what’s the most common cause of outages?” Jason Buffington, vice president, market strategy, Veeam, said.

“Three years running ransomware was the cause of the most impactful events and the last two years, the most common cause of outages as well,” said Buffington while discussing the report with Dave Russell, vice president, enterprise strategy, Veeam in the eponymously called talk “Ransomware Trends Report for 2023.”

But when it comes to investing in resiliency for proper backups and other ways to protect data against ransomware and other attacks, CTOs, CxOs and other stakeholders with purchasing power are seemingly investing more in protection, but the growth in spending does not seem to be exponential.

Citing data from Gartner and IDC data, security budgets, in general, are up this year to about three to four percent and are being “positively influenced,” Russell said.

“There has been a lot of talk lately about how security budgets are getting positively influenced because of the cycle of trends to invest more and more in those areas. But in fact, on the recovery side, we’re seeing similar kinds of activities,” Russell said. “So, there is recognition that recovery plays a role in overall cyber resiliency.”

But when it comes to resiliency, money will eventually be spent regardless. “In cyber resiliency, you are either going to pay in advance or you’re gonna pay after the fact,” Buffington said. “So, if you don’t want to pay after the fact, i.e. ransomware or in downtime, then you better pay upfront.”

The post VeeamON 2023: When Your Nightmare Comes True appeared first on The New Stack.

The Cedar Programming Language: Authorization Simplified

Alex Williams — Fri, 02 Jun 2023 13:07:21 +0000

Amazon Web Services open sourced Cedar this Spring, a language for helping developers control access to resources such as data, compute nodes in a cluster, or workflow automation components.

Mike Hicks, a senior principal applied scientist with Amazon Web Services, demoed Cedar’s core features for The New Stack at the Open Source Summit North America last month in Vancouver, BC.

“Basically, to write a permission system for your application, what you might do normally is to write a bunch of code to implement your permission system,” Hicks said. “But instead with Cedar, you can write Cedar policies, and you can delegate access requests to the Cedar authorization engine. There’s a bunch of reasons why you might want to do that.”

The authorization engine uses automated reasoning and intensive testing to ensure it’s correct, making policies ergonomic and easy to read and write, Hicks said. The language has deterministic low latencies; a developer’s policy set is analyzable, and it provides tools to help users find bugs.

Automated reasoning and intensive testing work in some respects as a way to improve the developer experience. Automated reasoning takes the burden off the developer to verify the correctness of software systems. Intensive testing looks at the robustness of software systems. With these integrations, such capabilities as authorization become more automated and reliable.

Opening Cedar means the community can start contributing its features, such as bindings for multiple programming languages.

Cedar started its life as the policy language for Amazon Verified Permissions (AVP), now in private preview, Hicks said. AVP is a service for fine-grained permissions and authorizations within custom applications. So instead of writing authorizations inside Rust code, the developer may run the authorizations stored in that service.

Hicks said this is great when many applications want to share the same policy. It allows the developer to co-locate all the logging and auditing inside the cloud service.

But not everyone can use a cloud service. Some applications require the authorization engine local to their application, so they don’t have to pay that round trip. Customers may also have use cases that are lighter weight that they want to customize, for example, for different data models.

“And so we felt like open sourcing it is going to make those customer applications possible. And it’s going to allow us to take in community contributions and ideas to continue to make the language better.”

According to AWS, “Cedar is open-sourced under the Apache License 2.0 and includes the Cedar language specification and software development kit (SDK). The SDK provides libraries for authoring and validating policies, and authorizing access requests.”

Want to see another demo from AWS?

Check out: Amazon Web Services Open Sources a KVM-Based Fuzzing Framework

The post The Cedar Programming Language: Authorization Simplified appeared first on The New Stack.

How to Improve Operational Maturity in an Economic Downturn

Jonathan Rende — Wed, 31 May 2023 15:30:20 +0000

Times are changing. The buoyant technology markets of the past few years could be set to take a dip. High inflation, rising interest rates and skills shortages are adding to the economic pain for many businesses. But this doesn’t mean they should go into survival mode as the storm clouds gather. For one thing, their customers won’t let them.

Instead, they must do more with less. When it comes to the digital systems on which so many organizations now depend, efficiency is the name of the game. And improving digital operations maturity is how organizations can win.

What Is the Digital Operational Maturity Model?

The pandemic has done much to change the way we live and work. It’s also taught businesses some hard lessons about the need for operational maturity and addressing mission-critical, urgent work. The hybrid cloud and microservices-based architectures that enable so many to adapt with agility during the pandemic have introduced more IT complexity. That’s made it all but impossible to manage everything using traditional, centralized tools and processes. As has the trend toward decentralizing teams by different lines of business, each with their own toolchains and workflows.

The result is that when incidents inevitably happen, teams can be slow to coordinate and respond, potentially hurting the customer experience and bottom line. And as systems become more complex and interdependent, and users put more pressure on these systems, more failures will occur.

Operational maturity is the key to improving the speed at which organizations respond to mission-critical work. It defines how prepared ITOps teams are to detect, triage, mobilize, respond and resolve outages or system failures, and other critical, unplanned work. The lowest state of operational maturity is manual — where teams deal with incidents using slow, manual processes — and reactive, which sees teams constantly in firefighting mode.

Next up is “responsive.” Half of organizations are at this stage, with many in a state where they resolve issues as they occur. Few organizations are further along in the “proactive” and “preventative” states. While research shows that teams are improving, with most agreeing they’re better at resolving incidents now than they were 6 to 12 months previously, there’s room for growth. Doing so will not only help protect revenue and reputation by enabling more efficient incident response. Digital operations maturity is also linked to improved workday consistency and more even distribution of work among team members.

Why Does This Matter Now?

Improving operational maturity matters for several reasons. Firstly, when teams work more efficiently, they spend less time on firefighting incidents and more time on value-adding tasks such as innovation. You don’t want your technical teams wasting time on incidents they aren’t needed for or exerting energy on manual tasks that could be automated.

Second, digital Ops maturity helps to prevent burnout and resignations. Research shows that across all sectors, 54% of responders are interrupted outside of normal working hours — rising to 62% in retail. And 42% of responders worked more hours in 2021 than the previous year. That’s due in part to the fact that over 60% of responders had to respond to off-hours alerts at least once a week or more. These are all red flags for possible team burnout, and with the effects of the Great Resignation still being felt, organizations can ill afford to spend more time and money on recruitment and onboarding. If budgets are frozen, they will come under even more pressure to do more with less.

Finally, digital maturity means teams fix incidents more quickly, reducing impact on customer experience and bottom line. In preventative organizations, they will even be able to predict and address issues before customers become aware of them. That matters today more than ever in a world where a third of consumers would stop doing business with a brand they love after just one bad experience.

How Can Organizations Improve?

The first step toward improving digital operations maturity is understanding what level your organization is currently at. That will help benchmark against best practices, identify goals and metrics, and build a strategic roadmap for future success.

If your organization is at the most basic level of the maturity curve, “manual,” digital operations will be characterized by manual incident response and queued workflows such as ticket-based systems. There’ll be virtually no way to reach subject-matter experts (SMEs) in an urgent and timely manner. To become a “reactive” organization, you’ll need to establish better communications and notifications so that there are clear lines of escalation to SMEs that don’t require an intermediary. Also important is having visibility into what your current incident response process looks like. Empower your teams to codify tribal knowledge and, to address knowledge gaps, encourage certifications to make teams more efficient and effective.

To become a “responsive” organization, the focus should switch to standardization. That means consolidating onto a single digital operations platform to handle mission-critical work and ensuring all stakeholders can use it. This platform should help to provide a single source of truth for responders to gather information and communicate.

Moving from a “responsive” to a “proactive” organization requires optimization. This means identifying real-time work that can be automated so teams can work as efficiently as possible, reducing incident response times, and ensuring SMEs are free for high-value work.

How does an organization do this? A good place to start is by automating the event stream via enrichment techniques using machine learning and event rules, and connecting it to runbook automation. Then it’s time to map business services and dependencies, and standardize priority service responses. Understand which technical services roll up into critical customer-facing functionality and develop processes and protocols to manage these when something goes wrong.

The ultimate goal is mastering digital operations to become a “preventative” organization. To get there, you’ll have to be able to predict team health to prevent burnout and attrition. The key to doing so is collecting and analyzing data on which teams and services are under the most and least strain so you can better use in-house resources.

Finally, align technical metrics to business objectives, which should help you connect business goals to the way your technology operates. In doing so, it will be easier to encourage all stakeholders to be part of the improvement and goal-setting process.

Not Surviving but Thriving

In uncertain economic times, it can be tempting to do the bare minimum, to wait things out until better times arrive. But that’s a risky strategy, especially if your competitors use that time to improve their digital operations.

That’s why smart businesses plan for success during economic downturns. By enhancing your organization’s digital operations now, you will be best placed to capitalize when the market returns to growth. To learn more about digital operations maturity, check out what analysts from 451 Research have to say about PagerDuty.

The post How to Improve Operational Maturity in an Economic Downturn appeared first on The New Stack.

MongoDB vs. PostgreSQL vs. ScyllaDB: Tractian’s Experience

Joao Pedro Voltani — Wed, 31 May 2023 13:10:04 +0000

Tractian is a machine intelligence company that provides industrial monitoring systems. Last year, we faced the challenge of upgrading our real-time machine learning (ML) environment and analytical dashboards to support an aggressive increase in our data throughput, as we managed to expand our customers database and data volume by 10 times.

We recognized that to stay ahead in the fast-paced world of real-time machine learning, we needed a data infrastructure that was flexible, scalable and highly performant. We believed that ScyllaDB would provide us with the capabilities we lacked, enabling us to push our product and algorithms to the next level.

But you probably are wondering why ScyllaDB was the best fit. We’d like to show you how we transformed our engineering process to focus on improving our product’s performance. We’ll cover why we decided to use ScyllaDB, the positive outcomes we’ve seen as a result and the obstacles we encountered during the transition.

How We Compared NoSQL Databases

When talking about databases, many options come to mind. However, we started by deciding to focus on those with the largest communities and applications. This left three direct options: two market giants and a newcomer that has been surprising competitors. We looked at four characteristics of those databases — data model, query language, sharding and replication — and used these characteristics as decision criteria for our next steps.

First off, let’s give you a deeper understanding of the three databases using the defined criteria:

MongoDB NoSQL

Data model: MongoDB uses a document-oriented data model where data is stored in BSON (Binary JSON) format. Documents in a collection can have different fields and structures, providing a high degree of flexibility. The document-oriented model enables basically any data modeling or relationship modeling.
Query language: MongoDB uses a custom query language called MongoDB Query Language (MQL), which is inspired by SQL but with some differences to match the document-oriented data model. MQL supports a variety of query operations, including filtering, grouping and aggregation.
Sharding: MongoDB supports sharding, which is the process of dividing a large database into smaller parts and distributing the parts across multiple servers. Sharding is performed at the collection level, allowing for fine-grained control over data placement. MongoDB uses a config server to store metadata about the cluster, including information about the shard key and shard distribution.
Replication: MongoDB provides automatic replication, allowing for data to be automatically synchronized between multiple servers for high availability and disaster recovery. Replication is performed using a replica set, where one server is designated as the primary member and the others as secondary members. Secondary members can take over as the primary member in case of a failure, providing automatic fail recovery.

ScyllaDB NoSQL

Data model: ScyllaDB uses a wide column-family data model, which is similar to Apache Cassandra. Data is organized into columns and rows, with each column having its own value. This model is designed to handle large amounts of data with high write and read performance.
Query language: ScyllaDB uses the Cassandra Query Language (CQL), which is similar to SQL but with some differences to match the wide column-family data model. CQL supports a variety of query operations, including filtering, grouping and aggregation.
Sharding: ScyllaDB uses sharding, which is the process of dividing a large database into smaller parts and distributing the parts across multiple nodes (and down to individual cores). The sharding is performed automatically, allowing for seamless scaling as the data grows. ScyllaDB uses a consistent hashing algorithm to distribute data across the nodes (and cores), ensuring an even distribution of data and load balancing.
Replication: ScyllaDB provides automatic replication, allowing for data to be automatically synchronized between multiple nodes for high availability and disaster recovery. Replication is performed using a replicated database cluster, where each node has a copy of the data. The replication factor can be configured, allowing for control over the number of copies of the data stored in the cluster.

PostgreSQL

Data model: PostgreSQL uses a relational data model, which organizes data into tables with rows and columns. The relational model provides strong support for data consistency and integrity through constraints and transactions.
Query language: PostgreSQL uses structured query language (SQL), which is the standard language for interacting with relational databases. SQL supports a wide range of query operations, including filtering, grouping and aggregation.
Sharding: PostgreSQL does not natively support sharding, but it can be achieved through extensions and third-party tools. Sharding in PostgreSQL can be performed at the database, table or even row level, allowing for fine-grained control over data placement.
Replication: PostgreSQL provides synchronous and asynchronous replication, allowing data to be synchronized between multiple servers for high availability and disaster recovery. Replication can be performed using a variety of methods, including streaming replication, logical replication and file-based replication.

What Were Our Conclusions of the Benchmark?

In terms of performance, ScyllaDB is optimized for high performance and low latency, using a shared-nothing architecture and multithreading to provide high throughput and low latencies.

MongoDB is optimized for ease of use and flexibility, offering a more accessible and developer-friendly experience and has a huge community to help with future issues.

PostgreSQL, on the other hand, is optimized for data integrity and consistency, with a strong emphasis on transactional consistency and ACID (atomicity, consistency, isolation, durability) compliance. It is a popular choice for applications that require strong data reliability and security. It also supports various data types and advanced features such as stored procedures, triggers and views.

When choosing between PostgreSQL, MongoDB and ScyllaDB, it is essential to consider your specific use case and requirements. If you need a powerful and reliable relational database with advanced data management features, then PostgreSQL may be the better choice. However, if you need a flexible and easy-to-use NoSQL database with a large ecosystem, then MongoDB may be the better choice.

But we were looking for something really specific: a highly scalable and high-performance NoSQL database. The answer was simple: ScyllaDB is a better fit for our use case.

MongoDB vs. ScyllaDB vs. PostgreSQL: Comparing Performance

After the research process, our team was skeptical about using just written information to make a decision that would shape the future of our product. We started digging to be sure about our decision in practical terms.

First, we built an environment to replicate our data acquisition pipeline, but we did it aggressively. We created a script to simulate a data flow bigger than the current one. At the time, our throughput was around 16,000 operations per second, and we tested the database with 160,000 operations per second (so basically 10x).

To be sure, we also tested the write and read response times for different formats and data structures; some were similar to the ones we were already using at the time.

You can see our results below with the new optimal configuration using ScyllaDB and the configuration using what we had with MongoDB (our old setup) applying the tests mentioned above:

MongoDB vs. ScyllaDB P90 Latency (Lower Is Better)

MongoDB vs. ScyllaDB Request Rate/Throughput (Higher Is Better)

The results were overwhelming. With similar infrastructure costs, we achieved much better latency and capacity; the decision was clear and validated. We had a massive database migration ahead of us.

Migrating from MongoDB to ScyllaDB NoSQL

As soon as we decided to start the implementation, we faced real-world difficulties. Some things are important to mention.

In this migration, we added new information and formats, which affected all production services that consume this data directly or indirectly. They would have to be refactored by adding adapters in the pipeline or recreating part of the processing and manipulation logic.

During the migration journey, both services and databases had to be duplicated, since it is not possible to use an outage event to swap between old and new versions to validate our pipeline. It’s part of the issues that you have to deal with in critical real-time systems: An outage is never permitted, even if you are fixing or updating the system.

The reconstruction process should go through the data science models, so that they can take advantage of the new format, increasing accuracy and computational performance.

Given these guidelines, we created two groups. One was responsible for administering and maintaining the old database and architecture. The other group performed a massive reprocessing of our data lake and refactored the models and services to handle the new architecture.

The complete process, from designing the structure to the final deployment and swap of the production environment, took six months. During this period, adjustments and significant corrections were necessary. You never know what lessons you’ll learn along the way.

NoSQL Migration Challenges

ScyllaDB can achieve this kind of performance because it is designed to take advantage of high-end hardware and very specific data modeling. The final results were astonishing, but it took some time to achieve them. Hardware has a significant impact on performance. ScyllaDB is optimized for modern multicore processors and uses all available CPU cores to process data. It uses hardware acceleration technologies such as AVX2 (Advanced Vector Extensions 2) and AES-NI (Advanced Encryption Standard New Instructions); it also depends on the type and speed of storage devices, including solid-state disks and NVMe (nonvolatile memory express) drives.

In our early testing, we messed up some hardware configurations, leading to performance degradation. When those problems were fixed, we stumbled upon another problem: data modeling.

ScyllaDB uses the Cassandra data model, which heavily dictates the performance of your queries. If you make incorrect assumptions about the data structures, queries or the data volume, as we did at the beginning, the performance will suffer.

In practice, the first proposed data format ended up exceeding the maximum size recommended for a ScyllaDB partition in some cases, which made the database perform poorly.

Our main difficulty was understanding how to translate our old data modeling to one that would perform on ScyllaDB. We had to restructure the data into multiple tables and partitions, sometimes duplicating data to achieve better performance.

Lessons Learned: Comparing and Migrating NoSQL Databases

In short, we learned three lessons during this process: Some came from our successes and others from our mistakes.

When researching and benchmarking the databases, it became clear that many of the specifications and functionalities present in the different databases have specific applications. Your specific use case will dictate the best database for your application. And that truth is only discovered by carrying out practical tests and simulations of the production environment in stressful situations. We invested a lot of time, and our choice to use the most appropriate database paid off.

When starting a large project, it is crucial to be prepared for a change of route in the middle of the journey. If you developed a project that did not change after its conception, you probably didn’t learn anything during the construction process, or you didn’t care about the unexpected twists. Planning cannot completely predict all real-world problems, so be ready to adjust your decisions and beliefs along the way.

You shouldn’t be afraid of big changes. Many people were against the changes we were proposing due to the risk it brought and the inconvenience it caused to developers (by changing a tool already owned by the team to a new tool that was completely unknown to the team).

Ultimately, the decision was driven based on its impact on our product improvements — not on our engineering team, even though it was one of the most significant engineering changes we have made to date.

It doesn’t matter what architecture or system you are using. The real concern is whether it will be able to take your product into a bright future.

This is, in a nutshell, our journey in building one of the bridges for the future of Tractian’s product. If you have any questions or comments, feel free to contact us.

The post MongoDB vs. PostgreSQL vs. ScyllaDB: Tractian’s Experience appeared first on The New Stack.

How to Protect Containerized Workloads at Runtime

Kevin Casey — Tue, 30 May 2023 11:00:22 +0000

Security is (finally) getting its due in the enterprise. Witness trends such as DevSecOps and the “shift left” approach — meaning to move security as early as possible into development pipelines. But the work is never finished.

Shift left and similar strategies are generally good things. They begin to address a long-overdue problem of treating security as a checkbox or a final step before deployment. But in many cases is still not quite enough for the realities of running modern software applications. The shift left approach might only cover the build and deploy phases, for example, but not apply enough security focus to another critical phase for today’s workloads: runtime.

Runtime security “is about securing the environment in which an application is running and the application itself when the code is being executed,” said Yugal Joshi, partner at the technology research firm Everest Group.

The emerging class of tools and practices for security aim to address three essential security challenges in the age of containerized workloads, Kubernetes, and heavily automated CI/CD pipelines, according to Utpal Bhatt, CMO at Tigera, a security platform company.

First, the speed and automation intrinsic to modern software development pipelines create more threat vectors and opportunities for vulnerabilities to enter a codebase.

Second, the orchestration layer itself, like Kubernetes, also heavily automates the deployment of container images and introduces new risks.

Third, the dynamic nature of running container-based workloads, especially when those workloads are decomposed into hundreds or thousands of microservices that might be talking to one another, creates a very large and ever-changing attack surface.

“The threat vectors increase with these types of applications,” Bhatt told The New Stack. “It’s virtually impossible to eliminate these threats when focusing on just one part of your supply chain.”

Runtime Security: Prevention First

Runtime security might sound like a super-specific requirement or approach, but Bhatt and other experts note that, done right, holistic approaches to runtime security can bolster the security posture of the entire environment and organization.

The overarching need for strong runtime security is to shift from a defensive or detection-focused approach to a prevention-focused approach.

“Given the large attack surface of containerized workloads, it’s impossible to scale a detection-centric approach to security,” said Mikheil Kardenakhishvili, CEO and co-founder of Techseed, one of Tigera’s partners. “Instead, focusing on prevention will help to reduce attacks and subsequently the burden on security teams.”

Instead of a purely detection-based approach, one that often burns out security teams and puts them in the position of being seen as bottlenecks or inhibitors by the rest of the business, the best runtime security tools and practices, according to Bhatt, implement a prevention-first approach backed by traditional detection response.

“Runtime security done right means you’re blocking known attacks rather than waiting for them to happen,” Bhatt said.

Runtime security can provide common services as a platform offering that any application can use for secure execution, noted Joshi, the Everest Group analyst.

“Therefore, things like identity, monitoring, logging, permissions, and control will fall under this runtime security remit,” he said. “In general, it should also provide an incident-response mechanism through prioritization of vulnerability based on criticality and frequency. Runtime security should also ideally secure the environment, storage, network and related libraries that the application needs to use to run.”

A SaaS Solution for Runtime Security

Put in more colloquial terms: Runtime security means securing all of the things commonly found in modern software applications and environments.

The prevention-first, holistic approach is part of the DNA of Calico Open Source, an open source networking and network security project for containers, virtual machines, and native host-based workloads, as well as Calico Cloud and Calico Enterprise, the latter of which is Tigera’s commercial platform built on the open source project it created.

Calico Cloud, a Software as a service (SaaS) solution focused on cloud native apps running in containers with Kubernetes, offers security posture management, robust runtime security for identifying known threats, and threat-hunting capabilities for discovering Zero Day attacks and other previously unknown threats.

These four components of Calico — securing your posture in a Kubernetes-centric way, protecting your environment from known attackers, detecting Zero Day attacks, and incident response/risk mitigation — also speak to four fundamentals for any high-performing runtime security program, according to Bhatt.

Following are the four principles to follow for protecting your runtime.

4 Keys to Doing Runtime Security Right

1. Protect your applications from known threats. This is core to the prevention-first mindset, and focuses on ingesting reliable threat feeds that your tool(s) continuously check against — not just during build and deploy but during runtime as well.
Examples of popular, industry-standards feeds include network addresses of known malicious servers, process file hashes of known malware, and the OWASP Top 10 project.

2. Protect your workloads from vulnerabilities in the containers. In addition to checking against known, active attack methods, runtime security to proactively protect against vulnerabilities in the container itself — and everything that the container needs to run, including the environment.

This isn’t a “check once” type of test, but a virtuous feedback loop that should include enabling security policies that protect workloads from any vulnerabilities, including limiting communication or traffic between services that aren’t known/trusted or when a risk is detected.

3. Detect and protect against container and network anomalous behaviors. This is “the glamorous part” of runtime security, according to Bhatt, because it enables security teams to find and mitigate suspicious behavior in the environment even when it’s not associated with a known threat, such as with Zero Day attacks.

Runtime security tools should be able to detect anomalous behavior in container or network activity and alert security operations teams (via integration with security information and event management, or SIEM, tools) to investigate and mitigate as needed.

4. Assume breaches have occurred; be ready with incident response and risk mitigation. Lastly, even while shifting to a prevention-first, detection-second approach, Bhatt said runtime security done right requires a fundamental assumption that your runtime has already been compromised (and will occur again). This means your organization is ready to act quickly in the event of an incident and minimize the potential fallout in the process.

Zero trust is also considered a best strategy for runtime security tools and policies, according to Bhatt.

The bottom line: The perimeter-centric, detect-and-defend mindset is no longer enough, even if some of its practices are still plenty valid. As Bhatt told The New Stack: “The world of containers and Kubernetes requires a different kind of security posture.”

Runtime security tools and practices exist to address the much larger and more dynamic threat surface created by containerized environments. Bhatt loosely compared today’s software environments to large houses with lots of doors and windows. Legacy security approaches might only focus on the front and back door. Runtime security attempts to protect the whole house.

Bhatt finished the metaphor: “Would you rather have 10 locks on one door, or one lock on every door?”

The post How to Protect Containerized Workloads at Runtime appeared first on The New Stack.

Meet The Hobbyists Building Their Own DIY Cyberpunk Devices

David Cassel — Mon, 29 May 2023 13:00:29 +0000

Back in the 1980s, William Gibson’s science fiction novels envisioned a coming dystopian future where cyberspace was accessed with head-mounted interfaces. In his 1982 story “Burning Chrome,” two hackers use them for “casing mankind’s extended electronic nervous system, rustling data and credit in the crowded matrix.” (While “high above it all burn corporate galaxies and the cold spiral arms of military systems…”)

“Burning Chrome”

But here in our own real-world future, enthusiastic hobbyists are now trying to make it all come true — or at least, jerry-rigging their own home-brewed “cyberdecks” for accessing the internet.

It’s the ultimate project for cyberpunk fans: cobbling together their own gear using repurposed leftovers and cheap surplus parts, plus all the right components from suppliers catering to makers.

But instead of cracking corporate data silos with a tricked-up Ono-Sendai “Cyberspace VII” (as William Gibson imagined), these enthusiasts are just sharing their creations on social media for bragging rights, and to celebrate their own maker successes. And like any home project, they also always seem to be learning an awful lot about technology.

It’s inspiring and it’s exciting. And it also looks like it’s a lot of fun…

Sunglasses at Night

For a head-mounted solution, some cutting-edge makers are now experimenting with the newly released Nreal Air (renamed Xreal) sunglasses, which come equipped with a small built-in (micro-OLED) screen. A USB-C cable connects them to your computer or smartphone.

Marketed as “AR glasses,” they display output from the company’s “spatial internet” app (currently available on “select” Android devices). But the glasses can also function as a head-mounted display, according to their website, transforming a laptop or monitor into what’s essentially a “cinema-sized 201-inch screen.”

And UK-based futurist Martin Hamilton calls new products like these “the real breakthrough” for finally jerry-rigging your own cyberdeck. Hamilton says in an email interview that Nreal’s micro-OLED screens can give cyberpunk makers a full HD display “with a decent field of view.”

UK-based Martin Hamilton made a cyberdeck with Nreal Air sunglasses powered by an old ThinkPad

“If you’ve used a VR headset then you’re probably expecting something similar — like strapping a phone to your face. These are different because the glasses weigh very little (79 grams, or around three ounces), due to all the clever stuff happening on your phone or computer. In particular, there’s no battery, as the glasses are powered by the same USB-C cable which feeds the video from your device.”

To create his own home-brewed cyberdeck, Hamilton bought a pair of the Nreal Air glasses, then hooked them up to a five-year-old ThinkPad laptop with a broken screen. “Right now this really feels like a hacker’s device,” he said in an email (which he composed using his home-brewed cyberdeck).

“ThinkPads are pretty good for this kind of thing because they’re designed to be repairable,” Hamilton wrote. After unscrewing the screen’s hinges to remove it — and detaching its cables — it’s a self-contained unit “without any unsightly gaps.”

Instead of wearing the sunglasses over his prescription eyeglasses, he was even able to purchase prescription lens inserts from Nreal’s official partner.

Hamilton shared his adventure with other DIY-cyberpunk enthusiasts in Reddit’s Cyberdeck subreddit. (“The era of virtual reality is coming,” says the subreddit’s description, “so it is also time for cyberdecks to come…”)

He’s calling his new ThinkPad-derived cyberdeck a “ThinkDeck,” telling the forum that he’s been “using the glasses as a big head mount display,” for everything from coding and sys-admin work to sending email, surfing the web, and watching videos. (“You wouldn’t want to wear the glasses for more than about an hour at a time, but then you should probably be getting a screen/movement break at this point anyway.”)

ThinkPad + Nreal Air = ThinkDeck
by u/martin_hamilton in cyberDeck

There’s also practical considerations. Hamilton wonders if governments and corporations will demand their staff start using these eyeglass-based interfaces (with no screens) for the extra privacy. In a world where biometric fingerprint scanners already control access to data on encrypted partitions, wouldn’t this be the next logical step?

“You can just plug in a screen when you want one,” Hamilton said — for example, by connecting a projector for “a wall-sized display that other people can also see.”

And yes, he told me, it does feel like something out of a William Gibson story. Writing code in Linux, “my field of view is full of terminal windows and debug output,” Hamilton writes, adding that this “seems appropriately cyberpunk.”

In William Gibson’s novel Neuromancer, the protagonist named his cyberdeck “Hosaka” — so Hamilton has done the same.

“It was either that or Ono-Sendai Cyberspace VII, but that’s a bit of a mouthful…”

A Yearned-for Future

Hamilton isn’t the only one home-brewing their own technology. Belgium-based Ken Van Hoeylandt has built his own tiny handheld PC by crafting a custom 3D-printed case for his Raspberry Pi CM4 (and a “Raspberry Pad” screen from BIGTREETECH), hooking everything up to a modified BlueTooth mini keyboard.

Decktility – An open source/hardware handheld PC
by u/ByteWelder in cyberDeck

In Detroit a turntablist and music producer named “DJ Vulchre” has been uploading videos of their own home-brewed cyberdecks — the latest made with a GOLE1 Pro pocket-sized PC and a lens the magnify the interface for their music software.

And Hong Kong-based YouTube Winder Sun has built his own small pocket PC. He started with an 8-inch touchscreen display from component vendor Elecrow, then mounted it in a hollowed-out portable radio case — along with a keyboard and a small portable charger.

Sun is now proudly using it to write the code, including mods to the space exploration game No Man’s Sky. In the video, he jokes it’s “a Cyberdeck That Should Go Straight to E-Waste” but adds that it “feels like a very cyberpunk thing to do…”

“This thing is the jankiest thing ever, and I love it. Working with this in a dilapidated concrete jungle was a delight… At least I look like a cyberpunk now, and in my sick twisted mind it’s worth everything.”

“I think the cyberpunk community yearns for this future we never got,” tinkerer Brendan Charles told GameSpot last month, “and making these kinds of projects allows us to make it a reality.”

Charles built a battery-powered micro-PC out of a 1990s-era “Talking Whiz Kid” toy, learning everything he need to know along the way bout soldering, sanding, painting, and 3D printing — and even some basic electronics. “You can find premade modules and connectors to do almost anything you want, from LCD displays, to controllers, to battery packs,” Charles tells GameSpot.

My quarantine project: The Ceres 1, a battery powered portable PC
by u/ThisIsTheNewSleeve in cyberDeck

Hackaday-Prize – promotional poster

Gamespot described Charles as part of “an entire community of talented builders using tiny computers like the Raspberry Pi to build the cyberpunk setups of their dreams.” And the tech-projects site Hackaday even has its own section dedicated to homemade cyberdecks. It notes a popular feeling that a true cyberdeck should be “a custom rig built up of whatever high-tech detritus the intrepid hacker can get their hands on.”

And along those lines, the site recently featured a maker who created a mobile satellite-monitoring station from a touch-screen computer from a police cruiser in the early 2000s.

A home-brewed pocket computer also placed in an early round of the Hackaday prize competition (which culminates with a $50,000 prize in September) when maker Spider Jerusalem wrapped a 3D-printed case around a Raspberry Pi (4) board connected to an LCD screen and a full QWERTY keyboard. “It’s a useful tool if you need to interface with a server on the go or do some low-level network diagnostics without carrying a whole laptop around,” Hackaday suggested.

When you’re brewing your own technology, the possibilities are endless. A maker named “Frumthe Fewtcha” even built a ChatGPT-enabled smartwatch that could answer any question, according to their recent video on YouTube. The answers appear as text on the watch’s display — and are also piped as audio into earbuds.

Writing from his home-grown cyberdeck, Hamilton said he felt like we’ve finally achieved a piece of that future that we were always promised. “In the almost 40 years since Neuromancer was published it feels like the world has caught up with William Gibson’s imagination, from mRNA-based gene editing to Large Language Models that seem almost sentient.”

But he also believes there’s also some practical advantages to a world where you can build your own head-mounted cyberdeck. “I’ve also spent a lot of those 40 years hunched over laptop screens, and it’s really liberating to be able to move your head around to wherever is comfortable!”

The post Meet The Hobbyists Building Their Own DIY Cyberpunk Devices appeared first on The New Stack.